It’s possible to uncover or create some really knotty problems when porting applications to a new environment, whether that’s on-premise or in the cloud. More and more organisations are looking to move to the new Arm v8 server so we decided to take a look at what would happen.
We recently released Arm versions of all our tools.
Porting our tools to ARM
Step 0: run them on a Raspberry Pi. Although our tools are designed for scale and are run in colossal high performance computing clusters, you can run them anywhere so we started with a Raspberry Pi that included an Arm v8 chipset.
This very quickly gave us the answer “will it work”, but as you will shortly see, it didn’t actually save us any time.
Step 1: Cross compile our tools and migrate them to AArch64. When we first migrated our tools to 32bit Arm devices we had to use a commercially supported compiler. However, the Arm ecosystem has come on a long way since then and the cross compiler is now part of GCC (aarch64-linux-gnu-gcc). We were recommended version 6 or higher so we used version 6.3.0.
There was some fiddling to get it working with our build because we use inline assembly and we have to exactly match the version of glibc used, Glibc 2.17 for AArch64. From then on it was easy as we have few dependencies. Most applications won’t have those problems and will simply need a change of compiler.
Step 2: Run on a real Arm v8 server. We used Packet.net, but others are available. We chose a Cavium ThunderX and tested with all available operating systems: Centos 7, Custom iPE, Ubuntu 16.04 LTS or 17.04. We then chose to run the server in Amsterdam since that was the closest to us in the UK.
It worked first time so was cheaper than buying a RaspberryPi!
Testing our tools on ARM
Since our in-house test framework passed first time we thought we run them on something real.
We took one of the HPC applications ported to Arm by the Barcelona Supercomputing Center: Quantum Espresso. We profiled the application with Mistral to highlight some of the things you might like to look for when moving your applications to ARM or to any other new compute environment.
We only ran the application with some sample options and small data sets so it is not a very representative of real HPC workloads, but even so we were able to see some interesting results. One of the biggest issues in I/O after bad meta-data is small I/O. Small I/O can make your application look busy, but in fact it is wasting time spinning the processor as well as overloading the network and storage.
With some careful programming you can get the OS to buffer some reads and writes, but this shouldn’t be relied upon. Often, any automatic buffering will be too small to be of much benefit to performance. We’ve seen a lot applications do one byte reads and writes, but anything less than 32kB is too small.
Results by size range
Espresso was run with the Phonon package across all the examples supplied. The following graphs show reads and writes broken down by size range.
You can clearly see the boundaries between each example, while some examples give much better I/O profiles than others. There is very little data written and read one byte at a time, but plenty read or written in blocks smaller than 32KB, which could be improved. Clearly the first few examples and a larger example halfway down the list are the worst offenders.
We can get an idea of where the problem originates from a quick inspection of the log, which samples problem I/O giving the program and file. It looks like examples 1-6 and example 14 are the ones to improve. An inspection of what those examples do and how their I/O paths are coded would be worthwhile.