As compute power grows, optimisation is vital

New storage solutions are starting to fundamentally alter the HPC storage arena. Organisations are keen to switch to the most efficient solution, be it a cloud architecture or a new low-power offering.

However, systems are not able to cope with these changes. Each architecture poses such varying performance challenges that if applications are transferred without being tuned for the new environment, everyone’s work could grind to a halt. We need a whole new, sustainable approach to optimisation that is managed by all users, not just the IT experts.

The unicorn

At the moment, optimisation to make software run well on an HPC organisation’s hardware infrastructure is usually carried out by someone who is both an expert in the domain that the company works in, such as cancer research, and in IT infrastructure. These people are absolute gold dust, often self-taught experts.

However, their optimisation is usually quite informal and, understandably, focused on their current system. It is very easy to miss the big picture. It’s rare to find someone doing generalised work that will be of benefit in two years’ time when that application is running on a completely different architecture.

For example, consider what happens to a memory bound application when you change the compute architecture. Without enough memory, you will get poor utilisation of the storage and compute. You might spend a lot of money increasing the memory, but then the application will be I/O bound and the file system will fall over because it is under such a heavy load. In short, these outages could mean that the performance of the file system under heavy load could be lower than previously when the load was light.

These such challenges will arise when companies look to switch to low power and high density architectures such as the ARM architectures or the Power architecture from IBM. Applications that have been highly tuned for HPC hardware of today rarely also run well on something else.

In the cloud arena, there architecture changes can be even more frequent. A company could spin up software on any configuration, spend six months tuning for that architecture only to find that the cloud service has changed the price and it’s no longer economical. All optimisation is then lost.

Optimisation 2.0

Before considering any kind of storage migration, companies first need to take a completely different approach to optimisation. The pursuit of optimising and tuning software for part A and chasing every little percentage of performance is potentially a thing of the past, particularly as compute is getting a lot cheaper.

Crucially, the responsibility for optimisation shouldn’t just lie in the domain of the experts. It’s all too easy for an application to do something completely stupid, so the focus has got to be on making it very, very easy for non-expert users to discover the colossal mistakes they are making.

For example, it doesn’t matter if an application could be 10% slower because of poor design if it runs just about OK no matter what the compute, memory or storage profile is. It is far more important to discover that, say, 80% of the run time is being wasted by a script that stats everything on the file system. This scenario is dangerous no matter what and can be identified in a way that non-expert users can understand.

We need a different kind of optimisation to the tiny tuning that a lot of the debuggers and profilers of today deliver, which require the user to be an expert in the tool to get anything done. The next generation of tools must be a lot more general or target applications that are run at a much larger scale, where it is actually worth chasing after that every percentage.