Managing HPC systems: When to use I/O profiling

I/O profiling is not just about improving performance. It is also about understanding I/O patterns so you can make the right choices about your storage infrastructure. Whether you have an on-prem system, a cloud system or a hybrid cloud system, understanding application I/O patterns and resource use will save you an awful lot of time and money in the long run.

“Improving run time often doesn’t require extensive rewrites. Knowing where to look is key.”

– Keiran Raine, Bioinformatician, Wellcome Sanger Institute

Debug and triage

Firefighting downtime comes before any other task in system administration. When applications are slow or when they fail to run completely, often it’s an I/O issue or a dependency problem.

Are you sourcing the right version of each library and program? Are you running applications out of the home directory instead of on the fast storage? Are the temporary files being sent over the network instead of being stored on local scratch?

Forecasting for procurement

How can you give vendors a concise set of requirements if you don’t know what is currently running on your system? Understanding application I/O is key to specifying what you need to buy and how those needs will change.

What are you running now? What will you be running in 3-5 years’ time? What are the I/O patterns of those systems?

Benchmarking new systems

Synthetic I/O benchmarks will tell you how they perform on storage, but unless you have matched the benchmark to the I/O patterns of your existing system they will only tell you how well the benchmark will perform.

Performance and tuning third party software

Optimizing I/O is often the first thing people think about when considering I/O profiling, but it’s the last thing you have time for. I/O profiling used to be something left to expert developers, but even if you have no access to the code base, there are often benefits to profiling third party tools on your system.

For example, it’s easy for a program to behave well, but to have I/O issues caused by its configuration. Are you storing all the files in the right place? Is the software hanging on a slow license server? Are there hundreds of locations in the PATH variable causing file system trawls?

Free tools vs Ellexus’ I/O profiling solutions

Tools such as iotop, dstat and strace can tell you what the disk is doing and what the application is up to, but unless you have a large team on hand to take that information and write a front end to the data it can take a long time to get any meaningful results. This is particularly true when running applications at scale.

Ellexus has done the hard work for you, not just by collecting the I/O profiling data in a reliable way, but also by providing high-level access to the data so you can quickly see what is going on and where the bottlenecks are.

For more information, visit the product pages for Breeze and Mistral.