Ellexus contributes to global paper on how to analyse I/O

Ellexus is a firm believer that collaboration will enable the HPC industry to create new tools that will help all of us succeed. In this vein, our CEO Rosemary has collaborated with universities and research bodies across Europe and the US to produce a workshop paper entitled ‘Tools for analysing parallel I/O’.

The paper, available for download on the Cornell University Library website, will be published in Springer Lecture Notes in Computer Science (LNCS). It contains an overview of available I/O profiling solutions including Breeze and Mistral.

The paper was written in response to the need for more sophisticated tools to capture, analyse, understand and tune application I/O. Parallel application I/O performance often does not meet user expectations and this is only going to become more apparent is compute continues to speed up.

There have been significant steps forward in monitoring tools to address this problem, however, and the paper gives an overview of those that are currently available.

It also describes best practices, identifies issues in measurement and analysis, and provides practical approaches to translate parallel I/O analysis into actionable outcomes for users, facility operators, and researchers.

Get in touch if you’d like to talk to the Ellexus team about parallel I/O.

Introduction: Tools for Analyzing Parallel I/O

The efficient use of I/O systems is of prime interest for data-intensive applications, since storage systems are increasing in size and the use case of a single system is so diverse, especially in the scientific community. As computing centers grown in size, and the high-performance computing (HPC) community approaches exascale, it has become increasingly important to understand how these systems are operating and how they are being used.

Additionally, understanding system behavior helps light the path for future storage system development and allows the purveyors of these systems to ensure performance is adequate to allow for work to continue unimpeded. While industry systems are typically well understood, shared storage systems in the HPC community are not as well understood.

The reason is their sheer size and the concurrent usage by many users, which typically submit disparate workloads. Most applications achieve only a fraction of theoretically available performance. Hence, optimization and tuning of available knobs for hardware and software are important. This requires that the user to understand the I/O behavior of the interaction between application and system, since it determines the runtime of the applications. However, measuring and assessing observed performance are already nontrivial tasks and raise various challenges around hardware, software, deployment, and management.

This paper describes the current state of the practice with respect to such tools.