Supercharging the Ellexus HPC monitoring tools with the Hartree Centre

Ellexus specialises in application monitoring tools for high performance computing (HPC). While many HPC clusters are run as farms for single-core applications, others utilise a combination of scheduler technologies and MPI libraries to run a single application – not just across many cores, but also across many host machines with the compute infrastructure operating as one supercomputer.

To provide system monitoring and I/O profiling tools for this industry, it is necessary to integrate with the relevant schedulers and MPI libraries running on the system. This means that applications distributed across thousands of cores can be profiled.

Ellexus has been working with the Science and Technology Facilities Council’s (STFC) Hartree Centre to perform extensive testing of the Ellexus monitoring and profiling tools against a range of commercial schedulers and MPI libraries, in particular the latest versions of Intel MPI. By using the HPC facilities at the Hartree Centre, the Ellexus tools can be tested in a production environment that better matches the HPC environments that our customers use.

“By supporting UK industry with specialist expertise, the Hartree Centre ensures that British companies stay at the forefront for high performance computing,” said Dr Rosemary Francis, Ellexus CEO. “The Ellexus team gained valuable insights into how our tools performed at scale by doing extensive and transparent testing and having good access and control over experiments in a way we don’t usually get in customer engagements.”

How the Ellexus tools work

The Ellexus tools trace MPI applications and distributed jobs by intercepting the remote launch calls and re-injecting the Ellexus monitoring technology into the environment. This ensures that users can continue to treat the supercomputer as a single machine and that jobs that span multiple hosts can be profiled seamlessly. Ellexus has built up a comprehensive framework to allow the tracing technology to work in a wide variety of distributed environments both on-premises and in the cloud.

Our work with the Hartree Centre concentrated on testing the wide number of ways schedulers and MPI libraries interact. No two HPC environments are the same so while many of Ellexus’ customers use technologies such as LSF and Intel MPI, there is no guarantee that any two customers will have set up the technology in the same way. It is therefore vital for SMEs like Ellexus to work with organisations such as the Hartree Centre that can provide access to leading HPC platforms to ensure that all combinations and configurations of these environments have been tested.

Using Openfoam as a benchmark for testing, the team at the Hartree Centre and the Ellexus engineers were able to pool knowledge and resources and identify new, undocumented ways in which certain schedulers and MPI libraries were distributing workloads. By incorporating this knowledge into the Ellexus tool suite, we can ensure that our I/O monitoring tools, Breeze and Mistral, are future proof and ready for the latest updates in HPC environment.

To read more about our work the Hartree Centre, download the whitepaper on the performance improvements we identified in HPC application DL_POLY using I/O profiling.