How to manage I/O in the biggest supercomputers in the world

The Texas Advanced Computing Center (TACC) designs and operates some of the most powerful computing resources in the world. Their new supercomputer, known as Frontera, will be the fastest in any US university once it is up and running in 2019.

As a result, the IT team has come across many I/O-related challenges, as their systems handle many different users and workloads. To combat this challenges the team uses tools such as Ellexus Mistral, which catches problem users on Frontera. Predicting failures is the holy grail of monitoring solutions. Currently the TACC team is using Mistral to monitor monitoring meta data, specifically to catch users who delete a lot of files at once, and over time they will add further aspects to monitor.

In a recent talk at the TACC STAR Program event, Dan Stanzione, executive director of TACC, described the work to develop Frontera and how I/O monitoring has changed over the years.

Specific I/O challenges

Dan highlighted two specific I/O-related problems his team has come across:

  • 40,000 10byte files were created in 10 minutes across 40 machines
  • An open/close loop in which a program repeatedly opens and closes a file or set of files, often writing just one byte each time or even doing nothing.

Single user load rarely causes a problem at TACC, and usually iops, not bandwidth, is the limiter. However, metadata can get swamped and file count is often a problem: 3 billion files is a push for Lustre.

To solve this, the supercomputer Frontera will have four scratch file systems: one flash 3PB and three other file systems for randomly assigned scratch. They plan to stay with Lustre for the time being.

Politics, performance and performance management

HPC doesn’t look like it used to, and TACC, which works with many scientific organisations, has evolved with it. Streaming data is common as applications access large data sets, often from external sources. Synthetic biology and manufacturing have online bodies of data and play a big part in many TACC projects. Getting on top of I/O challenges is therefore increasingly critical.

For machine learning, there is a lot to learn in the application of the technology as well as tuning the tools. At TACC, the team is working on constraints for machine learning to avoid physics-free modelling; for example, the car must stay on the road. To ensure performance, they plan to use auto-tuned MPI stacks, with compiler time system telemetry that affects internal I/O settings. They will treat all data as protected data since so much of will be anyway.

Monitoring solutions

As well as third-party monitoring tools such as Mistral, TACC has a number of home-grown solutions including TACC Stats. As the data collected by Mistral can be plugged into other dashboards, it is complementary to TACC Stats and XALT.

Policing will always be a large part of managing a service as big as TACC. For example, bitcoin mining has been tried a few times on TACC clusters. Users can be clever at hiding workloads, but they are always found and banned.

As HPC continues to evolve and Frontera revs its gears, no doubt the team at TACC will encounter more I/O problems – and innovative ways to manage them.