XALT: Tracking job-level activity on supercomputers

XALT is a tool that allows supercomputer support staff to collect and understand job-level information about the libraries and executables that end-users access during their jobs. The tool can also work with a system’s module software to provide additional information about module usage. XALT is a collaboration between Mark Fahey of the University of Chicago, formerly National Institute for Computational Sciences, and Robert McLay of the Texas Advanced Computing Center.

Ellexus is offering commercial support for XALT as part of our commitment to open source technology and collaboration. Come and talk to us if you would like help in installing and running XALT. We are able to offer a further range of professional services including custom integration and bug fixing.

XALT can be used in combination with Mistral to understand the I/O patterns of the jobs on your cluster as well as the dependency information from XALT.

Case study

Using Xalt to reduce queue time

The Texas Advanced Computing Center at Texas University uses the Slurm job scheduler to manage different queues. One of those queues is reserved for jobs that use a lot of memory so that they could be scheduled with the resources they needed.

This worked for a while, but the wait time on the queue started to get really long and users complained.

Robert at TACC went through the XALT log data to see what was running on that queue and found that half the jobs were a geological application. This application can be run in parallel on more cores with much less memory per core so it didn’t need to be in the high memory queue.

After that application was moved, the wait time on the high-memory queue went down from four days to two. This really shows the value of knowing what you are running and spending a little time optimising the common cases.