Named Mistral, the tool will solve the noisy neighbour problem: a situation created when one user can submit a job onto the cluster that carries out too much IO and prevents others from using the shared file system.
Mistral will detect applications with unexpectedly high IO, notify a central control system and then automatically throttle the rogue job so that others can continue unaffected at full speed. By doing lightweight, low impact profiling of jobs, the tool will also build a history of IO patterns for use in more advanced tuning, which can be used to educate users and improve IT infrastructure design. This will also enable storage-aware scheduling through integration with schedulers such as IBM Platform LSF, which is used by ARM’s IT department.
This technology takes our existing tool Breeze a step further to confront a problem that is common to every compute cluster with shared storage.
Olly Stephens, Engineering Systems Architect at ARM, says of the project with Ellexus: “One of our key long-term objectives is to develop a better understanding of the storage requirements of each compute job so that we can be more dynamic in the way we manage and provision storage.
“We want to develop a system that will allow the infrastructure to protect itself somewhat against behaviour that is considered a risk to the collective whole. In particular, we want the ability for aggressive, and potentially dangerous, use of the storage infrastructure to be automatically detected early and remedial steps taken quickly, hopefully preventing it from escalating to a system-wide issue.
“Currently this activity is done by the HPC support staff, who are able to monitor and detect issues then attempt to trace them back to the culprit jobs. This is a slow and difficult process, primarily due to the lack of available information. The data and system control provided by the new software from Ellexus will automate this process and give us a lot more information to learn from.”
Become a development partner
We’re currently looking for more development partners to take the project forward. If you’re interested in learning more get in touch with Rosemary at firstname.lastname@example.org.