Silicon Valley case study: Catching rogue jobs before they overload shared storage

At a recent Spectrum Scale User Group meeting in Berkeley, California, Rosemary presented tales from a Silicon Valley start-up – namely, some work we’ve been doing to catch rogue drops at an admittedly huge start-up, which has been acquired by a large organisation.

The company has a host-leaf architecture is as follows:

Host compute nodes:

  • Native Spectrum Scale client
  • 4 Gigabit/s Infiniband
  • 32 Slots

Leaf compute nodes:

  • Send data over RCP to host nodes
  • Clustered NFS


  • 8 Spectrum Scale servers

The company suffers from two recurring problems with bad I/O patterns. Firstly, jobs are suppose to aggregate data in the host node, but sometimes they write directly from the leaf nodes instead. If they hammer a particular mount point that can overload one of the filers.

Secondly, a job is debugged on one core then handed over to run at scale – but the debug flag is left on accidentally. This happens weekly.

Introducing Mistral

The company decided to initiate Mistral to solve the problem. Mistral works by wrapping production with its technology via an LSF job starter.

Once running, Mistral generates an alert if:

  • Leaf nodes write to the file system
  • Host nodes write too much data

As the company puts it, this breaks down the problem into two cases:

  • My house is on fire
  • My house was on fire, but I missed it and I want to know what happened

In the first case, Mistral can send an alert. In the second case, Mistral helps the company to track the problem, for example if it occurred at night or over the weekend. This has been made possible through integration with IBM LSF RTM, the LSF analytics dashboard.

Since introducing this framework the company has  seen an overall reduction in the number of rogue jobs through user education and a dramatic reduction in the time it takes to find and resolve and issue when it occurs.

“We really like Mistral – it’s really smart. We now want to look at long term QA as we have nearly eliminated rogue jobs on the cluster.”

Hear all about it at ISC 2017

Rosemary will be presenting the work carried out in this post at the Spectrum Scale and LSF User Group meeting at ISC. Come along to hear more about it and ask any questions.

The meeting will be held on Monday 19th June from 12 – 2:30pm at the IBM meeting room Konstant. For a detailed agenda contact Ulf Troppens at