In a shared environment such as an HPC cluster, it’s easy for users to ‘noisy neighbour’ each other and themselves. They can make mistakes that slow down everyone’s work or halt the system completely. System rules and best practices are usually developed with years of experience in firefighting common problems, but how do you proactively develop best practices for new systems. Crucially, how do you ensure that users follow shared storage best practice?
I recently met Adelina, the geek whisperer. She runs presentation skills courses for geeks. Instead of “influencer training” she runs classes on “How to tell someone that they are wrong and make them believe you”. As a sys admin, you often have to be the bearer of bad news. I think the HPC industry can learn from Adelina.
Training and education
The first step in making users behave well is to tell them what good behaviour is. The way you present the data is key. Simple, comprehensive sets of rules win every time, along with clear ways to know what is good and what is bad.
The traffic light view in Ellexus’ tools shows you exactly how much good, average and bad I/O your applications are doing. This is a very handy tool to show to users and to develop shared storage best practice.
Blame and public shaming
Seeing who has just launched a thousand jobs on a login node and then gone to lunch is a powerful motivator to get things right. Peer pressure works because most people are team players, but it does penalise the inexperienced.
Good design and design for error is the best way to keep users on the right path, but as most systems comprise a custom combination of tools and requirements, most organisations will need custom error checking as well. This is expensive and time consuming but can create a legacy of efficiency so is worth the investment. We’ve seen the best results where dev ops is treated like software engineering.
Monitoring and triage
You don’t know what the next problem will be and you don’t know what you need to monitor in order to catch the issue, but basic performance metrics and chargeback is a good start. Spotting abnormal behaviour starts with knowing what is normal and having good processes in place to triage problems when they are detected.
Don’t force staff to reinvent the wheel every time something goes wrong. Having a diagnostic pipeline that quickly rules out common problems will save time and empower your engineers.
In short, enforcing shared storage best practice requires a combination of the right technology and the right people management. It’s all about placing knowledge in the right people’s hands. The tools from Ellexus can provide a valuable helping hand.