We are often approached by customers who have an I/O issue in the field. Of course we can help them to fix problems as they come up, but it’s so much better to prevent them from affecting your end users in the first place. As datasets get larger and applications increase in complexity, it’s never been a better time to start to care about the way you access your data.
In this blog I’ll be sharing some case studies of common I/O issues and what we look for to catch them early using Mistral and Breeze. No test suite is going to catch every potential problem, but by looking for bad I/O patterns in your existing tests you can find a lot of problems before they reach users.
Once you’ve read these case studies, read our blog on how to integrate Mistral and Breeze into Jenkins so that you can toggle I/O profiling on and off with varying levels of detail. Most of the issues in this blog post are concerned with file-system I/O to local or shared storage.
How do I decide what is “good” I/O?
We are often asked by customers how to write tests that detect bad I/O when they are not sure what good I/O looks like. Sometimes a period of profiling and inspection is needed to find out what is normal, but most of the time bad I/O is really obvious.
When testing for production you shouldn’t look for small performance wins: you should be looking for catastrophic events that will take down your infrastructure or your users’ infrastructure.
For example, several of our customers had a common problem: their shared file system was being taking out by users running applications with the debug flag left on. This can cause a distributed application to generate more than a million times more data than normal. Users could easily work on a problem in debug mode then leave it on by mistake when submitting it to be run in their test suite.
To solve this problem, simply set up a test that detects when an application is generating enough data to crash your file system. Start by setting it high and then adjust up or down depending on the rate of false positives or false negatives. The right range will be very wide.
Trawling the file system
It’s very easy to trawl a file system with a “find” command or in an application to list every file in a directory to find the one you want. This can run completely undetected if run in a directory with few files, but can be a disaster in a very full file system. The most common trawl we come across is applications checking every file in the home directory.
One customer had a major problem when they shipped an application that checked the home directory for a config file in a system that stored many other tools and config files. The home directory was located on a shared file system so the time it took to load the application was increased to over a minute. As most users lose interest after less than 30 seconds, this kind of delay is unacceptable.
To fix this, you can look for file system trawls. They don’t have to take a long time to be flagged up. It’s a higher overhead test than some, but it will pick up on every issue of this type so doesn’t need to be run with every commit.
Hard coded paths and home directory dependencies
It’s common for developers to start working on something new in their home directory, using it as a personal scratch space. Even with company policies, restricting this can be hard to police.
Once a feature has been developed it is easy to accidentally leave some dependencies in the home directory and for that to make it into the final product. Clean room testing should pick up on this, but that is not always completed for quick releases or internal releases and mistakes happen.
Our tools look for files and programs in your home directory as well as for files and programs in other users’ home directories. You can expand this to include other forbidden areas of the file system and save yourself a lot of pain when it comes to removing those dependencies in release candidates and deployed applications.