In the spirit of the New Year, the Ellexus team has put together our top five I/O resolutions. Big data is getting bigger and fast compute is getting faster – now is the time to clean up your act so you’re ready for the year to come.
1. I will tidy up my data on the shared file systems
If you have a project or user space quota then don’t exceed it. Some file systems get very slow when they get full. Some of our customers have checks that warn users who are nearing their quota, but if you work on a system without that check then it’s up to you to be a good citizen.
It’s also important to delete files as you go along. Deleting a lot of files all at once at the end of a large job or as you run out of space puts a lot of pressure on the file system. We have one customer who has an issue with users trying to delete large numbers of files in parallel, sometimes without splitting the files between the parallel jobs. This is enough to bring an entire cluster to a halt so don’t do it.
2. I will check application dependencies
Do you regularly check that you are using the correct versions of files or libraries? When you call programs, are you always calling the same version or do you have (for example) multiple versions of perl or python installed? It’s not just C programs that get bit rot from this kind of inconsistency – the portability of all code degrades over time if you don’t keep track of dependencies.
Another area to check is your hard-coded paths. It’s all too easy to set something up quickly in your home directory or scratch space and to let that work its way into production code.
Don’t know how to check for application dependencies? Take a look at Breeze.
3. I will check for file system trawls and failed I/O
File system trawls are a really easy way for bad performance to make its way into the production environment. For example, does your application look for a file by reading everything in that directory? Do you use “find *”? Have you tested it on a full file-system? Are you relying on the PATH variable to call programs and find libraries?
Often, test environments are much cleaner than the real world. When testing, every failed I/O operation should be examined because ten failed opens in test could mean a 10s delay on startup in production. One of our customers recently lost 24 hours of runtime on a short application due to failed I/O.
4. I will profile I/O patterns in applications in both development and deployment
It’s important to profile the I/O in your application at every stage. Problems such as small I/O operations or very big I/O can be introduced by code changes or third-party libraries, but there can also be other problems introduced when an application is deployed.
For example, we’ve had customers distribute an application with an option disabled. An input file wasn’t needed so it was set to /dev/null. The application still tried to read the file, wasting 10% of the run time on null operations. The I/O patterns were good when there was a file to read, but a broken assumption caused a significant slowdown.
5. I will think about the way I access data right from the start of each project
If you work on a modern distributed system, it’s no longer OK to assume that accessing data is local, fast and scaleable. The design of the data, where and how to store it and how to monitor its use need to be considered from the beginning.
For example, in a large web-facing SaaS application, which clicks affect the database? If you don’t understand the I/O patterns of a system then you are shooting in the dark when you carry out any kind of optimisation.
We wish you a happy 2018!