Video from InsideHPC: I/O architectures and technology, comparing Lustre and Spectrum Scale

If you were absorbed with festivities in December, you might have missed the video InsideHPC presented from the Argonne training programme on extreme scale computing. Glenn Lockwood from NERSC presents an excellent overview of I/O architectures and technology, maintaining that good I/O is about using the right APIs and understanding the technology underneath. We couldn’t agree more.

Lustre vs Spectrum Scale

Glenn includes a comparison of Lustre and Spectrum Scale, formerly GPFS, a subject that Ellexus is often asked about. Our tools detect good and bad I/O patterns and we are often asked if we have specific categories for detecting good and bad I/O depending on the file system used.

Both Lustre and Spectrum Scale are parallel file systems that allow a file to reside on many disks managed by multiple storage nodes, which means reading from that file can be done in parallel at far greater bandwidth than what is achievable with a single disk. Some I/O patterns are very well suited to this design and some are not. Both Lustre and Spectrum Scale are designed to give wide bandwidth to very large checkpointing-style jobs that read and write large streams of data at a time. They are not designed to serve small I/O and random I/O well.

Lustre has a separate meta-data server. This means you can size your meta-data server independently from the data store and it can be optimised for random access over streaming access.

However, Lustre was designed for large files with continuous access so even when the meta-data server is built with state-of-the-art solid storage, the meta-data access rates are never very good. Lustre open to noisy neighbour problems where one user performs a large number of meta-data operations and denies access to meta-data resources to other users.

In comparison, Spectrum Scale lets you interleave the meta data with the data, which means the meta data scales with your storage capacity. Large directories should be avoided, though, as they can span multiple blocks and multiple physical volumes, which degrades performance.

The topology of the file system is also more configurable in Spectrum Scale as there are many more options to choose from. Most problems in Spectrum Scale are therefore networking problems with the complexity that no two deployments are the same.

Stripe size and SSDs

Both file systems rely on striping files across multiple nodes. In Lustre the stripe size is configurable, usually larger than 1MB and can be tuned to fit the application I/O patterns and data. It can be configured to fit each file.

In Spectrum Scale the stripe size is fixed and it is not possible to choose where the blocks in a file go. Spreading files across a lot of disks is great for parallel read, but it does make the file system vulnerable to issues caused by a single slow disk. In this way applications with bad I/O patterns can easily affect the performance for hundreds of other well-behaved applications.

Both file systems benefit from the increasingly available solid-state drives (SSDs), which are faster than disc and cope better with small and random I/O. SSDs are still not perfect, however; nothing comes for free. Random or small I/O still degrades performance as SSDs have to write and erase data in large (128k-2M) blocks so large contiguous I/O is still preferred.

While the two file systems have different architectures that cope differently with specific I/O patterns, the configuration of the system will matter just as much as the file system used. A broad message is generally true for both, however: streaming I/O is great, small random I/O is not and if you can, avoid meta-data operations such as stat(), open(), create() and delete().