Optimising genome pipelines is in all of our interests – 1 in 2 people born after 1960 will get cancer at some point in their lifetime, according to Cancer Research UK. It’s a race against time to find cures.
However, given that each cancer sample generates around 250GB after initial processing, the storage involved in genome projects is huge. Optimisation is crucial.
Most pipelines are written and tested on local machines and then run in parallel on compute clusters with shared storage. However, the I/O behaviour is very different on the clusters and unless the bio-informatician has access to comprehensive I/O profiling tools there are likely to be some inefficient I/O patterns that can harm performance on the storage and potentially prevent others from getting their work done.
Pipelines at the Sanger Institute
Ellexus profiled one of the genome pipelines at the Sanger Institute with Mistral to look for such I/O patterns. The pipeline had been optimised in some areas, but our profiling showed that there was still potential for it to be improved.
The Sanger Institute supplied us with the following files, which we first ran on a native Centos machine and then on a virtual machine to see how the performance might vary in a cloud environment:
- A public pipeline (as a Docker image)
- A public reference genome
- A public mutant sample
We set up the Mistral I/O contract to sample every 20s and generate an alert when the following limits were exceeded:
- >1,000 meta data operations
- >1,000,000 read or write operations in ranges between 1B, 4kB, 32kB and 1MB
- >1,000 read or write operations 32kb-1MB and 1MB+
Fig 1 shows the alerts generated when there were more than a million small reads in a 20s period.
At the start there were up to a million 1 byte reads per second. These small reads harm the computational performance and create sub-optimal I/O patterns on the shared storage. This soon settles down to a longer period of good streaming I/O, but it would still be worth optimising the early small reads.
Fig 2 shows the better streaming reads.
This is the kind of I/O we want to see because these access patterns allow the shared storage to run at maximum bandwidth with minimum impact on other jobs. Everything is faster when we see this.
Fig 3 shows the small reads and writes performed when the pipeline was run on a virtual machine with only a small amount of memory.
The virtualisation will slow down the I/O and therefore elongate the run time, but the small amount of memory will have had a bigger impact in this case. The sub-optimal I/O patterns aren’t nearly so bad in this run because the job is running much more slowly and will put less strain on the file system.
The conclusion of our profiling work: buy lots of memory, profile your file I/O, avoid small reads and writes in preference to streaming.