Optimising DNA sequencing pipelines in the New Pipeline Group at the Sanger Institute

As a forerunner in the world’s fight against cancer and genetic diseases, the Sanger Institute also has to be a trendsetter in the field of High-Performance Computing. To this end, the institute uses a wide variety of tools to profile and optimise their genome pipelines, including Ellexus Breeze. The vast data sets that come out of the Illumina sequencing machines are a real match for the latest algorithms for sequencing and analysing genomes, and managing the compute and storage needed is no easy task.

The Sanger’s New Pipeline Group (NPG) has been profiling pipelines with Breeze on the team’s on-prem Openstack cluster. Specifically, they are using it to monitor I/O performance and look for areas for optimisation.

Openstack is a flexible framework for deploying compute resources. It can be used to implement a hybrid cloud architecture that allows organisations like the Sanger Institute to expand into cloud resources when their on-prem setup is not enough.

In particular, the NPG was keen to see the I/O patterns of a program called fastqcheck. They suspected that it was a bottleneck and had started work on improving it, but they wanted to make sure that it was worth the effort. The team selected a small part of their pipeline to profile in detail and ran it with the Breeze tracing wrapper.

Read on to find out what the team discovered when the data was loaded into Breeze.

Overview of good and bad I/O

In Breeze, the traffic light report immediately gives a breakdown of how much time is spent doing good and bad I/O. In this case, although the majority of I/O is good, fast streaming I/O, there is a large amount of time spent doing small I/O – 251s in a 351s pipeline is a lot even when the I/O is done in parallel.

Small I/O is a problem for file systems because the operating system doesn’t necessarily do a good job of grouping I/O operations into larger chucks. A lot of small operations have to be sent over the network and acknowledged. On shared file systems, even reads are effectively writes because the file system has to keen track of where in a file the program is reading in order to implement correct locking policies.

Clicking on the entry for small I/O in the Breeze traffic light report brings up a list of programs that did small I/O. The program fastqcheck that is under investigation comes up as the top three entries accounting for most of the small I/O. This means the NPG was correct to target this application for optimisation.

Checking the file dependencies

As well as looking at which programs are doing small I/O, you can see which files have been targeted. In this case, fastqcheck only reads from one file so it is easy to see what it is doing, but in other programs it might be necessary to narrow it down further.

The files view in Breeze gives you a list of which files have been accessed and a table detailing the I/O patterns. You can sort the table to see which files were accessed by the most small read operations. In this case it was three bamtofastq files. Clicking on the top one tells us that is was read by fastqcheck as we would expect.

Finally, we return to the traffic light report to see if there are any other areas worth optimising. There is large amount of time spent opening files that are not read or written much. Selecting this segment tells us that the time spent opening files was almost all spent by one call to cmp, which seems odd. Selecting this program and looking at its list of operations in the event view show that it was just one open operation that took over 21s to complete.

In this case it is likely that the delay is caused by an artefact of the pipeline where the open is blocking on another process being ready and is therefore not a problem. This finding can be fed back to the storage administrators and monitored on an ongoing basis. If there are logs from the network and file system available at the time of that operation, perhaps that will also explain why the open took so long.

Even focusing a profiling exercise on one small part of the pipeline produced useful information about how fastqcheck could be optimised. In an industry as critical as cancer research, any performance gain is a win, but this exercise proved what could be achieved on a greater scale through I/O profiling.