Containerisation is a powerful tool for encapsulating complex applications and their dependencies. They can ensure that results are reproducible and scalable.
Genomic workflows are well suited to data-driven workflows, but until recently efforts to orchestrate multiple containers in a single pipeline were home grown and hard to maintain. Nextflow is a pipeline orchestration tool that is gaining in popularity and goes a long way to filling this gap.
Ellexus worked with the Science and Technology Facilities Council’s Hartree Centre to profile genomic pipelines orchestrated with Nextflow and Singularity.
How Nextflow is a step towards data-driven workflows
Many HPC centres are turning to hybrid cloud in order to trade off between the efficiencies of running predicable workloads on dedicated on-prem resources and the rapid scaling, tuning and burst capabilities of cloud.
In order to take advantage of hybrid cloud it is important to abstract the temporal, special and sequential nature of traditional genomic pipelines so that the compute and storage resources can be correctly sized for the stage in the pipeline. For example, in the cloud there are great advantages to running the memory intensive stages on an expensive high-memory machine, but then moving to a cheaper image for the next stage that could be I/O bound instead.
Nextflow is a first step in moving to data-driven workflow architectures and unlocking the potential that hybrid cloud can bring.
The project: Singularity, The Hartree Centre, YAMP and Breeze
Singularity is a containerisation technology developed for HPC workloads at the Laurence Berkley Lab in California with security, scalability and multi-tenancy at the heart of its design. Now it is maintained by Sylabs, which is building a commercial ecosystem around the core technology.
The Hartree Centre ensures that the UK is at the forefront of life sciences and other HPC disciplines, supporting academia and industry by providing expert facilities. Ellexus recently worked with the team at the Hartree Centre to profile and improve DL_POLY, a general purpose classical molecular dynamics (MD) simulation software.
During the Nextflow project, Ellexus and the Hartree Centre used the pipeline profile YAMP, which is constructed on Nextflow. To profile YAMP, the team used Ellexus Breeze. Breeze profiles application I/O and visualises process relationships and dependencies, as well as details such as arguments and environment. It is an excellent discovery tool for finding out more about a third-party application and for profiling for performance checking.
Profiling Nextflow and Singularity
There was a small overhead on the Singularity-based pipeline compared with “bare metal” workflow. This is likely to reduce in time as the Singularity product matures.
The Nextflow profile looked clear and well designed (see picture below). Calls out to the genome pipeline were easy to identify in Breeze.
The I/O was dominated by a lot of reads early on (see picture below), which is typical of genome pipelines. The reads were mix of small and medium sized operations taking just a few seconds in total.
Potential for improvement
We did notice a large number of stat calls (more than 700,000, shown in the picture below) performed by the genome pipeline iteself, not by Nextflow or Singularity. While those took just a few hundred miliseconds to perform, they introduce high load on the meta-data server and so it would be good practice to remove those.
Using the files view in Breeze to identify the source of the stat files, you can easily see that the majority of the stat calls are made by the Java pipeline on the two large block files. It looks as though there is a stat call for every read and seek. The stat calls can clearly be eliminated to improve the I/O patterns of this pipeline, even if it would make little difference to the performance.