Whitepaper: Accelerating Wellcome Sanger Institute’s cloud-based genomics pipelines through I/O profiling

Ellexus has published a whitepaper presenting an overview of the work by the Wellcome Sanger Institute to make one of their cancer pipelines portable and to tune it for cloud deployment using the I/O profiling tools from Ellexus.

The full title of the whitepaper is: “Accelerating cloud-based genomics pipelines through I/O profiling for analysis of more than 3,000 whole genome pairs on AWS’.

The paper includes a comparison of the time and cost of running the pipeline on various storage options on AWS (Amazon Web Services). Ellexus profiled the containerised workload on AWS using Breeze and determined that the default storage option is the best value for money over faster, more expensive options. This cost saving is possible due to the effort that has been put in by the scientists at the Wellcome Sanger Institute to make the pipeline fast, agile and cloud-ready.

Read the opening below and download the full whitepaper.

Cancer, Ageing and Somatic Mutation Group at the Wellcome Sanger Institute

The research of the Wellcome Sanger Institute’s Cancer, Ageing and Somatic Mutation (CASM) Group focuses on understanding the mutational processes that lead to various age-related diseases and cancer. This is a diverse and
rapidly changing field using both model organisms and patient samples.

CASM-IT is a dedicated and motivated team of bioinformaticians and software developers who support scientific staff, simplifying the use of tools, incorporating them into large-scale pipelines and presenting data in more easily digestible forms. The team develop many of their own tools.

Migration to containers and cloud compute

In 2013, two of the largest funders of genomic research agreed to one of the most ambitious data analysis projects at that time: the ICGC/TCGA PanCancer Analysis of Whole Genomes (PCAWG). The goal of the project was to analyse 2,000 donors, generating uniform and comparable data that was to be processed at many sites.

In order to participate, the “Sanger” pipeline (CASM-IT) needed to be usable at many sites. CASM-IT started working to solve the tie-ins. In 2015, the decision was made to switch to Docker and for significant processing to be carried out using Amazon Web Services (AWS). The project went well and the initial deployments were handled without containers. Final versions of all analysis flows exist on dockstore.org.

The result of this migration to containers and the cloud is that tools are now far more portable. More users can access the tools and it is easier for the IT team to get buy-in for changes to working practices. Tools are now being employed in a PanProstrate analysis project, which is of a similar size to PCAWG but restricted to a single cancer type.

Containers aren’t always as simple as they seem

It is often a general assumption that containers just run. However, many variables can mean that a container will not perform as well as it should, or not at all. Problems particularly arise when the software that is being containerised has not been written by a software developer. It can be poorly defined or unstable, with lots of dependencies that are not clear and can be left behind when the container is generated.

The ideal situation for containerising a tool is for it to be a well-understood tool with no other dependencies and in-house support. Ideally it will have been designed to be species agnostic and for I/O to have been taken into
consideration. An example of the ideal scenario is the dockstore-cgpwgs, the CASM-IT whole genome sequencing (WGS) analysis container. The team knows the order of events and can easily map them back to a CPU utilisation trace. It is clear where CPU usage is expected to be low.

The tools from Ellexus were able to identify why the team was hitting I/O bottlenecks in areas where they expected full CPU utilisation. This is where the tools from Ellexus are very useful.