Building for the cloud: Dataflows not workflows

Once you’ve made the decision to tackle cloud migration, the next step is to make it happen. Migrating applications to the cloud can be a daunting prospect. Where do you begin?

Some people choose to ‘lift and shift’ to get applications into the cloud by mirroring on-premise infrastructure. This is a great way to get started, but it is not a cost-effective or efficient way to operate long term. It certainly doesn’t take advantage of the dynamic nature of public cloud.

I/O centric orchestration

For many applications, I/O is going to be the bottleneck. We need to build frameworks that can pump the data back and forth in ways that understand what the application needs, the hardware constraints and the changing models of the environment. A truly intelligent storage solution needs to understand all of these in ways that a human will never be able to process.

Instead of designing workflows where the algorithm is the main factor in the architecture, design dataflows where the location and access patterns of the data is key:

  • Spin up data and compute
  • Process data on fast local storage
  • Save results back

Multi-stage pipelines can be processed on different compute notes. Data can be staged to object storage if the pipeline steps are separated in time as well as space, giving complete abstraction between compute steps.

A lot of the infrastructure can be deployed and torn down dynamically along with the data and applications. In the short term, that might mean you need schedulers to match the on-premise infrastructure and deployment methods, but as hybrid cloud matures, we’ll start to see a settling of the job submission frameworks and a few technologies winning.

Trading off CPU, memory and I/O

A lot of HPC applications support checkpointing with variable success. One way to apply this is to run applications on low-memory instances and migrate the application when it runs out of memory. This is a great cost-saver for workloads with data-driven memory consumption, but you will need more storage to pay for your snapshots. This might out-weigh the benefits.

A better approach might be to run everything on a cheaper, low-memory instances and simply re-run all the instances that run out of memory. Assuming you have time to do this, it’s easy to orchestrate and easy to tune for cost efficiency, particularly if it is a small number of runs that have large memory constraints.

Similarly, for applications with bursty I/O, it can be cheaper to run on slower storage and have the CPU wait for the data. We found this to be the case when sizing AWS storage for genome pipelines at the Sanger Institute, even though we were using large compute instances with lots of memory.

The main point to keep in mind about migrating applications to the cloud, though, is that the industry and its offering are still evolving – at the same time as each organisation’s own requirements. By keeping an open mind, you will find yourself open to some different cost trade-offs to what you expected.