Building for the cloud: Dataflows not workflows

Once you’ve made the decision to switch to cloud storage, the next step is to make it happen. Migrating applications to the cloud can be a daunting prospect. Where do you begin?

Some people choose to ‘lift and shift’ to get applications into the cloud by mirroring on-premise infrastructure. This is a great way to get started, but it is not a cost-effective or efficient way to operate long term. It certainly doesn’t take advantage of the dynamic nature of public cloud.

The following are ideas for how to create dataflows, not workflows, that will help you make the most of cloud storage. Get in touch with us for information about how the tools from Ellexus can help develop your go-to-cloud strategy.

I/O centric orchestration

For many applications, I/O is going to be the bottleneck. We need to build frameworks that can pump the data back and forth in ways that understand what the application needs, the hardware constraints and the charging models of the environment. A truly intelligent storage solution needs to understand all of these in ways that a human will never be able to process.

Instead of designing workflows where the algorithm is the main factor in the architecture, design dataflows where the location and access patterns of the data is key:

  • Spin up data and compute
  • Process data on fast local storage
  • Save results back

Multi-stage pipelines can be processed on different compute nodes. Data can be staged to object storage if the pipeline steps are separated in time as well as space, giving complete abstraction between compute steps.

A lot of the infrastructure can be deployed and torn down dynamically along with the data and applications. In the short term, that might mean you need schedulers to match the on-premise infrastructure and deployment methods, but as hybrid cloud matures, we’ll start to see a settling of the job submission frameworks and a few technologies winning.

Trading off CPU, memory and I/O

A lot of HPC applications support checkpointing with variable success. One way to apply this is to run applications on low-memory instances and migrate the application when it runs out of memory. This is a great cost-saver for workloads with data-driven memory consumption, but you will need more storage to pay for your snapshots. This might out-weigh the benefits.

A better approach might be to run everything on cheaper, low-memory instances and simply re-run all the instances that run out of memory. Assuming you have time to do this, it’s easy to orchestrate and easy to tune for cost efficiency, particularly if it is a small number of runs that have large memory constraints.

Similarly, it can be cheaper to run applications with bursty I/O on slower storage and have the CPU wait for the data. We found this to be the case when sizing AWS storage for genome pipelines at the Sanger Institute, even though we were using large compute instances with lots of memory.

Keeping moving as the cloud evolves

The main point to keep in mind about migrating applications to the cloud, though, is that the industry and its offering are still evolving – at the same time as each organisation’s own requirements. By keeping an open mind, you will might find yourself open to some very different cost trade-offs to what you expected.