The How, Why and Where of HPC in the Cloud: Understanding the network topology

Network Topology

When you move from an on-prem environment to the cloud, you hand over a lot of control of the network administration. This can be a blessing and a curse.

On the plus side, the cloud vendors invest heavily in the network as this is critical to the delivery of a flexible on-demand service. This means it is likely to be far better than the network infrastructure that you had on-prem.

On the other hand, the network you had on-prem was set up for you and the architecture was known to you. It is reasonable to assume that an AWS availability zone is a physical datacentre for most applications, but that is a crude approximation of a complex compute infrastructure. An availability zone may have multiple physical data centres. Applications that span thousands of machines are likely to find that the VMs are spread across multiple data centres with variable latencies that follow.

In order to balance out the blessings and the curses, planning and effective data collection are vital. At this point, turning to Douglas Adams could help.

How, why and where

The Hitchhiker’s Guide the Galaxy says that civilisation follows several phases known as “how, why and where”. At the moment, most hybrid cloud orchestration for the HPC stage is still at the “how do we run in the cloud?” stage. The follow-on question is usually: “How do we run in the cloud efficiently?” But this is really just an extension of the original PoC, but with better profiling and optimisation tools.

As applications become more portable and bursting gets easier, we’ll enter the “why” stage. Finetuning the business case for what should stay on-prem and what should move to the cloud will be possible using the data collected on costs and efficiency. This is where investment in effective telemetry will pay off because the data collected at the PoC stage can be used to forecast and plan for the “why” stage.

Finally, once we have our applications bursting into the cloud efficiently, we will be ready for the “where” stage. Fine tuning up to this point is likely to have focused on optimising application I/O, sizing the compute instances correctly and designing the data management strategy. The latter is likely to shape a lot of the big-picture “where” questions as we pick regions that suit data protection policies and we choose availability zones to de-risk critical infrastructure.

Challenges of orchestration

At the moment, on-prem schedulers are grappling for a share of the hybrid cloud market. It’s not clear if they will be able to adapt to such a different environment or whether we’ll see completely new technology win. One challenge for the orchestration technology of tomorrow will be to dynamically adapt to an unknown network topology within an availability zone.

It would be unusual for an on-prem cluster to be spread across multiple data centres, but it should be expected in the cloud. The schedulers of the future will need to be able to derive the locality of available resources and allocate tasks accordingly. To do this, organisations will need to monitor compute resources, I/O latency and applications to be able to predict the communication patterns and choose the appropriate machines for each task.

Another advantage of the cloud is virtually unlimited scaling when you need it. Why wait a week for the answer when you can scale up and get the results in an hour? As cloud deployments scale far beyond what you would see in an on-prem environment (and on-prem edges closer to exascale) we’ll see far more nodes fail and the way we run jobs will have to change to expect that and adapt.

A far more transactional approach to computing will suit this compute model, but are we close to being able to do that? Many areas of HPC are in theory ideal for this kind of compute model and many organisations are already run in this way.

Making predictions when technology is still an such an emergent phase is extremely difficult. To have the best chance of success, however, we owe it to the network to monitor data effectively.