This article was first published on InsideHPC.
We can all admit to being pressured into investing in a new solution as soon as it becomes available. We want to be the fastest, the best-informed, the best-equipped. However, sometimes there can be a real cost benefit to mulling things over for a while before leaping in with a purchase order.
Take the switch to cloud storage, which many HPC organizations are considering. While many are put off by the cost of rearchitecting their infrastructure, they should also consider all the secondary costs that come with adopting a significantly different compute paradigm. As well as everything you have to budget for there are always a dozen things that can delay a project, hamper productivity and incur huge costs in lost opportunity.
The following are several secondary costs that an HPC organization should mull over before leaping in with the purchase of a whole new storage system:
Access to the system
On-premise systems usually have a scheduler with a well-honed set of submission scripts and policies to ensure that everything ends up in the right place. They are often informally documented and completed without modern development techniques such as code review, which makes these scripted environments hard to untangle.
It is easy for legacy dependencies to trip up the best laid plans when migrating to the new system and those trips can waste weeks or even months if not timetabled from the start.
Existing telemetry may not work in the new environment and may not give you the same information. Not only will you have to set up a new analytics framework, but you will have to re-learn what normal means for you and how to react to the new measurements.
This is less of a problem than it sounds because few datacenters have a comprehensive analytics framework. For a lot of HPC organizations, the adoption of a new system is an opportunity to get things right.
Every complex product has skeletons in the closet and HPC hardware and software are no exception. No matter how ropey the existing system is, chances are it works well a lot of the time and knowing how to maintain that represents significant investment and IP.
Knowing how to maintain and tune the systems you have chosen creates vendor lock-in and is a major hurdle to adopting new systems. Do you retrain your staff? Get new ones? Do you need consultants to get you off the starting blocks?
People don’t like change and users are people (yes, they really are). No matter how simple you make the new system to use, if it is different from the old system it will be met with resistance. This is especially true of any new technology that requires users to learn new skills or recode any applications.
For example, object storage offers a lot of advantages for many applications, but adapting workflows to pull data from object is quite a bit of work. Users continue to create new applications that are tied to block storage.
Tuning and optimization
This is sometimes given over to a separate team of experts, but more often then not, power users act like lone rangers pulling performance out of midnight hours at the command line. Some optimizations will benefit all systems – removing failed I/O will always make an application faster. Other optimizations such as removing small reads and random reads will become less important as super-fast random-access storage becomes more affordable.
Some tuning activities are so system specific that they need to be undone for newer systems and that creates a race that leaves the users always a step behind. The science of tuning has a long way to go to ensure resources are spent on permanent improvements.
Conclusion: Being late to the party isn’t always a bad thing. Sometimes it’s worth really putting thought into the repercussions of what you order from the bar.