Scientific Computing Options Maturing in the Cloud

By Agam Shah

August 31, 2023

Supercomputing remains largely an on-premise affair for many reasons that include horsepower, security, and system management. Companies need more time to move workloads to the cloud, but the options are increasing. (See the recently posted HPC-AI forecast from Intersect 360 Research.)

In August, Google Cloud and Amazon Web Services announced high-performance computing virtual machines, which effectively are online versions of the computing provided by on-premise systems. The HPC VMs are built on cloud providers’ proprietary tech, including the latest processors, superfast interconnects, security features, and memory capacity.

The HPC VMs support hybrid deployments, where companies can split workloads between on-premise systems and virtual machines offered by AWS and Google. Some HPC users prefer to dispatch low-priority workloads to the cloud, which frees up on-premise computing resources to run more critical workloads.

The biggest disadvantage of HPC in the cloud remains bandwidth limitations, given the slow network speeds over large geographical distances. Nevertheless, many engineering and pharmaceutical companies are turning to the cloud because of the rich development tools, a laundry list of data sets, analytical and database tools, and other middleware available to customers. Integrators like Rescale and Altair provide software and support to create shared hybrid environments for HPC applications.

The new VMs from the cloud providers are focused square on conventional scientific computing. The systems are not targeted at AI and are not bundled with GPUs. AWS and Google offer pricey instances of Nvidia’s H100 GPUs, targeted at parallel computing and AI applications.

AWS recently announced EC2 Hpc7, which is a VM based on AMD’s fourth-generation Epyc chips code-named Genoa. Hpc7a is x86, an upgrade from the recent EC2 Hpc6a instances based on AMD’s previous-generation Epyc chips code-named Milan.

The Hpc7a has double the memory capacity in its fully loaded VM configurations and 300Gbps network bandwidth. Amazon claimed that Hpc7a provides 2.5 times faster than Hpc6a instances. The largest hpc7a.96xlarge instance offers 192 CPU cores and 768GB of DDR5 memory. The VMs support Elastic Fiber Adapter and file systems such as Lustre, which are popular in HPC.

AWS offers other HPC VMs, including the ARM-based Hpc7g, which runs on the homegrown Graviton3E chip. The Riken Center of Computational Science has built a “virtual Fugaku” for Hpc7g, or a cloud version of the software stack in Fugaku, the world’s second fastest supercomputer, on AWS. Fugaku is also built on ARM processors, making replicating the software environment possible.

Google announced the H3 VM instance for HPC in August, which balances price with performance with the help of fast network speeds and a large bevy of CPU cores.

The H3 configurations are based on Intel’s latest Sapphire Rapids CPUs, with each node aggregating 88 CPU cores and 352GB of memory. The VMs are targeted at applications that are not parallelized and are run in single-threaded environments.

The virtual machines are built on top of the Intel-Google co-developed custom data processor E2000, code-named Mount Evans. The H3 nodes can communicate at speeds of 200 Gbps and have 16 ARM-based Neoverse N1 CPU cores.

Google’s benchmarks compared the H3 to previous C2 VMs based on Intel’s Cascade Lake CPUs, which are two generations behind Sapphire Rapids. The H3 CPU-only VM is three times faster in performance-per-node and can save customers 50% in costs.

The comparison is not an apples-to-apples as server chips are typically benchmarked to previous-generation chips, in this case, Ice Lake. But Google’s comparison is more in line with server upgrade cycles, which occur every two to three years.

At its recent Google Cloud Next summit, the company expanded its high-performance computing options for AI. The company announced pods with its latest TPU v5e AI chips and announced the general availability of its A3 supercomputing systems, which can host 26,000 Nvidia GPUs and support parallel computing. Both the chips are targeted at training and inference in AI applications.

Google Cloud’s Hugo Saleh, director of product management for HPC, answered some questions by HPCwire on the H3 and its design.

HPCwire: As a public preview, who can test H3? When will it become publicly available?

Saleh: We’ve gotten valuable feedback from select customers and partners over the last few weeks while H3 was in private preview. We announced the start of our public preview period, where any interested customer can access H3 VMs free of charge. To begin using H3 instances, customers can select H3 under the Compute Optimized machine family when creating a new VM or GKE node pool in the Google Cloud console. H3 VMs are currently available in the US-central1 (Iowa) and Europe-west4 (Netherlands) regions. Following the public preview window, general availability will be announced later this year.

HPCwire: Does Google provide help in moving HPC workloads from on-prem to the new instances?

Saleh: There are a number of options to help HPC customers on their journey to Google Cloud. We recommend connecting with Google Cloud’s HPC specialists, who can help with most questions and can bring in additional resources as needed to help with migrations. For customers needing specialized support, we also have a Professional Services organization as well as an extensive list of partners ready to help HPC users migrate their workloads from on-premises or other clouds.

HPCwire: Is real-time a priority here? HPC users care about speed, but bandwidth to deliver results over the Internet is a bottleneck.

Saleh: Google invests heavily in making access to the cloud seamless, secure, and reliable at a worldwide scale. Time to insight and results is key … which is why we have designed the H3 platform with 200 Gbps low-latency networking, twice the bandwidth of our previous generation VMs. H3 machines also support compact placements and are deployed in large, dense pools to reduce latency and network jitter, improving HPC application scalability.

HPCwire: Why are partners like Rescale.AI important? How do they connect the gap between HPC users and Google Cloud?

Saleh: PC users and their workloads span a wide spectrum of needs and tend to have a diverse set of requirements. There is already a well-established and rich ecosystem of software and services companies adept at supporting and delivering solutions to address those users’ needs. Partnering with companies like Rescale, Altair, and Parallel Works, among others, to support custom end-to-end solutions enables customers to use Google Cloud products best. In some cases, this might look like supporting a customer’s move to the cloud, optimizing for a hybrid environment, or deploying specific applications at scale. In other cases, it might be the need to support a specific operating system or scheduler that’s key to a customer’s workload and environment.

HPCwire:Saleh:HPCwire:Saleh:HPCwire:Saleh:HPCwire:Saleh: