With ROCm software and Instinct MI200 GPUs, AMD has critical ecosystem mass

Paid function Quality hardware is the foundation of any computing platform. But hardware, by itself, is never enough to create a platform. And in fact, two other important things are needed for a platform to be realized.

After hardware, a platform requires a full stack of software that can be easily and, in our view, relatively inexpensively adopted for application development on that hardware base.

This software stack should be both inclusive and comprehensive, encompassing algorithms and libraries tailored to target markets, as well as a software development kit that supports familiar programming languages ​​and programming models. Even if the software layer of this platform is not complete at the start – none of them ever do – it must reach a reasonable level of completeness in a few years.

In terms of speed of development and adoption, it is useful that this software layer is open source, but it is not a condition for success. And when it comes to deploying the platform in production environments, the software stack should have enterprise-grade technical support and a reasonable pricing model. The latter allows companies to bring to production at scale what developers have created on their laptops and proof-of-concept systems.

The third thing needed to build a computing platform is an ecosystem of developers who build applications that run on the combined hardware and software. Just like a falling tree in the woods doesn’t make a sound because there’s no one to hear it, great hardware and great software don’t make a credible platform so long as a critical mass of people do not adopt these goods.

By this definition, the AMD ROCm open software stack is what turns a server or cluster of servers that use its AMD EPYC processors (or those of others) and some of its Radeon and AMD Instinct accelerators into a true platform. We believe this is a platform that can now compete with Nvidia’s hardware plus its CUDA stack and Intel’s hardware plus its oneAPI stack. The added touch, ROCm can also deploy GPU code for Nvidia accelerators and any other accelerators that support C++.

The software stack is always the hardest part of the platform to develop, and ROCm is no exception in this regard. AMD had been working on heterogeneous CPU-GPU computing, including its heterogeneous system architecture, for many years when Nvidia suddenly took the HPC market by storm with its CUDA software stack in 2008 and then rode the wave of shock of the AI ​​explosion in 2012, extending CUDA to cover machine learning training and inference workloads in addition to HPC simulation and modeling.

ROCm was first unveiled as “Project Boltzmann” at the SC15 supercomputing conference in November 2015 as a way to provide an open-source alternative to Nvidia’s closed-source CUDA stack. The Boltzmann project was able to create C++ code that could be offloaded to AMD GPUs but also Nvidia GPUs via the Heterogeneous Interface for Portability, or HIP, API, which is a CUDA-like API that allows runtimes to target a variety of processors as well as Nvidia and AMD GPUs.

GPU runtimes are included in the ROCm stack, which is open source and available through AMD and on GitHub, and CPU runtimes are also open source on GitHub. The ROCm stack also supports OpenMP for C and C++ programming on multithreaded CPUs and GPUs, and similarly the GPU runtime for OpenMP is included in ROCm and the CPU compiler and runtime are available on GitHub.

The ROCm stack has matured at an accelerated rate, right alongside ever-improving AMD Instinct GPU computing hardware sourced from AMD, starting with common math libraries used in HPC and AI, as well as the OpenCL effort. to run C and C++ on heterogeneous computations. engines that originally came out of Apple and became standard in 2008.

Building a platform is always a journey, and it’s one AMD has been on since throwing its weight behind the OpenCL approach to building heterogeneous applications and then launching Project Boltzmann in 2015. to create a larger platform that could rival Nvidia’s CUDA.

AMD first focused on the AI ​​opportunity with ROCm, and 2018’s ROCm 2.0 stack was primarily aimed at machine learning applications. But make no mistake. AMD has a long history in traditional high performance computing and was only able to phase out exascale-class systems because of its commitment to improving ROCm to become a full, comprehensive stack for HPC and AI workloads, which presumably happened in 2020 when ROCm 4.0 was shipped.

Significantly, with the release of ROCm 4.5 in November 2021, alongside the preview of the AMD Instinct MI200 series accelerators which are deployed in the 1.5 exaflop “Frontier” supercomputer at Oak Ridge National Laboratory, the first US system to break the exaflop barrier with 64-bit floating-point precision, has unified memory support across CPUs and GPUs.

Cache coherency between CPUs and GPUs has also evolved in the ROCm software stack over time, as shown in the table below, all with the goal of making GPU compute engine programming more transparent and streamlined. improve performance through better hardware (AMD’s Infinity Fabric links between CPUs and GPUs are key here) and better memory management techniques.

With ROCm 5.0 generally available, AMD is coming back to the developer and helping to ensure that the software stack will work on the new AMD Instinct MI200 series accelerators (and use 64-bit floating point in GPU matrix engines), but also on workstations. working using Radeon PRO W6800 GPUs.

“With ROCm, we have done a lot of work over the years to deliver a production-ready software stack that can go live in deployments, but that’s not the end of the story,” Tushar Oza, director of software product management for data center GPUs at AMD, says The next platform. “Developers can create and test their code on a workstation, then transfer it to a mainframe and run it performing well. And we also want to give developers choice – not just in hardware, but with programming environments through our support for OpenMP target offloading and HIP APIs.

Compatibility with the Nvidia CUDA stack will be key to the rapid adoption of AMD GPUs and the ROCm software stack by the HPC and AI communities. It all starts with an equivalence between the mathematics and communication libraries, like this:

But when it comes to getting many applications ported to AMD GPUs and the ROCm environment, HIP is absolutely the most important tool in the AMD toolbox. As we said above, this is the key to creating a widely adopted platform, as it will make it relatively easy to port C++ codes that have been run in accelerated mode using Nvidia GPUs to a variant of C++ that can be deployed on Nvidia or AMD GPUs. or X86 processors without further code changes. And the evidence suggests that such a port is relatively easy and does not significantly affect performance:

Portability works both ways with HIP, which is one of the things that makes it so appealing. Once the code has been “HIPified” to move it from Nvidia GPUs to AMD GPUs, with only a 1% or 2% performance overhead – which is negligible given the value of portability – it takes into supports the build path for both hardware platforms with a single code base, allowing easy testing, performance comparison, and migration. It’s another way openness goes both ways for AMD – and keeps both vendors honest and competitive.

The most important thing for the competition, however, is that AMD has reached critical mass with ROCm 5.0 and the AMD Instinct MI200 GPU accelerators, garnering support and adoption from developers, ISVs and the open source community. As commercially supported applications with ROCm 5.0 become available this year, the AMD ecosystem will expand from its exascale system wins to a wide range of customers, partners and systems in the HPC and AI.

Sponsored by AMD

Comments are closed.