October 5, 2022 – Junchao Zhang, software engineer at the US Department of Energy (DOE) Argonne National Laboratoryleads a team of researchers working to prepare PETSc (a portable and expandable toolkit for scientific computations) for the nation’s supercomputers—including the Aurora, an exascale system that is being prepared for publication at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility located in Argonne.
PETSc is a scalable solution mathematical library for models generated with continuous partial differential equations (PDEs). PDEs, fundamental to describing the natural world, are ubiquitous in science and engineering. As such, PETSc is used in many disciplines and industrial sectors, including aerodynamics, neuroscience, computational fluid dynamics, seismology and fusion, materials science, ocean dynamics, and the petroleum industry.
As researchers from both science and industry strive to create increasingly high-precision simulations and apply them to increasingly large-scale problems, PETSc will benefit directly from advances in exascale computing power. In addition, the technology developed for Exascale can also be applied to less powerful computing systems and make PETSc implementations on these systems faster and cheaper, which in turn leads to wider adoption.
Furthermore, every Exascale machine slated to be brought online at DOE facilities has adopted an accelerator-based architecture and draws most of its computing power from graphics processing units (GPUs). This made porting PETSc for efficient use on GPUs an absolute must.
However, each exascale vendor has adopted its own programming model and corresponding ecosystem. Furthermore, portability between the different models remains, where intended, in its relative infancy for all practical purposes.
To avoid getting locked into a particular vendor programming paradigm and to benefit from extensive user support and a math library, Zhang’s team chose to set up PETSc for GPUs using the independent vendor. cocos As a portability layer and as a backend to it where possible (otherwise dependent on CUDA, SYCL, and HIP).
Instead of writing multiple interfaces for different vendor libraries, the researchers used the Kokkos Math Library, known as Kokkos-Kernels, as the wrapper. The Kokkos team took advantage, by virtue of being a library, by allowing them to consider their users’ choice of programming model, thus enabling smooth and natural GPU support.
GPU Support Expansion
Before the efforts of Zhang’s team, who DOE Exascale Computing Project (ECP), PETSc support for GPUs was limited to NVIDIA processors and required many of its computing cores to execute on host machines. This had the effect of reducing both the portability and capacity of the code.
“So far, we believe the adoption of Kokkos is successful, because we only need one source code,” Zhang said. “We got direct support for NVIDIA GPUs with CUDA. We tried to replicate the code to support AMD GPUs with HIP directly. We find it painful to keep duplicate code: the same feature has to be implemented in multiple places, and the same error has to be fixed in multiple places. Once branched CUDA and HIP application programming interfaces (APIs), it becomes difficult to replicate the “.
However, while PETSc is written in C, enough GPU programming models use C++ that Zhang’s team found it necessary to add an increasing number of C++ files.
“Within the ECP project, considering the formula in computing architecture known as Amdahl’s Law, which indicates that any non-accelerated piece of code can become a bottleneck for overall acceleration,” Zhang explained, “we tried to consider important GPU portability and code portability. GPU in blanket terms”.
Improved connection and account
The team is working to improve GPU functionality on two fronts: connectivity and computation.
As the team discovered, CPU-GPU data synchronization processes must be carefully isolated to avoid the tricky and elusive bugs that affect them.
So, to improve connectivity, the researchers added support for GPU-aware Message Passing Interfaces (MPI), thus enabling data to pass directly to GPUs rather than cached on CPUs. Furthermore, to remove GPU synchronizations that result from MPI’s existing limitations on asynchronous arithmetic, the team conducted research on GPU current-aware connectivity that completely bypasses MPI, and passes data using NVIDIA NVSHMEM Library. The team is also collaborating with Argonne’s MPICH group to test new extensions that address MPI limitations, as well as the MPI Stream-conscious feature developed by the group.
For an improved GPU computation, Zhang’s team transferred a number of functions to the device aimed at reducing data copying back and forth between the host and the device. For example, while matrix aggregation—which is necessary to use PETSc—has been implemented on hosts previously, its APIs are practically unparalleled on GPUs, despite their convenience for CPUs. The team added new matrix aggregation APIs suitable for GPUs, which improved performance.
Improved code development
Apart from recognizing the importance of avoiding code duplication and encapsulating and isolating data synchronization between processors, the team learned to profile often (by relying on NVIDIA nvprof And the Nsight Systems) and preview the schedule of GPU activities to identify (and then eliminate) hidden and unexpected activities.
One crucial difference between the Intel Xe GPUs that will power the Aurora and the GPUs found on other exascale devices is that Xes have multiple subchips, indicating that optimal performance hinges on NUMA-aware programming. (NUMA is a way to configure a group of processors to share memory locally.)
Relying on a single source code allows PETSc to easily run on Intel, AMD, and NVIDIA GPUs, albeit with some tradeoffs. By making Kokkos a kind of intermediary between PETSc and the sellers, PETSc becomes certified to the quality of Kokkos. Therefore the Kokkos-Kernel APIs for vendor libraries should be optimized to avoid poor performance. Researchers have discovered that some major Kokkos-Kernels functionality is not optimized for vendor libraries, and researchers are contributing fixes to address issues as they arise.
As part of the project’s next steps, the researchers will help the Kokkos-Kernels team add interfaces to the Intel oneMKL math kernel library before testing them with PETSc. This in turn will help the Intel oneMKL team as they prepare the library for Aurora.
Zhang noted that to further expand PETSc’s GPU capabilities, his team will work to support more low-level data structures in PETSc along with higher-level user-facing GPUs. The researchers also plan to work with users to help ensure effective use of PETSc on Aurora.
The GPU Code Development Best Practices series highlights researchers’ efforts to improve code to run efficiently on ALCF’s Aurora exascale supercomputer.
The Argon Command Computing Facility It provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a wide range of disciplines. Supported by the Office of Science, the US Department of Energy’s (DOE) Advanced Scientific Computing Research (ASCR) program, ALCF is one of two DOE leadership computing facilities in the country designated for open science.
source: Nils Heunen, ALCF