This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[a]Waseem Kamleh

Evolving the COLA software library

Abstract

COLA is a software library for lattice QCD, written in a combination of modern Fortran and C/C++. Intel and NVIDIA have dominated the HPC domain in the years leading up to the exascale era, but the status quo has changed with the arrival of Frontier and other AMD-based systems in the supercomputing Top 500. Setonix is a next generation HPE Cray EX system hosted at the Pawsey Supercomputing Centre in Perth, Australia. Setonix features AMD EPYC CPUs and AMD Instinct GPUs. This report describes some of my experiences in evolving COLA to adapt to the current hardware landscape.

The first supercomputer in the world to achieve a HPL benchmark greater than 1 exaflop/s was Frontier, based at the Oak Ridge National Laboratory. Frontier is a HPE Cray EX system featuring AMD EPYC “Trento” cores and Radeon Instinct MI250X accelerators. Topping both the June and November Top 500 lists in 2022 with an RmaxR_{\rm max} score of 1.102 exaflop/s, Frontier displaced the Fugaku supercomputer at RIKEN, which features a bespoke chipset based on the ARM architecture. The previous crown holder was Summit, another Oak Ridge entry, but with NVIDIA accelerators and an IBM Power system. The diversity of hardware architectures on leading systems represents one of the challenges of high performance computing. Research groups typically have allocations on a variety of machines, and with the recent resurgence of AMD the range of target architectures has only widened. It is not just platform portability for scientific codes that is essential, but also performance portability.

The challenges of portability in the context of scientific computing are not new of course, but there are aspects of the contemporary architecture ecosystem that are distinct. Historically, there have always been a range of CPU chipsets deployed at HPC facilities. Performant scientific codes are typically written in C/C++ or Fortran. These languages are not tied to a specific vendor, and it is reasonable to expect that the compilers for these languages are available on any system. In this sense, platform portability for CPU codes presents a relatively low barrier for code development. For performance portability, due to tight language restrictions around aliasing, Fortran compilers developed by hardware vendors such as Intel and Cray have traditionally been very successful at generating optimised code. With regard C/C++, the use of architecture specific intrinsic has often been required to get the most benefit from specific processor features (such as vectorisation).

COLA is a custom in-house code that I began developing in Fortran as a graduate student. The code is more or less in a constant state of change, either evolving to adapt to new challenges or expanding to add new features. Key algorithms are solvers for linear systems and eigenmodes [1], and gauge field generation with Hybrid Monte Carlo [2]. The latter features a number of variants such as the RHMC algorithm [3] and a selection of filtering techniques [4, 5, 6]. Specific physics features are tailored to the CSSM lattice research programme, for which the COLA software library has formed the computational foundation for some time [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64].

The fastest machine in Australia in the November 2022 Top 500 is Setonix, ranked at #15 with a HPL score of 27 petaflop/s and hosted at the Pawsey Supercomputing Centre in Perth. Similar to Frontier, Setonix is based on the AMD EPYC platform and also features the Radeon MI250X accelerators. Building upon the energy efficiency of these accelerators, Setonix ranks at #4 in the Green 500 list based on performance per Watt, 2 places higher than Frontier at #6. As the world moves towards a carbon-neutral future, there has been a developing focus on optimising the energy efficiency [65] of traditionally power hungry high performance computing systems.

To accompany the release of this novel architecture to the Australian supercomputing scene, the Pawsey Centre for Extreme Scale Readiness (PaCER) scheme was created. Several projects were chosen to partner with Pawsey to optimise codes and workflows for the next generation of supercomputers. The CSSM is partnered with the PaCER scheme via one of these projects, Emergent phenomena revealed in subatomic matter. User community initiatives with similar goals have accompanied the launch of other recent AMD-based systems.

Refer to caption
Figure 1: Performance of the COLA fermion matrix on the AMD EPYC “Milan” CPU with the CrayEX, GCC, and AOCC Fortran compilers, measured relative to the Xeon “Cascade Lake” performance with the Intel Fortran compiler.

As the Fortran components of COLA do not use any vendor or architecture specific features, adapting the software to the AMD EPYC platform was fairly straightforward. The enforcement of standard compliance does vary between the different compilers, requiring some minor changes for code which has primarily evolved on the Intel platform. There are three programming environments available on Setonix, namely Cray EX, GNU GCC, and AOCC (which is based on LLVM). Figure 1 shows the relative performance of the fermion matrix code under these three compilers on a single dual-socket “Milan” CPU node with a total of 128 cores, as compared to the Intel compiler on a dual-socket Xeon “Skylake” node with 48 cores. As would be expected given the effort that they put into their Fortran compiler optimisation, Cray performs the best of the three AMD programming environments, closely followed by the GNU compiler. At the time of writing, modern Fortran support within the AOCC programming environment should be considered as preliminary.

GPU acceleration for the COLA fermion matrix inverter was first introduced via NVIDIA CUDA C/C++ around the time that the Fermi architecture was released. This mixed-language approach persists to this day. As the CPU aspects of the code are written in modern Fortran, I utilise the interoperability provided by the intrinsic ISO C Binding module to interface with the GPU-accelerated routines that are implemented in CUDA C/C++. A consequence of this approach is that for many of the key algorithms in COLA there exist two independent implementations – one in Fortran for CPUs and one in CUDA for GPUs. The ability to cross-check the two implementations as a form of validation has proved very beneficial during code development. Significant amounts of utility code reuse are also realised, with the Fortran code that handles the reading of runtime parameters, the input and output of data, and the initialisation of the MPI process topology being common to both the CPU and GPU implementations.

Lattice codes have a relatively low flops/byte ratio and as such are typically memory bandwidth limited. The intrinsic geometric parallelism of the lattice maps naturally onto the massively parallel nature of the GPU platform. Whilst on a CPU a subvolume of the lattice would typically be mapped to a single core, on a GPU each lattice site is mapped to a single thread. The latency hiding capabilities of the GPU execution model coupled with the high computational power means that opportunites to perform additional computation in order to reduce traffic from global memory generally result in an overall speedup of the code. Similarly, the use of mixed-precision techniques also results in a significant benefit [66]. Historically, the amount of device memory available was relatively small compared to the size of a fermion field. The available device memory on GPU accelerators has increased significantly with successive generations, such that limiting the number of vectors stored is less of a concern than it once was (similarly for register pressure).

Refer to caption
Figure 2: Performance of the COLA GPU-accelerated conjugate gradient inverter on various device architectures, measured relative to the Tesla K40 performance. Results are shown for various NVIDIA Tesla cards with generations ranging from Kepler to Ampere, and the AMD MI100 accelerator.

Figure 2 shows the relative performance of the GPU-accelerated COLA conjugate gradient inverter on a variety of GPU architectures, using the NVIDIA Tesla K40 as a reference. Starting with Kepler, NVIDIA results are provided for generations up to Ampere with the A100. There is a single data point for AMD Instinct, the MI100. A mixed-precision solver is used for the benchmark, with 32-bit precision for the inner iterations and 64-bit precision for the outer solver.

To run on the MI100 architecture the CUDA components of COLA were converted to AMD’s Heterogeneous Interface for Portability (HIP). The HIP SDK provides scripts that automated much of the process of converting from CUDA code, although some manual effort was required to complete the conversion. The HIP compiler can target both NVIDIA and AMD accelerators, and on the Volta platform at least seemed to provide equivalent performance to nvcc. At the time of writing, the MI250X was not yet available to the author for benchmarking. Setonix will be fully launched in early 2023, when the MI250X partition becomes available to users. While platform portability has been demonstrated for the COLA software, it will be interesting to see what can be achieved in terms of performance on the MI250X.

It seems that one of the aims of HIP is to mirror the functionality provided by CUDA, though this approach necessarily means there is a delay between a CUDA feature release and the appearance of the HIP equivalent. The dominance of CUDA for GPU-accelerated codes can be attributed to the fact that NVIDIA was the first vendor to successfully bring to market devices that targeted the HPC community. Arguably, their success would not have been possible without the large amount of effort put into developing the CUDA programming environment. Many researchers have implemented their code in CUDA due to a combination of the rapid maturity it achieved relative to other programming models (such as OpenCL), and the near-monopoly NVIDIA has had on accelerated HPC systems for many years.

AMD has had recent success at the hardware level with the launch of high profile HPC systems such as Frontier in the US, Lumi in Europe, and Setonix in Australia. They must now ensure that their programming environment rapidly achieves maturity in order to sustain momentum. It is also interesting to note that Intel is attempting to (re-)enter into the accelerator space with the Xe HPC platform.

This again raises the issue of (platform and performance) portability for accelerated computing. CPUs can generally be targeted by Fortran and C/C++ code in a platform-independent manner whilst maintaining performance (of course, platform-specific optimisations can always improve upon these). For accelerators, it is a different story. Vendor-specific programming environments are typically required to get the best performance. Developing divergent branches of the same code adds significant overhead to an activity that is already at a premium in the academic research environment, where time well-spent must necessarily translate into the de facto currency of the field.

In an ideal world we would treat the accelerator space in much the same way as we treat the traditional compute space. That is, through the establishment of platform-independent programming models with agreed upon standards that compiler providers implement to target their respective hardware platforms. Arguably, the natural way for this to proceed would be to have vendor-agnostic accelerated programming extensions to C/C++ and Fortran. Fortran has included intrinsic parallel computing features since the 1990s, and accelerator-based programming would seem like a natural extension. Language standards tend to evolve fairly slowly however, so in the short term we must look elsewhere. There are vendor-led candidates for cross-platform heterogeneous programming such as NVIDIA’s OpenACC, AMD HIP, and Intel’s DPC++ for oneAPI. Of course, these can be expected to perform well on the vendor’s respective hardware, but support for (and performance on) the competitors hardware is not guaranteed. There are also a number of open candidates for heterogeneous programming such as OpenCL, SYCL, and Kokkos. The extent to which the various candidates above provide performance portability is being investigated [67, 68, 69, 70, 71]. Future work will explore this question in the context of continuing to evolve the COLA software library.

Acknowledgements

The author is supported by the Pawsey Supercomputing Centre through the Pawsey Centre for Extreme Scale Readiness (PaCER) program. This work is supported by the Australian Research Council through Grants No. DP190102215 and DP210103706.

References