Speaker
Dr
Mathias Wagner
(NVIDIA)
Description
In the 10 years since the creation of the QUDA library for Lattice QCD on NVIDIA GPUs the hardware and software features of GPU systems have evolved dramatically. Not only has the raw Dslash kernel performance on a single GPU improved by more than one order of magnitude but also modern GPUs are often deployed in "Fat Nodes" with up to 8 GPUs. We report on the techniques that QUDA implements to achieve high performance on these modern GPU architecture by exploiting the features of modern NVIDIA GPUs, like Unified Memory, GPU Direct and NVLink-connections between GPUs and to IBM Power CPUs. We discuss the impact of these optimizations and present scaling results for QUDA on DGX-1 based clusters and Summit. Finally, we will give an outlook on future directions. In particular we preview strong scaling and programmability improvements by using NVSHMEM, an OpenSHMEM implementation for GPUs as well as QUDA on NVSwitch-based systems like DGX-2 with 16 fully interconnected GPUs.
Primary author
Dr
Mathias Wagner
(NVIDIA)
Co-author
Dr
Kate Clark
(NVIDIA)