Speaker
Cory Lueninghoener
(Argonne National Lab)
Description
Large-scale high performance computing (HPC) systems pose special problems to system administrators, particularly with respect to configuration management. These systems function at a scale larger than typical environments, run with synchronized workloads, and must be treated in a hands-off manner when jobs are running. Coupled with the need to keep compute systems as uniform as possible, these problems can put considerable stress on infrastructure and administrators alike.
At the same time, HPC systems are perfect candidates for complete configuration management, generally exhibiting high levels of uniformity and administrator control. With a strong configuration management tool, keeping compute nodes identical, login nodes clean, and management nodes secure all become much more manageable. This can all be done while helping administrators both document and understand their environments better than with ad-hoc systems.
In this talk, we will give an overview of the challenges we face in managing the 500TF Blue Gene/P system at Argonne National Laboratory's Leadership Computing Facility and its infrastructure. In particular, we will focus on the configuration tradeoffs that we face in this environment and the level of automation we have achieved by using Bcfg2, an open-source configuration management tool that we have developed in Python at Argonne.
Primary author
Cory Lueninghoener
(Argonne National Lab)