The Argonne Leadership Computing Facility is joining the PBS Pro open source community

Greetings!

We just wanted to say hello, introduce ourselves and say how excited we are to be joining the PBS Pro open source community.

First, let me introduce the team: We currently have four people whose primary task is the workload manager. In alphabetical order, they are Lisa Childers, Eric Pershey, Paul Rich, and Brian Toonen. They will likely be fairly active in the community. There are also some other folks that will be contributing as well: Mike Zhang is on our business intelligence team and will be working on issues related to extracting data from the system. Zach Nault is technically part of the storage team, but has been contributing a good bit. George Rojas will help us with architecture, design and we have projects driving the need for a REST (or similar) API for interacting with the workload manager and George will likely engage in that arena as well. I manage the team, which supports the workload manager, the sbank cluster accounting system, our user management system, and several other smaller internal systems. Also, since we run a supercomputing center, we get very direct feedback from our systems administrators and occasionally some code contributions.

Until now, we have developed and supported an internal workload manager called Cobalt. It has worked very well for our needs, but it is getting antiquated, and scaling is becoming an issue for us. So after a lot of debate, we set out to do a complete re-write of Cobalt. Cobalt has always been weak on the resource management side of the house, and that is not our forte, so we went looking to see if we could find an open source resource manager that we could use as a base. We looked at a number of possibilities: Mesos, Peloton, Flux, Kraken, Slurm, and of course, PBS Pro. We met with Altair to discuss the possibility of extracting the resource management functionality from PBS Pro. However, the team came back and unanimously agreed that our best course of action was to engage with the PBS Pro open source community as a whole and work from there.

Right now, we are still working to wrap our heads around the code base and figuring out how best to manage the transition from Cobalt to PBS Pro. You will start seeing some small contributions from our team: a minor bug fix, a proposal for a hook or two to get us some additional information we need to drive our internal reporting machinery, etc… As this will be our first foray into this area, we are looking forward to feedback from the community on this. Long term, we hope to be steady contributors to the project picking off proposed features that match our needs and proposing some ideas of our own along the way.

We are truly very excited about this and we are looking forward to engaging with you all in the future.

Regards,

The Cobalt team

P.S.: Many of you may be familiar with the National Labs, but for those of you who are not, here is some background that might be of interest.

The Argonne Leadership Computing Facility (ALCF) is a national scientific user facility that provides supercomputing resources and expertise to the scientific and engineering community to accelerate the pace of discovery and innovation in a broad range of disciplines. For those of you who might not be familiar with us, we are part of Argonne National Laboratory which is a part of the broader US Department of Energy National Laboratory System

So what does that really mean? We are a supercomputing center. We are generally running one or two of the largest systems in the world, including the future Aurora system which will be the first Exascale system in the US, along with some smaller clusters, 10s and soon to be 100s of PB of disk and large archival tape facilities. Until recently, our primary machines were IBM Blue Gene based systems. In 2006, when we first started running Blue Gene machines we were also tasked with supporting operating systems research. Thus we had to be able to boot alternate non-standard operating systems on our Blue Genes. There was only one workload manager at that time that supported the Blue Gene, which was IBMs LoadLeveler, and it didn’t support booting alternate operating systems. As a result, we teamed up with the Argonne Mathematics and Computer Science Division to modify the in-house developed Cobalt workload manager.

9 Likes