11
October
2021

Join Me for TechEXtra21 on Oct 15: Running a 380 fp32 petaFLOPS GPU Burst for Multi-Messenger Astrophysics With IceCube

Subscribe for more like this

Share

Array

Estimated reading time: 4 minutes

Register for the Running a 380PFLOP32s GPU Burst for Multi-Messenger Astrophysics With IceCube event to be held Friday, October 15, at 1:30 p.m. ET.

By Igor Sfiligoi, Lead Scientific Software Developer and Researcher at UCSD-SDSC

Igor Sfiligoi

The IceCube Neutrino Observatory is the National Science Foundation’s premier facility to detect neutrinos and a pillar of the Windows on the Universe – Multi-Messenger Astrophysics (WoU-MMA) program, one of the NSF’s 10 Big Ideas.

The detector is composed of more than 5,000 optical sensors buried deep in the ice at the South Pole. Understanding the properties of ice as a natural medium is paramount to science, as one observes drastic changes in the reconstructed position of detected neutrinos with different ice models.

The problem, however, is too complex for a parameterized approach. So, brute-force photon propagations, also known as ray-tracing simulations, are used instead. Because ray-tracing applications are notoriously well-suited for GPU computing, IceCube’s code is optimized for that platform. 

In an effort to both support the IceCube science and explore the feasibility of large-scale high-throughput computing (HTC) in the cloud, the NSF-awarded grant funding to support IceCube simulations using cloud computing.

We executed a series of runs where we aggregated several fp32 exaFLOP hours worth of GPU computing across multiple commercial cloud providers and used those to run the IceCube simulations to produce the much-needed calibration data. Given the exploratory nature of work, each run explored different aspects of the infrastructure provisioning and operation, with the end result being complete commoditization of the process. 

A Closer Look at Each Cloud Run

The first cloud run was focused on sheer size and provisioning speed. In a matter of hours, we harvested all available-for-sale GPUs across Amazon Web Services, Microsoft Azure, and Google Cloud Platform – reaching over 51K GPUs total and 380 fp32 petaFLOPS, with GPU types spanning the full range of generations from the NVIDIA GRID K520 to the most modern NVIDIA T4 and V100. While we did not sustain the peak for very long, this was – at peak – the most performant cloud-based compute pool ever to be created.

The second run was aimed at demonstrating the feasibility of cloud computing in a more production-like setting, including using on-premises data sources, provisioning only cost-effective compute resources, and sustaining the peak performance for a significantly longer period of time. The provisioned pool was unsurprisingly smaller, but still reached about 170 fp32 petaFLOPS and was sustained for a whole workday, integrating about one fp32 exaFLOP hour of compute and delivering about 50% more science output compared to the first run.

The third run was instead focused on exploring the feasibility of using dedicated network paths for routing IceCube data artifacts back to on-premises storage. The primary driver for this work was the high connectivity fees imposed by the cloud providers, although dedicated network paths do offer other potential benefits, too. We executed this run in collaboration with both local networking groups and Internet2, delivering 130 terabytes of data to on-premises storage and integrating about 200 fp32 petaFLOP hours in the process.

All of the previous runs used ad-hoc setups, at least partially, to minimize the risks inherent in exploratory work. We thus finished the series by exposing the cloud resources through a standard OSG Compute Entrypoint (CE) and expanded the regular IceCube HTCondor pool with up to 2K GPUs, integrating 16K GPU days or about 3.1 fp32 exaFLOP hours of compute over a period of two weeks. We are thus confident that we can support cloud-based computing for IceCube, and indeed for most OSG communities, on a regular basis if funding for cloud resources was available.

Join Me for the Talk

I hope you will join me on Friday, October 15, at 1:30 p.m. ET as I share more details about these exploratory IceCube simulations using cloud compute, including the outcomes and costs incurred.

More About TechEXtra21

TechExtra21 offers a wide selection of opportunities for the Internet2 and InCommon technical community to convene virtually through small bites of the Technology Exchange experience that is dubbed “TechEXtras.” View the TechExtra 21 schedule or submit a proposal to round out our fall programming!