E-CAS Project Updates

Fall 2021

Want to learn more about E-CAS sub-awardees’ recent achievements and the benefits and challenges of using cloud platforms to support research? Watch recordings of the Sept. 22 “Exploring Clouds for Acceleration of Science Phase II” workshop.

Watch the Workshop

Project Progress

Summary of Achievements

In the News

FAQs

Phase 2 Projects: From September 2020 to September 2021

Two projects out of the six have selected for the program’s second phase, which runs from September 2020 to September 2021. These projects are currently developing and repeatedly running the science workloads at scale. The two selected projects are:

“Deciphering the Brain’s Neural Code,” William Lytton and Salvador Dura-Burnal, SUNY Downstate Medical Centre

This project aims to help decipher the brain’s neural coding mechanisms with far-reaching applications, including developing treatments for brain disorders, advancing brain-machine interfaces for people with paralysis, and developing novel artificial intelligence algorithms. Using a software tool for brain modeling, researchers will run thousands of parallelized simulations exploring different conditions and inputs to the simulation of brain cortical circuits.
“Heterogeneous computing for the Large Hadron Collider,” Philip Harris, MIT

Only a small fraction of the 40 million collisions per second at the Large Hadron Collider (LHC) are stored and analyzed due to the huge volumes of data and the compute power required to process it. This project proposes a redesign of the algorithms using modern machine learning techniques that can be incorporated into heterogeneous computing systems, allowing more data to be processed and thus larger physics output and potentially foundational discoveries in the field.

Project Progress

The project is running to schedule and budget. All six teams presented their work at a virtual workshop in early April 2020 and submitted phase one final reports by the end of April.
The cloud providers (Amazon Web Services and Google Cloud Platform) have committed around $750,000 in cloud credits to this project. This includes extra credits for committed spending beyond the initial commitments, representing a more than 70% extra value (or the equivalent of more than 40% savings).
Subaward contracting has not been an issue, and teams have been able to utilize their existing cloud contracts. However, one team had significant delays in accessing GCP credits while their institution was negotiating their “whole of organization” GCP contract.
The project highlights the potential procurement issues of awarding a large sum of money to an institution to spend with a certain cloud provider without going through a tender process. It also shows the need to use institutional contracts that are compliant with state procurement laws and guidelines.
One obvious takeaway from each of the team’s presentations (and discussions in team meetings) is that the projects all require a team with diverse skills and research and technology support personnel to succeed.

Summary of Phase One Projects and Achievements

Key achievements:
- Utilized more than 100,000 CPU cores simultaneously on GCP over several hours to enable very detailed models of the motor cortex
- Scalable resources enabled the introduction of evolutionary algorithms to further refine the accuracy of the models over 68 generations using >10,000 cores over a period of 2 weeks, totaling > 1.8 Million CPU hours
- Completed several long-run neuronal avalanche simulations over >10 days
Related presentations:
Key achievements:
- Using gRPC calls to off-load GPU, FPGA and ASIC accelerated deep learning algorithms to the cloud platforms while running CPU intensive code at FermiLab and MIT.
- Enabled rapid prototyping of new deep learning algorithms and retraining of existing models that will influence the design of future computational facilities for the LHC.
- Developed GPUaaS and FPGAaaS tools, shared on GitHub.
- Ran workshops and training for the HEP community on Fast Machine Learning.
- Note: Existing GPU resources including DOE HPCs are not configured to perform tests required by this project. No large scale public-funded FPGA cloud exists.
Related presentations:
CIPRES is a web portal that allows scientists around the world to analyze DNA and protein sequence data to determine the natural history of a group or groups of living things. For example, one can ask where mammals originated, or how does Ebola virus spread, or whether a given plant is really a new species, or an unwelcome imported species, or how does a given species interact with other species and its environment over long periods of time.

CIPRES helps answer these kinds of questions by providing access to parallel phylogenetics codes run on large HPC clusters provided by the NSF XSEDE program. CIPRES currently runs analyses for about 12,000 scientists per year, and that number is growing each year. CIPRES accelerates research by increasing each researcher’s throughput. Job runs go faster using parallel codes, and users can run many jobs simultaneously on large clusters. For example, CIPRES provides access to P100 GPUs that can speed up some jobs by 100-fold relative to a single core run. But GPUs are in short supply in the XSEDE portfolio, and so usage must be strictly limited. This project will develop the infrastructure needed to cloudburst CIPRES jobs to newer, faster V100 GPUs at AWS. As a result, individual jobs will run up to 1.5 fold faster, and users will have access to twice as many GPU nodes as they did in the previous year. The infrastructure created will also open the door for scalable access to AWS cloud resources through CIPRES for all users.

Key achievements:
- Developed a method of bursting CIPRES jobs into AWS to reduce queue wait time and job processing time.
- Used Internet2 Cloud Connect so that data could stay onsite at SDSC (West-Coast) while using the newest GPU hardware available at AWS in Ashburn (East-Coast), and intermediate results and checkpoints stayed on SDSC systems.
- Ran 812 jobs on AWS using approx. 25,000 GPU hours and 85,000 CPU hours benefiting many end-user studies in phylogenetics, including models and variants of SARS-CoV-2.
- Saw speed increase of up to 1.4x using V100 GPUs in AWS compared to P100 GPUs on campus, and 48x faster than Haswell CPUs. This exceptional speedup means large analyses that would require months on CPUs can be completed in days in AWS.
- Highlighted some performance issues on intermediate size jobs in the cloud that could not be explained or rectified by the provider. These jobs ran approximately 20% slower than on campus, however, small and large jobs were faster in the cloud.
- Found cloud very good for short queue wait times, and good for long-running jobs.
Related presentations:
The IceCube Neutrino observatory located at the South Pole supports science from a number of disciplines including astrophysics, particle physics, and geographical sciences operating continuously being simultaneously sensitive to the whole sky. Astrophysical Neutrinos yield understanding of the most energetic events in the universe and could show the origin of cosmic rays. Being able to burst into cloud supports follow-up computations of observed events & alerts to and from the community such as other telescopes and LIGO. This project plans to use custom spot instances and FPGA based filters in AWS and GPU/TensorFlow Machine Learning in GCP.

Key achievements:
- Used the Open Science Grid platform and a fleet of around 20,000 CPU cores to reduce MultiMessenger Astrophysics (MMA) reconstruction from 6 or 7 hours to a consistent 1 hour.
- Rand a large-scale test of distributed computation across multiple provers, regions, and GPU hardware model, creating a pool of >50,000 GPU cores for 1 hour.
- Identified that “social engineering”, building relationships and excitement about the project was required to lift restrictions on access to large amounts of resources.
Related presentations:
- Workshop presentation recording
- Final Report
This Exploring Clouds for Acceleration of Science (E-CAS) project will exploit the computational power and network connectivity to provide a world-scalable solution for generating building-level information for urban canopy parameters as well as for improving the information for estimating local climate zones, both of which are critical to high resolution urban meteorological/environmental models. The challenge is that current computational models have a bottleneck, not just in terms of the physics and processes within the land surface and boundary layer schemes, but even more critically the need is for providing a robust means of generating parameter values that define the urban landscape. This is how the proposed E-CAS inverse modeling approach comes into play. By utilizing images and world-wide input about building properties, we can infer a sampling of 3D building models at world-scale containing more than just the geometrical shape information and enable world-scale urban weather modeling.

Key achievements:
- Ported Photo2Building (P2B) tool to containers with a template for accelerated photogrammetry and object recognition using GPU accelerated neural networks.
- Created a cloud deployment of P2B to allow submission and processing of building images by the public (project collaborators).
- Deployed WRF climate modeling software in the cloud using containers but found poor scaling with performance degrading when using > 240 CPUs. This needs further investigation into inter-node latency. Logs indicate that MPI-Wait was significant, and CPU density per node was an important factor. Placement groups were not used.
Related presentations:
BioCompute Objects allow researchers to describe bioinformatic analyses comprised of any number of algorithmic steps and variables to make computational experimental results clearly understandable and easier to repeat. Galaxy is a widely used bioinformatics platform that aims to make computational biology accessible to research scientists that do not have programming experience. The project will create a library of BioCompute objects that describe bioinformatic workflows on Amazon Web Services, which can be accessed and contributed to by Galaxy users from all over the world. This project also plans to utilize AWS Direct Connect over Internet2 to connect the library of biocomputer objects to the campus HPC environment at George Washington University.

Key achievements:
- Created Galaxy instance in AWS cloud, and modules to create Bio-Compute Objects (BCOs)
- Created BioComuputeDB and portal for the outside community to submit, store, annotate, and validate BCOs.
- Populated BioComputeDB with 8x sample pipelines
- Created cost estimation tools for BioCompute pipelines.
- Recruited users from 10+ Pharma companies and 12+ Global research universities, and the NIH and FDA. Attracted additional grants from NIH and FDA.
Related presentations:

View All Internet2 Services

Featured Services

View All Featured Services

NET+ Cloud Services

View All NET+ Cloud Services

Network Services

View All Network Services

InCommon Services

E-CAS Project Updates

Fall 2021

Phase 2 Projects: From September 2020 to September 2021

Project Progress

Summary of Phase One Projects and Achievements

E-CAS in the News