CIPRES is a web portal that allows scientists around the world to analyze DNA and protein sequence data to determine the natural history of a group or groups of living things. For example, one can ask where mammals originated, or how does Ebola virus spread, or whether a given plant is really a new species, or an unwelcome imported species, or how does a given species interact with other species and its environment over long periods of time.
CIPRES helps answer these kinds of questions by providing access to parallel phylogenetics codes run on large HPC clusters provided by the NSF XSEDE program. CIPRES currently runs analyses for about 12,000 scientists per year, and that number is growing each year. CIPRES accelerates research by increasing each researcher’s throughput. Job runs go faster using parallel codes, and users can run many jobs simultaneously on large clusters. For example, CIPRES provides access to P100 GPUs that can speed up some jobs by 100-fold relative to a single core run. But GPUs are in short supply in the XSEDE portfolio, and so usage must be strictly limited. This project will develop the infrastructure needed to cloudburst CIPRES jobs to newer, faster V100 GPUs at AWS. As a result, individual jobs will run up to 1.5 fold faster, and users will have access to twice as many GPU nodes as they did in the previous year. The infrastructure created will also open the door for scalable access to AWS cloud resources through CIPRES for all users.
Key achievements:
- Developed a method of bursting CIPRES jobs into AWS to reduce queue wait time and job processing time.
- Used Internet2 Cloud Connect so that data could stay onsite at SDSC (West-Coast) while using the newest GPU hardware available at AWS in Ashburn (East-Coast), and intermediate results and checkpoints stayed on SDSC systems.
- Ran 812 jobs on AWS using approx. 25,000 GPU hours and 85,000 CPU hours benefiting many end-user studies in phylogenetics, including models and variants of SARS-CoV-2.
- Saw speed increase of up to 1.4x using V100 GPUs in AWS compared to P100 GPUs on campus, and 48x faster than Haswell CPUs. This exceptional speedup means large analyses that would require months on CPUs can be completed in days in AWS.
- Highlighted some performance issues on intermediate size jobs in the cloud that could not be explained or rectified by the provider. These jobs ran approximately 20% slower than on campus, however, small and large jobs were faster in the cloud.
- Found cloud very good for short queue wait times, and good for long-running jobs.
Related presentations: