Reducing bioinformatic analysis costs with AWS Spot instances

Cancer Genomics CloudCavaticaPlatformProduct News

Although genome sequencing costs have dropped dramatically over the past few years, analyzing large amounts of genomic data remains expensive. As the scale of genomic projects continues to grow, cost-efficient bioinformatic analysis is key to gaining insight from the estimated 100 million to 2 billion human genomes that will be sequenced by 2025.

One way to lower bioinformatic analysis costs is to efficiently use the computing resources of the cloud. In particular, on the Amazon Web Services (AWS) deploy of the Seven Bridges Platform, as well as on the Cancer Genomics Cloud and Cavatica, analysis costs can be significantly reduced by running tasks on Spot instances.

An estimated 100 million to 2 billion human genomes will be sequenced by 2025, representing 2–40 exabytes of data. Image from Stephens et al. PLoS Biol. 2015 / CC-BY-4.0

Spot instances can reduce analysis costs by 75%

On the cloud, analysis tasks are run on different computation instances based on the amounts of CPU, RAM, and storage required. By default, tasks on Seven Bridges environments run on AWS On-Demand instances. Alternatively, users can choose to run tasks on Spot instances, which are spare AWS computing capacity that comes at a significant discount. Seven Bridges has moved to support these more cost-effective instances since AWS updated HIPAA compliance in May 2017.

To illustrate the cost reductions to bioinformatic analyses from using Spot instances, we ran our combined BWA and GATK whole exome analysis pipeline on sequencing read files of different sizes. Running the pipeline using Spot instances provided substantial cost savings across all read file sizes, with an average cost reduction of 75%. Absolute cost differences were greater for larger input files.

We ran a whole exome analysis pipeline on four sequencing read files of different sizes, using the default c4.2xlarge instance. The average cost savings between the On-Demand and Spot instance was 75%.

A cost reduction of this magnitude becomes significant when scaling up to large, complex analyses. For example, we analyzed four RNA-Seq paired-end samples with our Trinity pipeline for RNA-Seq assembly and analysis. We used an r3.8xlarge instance for the de novo assembly and one c4.8xlarge instance per sample to align reads and estimate transcript abundance. When we used On-Demand instances, the cost for the entire analysis was $101.47. In contrast, the analysis cost was only $23.75 when we used Spot instances.

While there is a significant cost difference between using On-Demand and Spot instances, the instances are identical in terms of computational resources. Because of this, there is no difference in task execution time, barring an instance interruption.

Best practices for using Spot instances

An important consideration of using Spot instances is that AWS can interrupt them while tasks are running. If a Spot instance is interrupted, Seven Bridges’ job retry functionality means that in-progress and remaining unfinished jobs will be automatically restarted on an On-Demand instance.

Although an interruption does not affect the reliability of task execution, it may impact the cost savings from using a Spot instance and can result in a longer overall runtime. For example, we analyzed a 102.8 GB read file with our whole exome analysis pipeline, using a c4.4xlarge Spot instance. Without interruption, the analysis cost $1.78 and took 8 hours and 52 minutes to complete. When there was a Spot instance interruption, the analysis cost $8.56 and ran for a total of 11 hours and 6 minutes. For reference, the analysis would have cost approximately $9.55 if performed using an On-Demand instance. In order to take advantage of the potential cost savings from using Spot instances, users should choose instances with a low risk of interruption.

In general, Spot instance types have a lower price and risk of interruption if they are in less demand. To better determine the risk of interruption for specific Spot instance types, we recommend using the AWS Spot Bid Advisor. For example, the tool shows that for c3.2xlarge and c3.8xlarge instances, the probabilities of interruption over the course of a week are low and medium, respectively.

The AWS Spot Bid Advisor tool shows the interruption frequency of each Spot instance type, as well as an estimate of cost savings over On-Demand instances. Image from aws.amazon.com/ec2/spot/bid-advisor/

We recommend that users of the Seven Bridges Platform (AWS deploy), Cancer Genomics Cloud, and Cavatica should switch to Spot instances to easily save on computing costs. Spot instances can be enabled as a global default for a project or on a task-by-task basis. More information is available in the Knowledge Center.

Our commitment to optimizing data analysis

Taking advantage of Spot instances is just one of the many ways that Seven Bridges is bringing down bioinformatic analysis costs, in order to improve researchers’ ability to analyze and gain insight from large-scale genomic data.

Contact us to learn more about Spot instances and our other optimizations that can reduce your analysis costs.