Although genome sequencing costs have dropped dramatically over the past few years, analyzing large amounts of genomic data remains expensive. As the scale of genomic projects continues to grow, cost-efficient bioinformatic analysis is key to gaining insight from the estimated 100 million to 2 billion human genomes that will be sequenced by 2025.
One way to lower bioinformatic analysis costs is to efficiently use the computing resources of the cloud. In particular, analysis costs can be significantly reduced by running tasks with Spot instances on the Seven Bridges Platform.
Spot Instances Can Reduce Analysis Costs up to 75%
On the cloud, analysis tasks are run on different computation instances based on the amounts of CPU, RAM, and storage required. By default, tasks on the Seven Bridges Platform run on AWS On-Demand instances. Alternatively, users can choose to run tasks on Spot instances, which is spare AWS computing capacity that comes at a significant discount.
To illustrate the cost reductions to bioinformatic analyses from using Spot instances, we ran our combined BWA and GATK whole exome analysis workflow on sequencing read files of different sizes. Running the workflow using Spot instances provided substantial cost savings across all read file sizes, with an average cost reduction of 75%*. Absolute cost differences varied for larger input files.
A cost reduction of this magnitude becomes significant when scaling up to large, complex analyses. For example, we analyzed four RNA-Seq paired-end samples with our Trinity workflow for RNA-Seq assembly and analysis. During this analysis, we used an r3.8xlarge instance for the de novo assembly and one c4.8xlarge instance per sample to align reads and estimate transcript abundance. When On-Demand instances were used, the cost for the entire analysis was $101.47. In contrast, the analysis cost was only $23.75 when Spot instances were used.
While there is a significant cost difference between using On-Demand and Spot instances, the instances are identical in terms of computational resources. Because of this, there is no difference in task execution time, barring an instance interruption.
Best practices for using Spot Instances
An important consideration of using Spot instances is that AWS can interrupt them while tasks are running. If a Spot instance is interrupted, a job retry functionality means that in-progress and remaining unfinished jobs will be automatically restarted on an On-Demand instance.
Although an interruption does not affect the reliability of task execution, it may impact the cost savings from using a Spot instance and can result in a longer overall runtime. For example, a 102.8 GB read file was analyzed with our whole exome analysis pipeline, using a c4.4xlarge Spot instance. Without interruption, the analysis cost $1.78 and took 8 hours and 52 minutes to complete. When there was a Spot instance interruption, the analysis cost $8.56 and ran for a total of 11 hours and 6 minutes. For reference, the analysis would have cost approximately $9.55 if performed using an On-Demand instance. In order to take advantage of the potential cost savings from using Spot instances, users should choose instances with a low risk of interruption.
In general, Spot instance types have a lower price and risk of interruption if they are in less demand. To better determine the risk of interruption for specific Spot instance types, we recommend using the AWS Spot Instance Advisor. For example, the tool shows that for c3.2xlarge and c3.8xlarge instances, the probabilities of interruption over the course of a week are low and medium, respectively.
We recommend that users of the Seven Bridges Platform switch to Spot instances to easily save on computing costs. Spot instances can be enabled as a global default for a project or on a task-by-task basis. More information is available in the Knowledge Center.
Our Commitment to Optimizing Data Analysis
Taking advantage of Spot instances is just one of the many ways that Seven Bridges is bringing down bioinformatic analysis costs, in order to improve researchers’ ability to analyze and gain insight from large-scale genomic data.
*Cost savings will vary depending on file size and timing selected.