Ion Proton RNA-Seq: in search of the best alignment method

Science

Tweaking technology with informatics

Each sequencing technology comes with a unique and unavoidable error profile due to the chemistry, biology, and hardware involved [1,2,3]. If we want avoid analysis artifacts and arrive safely at the biological reality underlying the data, we must to account for these errors during informatic analysis. Recently, researchers have begun tackling bioinformatic problems specific to the semiconductor sequencing technology employed by the Ion Torrent PGM and Ion Proton sequencers [4, 5].

Aligning Ion Torrent RNA-Seq data

ion_world_post_thumbnail

This past October, at the Ion World conference, we presented research comparing the ability of four bioinformatic protocols to align RNA-Seq reads generated by the Ion Proton. We found that using two aligners in sequence–STAR then Bowtie 2–yielded the best alignment. And were pleased to see our findings independently confirmed in a poster presented by a Senior Bioinformatics Engineer from Life Technologies, Yongming Sun.

You can take a look at the full poster, or read the summary below. If you’re running RNA-Seq on a Proton, or considering doing so, we hope you find this useful in guiding your analysis. Don’t hesitate to post any comments or questions below, or send them our way.

The setup

Life Technologies has provided guidance on how to align Ion Proton RNA-Seq reads using open source tools in a technical note; “Two Step Alignment Method for Ion Proton Transcriptome Data” [6].  This protocol aligns reads to a virtual transcriptome using TopHat2, and then takes any unmapped reads an aligns them to the genome using Bowtie 2 in local mode [7]. This second step using “soft clipping” to accommodate homopolymer errors and achieve greater alignment percentages. We used this protocol as a control to test other open-source methods of alignment. We tested alignment using TopHat2 alone, to illuminate the role of Bowtie 2 in aligning reads. We tested alignment using STAR alone to see how well this proclaimed “universal RNA-Seq aligner” handled Proton reads. And lastly we created our own modified “Two-step” protocol, using STAR followed by Bowtie 2.

Four protocols

  1. Alignment using STAR then alignment of unmapped reads using Bowtie 2 in local mode
  2. Alignment using STAR alone
  3. Alignment using TopHat2 [8] then alignment of unmapped reads using Bowtie 2 in local mode
  4. Alignment using TopHat2 alone

Two datasets

We tested alignment performance on real and simulated reads.

  • Three healthy controls from a tumor-normal dataset sequenced with the P1v2 chip [9]
  • Reads simulated from a human transcriptome using the DWGSIM [10]

The results

We did a lot of messing around with parameters on both the real and simulated data. You can find our full report in the poster. But, to cut to the chase, here’s the figure from the killer experiment. This is where we simulated reads from only exonic regions and then mapped them back to the genome with the four protocols. A perfect alignment would put every read back where it came from, in the exonic region.

Simulated_read_alignment

As you can see above, our champion protocol did the best, mapping nearly 90% of reads to exons [11]. “Mismapped reads” were ones that were placed by the protocols into introns or intergenic regions. “Unmapped reads” simply weren’t found a home by the protocols.

Footnotes

  1. Nakamura, K., Oshima, T., Morimoto, T., Ikeda, S., Yoshikawa, H., Shiwa, Y., et al. (2011). Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research, 39(13), e90. doi: 10.1093/nar/gkr344
  2. Minoche, A. E., Dohm, J. C., & Himmelbauer, H. (2011). Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biology, 12(11), R112. doi: 10.1186/gb-2011-12-11-r112
  3. In a recent post, Nate gave an overview of the errors associated with Ion Torrent data and highlighted Golan and Medvedev’s efforts to improve the process of inferring sequences from flowgrams.
  4. Dobin, A, Davis, CA, Schlesinger, F, et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21. doi: 10.1093/bioinformatics/bts635
  5. Golan, D, & Medvedev, P. (2013). Using state machines to model the Ion Torrent sequencing process and to improve read error rates. Bioinformatics. doi: 10.1093/bioinformatics/btt212
  6. Two Step Alignment Method for Ion Proton Transcriptome Data. Accessed 1 March 2013 http://ioncommunity.lifetechnologies.com/docs/DOC-7062
  7. Langmead, B, & Salzberg, SL. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. doi: 10.1038/nmeth.1923
  8. Kim, D, Pertea, G, Trapnell, C, et al. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology, 14(4), R36. doi: 10.1186/gb-2013-14-4-r36
  9. Ion Proton™ Whole Transcriptome RNA-Seq using Breast Cancer Cell Lines. http://ioncommunity.lifetechnologies.com/docs/DOC-8125
  10. Whole Genome Simulation. http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation.
  11. We realize that some exonic reads may, in fact, not be correctly mapped. But have no reason to believe that this error wouldn’t affect the four protocols evenly.