Better sequences (and fewer homopolymer errors) for Ion Torrent

Science

The short read files that Ion Torrent’s sequencing machines give us still contain many homopolymer errors: errors in the number of bases called when a single nucleotide occurs more than once in sequence. This makes alignment harder and drowns real indels in a sea of noise. These homopolymer errors arise as a consequence of the technology. The technology in question, though, has both a hardware and a software component. While most of us must simply wait passively for better hardware (sequencing machines), many of us can work right now to improve the software–indeed, David Golan and Paul Medvedev already have.

The Homopolymer Error Problem

Ion Torrent machines, like other sequencers, do not directly detect nucleotides. Rather, they apply a series of nucleotide washes to nucleic acid fragments, recording the pH level at every wash. The pH level is relevant because nucleic acids release ions when they incorporate nucleotides–thus we can infer how many nucleotides (if any) were incorporated from those pH levels. Those pH data are recorded as “flowgrams.” What happens next depends on the machine and the setting:

  • The machine may output a .sff file describing those flowgrams.

  • It may directly output a sequence read file in FASTQ format.

  • It may give us a different sequence file–commonly, a BAM file that does not contain alignment information but does contain data listing sequences and quality scores.

Whichever of these happens, two facts are important: first, it is the sequence files (FASTQ or BAM) that will be used in downstream analysis; second, however, what is in those files is inferred, not translated, from what the sequencing machine directly measures. The flowgrams are measures of acidity distributed on the nonnegative number line, whereas nucleotide sequences are series of letters representable with two bits (or three, if you count not just A, C, G, and T but also N, R, and so on). The structure of the latter information is fundamentally different from that of the former, which means that how we get one from the other is a substantive scientific question, not a matter of bookkeeping.

Enter Golan and Medvedev

This raises a question: Is this information being converted well or badly? It’s an important question, but almost nobody is asking it. That’s not because we’re lazy, but rather because we are collectively breaking ground in a big new discipline. Given that we still quite often see fundamental advances in alignment, it’s not a big surprise that our collective attention has not yet turned to details of inferring sequences from flowgrams.

Whether or not we’re doing this job perfectly, at least we’re doing it quickly: right now those conversions happen with an algorithm that can easily be performed in linear time (in the length of the sequence), and that we learn in elementary school: it’s done by rounding. This might seem naive, but for many flowgram values it works unambiguously well. Once you normalize the pH data, many of the values cluster around small integers, and it will usually be a good idea to interpret values very close to n as the incorporation of n nucleotides.

It’s less clear, though, that a normalized value of (say) 1.45 indicates the incorporation of one, and not two, nucleotides. Deciding what to do with these values matters, especially given the tendency of sequencers to give fuzzier and fuzzier data as the read gets longer and longer. This makes for false indels and wrong homopolymer lengths. Here Golan and Medvedev, in a recent Bioinformatics paper, urge us to think more carefully about the situation.

Notice how bases cluster less tightly around integer values as the read length increases. (Modified figure from Golan and Medvedev 2013.)

We have more information than we might first realize when we look at the flowgram data. When we’re deciding how to interpret an ambiguous flowgram measurement, it’s not just that numerical value that is relevant to the decision. An a priori judgment is available given the results of previous washes. Suppose the previous step was a T wash; however many nucleotides were incorporated, it was (if all went well) all of the Ts that might have been able to be incorporated, so T won’t be the next nucleotide. Moreover, several washes in a row with no incorporation can give us even more specific information: if we wash a strand with A and T successively, and neither sticks, the probability that the next nucleotide is G is not 25% but approximately 50%–only G and C remain as possibilities. This in turn might be enough to cause us to view a normalized pH value of .42 and infer that an incorporation did occur, despite the result rounding would give.

Golan and Medvedev begin with such observations and end with a very clever method for deducing–very quickly, thanks to a Viterbi technique–a sequence of nucleotides from a set of flowgram data. We won’t give all the details here, because they’ve already done it so well, but their idea is to view the whole series of washes as a long path through the space of sets of possible nucleotides. Before every wash a set of nucleotides is possible, given the results of previous washes. Whether the result of that wash represents an incorporation changes that set of possibilities accordingly. Golan and Medvedev’s technique can be viewed as a way to transform the “What nucleotide series does this set of flowgram data represent?” question into a different one: “Which path through possible-nucleotide-space does this set of flowgram data represent?” Thinking in terms of paths allows us to exploit the not-entirely-local nature of incorporation data–that is, the way that previous incorporations or non-incorporations change the a priori likelihoods of future incorporations. Moreover, once we have such a path, it is a short step to translate it into the series of nucleotides.

Again, check out the paper–the exposition is quite clear, there is a useful discussion of homopolymer modeling, their results are impressive, and it’s an elegant application of state machines and Viterbi algorithms. The more discrete mathematics you have, the better off you’ll be, but anyone with some college-level mathematics and a willingness to Google will find plenty to enjoy. More than anything, though, the paper is a demonstration that there are big questions everywhere in bioinformatics. With some logic and undergraduate-level biochemistry, Golan and Medvedev raise good questions; with fancier techniques, they make some progress solving them.

References

Golan, D., & Medvedev, P. (2013). Using state machines to model the Ion Torrent sequencing process and to improve read error rates. Bioinformatics, 29(13), i344–i351. doi:10.1093/bioinformatics/btt212