Short read alignment: seeding

Science

Alignment: a quick review

In my last post I explained some of the basics of short read alignment algorithms. Go read it if you like; if not, recall that:

  1. Many modern alignment algorithms rely on what is called seeding and extending.
  2. “Seeding” is finding exact matches of part of the read with part of the target.
  3. Finding exact matches is very fast, so seeding is very fast.
  4. “Extending” seeds is done with a technique called “dynamic programming,” which is infallible (relative to a scoring scheme) and exploits the fact that optimal alignments can be built up from the optimal alignments of their parts.
  5. Extension is pretty fast, but is much slower than seeding.

Speed vs. accuracy

Working with computers constantly requires us to manage a tradeoff between speed and accuracy–implementing the seed-and-extend method is a nice example of this. We gain speed by skipping directly to the stage when we have a viable proto-alignment: a perfect match of the seed. We lose accuracy because we might miss the best alignment; the best alignment of the read with the target sequence might involve a difference between between the read and the target that is inside the seed. If we require an exact match at the seeding stage, we will miss optimal alignments that have this feature.

Seed and Extend
Figure 1. An illustration of alignment by extension alone (left) and alignment using a seed and extend method (right).

Choosing better seeds

The read, we assume, comes from a location that corresponds to a unique place in the target sequence. Whatever that location is, we want to hit it with the seed; but we don’t want too much noise, so we want it to be as likely as possible that any given hit is actually that uniquely correct one.

Thinking about the speed-accuracy tradeoff helps us understand why the length of the seed matters. As you use longer seeds, you gain speed but lose accuracy. You gain speed for two reasons: you begin your extension with a longer match already in place, so the extension goes more quickly; and you get fewer matches of the seed in the target, so there’s less extension work to do. The loss in accuracy arises from the higher chance that you miss the location where the seed actually comes from.

Figure 2. A seed is a part of a read and comprise just a few base pairs or more (A). The size of seed chosen affects the number of matches found during the seeding stage or alignment. In general, the larger seeds yield fewer exact matches while smaller seeds will match at more locations in the target (B).

Choosing the length of the read correctly is just one way to balance the goals of speed and accuracy. Modern algorithms use some other tricks. One of them is to use nonconsecutive sets of letters as seeds: if your read is TCCGAAGTGCTAA, you might try to match up TCCGAAGTGCTAA instead of TCCGAAGTGCTAA. The second seed will match “AAGT;” the first will match “GA?GT,” where the question mark indicates that anything is allowed.

If genetic sequences were informationally similar to random strings of letters, this wouldn’t help: any four letters would be as good as any other. But genetic information, like English writing, is very much nonrandom. There is structure; the order of the elements matters; some subsequences recur much more often than others. One consequence of this is that spaced seeds tell you more about reads than non-spaced ones (the way that a fingerprint tells you more about the body it comes from than a fingerprint-sized patch of elbow skin would).

An example: spaced vs. non-spaced seeds

Figure 3. Illustration of a continuous seed (A) and a spaced seed (B).

For a simple demonstration of this, I took ten thousand continuous 10-letter sequences from human chromosome 1. They matched an average of 1322 times in the chromosome–this suggests that the average ten-letter non-spaced seed from a read from chromosome 1 will leave you with 1322 extensions to complete and choose from.

I also took over five thousand 10-letter spaced seeds from the same chromosome. My method was simple: I just took 13-letter unspaced seeds and ignored three of the letters (so that 10 letters, as before, would be what the program looked for in the chromosome).

Even this very simple tweak helps us a lot: the average number of hits decreased to 1190: if we use spaced seeds in this context, we have almost 10% fewer hits to worry about, and (under reasonable assumptions) we are just as likely to have found the location we want.

The details of seeding get much more complicated than this, but the goal remains simple: find ways to search that sacrifice as little accuracy as possible but get you closer to your goal faster.