Short read alignment: an introduction

Science

Biologists today often find themselves with lots of–say, 10^6–short sequences of DNA from a sample, and their ability to do scientifically useful things with those sequences depends on their ability to align those sequences to a reference sequence. Many of the hard and important projects in genomics either are alignment or depend on alignment.

Short read alignment is a nuanced and rapidly developing subject. This post describes the problem and describes some techniques for solving it.

What alignment is and why it is hard

Short read alignment is the process of figuring out where in the genome a sequence is from. This is tricky for several reasons:

  1. The reference genome is really big. Searching big things is harder than searching small things.
  2. You aren’t always looking for exact matches in the reference genome–or, at least, probably not.

After all, the whole project of collecting the data being usually involves an interest in where that data is different from reference data! But looking for both matches and near-misses makes the search way, way harder. Say your reference sequence is 10^8 bp long and your target sequence is 10^2 bp long. If what we were looking for were an exact match, we’d have to look at approximately 10^8 possibilities. (For each possible match, there is a location where it begins, and it could begin at any of the 10^8 locations except for the last 99.)

Notice, though, that if we’re looking for an exact match or a match that differs in one location base pair, that number goes up to roughly 3 x 10^14. (There are 10^8 starting locations, 10^6 possible locations for the different base pair, and 3 ways in which the base pair could be different.) That exceeds the number of exact matches by a factor of three million, and allowing for the possibility of insertions and deletions would drive that number even higher. That is far too many candidate matches for either existing computers or computers we could reasonably foresee ourselves having any time soon to search for by brute force.

Why alignment is hard but (mostly, qualifiedly) tractable

The trick is to figure out a way to search these enormous sequences for matches and near-matches with a method that doesn’t require exhaustively going through every near-match. This is a difficult task, but we are lucky: these bioinformatic questions intersect with areas that have been intensively studied.

  1. Searching. Alignment requires searching. Efficient searching is a well-studied problem! We have not only knowledge of how to search efficiently but a whole conceptual framework for thinking about efficient searching.
  2. Compression. Working with big sequences requires compression. (Making big files smaller is very helpful.) Compression is a well-studied problem! (The “BW” in “BWA” is for a reversible compression algorithm [Burrows-Wheeler] used far beyond genomics.) This makes it more practically feasible to implement whatever we want to do.
  3. Dynamic programmingAlignment happens to be the sort of problem that a technique called dynamic programming, studied in modern computer science, effectively handles. Dynamic programming is described in the next section:

From slow to medium speed: dynamic programming

The problem of aligning two sequences–e.g., a target sequence and a reference sequence–has optimal substructure. Sometimes we can guarantee that the solution to a large problem entails something about the parts of the solution to the larger problem. The classic example is that of shortest paths: if you know that the shortest path from Orlando to Boston goes through New York, then you know that the shortest path from Orlando to New York is part of the path to Orlando to Boston. You can ignore the larger question (getting from Orlando to Boston) and focus on the smaller one (getting from Orlando to New York), knowing that you’ll be able to construct the solution to the larger question out of a solution to the smaller one.

Notice that in this situation, if you don’t already know which of Boston’s neighbors the shortest path passes through, it becomes very useful to have a list of the shortest paths from Orlando to all of Boston’s neighbors: you can easily do an exhaustive search to figure out the shortest path from Orlando to Boston. How? Although it is artificial to pretend we know from the beginning that the shortest path goes through New York, it is much less contrived to pretend we know that the path goes through some city near Boston. That is why the imaginary list of shortest paths to all the cities near Boston is so useful: you could test them one by one, adding those shortest paths with the distance to Boston, to see which gives the shortest total distance.

Aligning a sequence with a target sequence has a similar structure: the answers to slightly smaller questions allow us to construct answers to bigger questions. Let’s say we’re aligning sequence A with a reference genome, that A is 1,000 bp long, and that B (the reference genome) is 1,000,000 bp long. The key fact is that the optimal alignment has to be yielded by one of three computations, each of which depends only on some trivial arithmetic and the best alignments of smaller parts of the sequences. Suppose we already know the best alignments of all the subsequences of A with all the subsequences of B. We can then look at:

  1. The best alignment of all but the last base pair of A with all but the last base pair of B, and see how adding the last base pairs to each changes the quality of the alignment;
  2. The best alignment of all but the last base pair of A with all of B, and see how adding the last base pair to A changes the quality of the alignment;
  3. The best alignment of all of A with all but the last base pair of B, and see how adding the last base pair to B changes the quality of the alignment.

The precise mechanism by which we see how the addition of the last base pair(s) changes the quality of the alignment depends on several things: which specific algorithm we are using, how we have set certain parameters, and so on. Quite generally, though, we can reason from the “optimal substructure” of the alignment problem–that is, the fact that the something is the best alignment of two whole sequences entails that certain parts of that alignment are also the best alignments of certain of those sequences’ parts–to a process for building up optimal alignments of larger sequences from optimal alignments of shorter sequences.

To summarize: dynamic programming:

  • is much faster than any naive algorithm but not fast enough to use exclusively for alignment
  • is guaranteed to find the best alignment relative to the background assumptions enacted by your scoring scheme
  • works by exploiting the optimal substructure of the alignment problem to generate longer alignments from shorter ones

Even faster: beyond dynamic programming

Dynamic programming is much faster than naive searching algorithms, but if we are willing to make small sacrifices in accuracy we can use algorithms that are even much faster than dynamic-programming algorithms. Under most conditions the accuracy sacrifices are small, so it is standard to use these faster “seed-and-extend” techniques.

  1. First, choose short “words”–subsequences of the target sequences– as seeds. (These are chosen in a complicated and very clever way–e.g., you want unique words so that when they match the reference sequence it’s more likely to mean something. BLAST and FASTA differ, among other ways, in how they choose these words.)
  2. Then, look for those words in the reference sequence. (This is a fast operation.)
  3. When you find a match, try to extend the match from each end to see if you can get an even better match. (This can be done in many ways, hence all the command-line options in alignment software packages like BLAST.)

The best alignment is likely to be yielded by (1)-(3). Notice that:

  1. This requires some luck–there’s usually no guarantee that you have chosen the best words to search with, or that you’ve chosen to stop trying to extend the matches from each end at the right time, etc…
  2. …but it doesn’t require much luck. The nature of the data and the ability to choose various parameters intelligently take most of the luck out of the process. [sets of filters, etc]
  3. Most of the steps in this process can be done very quickly. (1) and (2) are very fast. Step (3) is slower, and involves the dynamic-programming approach outlined in the previous section, but narrowing the search with the seeding process speeds up the whole process drastically.
  4. It is, perhaps, more intuitive than the dynamic programming method. Suppose someone gave you an untitled stack of 50 pages with no formatting or punctuation or spacing, just a bunch of letters one after the other. And say your life depended on finding one of Hamlet’s soliloquies in the pages. (And maybe there were some transcription errors.) You would probably just start skimming through the text for distinctive bits of text (“peasant slave,” “slings and arrows,” whatever). If you managed to match one of these phrases, you’d look to either side of it to see if you could extend it, and so on. This is very close to what modern alignment algorithms do.

Fast enough: combining seeding with dynamic programming

Modern alignment algorithms, for all their diversity, largely share a conceptual core: they use seeding to find potential matches, but they then extend those seeded matches using techniques based on dynamic programming. After a seeding match has been found, we test adjacent bases to see if we can improve the match by extending it forward and backward in the target. A dynamic-programming algorithm is used to assign scores to each possible extension of the alignment. Matches “earn” the alignment points; mismatches, insertions, and deletions lose points for the alignment.

This is a middle way between the infallible but slower method of using dynamic programming from the beginning and the very fast but incomplete method of simply scanning the target for seeds. The round of seeding constrains the search enough to let the search go quickly enough to be practical; the round of extension gives us the precision required to find the best alignment.

[There is now a follow-up to this post that takes a closer look at seeding! Click here to read “Short read alignment: seeding”]

References

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.

Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.

Li H. and Homer N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11:473-83.