Algorithms for Bioinformaticians

Uncategorized
Back to Blog

Algorithms for Bioinformaticians

One opportunity that comes from working in an infant discipline is that of being able to examine and influence its basic tools. You are more likely to fundamentally improve alignment or assembly than you are to improve the foundations of thermodynamics. And given the state of bioinformatics software, it is not always much harder to rewrite a tool yourself than it is to install and use an existing one. (Even if using a cloud platform is–::wink::–far easier than both.)

If you are curious about the guts of the tools we use, and if you have a taste for the fundamental, you will spend time thinking about bioinformatics algorithms, and this might well lead you to a more general study of algorithms. When you get to that point, though, you might not know what to study. Although high-quality information about algorithms is abundant, little of it is geared explicitly to bioinformaticians–and there’s so much of it that you might not know where to start.

Here is a hopefully useful, if necessarily incomplete, summary of the resources that are out there and how you might use them.

Free online courses on algorithms

Tim Roughgarden’s two-part course on Coursera is, in my view, the current place to begin among algorithm MOOCs. It is fast-moving, but not at the expense of clear, memorable exposition. It’s also agnostic among programming languages, which is very good for bioinformaticians: we have tools in many languages, ranging from C (in which, as of this writing, 83.8% of Tabix and 78.8% of SAMtools are written) to Python and beyond. The exercises are very useful–sometimes you simply have to create a viable implementation of the algorithm you just learned, and sometimes you have to optimize the algorithms to tackle much tougher data sets.

Wayne and Sedgewick’s Coursera course is also a valuable resource, but perhaps less useful for bioinformaticians than Roughgarden’s. Although they claim that Java is merely an expository language, the student does spend a lot of time worrying about details of Java. This can help those who want to examine the guts of Picard and GATK, but it might be an unwelcome distraction for everyone else. That said, it contains excellent and deep discussions of union-find and other methods, and it ought to usefully supplement material you find elsewhere.

Bill Howe’s introduction to data science is not a course in algorithms proper, but it does cover, e.g., clustering algorithms and decision-tree-construction algorithms. Moreover, even what is not strictly about algorithms is still relevant to the manipulation of large data sets, and the data bioinformaticians work with is big and only getting bigger (by eric). Howe includes valuable insight on which techniques are underrated, which are more broadly applicable than most practitioners realize, which will become more relevant in the near future, and which to use if you have a data set of a given type. The exercises are well-chosen, instructive, and often fun.

Guttag and Grimson’s EdX course is an introduction to programming and not an algorithms course, but it does cover algorithmic topics including asymptotic notation, basic algorithms, hashing, decision trees, and depth- and breadth-first searching. Moreover, as an introduction, it’s first-rate. It treats computer science as a subfield of epistemology: the focus is not on hardware architecture but on the basics of how to use computers to think about the world, evaluate beliefs, and extend knowledge. It therefore cultivates a crucial modern aspect of good scientific judgment.

Searching MIT’s OpenCourseWare for algorithms courses can keep anyone busy for a while. I have worked through a good part of Leiserson and Demaine’s introduction to algorithms, and found it rigorous and excellent, even if somewhat less attention went into the production of the lecture videos than a MOOC veteran might have come to expect.

Last fall there was a bioinformatics-specific algorithms course on Coursera. They have not announced a date when they will repeat the course, but a syllabus [pdf] that would make a nice study guide or checklist is still available, as is an online textbook [login required] by Philip Compeau and Pavel Pevzner.

Books on algorithms

The first algorithms book I bought was CLSR’s Introduction to Algorithms. (Price-wise, CSLR is similar to other seminal textbooks: although the newest edition is expensive, previous versions are quite acceptable, and you can find used copies of those for just a few dollars online.) The book is so good that I haven’t looked through any other algorithm-specific textbooks. The exposition is clear, major subjects are covered in detail, and there are many useful exercises. Bioinformaticians will appreciate the extensive treatment of dynamic programming and of search-friendly data structures, even ones (such as radix trees) about which it can be hard to find good information.

Some further reading on algorithms

I’ve written an introduction to short read alignment and a discussion of seeding.

You will want to go to the source and read major academic papers, including Li and Durbin’s BWA article; in particular, section 2 provides an important introduction to prefix tries.

The Wikipedia article on the Smith-Waterman algorithm is a nice place to begin a study of alignment algorithms, especially if you already have some familiarity with dynamic programming (another subject with a good Wikipedia entry).

C. Titus Brown gave a nice talk at PyCon 2011 on using Bloom filters to assemble reads.