Microsoft Research CABI 2013


This Monday we went to Microsoft’s beautiful NERD building for Microsoft Research’s 2013 conference in Computational Aspects of Biological Information (CABI).

The pastries were good and the presentations were even better. Jim Collins spoke about implementing flip-flops in bacteria, and about the prospects of biocomputation more generally; Pardis Sabeti discussed her lab’s work on pathogenic “footprints” in the human genome, and she offered cool visualizations; David Gifford spoke about PIONEER factors and their effects on chromatin geometry. A full list of the talks is here.

Beyond the details of individual presentations and posters, a few general issues were hard to miss:


Several speakers discussed phenotyping: how difficult it is to phenotype accurately, how often we get it wrong, and how our current taxonomies of sub-types might be failing us. The most emphatic cautions here came from by Doug Lauffenberger, whose work calls into question our current taxonomy of ovarian cancer; Jill Mesirov also discussed the problems in classifying subtypes of MS.

The assembly, maintenance, and accuracy of data sets

Worries about phenotyping relate closely to worries about the accuracy of our data sets. The conference provided remarkable evidence, if you still needed any, that data-gathering and -cleaning is very often the majority of the work in this field. IsaacZak” Kohane gave a truly remarkable talk about health-care data and metadata; perhaps its most interesting aspect was his discussion of the political and technical challenges to assembling the data. He described his confrontations with the metadata problems and the hurdles to combining Harvard hospitals’ data sets, noting that it’s easier for a Harvard hospital to collaborate with a hospital in San Francisco than with another Harvard hospital.

You really can’t escape philosophy

The general excitement arising from being in such a rapidly developing field, and the general frustration with the hard parts of data-collection and classification–a problem of which phenotyping is a special case–led to many discussions about the philosophy of science. John Quackenbush highlighted these issues in his talk, generally championing the role of data access in scientific breakthroughs and offering some reflections, roughly in the spirit of Nancy Cartwright and Bas van Fraasen, on the epistemology and metaphysics of model-building.

How should we design our models? What are our models, anyway? How much should we rely on our messy, fuzzy phenotype data? How do we responsibly assess and improve the quality of the biggest data sets we’ve ever worked with? Given the urgency of such questions, it was heartening to see not only the exciting research results–Quackenbush’s team’s network diagrams, Martha Bulyk’s team’s results on transcription factor structure, Rameen Beroukhim’s and Ben Raphael’s presentations on cancer–but also the collective attempt to address the hard epistemological questions surrounding bioinformatics.


If you’re interested in reading more, check out these papers by some of the presenters:

1) Instrumenting the healthcare enterprise for discovery research in the genomic era. Murphy S, Churchill S, Bry L, Chueh H, Weiss S, Lazarus R, Zeng Q, Dubey A, Gainer V, Mendis M, Glaser J, Kohane I (2009). Genome Res. 19(9): 1675–1681.

Using Informatics for Integrating Biology and the Bedside (i2b2), free open-source software that mines electronic health records for phenotypes, Isaac Kohane and his collaborators worked to efficiently and inexpensively identify patients with common diseases (rheumatoid arthritis, asthma, and major depressive disorder).

2) Grossman SR, Shlyakhter I, Karlsson EK, Byrne EH, Morales S, Frieden G, Hostetter E, Angelino E, Garber M, Zuk O, Lander ES, Schaffner S, Sabeti PC (2010). A composite of multiple signals distinguishes causal variants in regions of positive selection. Science 327(5967):883-6.

In a paper that appeared in January 2010, Pardis Sabeti and her colleagues developed the Composite of Multiple Signals (CMS) test to identify preferred regions within the human genome. The CMS method enabled them to drastically reduce the size of candidate regions thought to be under positive selection.

3) Leiserson MDM, Blokh D, Sharan R, Raphael BJ (2013) Simultaneous Identification of Multiple Driver Pathways in Cancer. PLoS Comput Biol 9(5): e1003054.

A “typical” cancer genome might have thousands of benign passenger mutations and just ten or twenty “driver mutations.” Ben Raphael and his colleagues have developed software to help us distinguish driver mutations from benign ones. In a 2013 paper they describe Multi-Dendrix, a tool that infers cancer pathways from mutation data.

4) Berger M, Philippakis A, Qureshi M, He F, Estep P, Bulyk M (2006). Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnology 24(11): 1424-1439.

In this 2006 paper, Martha Bulyk developed a compact, universal protein binding microarray that can be used for determining the binding preferences of transcription factors. This design represents all potential sequences of a given length on one microarray. Using this design, Bulyk and colleagues determined binding specificities for five transcription factors from yeast, worm, mouse, and human.