We know that our whole genome is distributed to (almost) every cell of our bodies. This fact can be used both to surprise introductory biology students and to usefully refine a fundamental scientific question. Instead of merely asking how it comes to be that different parts of our bodies have different features, we can ask: Which transcripts are expressed in the different regions of the human body, and at what levels? Next-generation sequencing has fundamentally changed how we investigate that question. We can search more broadly: we simply have more data about more transcripts to work with. We can also search in a less hypothesis-driven way. Indeed, we might even discover transcripts we didn’t know existed, which is not a possible result of (say) a microarray experiment.
To get our feet wet with RNA-Seq, we re-analyzed the Body Map 2.0 dataset from Illumina. This dataset comprises short reads from 16 human tissues (lung, heart, testis, etc.). We aligned the reads with TopHat and used Cuffdiff to do a differential expression analysis, comparing all 16 tissues in a pairwise fashion . We then visualized the DE output with Spectacle . To get a feel for the data, and to give a bioinformatician’s take on the old question of brains versus brawn, we focused on a single comparison: muscle vs. brain . Volcano plots are great at giving an overview of differential expression data, because you get significance and fold-change in one view. The X-axis gives the base-two logarithm of the fold change in the expression between the brain and the muscle; the Y-axis gives the base-ten logarithm of p, the variable describing the magnitude of the statistical significance of this change.
It is hard to know where to start with all this data, so let’s look at the ten transcripts at the top left: these combine extreme differential expression (in this case, in muscle vs. brain) with high statistical significance.
These ten are:
|Gene ID||Fold Difference||P Value|
Most of these are genes that you would expect to find: they code for proteins that constitute muscle tissue, mediate some aspect of its behavior, or regulate some aspect of its environment.
But what about ‘loc100507537,’ the seventh item on the list (and the yellow dot in the volcano plot)? It is not well-studied enough even to have a name other than the one assigned to it by virtue of the locus at which it occurs. We took a look at the other 15 tissue samples and didn’t see it there. A glance the NCBI UniGene database confirms that it’s muscle-specific, and Ensembl tells us that it’s a long-intergenic non-coding RNA (lincRNA).
It doesn’t code for any protein, but its high and specific expression in muscle suggests it may be doing something important. We don’t know which important thing loc100507537 might be doing. We wonder if anyone else does–are there any muscle experts out there who can chime in?
- We based our method on Box1B from Trapnell et al. 2012. Full details of our method, including tool versions and parameter settings, are available at igor.sbgenomics.com in the “Human Body Map Differential Expression” project. You can email Kate at firstname.lastname@example.org for an invitation to this project.
- Spectacle was presented at Biology of Genomes 2013; you can find the poster at F1000 Posters.
- Again, if you’d like to see the rest of the comparisons, drop Kate a line.