Taxonomic profiling of metagenomics samples: Get to know your loyal residents

Back to Blog

Taxonomic profiling of metagenomics samples: Get to know your loyal residents

Numerous colonies of different organisms live virtually everywhere on Earth, even in and on our bodies. They are called microbes and we all know about them. But do we really?

Actually, the human race knew nothing about microbes before the 17th century. There were some assumptions and hypotheses, but their existence was actually confirmed by Antonie Van Leeuwenhoek and Robert Hooke with the discovery of microscope. Since then, technology has changed tremendously and now we know that there are organisms from all three domains of life (Archaea, Bacteria, and Eukarya), which are too small to be seen, but too important for our existence. We learned how to cultivate some of them in the 19th century, and what makes the genetic basis of their physiology in the last half of the 20th century. Most of the knowledge we obtained about microbes was “laboratory-based”, gathered by observing those culturable microorganisms living in isolation and in artificial conditions. Apart from that, we have very little information about their ecological context or interactions with other organisms from their habitat [1].

However, metagenomics, the science which explores the mixture of genetic material of microbes directly from their natural environment, is developing fast on the wings of high-throughput sequencing and bioinformatics, and it will hopefully make it possible to investigate the complex communities of these organisms as a whole to reveal which microbes are there, what they do and what roles microbes play in environmental and physiological processes. The first step would be to see who is actually there: which microbes live in different environments, has the composition of microbial community changed after some treatment, or can we associate some microbes with certain health conditions and diseases? For these and similar questions, we use taxonomic profiling of metagenomic samples.

What is taxonomic profiling?

Taxonomic profiling answers the question “Who is there?”, giving us an insight into taxonomic composition of each analyzed sample. Apart from the identification of taxa present in a sample, the relative abundances of organisms are also estimated within this kind of analysis. As a result, taxonomic profile will contain both, a list of detected taxa and their estimated relative abundances as well as the various diversity indices.

Creating a taxonomic profile is not an easy job because metagenomic samples contain genetic material of millions of different organisms from thousands of different species. These molecules are sequenced jointly, so the reads we get from a sequencer could come from any of those organisms. Short read length, high similarity in sequence of genes and genomes, differences in their length, low DNA yield in some cases or lack of bioinformatics tools and databases make the problem of microbial identification and taxonomic profiling very challenging.

There are typically two approaches for taxonomic profiling of a metagenomic sample. One uses a single genetic marker (such as 16S rRNA gene for prokaryotes, and 18S or ITS for eukaryotes) and the other one targets the whole genomes of the organisms present in a sample.

The first one, marker gene based approach [2] is limited to detecting only species that have the gene used, but is usually much cheaper and more widely used. It also cannot distinguish all species, for example Escherichia coli and Shigella spp. share almost identical 16S rRNA gene sequences. When this approach is used, reads are clustered based on their similarity to so-called Operational Taxonomic Units (OTUs). Using the reference database of the marker gene, taxonomic information can be assigned directly to all reads or to forementioned OTUs, depending on the sequence similarity threshold (typically 95% identity threshold for genus level identification and 97% for species), which will be followed by a diversity analysis.

In the second approach, sequencing reads can originate anywhere from whole microbial genomes. The further analysis can be performed on all of them, or only on a set of reads coming from unique clade-specific marker genes. This way of narrowing the set of reads improves the computational performance while still achieving higher resolution (up to the strain level) [3]. On the other hand, there are methods that process reads from whole genomes. Those methods, with advances in computer and data science, have very good performance too  [4].

Seven Bridges workflows for taxonomic profiling of metagenomic samples

In order to support metagenomics analyses on the Seven Bridges Platform, we have developed three main workflows (with several additional, supporting ones) for taxonomic profiling. They are based on commonly used tools in the bioinformatics and metagenomics research community. For profiling bacterial and archaeal communities from 16s rRNA gene sequences, we have developed QIIME2 16S rRNA Metagenomic Profiling Workflow with QIIME2 tools. Whole metagenome sequences can be analyzed both with Metagenomics WGS analysis – Centrifuge 1.0.3 and Metagenomics WGS analysis – MetaPhlAn 2.0 workflows.

All three workflows will generate, among other workflow specific files, a report with all detected taxa within the samples with their relative abundances. For visualization of the results, we used Krona [5] interactive hierarchical graphs (Figure 1), which enable easy browsing through the samples on different taxonomic levels within each sample.

Figure 1. HTML report with a Krona interactive chart for each sample.

In addition, you can further analyze differential abundances between samples to see whether groups of samples significantly differ in their composition from one another. This analysis is based on the R metagenomeSeq package [6] and can be run on the Seven Bridges Platform as an interactive notebook (Figure 2 shows examples of plots from this analysis).

Figure 2. Examples of plots from interactive analysis of differential abundance between samples.

Metagenomics WGS analysis – Centrifuge 1.0.3

Metagenomics WGS analysis – Centrifuge 1.0.3 is a workflow for analyzing metagenomic samples against a custom reference, allowing researchers to assign reads in their samples to a likely species of origin and quantify each species’ abundance in the sample. This workflow is based on the Centrifuge toolkit, which implements memory-efficient indexing schemes for the classification of microbial sequences based on the FM-index [4]. The index used for classification of reads is compressed by omitting redundant sequences from highly similar microbial genomes. In this way, the size of the index is reduced, and the search operations become very fast. Another advantage of using Centrifuge is the possibility to create the index from a custom set of microbial genomes in a very convenient and user-friendly way.

For running this workflow on the Seven Bridges Platform, you will need FASTQ files for one or more samples, and Centrifuge index. If you do not have your own index, you can choose the one from our Public Repository that best fits your research. You also have an option to create a custom Centrifuge index with our Reference Index Creation workflow.

If you input FASTQ files from multiple samples, they will be analyzed in parallel. Centrifuge reports will be output for each sample individually, one with classification result for each read (Figure 3), and the other one with a list of all taxa found in a sample with their relative abundances (Figure 4).

Figure 3. Centrifuge result for one sample; taxID a read is classified to is given for each read in FASTQ file
Figure 4. Centrifuge report for one sample; a list of all taxa found in a sample with their relative abundances

As for final reports, our workflow will generate two HTML files, one with bar charts and Graphlan [7] charts for the top abundant species for every sample (Figure 5), and the other with Krona interactive graphs (Figure 1).

Figure 5. HTML report with bar chart and Graphlan chart per sample

Details about Metagenomics WGS analysis – Centrifuge 1.0.3 workflow can be found on its public app page, where you can see the steps it consists of, the complete list of parameters and more information on how to use it.

Metagenomics WGS analysis – Metaphlan 2.0

The other option for profiling metagenomic samples with whole genome sequencing reads could be by using the Metagenomics WGS analysis – Metaphlan 2.0 workflow, based on MetaPhlAn 2 toolkit [3]. This software uses clade-specific marker genes to unambiguously assign reads to microbial clades, up to a species level resolution. MetaPhlAn 2 relies on 1M unique clade-specific marker genes identified from 17,000 reference genomes (13,500 bacterial and archaeal, 3,500 viral, and 110 eukaryotic). This allows unambiguous taxonomic assignments up to the species-level resolution as well as accurate estimation of organismal relative abundance.

MetaPhlAn has been used during the Human Microbiome Project and, based on our experience, it tends to be faster than Centrifuge (up to 25%). That is the reason why we decided to provide the workflow based on this software to our users. However, the classification of reads is based on the pre-built database that could not be customized or extended easily, which may be necessary for some research.

Running Metagenomics WGS analysis – Metaphlan 2.0 requires a FASTQ or FASTA file with whole genome sequencing reads and the metaphlan database (available already on our Platform). The workflow will produce a report with estimated abundances of taxa per sample (Figure 6).

Figure 6. MetaPhlAn report for one sample with estimated relative abundances per taxon

As in the case of the Centrifuge-based workflow, Metagenomics WGS analysis – Metaphlan 2.0 will also produce final reports with bar charts, Graphlan charts and interactive Krona graphs (Figure 1 and 5).

Details about Metagenomics WGS analysis – Metaphlan 2.0 workflow can be found on its public app page, where you can see the steps it consists of, the complete list of parameters and more information on how to use it.

QIIME2 16S rRNA Metagenomic Profiling

One type of metagenomics analysis is the analysis of sequences coming from marker genes, usually for taxonomic profiling of samples. The most commonly used marker gene is 16S rRNA gene for bacterial and archaeal taxonomic profiling. Compared to whole genome sequencing, marker-gene sequencing is much cheaper and still sufficient for the identification of species (or other taxa) present in a sample.

QIIME2 16S rRNA Metagenomic Profiling is a workflow based on the QIIME2 toolkit [2], used to perform the analysis of microbiome samples using  16S rRNA gene sequences. It is possible to run different analyses by combining tools from QIIME2. We focused on one of the possible solutions, presented in the QIIME2 “Moving pictures” tutorial, which should cover majority of cases, as it includes importing data into QIIME2 artifact (QZA) format, demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction of the given samples.

For running this workflow on the Seven Bridges Platform, one needs FASTQ files for one or more samples, the corresponding metadata file and the Classifier in QZA format, a pre-trained Naive Bayes classifier. It is important to train the classifier on a specific dataset of interest, so we recommend using the QIIME2 16s rRNA Feature Classifier Training workflow to do so. However, if convenient, the other option is to choose the classifier from our Public Repository.

The output from each step of the analyses is given in QIIME2 artifact format, in case a user wants to analyze it further (QZA files) or view it on the QIIME2 website (visualization QIIME2 artifacts – QZV files). An example of such a visualization file is given in Figure 7.

Finally, taxonomic profiles are given in the form of interactive Krona charts, and in TSV format for further differential abundance analysis. The final feature table gives information on each OTU presence in each analyzed sample (Figure 8).

Figure 7. QIIME2 visualization artifact with detected taxa, presented on QIIME2 web viewer.
Figure 8. QIIME2 final feature table


Details about the QIIME2 16S rRNA Metagenomic Profiling workflow can be found on its public app page, where you can see the steps it consists of, the complete list of parameters and more information on how to use it.

Where do we go from here?

Creation of a taxonomic profile is not the end of a research, but rather one of its initial steps. After profiling, further analysis is needed in order to see, for example, if there are any differences in profiles obtained from different samples, or at different time points. Seven Bridges offer jupyter notebook with interactive analysis for differential abundance of metagenomics samples. The code from the notebook can be connected easily with all the files from other projects on our Platform and also customized in different ways, depending on your needs.

The same stands for the workflows themselves; if you have some specific requirements for modification or extension of these workflows, you can do that (with or without our help) on the Seven Bridges Platform.

All apps mentioned here are already available on our Public Apps page. We would love to hear from you about your experience using them or anything else you think is relevant to us, so we could keep developing new versions that suit your needs. Send us an email at


[1] National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. Washington (DC): National Academies Press (US); (2007).
[2] QIIME2 official web page
[3] MetaPhlAn v2.0 official web page
[4] Kim D, Song L, Breitwieser FP, and Salzberg SL., “Centrifuge: rapid and sensitive classification of metagenomic sequences”, Genome Research (2016).
[5] Krona official web page
[6] metagenomeSeq: Statistical analysis for sparse high-throughput sequencing
[7] GraPhlAn official web page