Data Cruncher Public Interactive Analyses

PlatformScience
Back to Blog

Data Cruncher Public Interactive Analyses

We are excited to announce a suite of public interactive analyses readily available for researchers to use with their own data. Data Cruncher Interactive Analyses is a public project created with the goal of enhancing end-to-end bioinformatics analysis on the Platform. The suite is currently comprised of five interactive analyses (Figure 1).

The list of available Data Cruncher Analyses
Figure 1. The list of available Data Cruncher Analyses.

To help researchers transform raw, NGS-based data into clinically actionable knowledge, Seven Bridges strives to efficiently bridge the gap between secondary and tertiary analysis. One of the features we offer to achieve this goal is the Seven Bridges Data Cruncher, which enables scientists to perform interactive computing and open-ended exploration of data on the cloud. Data Cruncher allows scientists to initialize a computation instance and write custom Python, R, and Julia scripts using JupyterLab — an interactive environment that enables users to readily view results. For an introduction to Data Cruncher, we encourage you to read the post we published in June 2017.

Seven Bridges Data Cruncher Analyses help researchers bridge the gap between secondary and tertiary analysis and get the most out of their data.

In this blog post, we will provide a step by step explanation of each analysis in the public project. We will explain what the analysis is used for and describe examples of outputs.

Read on to learn more about the five demo analyses, or scroll down to the Getting Started section to get your feet wet!

Ballgown Interactive Analysis

Ballgown Interactive Analysis tests for differential expression at the gene and transcript level using FPKM expression measurement data.

After loading the data, initial data inspection and quality control are performed along with removing low expression genomic features through variance filtration. Then, two differential expression tests are run, principal component analysis (PCA) and hierarchical clustering. Principal component analysis plots (Figure 2a) provide dimensionality reduction of the data while retaining most of the variation in the dataset. By using principal components, each sample can be easily visualized, making it possible to assess similarities and differences between samples. Hierarchical clustering (Figure 2b) represents another exploratory analysis tool which allows grouping the samples or genes together based on similarity of their features.

PCA and Hierarchical clustering plot
Figure 2. (A) PCA plot. (B) Hierarchical clustering plot.

After the differential expression tests have finished, the results are summarized in simple tables showing the number of differentially expressed features for different levels of significance. These results are visualized through volcano plots (Figure 3) and histograms of p and q values.

Volcano plot
Figure 3. Volcano plot showing statistically significant and insignificant differentially expressed genes in different colors.

The data processed in this analysis consists of 12 human RNA-Seq samples containing only reads aligned on chromosome X and is used as example data in Pertea et al., 2016.

VCF Visualization Analysis

VCF Visualization Interactive Analysis offers several quality control analyses of variant call format (VCF) data. This analysis accepts both raw and annotated VCF files without the need for compression or indexing.

The primary purpose of this analysis is to provide researchers with a broad overview of the data. All computation and visualization functions are open and can be modified by the user according to their specific use case. The analysis is written in Python, data-crunched with pandas and scikit-allel libraries, and the results are displayed using the bokeh package.

The main part of the application includes the following analyses:

  • Variant Density (Figure 4)
  • TS/TV Ratio
  • Base Changes
  • Variant Types
  • Insertions and Deletions Lengths
  • Variant Quality
  • Variant Allele Frequency
Variant density plot
Figure 4. Variant density plot for chromosome 1. Middle positions have zero coverage because of the centromere.

All of the mentioned Python functions can be performed on the entire VCF file or on a specific chromosome. The analysis also includes two more advanced plots: a plot displaying the genes with the most mutations and the mutation type (Figure 5) as well as a plot showing diseases from the CIViC database which correlate with the majority of called mutations.

Most mutated genes
Figure 5. Diagram showing the most mutated genes. The colors represent different mutation types.

Structural Variation Interactive Analysis

Structural Variation (SV) Interactive Analysis is a Jupyter Notebook designed for a quick overview of a VCF file containing structural variant calls. The analysis parses the SV caller’s VCF output into a pandas dataframe for simple data filtering and visualization.

This analysis contains eight customizable visualizations displaying various metrics that can be used to assess the impact of structural variants on a genome and evaluate the performance of an SV caller. Several simple visualizations featured in this analysis examine the different types of structural variants as well as their distribution within the genome. This information can help researchers inspect the biases of different variant callers and sequencing techniques. For instance, certain techniques are better suited for calling particular SV types over others.

The following plots are available through this analysis:

  • Bar plot of variants per variant type (Figure 6a)
  • Bar plot of variants per chromosome (Figure 6b)
  • Heatmap of specific variant types per chromosome
  • Histogram of SV lengths
  • Distribution of SVs within a chromosome
Total numbers charts
Figure 6. (A) Total number of duplications, inversions, and deletions with respect to their length. (B) Total number of structural variants per chromosome.

Finally, the SV analysis contains additional plots and graphics which demonstrate the ability to work with multiple files and external databases. These include:

  • Pie charts for inter- vs. intrachromosomal breakends
  • Histogram depicting distribution of SNPs in the vicinity of breakends
  • Histogram showing the number of SVs overlapping with a gene
  • Graphic displaying the highlighted gene affected by structural variation in a pathway (not easily customizable)

These visualizations demonstrate several of the capabilities of interactive analyses and can serve as a starting point for more complex projects.

ChIP-seq Interactive Analysis

The Chromatin Immunoprecipitation Sequencing (ChIP-Seq) Interactive Analysis provides a visualization of the likely locations of transcription factor binding sites or histone modifications.

After performing chromatin immunoprecipitation followed by high-throughput DNA sequencing, sequencing reads are aligned to a reference genome. Genomic regions which are enriched for aligned sequencing reads are then tested for statistical significance, and these data are plotted such that peaks indicate locations of transcription factor binding sites or histone modifications.

ChIP-seq Interactive Analysis uses the bokeh library to enable the visual representation of peaks (Figure 7) identified by MACS2 peak-caller, a tool widely used in ChIP-seq bioinformatic workflows.

ChIP-seq peak visualization
Figure 7. ChIP-seq peak visualization.

Inputs to this analysis include a pre-generated annotation file and MACS2 bedGraph files obtained from the treatment sample and an optional control sample containing information on reads pileups. Researchers can obtain bedGraph files for their own samples by using the SBG public pipeline — ChIP-seq BWA Alignment and Peak Calling MACS2 2.1.1.

Microbiome Differential Abundance Analysis

The goal of the Microbiome Differential Abundance Interactive Analysis is to detect differential abundance of microbes between two predetermined classes of samples. This experimental design is applicable to case-control clinical studies and other settings for which there is prior assumption about the existing microbiological conditions within the different groups of samples.

This analysis relies on results of the metagenomic workflows performing the profiling of microbial communities from metagenomic shotgun or 16S rRNA sequencing data. Inputs to this analysis include tables containing information on mapped read counts, taxonomy, and clinical data relevant to the sample (provided by the user). Counts and taxonomy tables are default outputs of MetaPhlAn 2, Centrifuge, and QIIME2 metagenomic workflows, all of which are available on the Platform. The BIOM file format, which also contains these tables, is supported as well.

Microbiome differential abundance volcano plot
Figure 8. Microbiome differential abundance volcano plot.

The analysis is written in R and Python combined and runs within Jupyter Notebook. The code enables the visualization of relative abundance of the most abundant taxa and the differential abundance analysis (Figure 8) using the fitFeatureModel() or fitZig() functions of the R package metagenomeSeq.

Additionally, this notebook comes with a distinctive “flavor” — metagenomeSeq methods are called directly from Python, using rpy2 which conveniently enables an interface between Python and R. Since both languages can be interchangeably used in cases of data exploration and visualization, this analysis can serve as a practical guideline for combining the languages within the same notebook and hence help researchers get the most out of the analysis.

Getting started

Data Cruncher Interactive Analysis project can be found under the Public Projects tab on the platform toolbar (Figure 9).

Public Projects on the Seven Bridges Platform
Figure 9. Public Projects on the Seven Bridges Platform.

Once in the project, you can see the list of interactive analyses by clicking the “Interactive Analysis” button in the upper-right corner and then choosing Data Cruncher. Alternatively, you can navigate to a specific analysis by clicking the link in the project description found on the left-hand side. Each analysis can be copied to projects in the same way apps and workflows can be copied.

The Jupyter Notebooks come with demo files which can easily be found by filtering by the unique tag associated with each analysis and copied. For example, to filter for files for the Ballgown Interactive Analysis, you can go to the project files and choose BALLGOWN in the tags dropdown menu, as shown in Figure 10. After that, you can select all the filtered files and copy them to your own project. Once both the analysis and files are copied to the project, you can start the analysis.

Filtering project files by tag
Figure 10. Filtering project files by tag.

Finally, if you intend to run these or similar analyses with your own data, each notebook is annotated with comprehensive descriptions and explanations to facilitate customization.

Interested in learning more about Data Cruncher Interactive Analyses or have questions about it? Contact us at support@sbgenomics.com!