Back to Blog

Pan-Genome Analysis Takes Center Stage

Written by Devin Locke in

Graph Genome

on January 31st, 2019

The concept of a pan-genome reference is straightforward: a reference structure that represents all the known genetic variation for a particular population or species. The representation of the human reference genome as a linear haploid DNA sequence poses limitations when trying to incorporate the known genetic diversity of human populations. Given these challenges, many have turned to graph-based reference systems and are beginning to see improvements in biomarker discovery, clinical genomics, metagenomics, and population studies.

With a linear alignment and variant calling paradigm, reads are aligned to the reference and then one tries to accurately characterize mismatches while also filtering out mismatches due to technical error. This is a common signal to noise problem, aside from regions of the genome prone to high population diversity or structural variation. Graph-based pan-genome references offer the possibility of increased accuracy in detecting insertions and deletions, in addition to longer structural variants. Ultimately, improving precision, speed, and recall without compromising specificity.

The Seven Bridges graph development team recently published a manuscript in Nature Genetics titled “Fast and Accurate Genomic Analysis using Genome Graphs.” This collaborative effort amongst a dedicated and diverse team of scientists and programmers showcase the advantage of graph-based references by improving alignment and variant calling, especially for indels. For example, using a graph constructed from Thousand Genome Project variants, the graph aligner accurately mapped >99% of simulated reads containing indels >10 bp. The linear paradigm, using BWA-MEM, miss-mapped reads bearing indels >10 bp at a rate two to three times higher.

One challenge with adopting a new analysis paradigm is compatibility with existing data formats. The workflow for our directed acyclic graph approach parallels the linear paradigm with minor differences, promoting ease of use. Let’s look at those steps:

The Seven Bridges tools create a graph in memory from reference data in common formats. Namely, a FASTA file representing the linear reference which also establishes the coordinate system for the graph, and a VCF file that defines the variants to be incorporated into the graph.
Our aligner takes next-generation sequencing (NGS) reads in FASTQ format and aligns them to the graph, and outputs a BAM file with graph-specific tagging information
Next, our reassembly-based variant caller takes the BAM, generates variant calls, and outputs a VCF.

The VCF can then be interrogated or further filtered using tools such as Smart Variant Filter. So while the dataflow is similar to the linear paradigm (FASTQ → BAM → VCF), the underlying tools take advantage of our graph-based approach.

As part of our commitment to the community, Seven Bridges has made the binaries used to generate the data in the Nature Genetics manuscript available for download. Since the paper submission, additional developments have occurred and so the code used in conjunction with the paper is not the most current version. There is an exciting roadmap planned for the graph product line which includes a Common Workflow Language (CWL)-based whole genome sequencing analysis workflow and much, much more.

Devin Locke