The representation of the human reference genome as a linear haploid DNA sequence poses limitations when trying to incorporate the known genetic diversity of human populations. This has led to the development of graph-based references, able to naturally represent all polymorphisms, including insertions, deletions, and structural variation.
Seven Bridges GRAF comprises bioinformatics workflows and tools for secondary analysis of next-generation sequencing (NGS) data, based on a pan-genome graph reference. These tools are able to call variants with superior accuracy, without compromising on speed or cost. GRAF uses standard genomic data formats (FASTA, VCF, FASTQ, BAM, BED), ensuring compatibility with existing workflows. All Seven Bridges GRAF workflows are specified in CWL 1.0, ensuring portability across compute environments.
Seven Bridges GRAF enables:
- Large human population studies by using a population-specific genome graph comprising millions of individuals
- Rare disease studies by using a curated graph containing mutations associated with the disease
- Precise analysis of individual genomes by making use of bespoke genome graphs and family genome graphs leading to accurate de novo mutation detection
- Sensitive detection of somatic mutations in tumor samples by using cancer genome and normal sample genome graphs
Seven Bridges GRAF consists of:
- GRAF Germline Variant Detection Workflow
- GRAF Aligner – maps sample reads against a Genome Graph Reference, implicitly considering many alternate haplotypes at each locus, thereby minimizing reference bias.
- GRAF Variant Caller – enables integrated calling of SNPs and INDELs, as well as structural variants present in the Genome Graph.
- GRAF Pan Genome Reference – contains both small variants (SNPs and INDELs up to several dozens of base-pairs in length) as well as larger structural variants, which are typically difficult to identify from short-read sequencing data. Consequently, graph references provide the means to both accurately detect structural variation and thus reduce spurious variant calls.
Graph Genome is a powerful genomic data structure
Seven Bridges GRAF is powered by a directed acyclic graph-based data structure and is a fundamental rethinking of the representation of genomic variation that impacts health. Unlike standard linear references, this structure makes use of information from an entire population to characterize genetic variants with unprecedented accuracy. Graph structures have the potential to learn from every new person sequenced, meaning that the graph-based reference can improve with each additional genome. Better still, this improvement happens with only minimal increases in file size, allowing analysis of genomes at population scale.
The GRAF Pan Genome Reference organizes genomic data from a population into an edge-based sequence variation graph. In such graphs, edges are the primary data carrier elements and alternative haplotypes are represented as different paths through the graph. The linear reference assembly forms the graph backbone, and additional variants are added as new edges in the graph. A longer genomic haplotype can be obtained by following a path through the graph and concatenating the (sub)sequences contained by the visited edges. We call this structure a genome graph.
A graph genome reference can contain both small variants (SNPs and indels up to several dozen base-pairs in length) and the larger structural variants, which are typically difficult to deal with using short-read sequencing. Consequently, graph references provide the means to both accurately detect structural variation and reduce the errors in causes in small variant calls.
Given the vast quantities of sequencing data being produced, attempts are being made to further reduce the size of data. Recent approaches introduced the concept of reference-based compression, it yields an improvement of 75% and is being implemented with the CRAM file format for storing sequencing reads. Reference-based compression works by a fast rough alignment of the sequence against a reference genome and storage of the differences only.
Graph structures enable accurate variant calling
GRAF enables accurate variant calling — the process of determining the set of genomic variants present in a sample. Current state-of-the-art variant callers are based around local reassembly of reads mapped in the region surrounding a putative variant. This approach significantly boosts the accuracy of variant calls but is limited by the choice of size of the reassembly region, and the size of the expected assembled contig. Set these values too small, and the results are compromised, while too large values tend to be computationally prohibitive.
Using GRAF can alleviate this issue, as reads (partially) aligned to a known variant edge allow a better adaptive selection of the necessary reassembly window as the edge from a graph reference carries the information regarding its size. Furthermore, GRAF enables a completely different paradigm in variant calling — for the variants present in the graph, we already have the sequence and the location, thus it will be possible to directly genotype against the graph.
The GRAF Aligner utilizes a pan-genome graph structure, created by augmenting the reference genome with a set of known variations, significantly alleviating the effects of reference bias. The graph incorporates variants from a variety of sources, including the 1,000 Genomes Project and Simons Genome Diversity Project. By preserving all the variant information, GRAF facilitates better alignment than linear alternatives, resulting in more accurate variant calls.