Introducing the world’s first enterprise-ready Graph Genome tools
Seven Bridges builds self-improving systems to analyze millions of genomes, including Graph Genome Suite — the most advanced population genomics tools in the world.
Developed in conjunction with Genomics England to support the UK Government’s 100,000 Genomes Project, Graph Genome Suite is a set of bioinformatics tools for highly accurate analysis of whole genome sequencing data provided on the Seven Bridges Platform. This toolkit leverages vast databases of genomic variants and our innovative algorithms to deliver the world’s most accurate variant calling.
Graph Genome Suite consists of:
- The Graph Genome Reference data structure, which includes a regular reference genome (e.g. GRCh38) as a backbone of a graph on top of which we add variants from genomic variation databases as edges to carry information on the many possible alternatives to the reference genome haplotype observed across populations.
- The Graph Genome Aligner, which uses the Graph Genome Reference to map sample reads considering many alternate haplotypes for each locus, thereby minimizing reference bias.
- The Graph-Genome-Assisted Variant Caller, which enables integrated calling of both small variants like SNPs and indels, in addition to common population structural variants up to thousands of base pairs in length.
Graph Genome is a powerful genomic data structure
The Graph Genome Suite is powered by a directed acyclic graph-based data structure, and is a fundamental rethinking of the representation of genomic variation that impacts health. Unlike standard linear references, this structure makes use of information from an entire population to characterize genetic variants with unprecedented accuracy. Graph structures have the potential to learn from every new person sequenced, meaning that the graph-based reference can improve with each additional genome. Better still, this improvement happens with only minimal increases in file size, allowing analysis of genomes at population scale.
The Graph Genome Reference organizes genomic data from a population into an edge-based Sequence Variation Graph. In such graphs, edges are the primary data carrier elements and alternative haplotypes are represented as different paths through the graph. The linear reference assembly forms the graph backbone, and additional variants are added as new edges in the graph. A longer genomic haplotype can be obtained by following a path through the graph, and concatenating the (sub)sequences contained by the visited edges. We call this structure a Genome Graph.
A Graph Genome Reference can contain both small variants (SNPs and indels up to several dozen base-pairs in length) and the larger structural variants, which are typically difficult to deal with using short read sequencing. Consequently, graph references provide the means to both accurately detect structural variation, and reduce the errors in causes in small variant calls.
Given the vast quantities of sequencing data being produced, attempts are being made to further reduce the size of data. Recent approaches introduced the concept of reference-based compression, it yields an improvement of 75%, and is being implemented with the CRAM file format for storing sequencing reads. Reference-based compression works by fast rough alignment of the sequence against a reference genome and storage of the differences only.
Graph structures enable better variant calling for better discovery
Graph Genome enables accurate variant calling — the process of determining the set of genomic variants present in a sample. Current state-of-the-art variant callers are based around local reassembly of reads mapped in the region surrounding a putative variant. This approach significantly boosts the accuracy of variant calls, but is limited by the choice of size of the reassembly region, and size of the expected assembled contig. Set these values too small, and the results are compromised, while too large values tend to be computationally prohibitive.
Using Graph Genome structures can alleviates this issue, as reads (partially) aligned to a known variant edge allow a better adaptive selection of the necessary reassembly window as the edge from Graph Reference carries the information regarding its size. Furthermore, Graph Genome technologies enable a completely different paradigm in variant calling — for the variants present in the graph, we already have the sequence and the location, thus it will be possible to directly genotype against the graph.
The Graph Genome Aligner utilizes a pan-genomic graph structure, created by augmenting the reference genome with a set of known variations, significantly alleviating the effects of reference bias. The graph incorporates variants from a variety of sources, including the 1,000 Genomes Project and Simons Genome Diversity Project. By preserving all the variant information, the Graph Genome facilities better alignment than linear alternatives, resulting in more accurate variant calls.
Graph Genome is driving innovative genomics research
The Graph Genome Suite enables highly accurate variant calling, to support basic and clinical genomics research in humans and other organisms, including personalized cancer analyses, family trio studies, and sublineage mapping of important viruses.
Seven Bridges granted early access to the Graph Genome Suite to three of the world’s leading research institutions: the US National Cancer Institute’s Cancer Genomics Research Laboratory, Canada’s Michael Smith Genome Sciences Center, and the Sidra Medical and Research Center of Qatar. Our scientists work with researchers at these institutions to deliver insights into infectious disease characterization, personalized cancer medicine, and human population genomics.