Graph Genome Academic Release

You now have access to graph genome binaries needed to reproduce the results of our publication in Nature Genetics: Rakocevic et al. (2019). Fast and Accurate Genomic Analysis using Genome Graphs. This page describes the tools and data available for download, and a brief primer to help you get started.

The components include an aligner that utilizes a directed acyclic graph representation of the reference genome, including population variation information, and a reassembly-based variant caller that together improve precision, speed, and recall without compromising specificity. The GRCh37-based graph described in the manuscript is also included. The tools are optimized for commodity hardware and this new method allows you to analyze large sets of human genomes at a lower cost per run when compared to other methods. The Graph Genome Pipeline is compatible with common data formats (FASTQ, FASTA, BAM, VCF) facilitating integration with standard bioinformatics workflows.

Please note: These tools are now obsolete and are only offered for the reproduction of the published results. The new version, Seven Bridges GRAF, contains significant performance improvements and supports both GRCh38 and GRCh37 genome graphs. The Seven Bridges Pan-Genome Graph references have also been published along with the GRAF Pan-Genome workflow to facilitate accurate analysis of next-generation sequencing data.

Download

First, download the tools. Download and use of the software do require submission of contact information and acceptance of licensing terms as defined in the End User License Agreement.

Distribution

The graph tools are distributed as two compressed Docker images, one for read mapping (bpa aligner version 0.9.1.1) and one for reassembly-based variant calling (rasm variant caller version 0.5.20). The images will work on the majority of Linux systems with an AVX2 instruction set architecture.

We also include a pan-genome graph (SBG.Graph.B37.V6.rc6.vcf.gz) constructed from a collection of known alternate alleles published from multiple public population variation data sources. All branches in the pan-genome graph are specified relative to the GRCh37 human genome reference assembly (b37 version with decoy sequences as described here).

FASTA Reference File

Another required input is a FASTA file representing the genome of interest, in this case we suggest using human_g1k_v37_decoy.fasta, available here.

Getting Started

A basic workflow for analyzing a single sample consists of read mapping followed by variant calling. The aligner requires the input of a FASTA file representing the reference genome, a VCF representing the genetic variation to incorporate into the graph (created by the aligner in memory), and reads to map to the graph. The aligner outputs a BAM file which is input into the variant caller, which itself outputs a VCF file.

The supplied test reads should finish alignment in approximately 15 minutes, and variant calling should subsequently take approximately 5 minutes on a machine with a 2.7 GHz CPU, 16 GB free RAM and 512 GB storage. We recommend at least 32 GB RAM for 30x human whole genome data.

There are eight files in the download package:
1) docker-bpa-0.9.1.tar
2) docker-rasm-0.5.20.tar
3) SBG.Graph.B37.V6.rc6.vcf.gz
4) LICENSE.pdf
5) opensource-software-terms-and-copyrights.md
6) example_human_Illumina.pe_1.fastq
7) example_human_Illumina.pe_2.fastq
8) SAMPLE.sort.vcf

For the aligner and the variant caller Docker settings are:
CPUs 4
Memory 15.0 GiB
Swap 1.0 GiB

The binaries are distributed in Docker containers. Add the tar files to Docker with the command:

docker load --input docker-bpa-0.9.1.tar

docker load --input docker-rasm-0.5.20.tar

Once the docker containers are loaded, the following command should show the IDs for all the images:

docker images

To run a Docker image use the command:

docker run -ti -v /name_of_directory_with_data:/mountedcwd image_id

Both the aligner and the variant caller executables are located in /usr/local/bin.

Read Mapping

The following command demonstrates how to align reads from a single paired-end NGS sample against a pan-genome graph (--vcf SBG.Graph.B37.V6.rc6.vcf.gz --reference human_g1k_v37_decoy.fasta).

The paired-end read data are supplied in two complementary FASTQ files (-q fragment_1.fastq.gz -Q fragment_2.fastq.gz). Read group information contained in the BAM tags can be set using the ‐‐read_group_sample and ‐‐read_group_library tags. The output BAM file name is specified using the -o tag. The –threads parameter can be used to specify the number of threads to use.

/usr/local/bin/aligner --vcf SBG.Graph.B37.V6.rc6.vcf.gz --reference human_g1k_v37_decoy.fasta -q fragment_1.fastq.gz -Q fragment_2.fastq.gz -o sample.bam --read_group_sample 'SAMPLE_READ_GROUP' --read_group_library 'lib' --threads 4

Variant Detection

The BAM file output by the aligner must be sorted and indexed before it can be used for variant calling and we recommend samtools for this:

samtools sort sample.bam > sample.sort.bam
samtools index sample.sort.bam

The reassembly_variant_caller detects SNPs, indels, and structural variants from aligned reads available in a sorted bam (-b sample.sorted.bam). The pan-genome reference graph is specified with -g and -f parameters (-g SBG.Graph.B37.V6.rc6.vcf.gz -f human_g1k_v37_decoy.fasta). The detected variants are output in a VCF file specified with -v parameter.

/usr/local/bin/reassembly_variant_caller -b sample.sort.bam -f human_g1k_v37_decoy.fasta -g SBG.Graph.B37.V6.rc6.vcf.gz -v results.vcf

Testing Binary Installation

There are two small fastq files included in the download package for verification:
example_human_illumina.pe_1.fastq & example_human_illumina.pe_2.fastq

Copy them into the same directory as the reference files. After installing the aligner and variant caller, the commands to process the test data are:

In Docker again variant calling

./aligner --vcf /mountedcwd/SBG.Graph.B37.V6.rc6.vcf.gz --threads 4 -o /mountedcwd/SAMPLE.bam --reference /mountedcwd/human_g1k_v37_decoy.fasta -q /mountedcwd/example_human_Illumina.pe_1.fastq -Q /mountedcwd/example_human_Illumina.pe_2.fastq --read_group_sample 'SAMPLE_READ_GROUP' --read_group_library 'lib'

In the directory holding the output BAM file, sort and index SAMPLE.bam

samtools sort SAMPLE.bam > SAMPLE.sort.bam samtools index SAMPLE.sort.bam

In Docker again variant calling
./reassembly_variant_caller -b /mountedcwd/SAMPLE.sort.bam -f /mountedcwd/human_g1k_v37_decoy.fasta -g SBG.Graph.B37.V6.rc6.vcf.gz -v /mountedcwd/results.vcf

The output file results.vcf can now be compared against the expected results, supplied as SAMPLE.sort.vcf in the archive.