You now have access to graph genome binaries that enhance the scalability and accuracy of your genomic analyses.
The download package consists of several components required for reproducing the results of our recent publication in Nature Genetics: Rakocevic et al. (2019). Fast and Accurate Genomic Analysis using Genome Graphs. This page describe the tools and data available for download, and a brief primer to help you get started.
The components include an aligner that utilizes a directed acyclic graph representation of the reference genome, including population variation information, and a reassembly-based variant caller that together improve precision, speed, and recall without compromising specificity. The GRCh37-based graph described in the manuscript is also included. The tools are optimized for commodity hardware and this new method allows you to analyze large sets of human genomes at a lower cost per run when compared to other methods. The Graph Genome Pipeline is compatible with common data formats (FASTQ, FASTA, BAM, VCF) facilitating integration with standard bioinformatics workflows.
More releases will follow, so stay tuned for updates!
Let’s get started.
First, download the tools. Download and use of the software do require submission of contact information and acceptance of licensing terms as defined in the End User License Agreement.
The graph tools are distributed as two compressed Docker images, one for read mapping (bpa aligner version 0.9.1.1) and one for reassembly-based variant calling (rasm variant caller version 0.5.20). The images will work on the majority of Linux systems with an AVX2 instruction set architecture.
We also include a pan-genome graph (SBG.Graph.B37.V6.rc6.vcf.gz) constructed from a collection of known alternate alleles published from multiple public population variation data sources. All branches in the pan-genome graph are specified relative to the GRCh37 human genome reference assembly (b37 version with decoy sequences as described here).
FASTA Reference File
Another required input is a FASTA file representing the genome of interest, in this case we suggest using human_g1k_v37_decoy.fasta, available here.
A basic workflow for analyzing a single sample consists of read mapping followed by variant calling. The aligner requires the input of a FASTA file representing the reference genome, a VCF representing the genetic variation to incorporate into the graph (created by the aligner in memory), and reads to map to the graph. The aligner outputs a BAM file which is input into the variant caller, which itself outputs a VCF file.
The supplied test reads should finish alignment in approximately 15 minutes, and variant calling should subsequently take approximately 5 minutes on a machine with a 2.7 GHz CPU, 16 GB free RAM and 512 GB storage. We recommend at least 32 GB RAM for 30x human whole genome data.
There are eight files in the download package:
For the aligner and the variant caller Docker settings are:
Memory 15.0 GiB
Swap 1.0 GiB
The binaries are distributed in Docker containers. Add the tar files to Docker with the command:
docker load --input docker-bpa-0.9.1.tar
docker load --input docker-rasm-0.5.20.tar
Once the docker containers are loaded, the following command should show the IDs for all the images:
To run a Docker image use the command:
docker run -ti -v /name_of_directory_with_data:/mountedcwd image_id
Both the aligner and the variant caller executables are located in /usr/local/bin.
The following command demonstrates how to align reads from a single paired-end NGS sample against a pan-genome graph (
--vcf SBG.Graph.B37.V6.rc6.vcf.gz --reference human_g1k_v37_decoy.fasta).
The paired-end read data are supplied in two complementary FASTQ files (
-q fragment_1.fastq.gz -Q fragment_2.fastq.gz). Read group information contained in the BAM tags can be set using the
‐‐read_group_library tags. The output BAM file name is specified using the
-o tag. The
–threads parameter can be used to specify the number of threads to use.
/usr/local/bin/aligner --vcf SBG.Graph.B37.V6.rc6.vcf.gz --reference human_g1k_v37_decoy.fasta -q fragment_1.fastq.gz -Q fragment_2.fastq.gz -o sample.bam --read_group_sample 'SAMPLE_READ_GROUP' --read_group_library 'lib' --threads 4
The BAM file output by the aligner must be sorted and indexed before it can be used for variant calling and we recommend samtools for this:
samtools sort sample.bam > sample.sort.bam
samtools index sample.sort.bam
reassembly_variant_caller detects SNPs, indels, and structural variants from aligned reads available in a sorted bam (-b sample.sorted.bam). The pan-genome reference graph is specified with -g and -f parameters (-g SBG.Graph.B37.V6.rc6.vcf.gz -f human_g1k_v37_decoy.fasta). The detected variants are output in a VCF file specified with -v parameter.
/usr/local/bin/reassembly_variant_caller -b sample.sort.bam -f human_g1k_v37_decoy.fasta -g SBG.Graph.B37.V6.rc6.vcf.gz -v results.vcf
Testing Binary Installation
There are two small fastq files included in the download package for verification:
example_human_illumina.pe_1.fastq & example_human_illumina.pe_2.fastq
Copy them into the same directory as the reference files. After installing the aligner and variant caller, the commands to process the test data are:
In Docker again variant calling
./aligner --vcf /mountedcwd/SBG.Graph.B37.V6.rc6.vcf.gz --threads 4 -o /mountedcwd/SAMPLE.bam --reference /mountedcwd/human_g1k_v37_decoy.fasta -q /mountedcwd/example_human_Illumina.pe_1.fastq -Q /mountedcwd/example_human_Illumina.pe_2.fastq --read_group_sample 'SAMPLE_READ_GROUP' --read_group_library 'lib'
In the directory holding the output BAM file, sort and index SAMPLE.bam
samtools sort SAMPLE.bam > SAMPLE.sort.bam
samtools index SAMPLE.sort.bam
In Docker again variant calling
./reassembly_variant_caller -b /mountedcwd/SAMPLE.sort.bam -f /mountedcwd/human_g1k_v37_decoy.fasta -g SBG.Graph.B37.V6.rc6.vcf.gz -v /mountedcwd/results.vcf
The output file results.vcf can now be compared against the expected results, supplied as SAMPLE.sort.vcf in the archive.