Back to Blog

Simons Genome Diversity Project now available on the Platform and CGC

Written by Patrick in

Cancer Genomics Cloud

on April 12th, 2017

We’ve just announced the availability of the Simons Genome Diversity Project (SGDP) dataset on our Platform and Cancer Genomics Cloud. In keeping with Seven Bridges’ mission to colocalize data and analysis tools on the cloud, anyone with an internet connection can now explore SGDP’s 35TB of data. One way to get an overview of this huge dataset is by exploring it via this interactive SGDP map.

The SGDP is unique because it was designed to help address the lack of diversity in available genomic data. While population genomics studies like the 1000 Genomes Project have made big strides towards understanding genetic diversity, the bias towards European populations means that models of human genetic diversity suffer from a lack of information from many parts of the world. Huge amounts of untapped genome diversity, especially in Africa, have the potential to accelerate precision medicine.

The Simons Project approaches this issue head-on by collecting whole genomes from 279 individuals from 130 populations, including indigenous populations on every inhabited continent. You can have a look at the interactive map at the bottom of this page to get an idea of the kind of geographic and cultural variation captured in SGDP. Another great thing about SGDP is the high sequence coverage – all samples were sequenced at 34-83 fold, with an average coverage of 43-fold.

Non-reference sequences

Variant hunters will additionally celebrate the 5.8 million base pairs of high coverage non-reference sequence that SGDP brings to the table. Non-reference sequence can also harbor interesting and potentially clinically relevant variation.

For example, researchers at deCODE/Amgen announced last month that they identified yet another heart attack-associated variant from their collection of Icelandic biobank samples. This variant, a 766-bp insertion, lies within an intron of SREBP-1 and was missing from the GRCh38 reference genome and 1000 Genomes Project data.

If we feed this type of variation into our graph genome, which becomes more accurate with each new sequence, we can get better alignment accuracy and minimize reference bias. What other interesting non-reference sequences exist out there in the world? We look forward to seeing new insights emerge from exploration of the new sequences in SGDP.

Human origins and migration

The initial publication of SGDP in Nature in October of 2016 was picked up by news outlets worldwide due to its conclusions about human origins and migration.

Phylogenetic analysis of the sequences in SGDP based on pairwise divergence per nucleotide show that there is greater genetic diversity within Africa than outside of it. All non-African genomes appear to descend from a single group that split from the ancestors of African hunter-gatherers around 50,000 years ago. This theory of an African origin for humanity aligns well with phylogenetic studies of mtDNA and recent publications from the Estonian Biocentre Human Genome Diversity Panel and another reporting whole genomes from Aboriginal Australians. The study also suggests that African populations were diverging genetically as much as 200,000 years ago, suggesting that linguistic and cultural differences have been shaping humanity even before it expanded from Africa.

The Nature paper also attempts to tease out variants that could provide clues to genotypic basis for the development language. Dampening the hopes of the linguistics community, which has sought genetic factors associated with the development of language (check out the studies on the recent evolution in FOXP2 and it’s role in mouse vocalization), the SGDP finds no genetic smoking gun. It seems that genetics is only one of many factors that contributed to humanity’s expansion across the globe.

Finally, another paper that emerged in the wake of SGDP takes advantage of the combination of high coverage and wide geographic distribution to track unique classes of single nucleotide variation that vary in frequency from population to population. Taking this variation in mutation rate into account should lead to more accurate calibration of molecular clocks and better demographic modeling.

To learn more, check out the Seven Bridges Knowledge Center, or login to the CGC or Seven Bridges Platform to access SGDP directly.

Patrick