What Makes TOPMed Datasets So Special?

BioData CatalystPlatform
Back to Blog

What Makes TOPMed Datasets So Special?

Studies from the Trans-Omics for Precision Medicine (​TOPMed​) program are available for analysis on NHLBI BioData Catalyst. The TOPMed program, funded by the National Heart, Lung, and Blood Institute (NHLBI), part of the National Institutes of Health (NIH), focuses on data specifically for advancing science in the fields of heart, lung, blood, and sleep disorders. This program is generating a rich, multi-omic dataset with data from various -omics fields, such as genomics, proteomics, metabolomics, and transcriptomics, and is adding that information to pre-existing studies that have characterized thousands of phenotypic variables, including biochemical, physiological, clinical, behavioral, and anatomical measures. These TOPMed studies feature a great degree of racial and ancestral diversity compared to other genetic studies, which have typically focused on participants with European ancestry. Working with participants of other ancestries is important to advance scientific discovery for the many groups that have received significantly less (or no) attention from genetic research.

TOPMed currently includes whole-genome sequencing (WGS) of over 155,000 individuals, making it one of the largest, most diverse WGS datasets available to researchers. Furthermore, some TOPMed studies offer rich imaging datasets like that of COPDGene with 22 million CT scan images for about 10,000 study participants. The TOPMed program works alongside and complements various other programs, such as the ​All Of Us​ Research Program, The Million Veterans Program, and the NIH Database of Genotypes and Phenotypes (dbGaP​). Overall, the TOPMed program is a powerful step toward developing precision medicine: using a patient’s individual genetic background and consideration of their environmental factors in order to formulate therapies and treatments specifically tailored for that individual.

TOPMed studies on NHLBI BioData Catalyst

NHLBI BioData Catalyst powered by Seven Bridges​ strives to facilitate access to TOPMed studies. The following TOPMed data is hosted on NHLBI BioData Catalyst today:

  • Multi-sample VCFs for ~55,000 sequenced participants within 32 TOPMed studies included in ​Freeze 5b​ (see the table below for the full list).
  • CRAM files and single sample VCFs for all sequenced participants, the former of which are not available on dbGaP.
  • Raw phenotype files for participants in TOPMed studies, providing clinical information such as BMI and lipids levels. In some cases, these data are in different dbGaP accessions.

Seven Bridges is currently working with the NHLBI BioData Catalyst consortium to onboard additional studies and data that have been released on dbGaP as part of TOPMed Freeze 8​.

Working with the TOPMed studies on the cloud

You can view the available TOPMed studies in the platform Data Browser feature and select studies to query. You can search for specific file types in the Data Browser as well as by study consent group. The following data types are available for the hosted TOPMed studies:

Data Type property in Data Browser Genomic data file type
Aligned Reads CRAM files
Simple Germline Variation Single sample VCF files
Unharmonized Clinical Data Raw phenotype files from dbGaP
Variant Call Multi-sample VCF files

 

You can find these files by searching the “File” entity and using the “Data Type” property in the Data Browser, as shown in the image below:

The phenotype data and multi-sample VCF data are currently in tar bundles for each consent group within a TOPMed study. These tar files can be decompressed on the platform using the Seven Bridges Decompressor App by searching “decompressor” in the Public Gallery of Apps.

The ​Data Browser exposes only open metadata from the TOPMed studies for search, so all researchers are able to do the same searches and see the existence of all files. However, only users with appropriate dbGaP approval can add files to their project to use in an analysis. ​A service within NHLBI BioData Catalyst programmatically reads user permissions from dbGaP to determine if a user can access particular files on the system. This service recognizes if a user has a Data Access Request in dbGaP for TOPMed data or if a user is set as a “dbGaP downloader” for a particular dataset. Therefore, these are the two mechanisms for getting access to TOPMed studies on NHLBI BioData Catalyst. Please note that phenotype and genotype data for some studies are in different dbGaP accessions. More information is available on the ​Data page​ ​of the NHLBI BioData Catalyst website​.

To learn more about how to get started working with the TOPMed studies on NHLBI BioData Catalyst powered by Seven Bridges, take a look at the ​Onboarding to Seven Bridges Tutorial​ which describes how to create an account, set up projects, run analyses, and find the TOPMed data in the Data Browser.

Further information on the hosted datasets can also be found on the ​Seven Bridges Documentation section “Datasets Hub”​.

For more information on which TOPMed studies and parent studies are offered, including their phs identification numbers used by dbGaP, please see the tables below:

Hosted TOPMed study accessions with genomic data from Freeze 5b

Study Name Acronym phs I.D. #
NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish Amish phs000956
NHLBI TOPMed: Atherosclerosis Risk in Communities ARIC phs001211
NHLBI TOPMed: The Genetics and Epidemiology of Asthma in Barbados BAGS phs001143
NHLBI TOPMed: Cleveland Clinic Atrial Fibrillation Study CCAF phs001189
NHLBI TOPMed: The Cleveland Family Study CFS phs000954
NHLBI TOPMed: Cardiovascular Health Study CHS phs001368
NHLBI TOPMed: Genetic Epidemiology of COPD COPDGene phs000951
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica CRA phs000988
NHLBI TOPMed: Diabetes Heart Study DHS phs001412
NHLBI TOPMed: Boston Early-Onset COPD Study EOCOPD phs000946
NHLBI TOPMed: Framingham Heart Study FHS phs000974
NHLBI TOPMed: Genes-Environments and Admixture in Latino Asthmatics GALAII phs000920
NHLBI TOPMed: Genetic Study of Atherosclerosis Risk GeneSTAR phs001218
NHLBI TOPMed: Genetic Epidemiology Network of Arteriopathy GENOA phs001345
NHLBI TOPMed: Genetic Epidemiology Network of Salt Sensitivity GenSalt phs001217
NHLBI TOPMed: Epigenetic Determinants of Lipid Response to Dietary Fat and Fenofibrate GOLDN phs001359
NHLBI TOPMed: Heart and Vascular Health Study HVH phs000993
NHLBI TOPMed: Genetics of Left Ventricular Hypertrophy HyperGEN phs001293
NHLBI TOPMed: The Jackson Heart Study JHS phs000964
NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism Mayo_VTE phs001402
NHLBI TOPMed: The Multi-Ethnic Study of Atherosclerosis MESA phs001416
NHLBI TOPMed: Massachusetts General Hospital (MGH) Atrial Fibrillation Study MGH_AF phs001062
NHLBI TOPMed: Partners HealthCare Biobank Partners phs001024
NHLBI TOPMed: San Antonio Family Heart Study SAFS phs001215
NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment SAGE phs000921
NHLBI TOPMed: African American Sarcoidosis Genetics Resource Sarcoidosis phs001207
NHLBI TOPMed: Genome-wide Association Study of Adiposity in Samoans SAS phs000972
NHLBI TOPMed: Rare Variants for Hypertension in Taiwan Chinese THRV phs001387
NHLBI TOPMed: The Vanderbilt Atrial Fibrillation Ablation Registry VAFAR phs000997
NHLBI TOPMed: The Vanderbilt Atrial Fibrillation Registry VU_AF phs001032
NHLBI TOPMed: The Women’s Genome Health Study WGHS phs001040
NHLBI TOPMed: Women’s Health Initiative WHI phs001237

Hosted TOPMed study accessions with phenotype data

Study Name Acronym phs I.D. #
Atherosclerosis Risk in Communities ARIC phs000280
Cleveland Clinic Atrial Fibrillation Study CCAF phs000820
The Cleveland Family Study CFS phs000284
Cardiovascular Health Study CHS phs000287
Genetic Epidemiology of COPD COPDGene phs000179
Framingham Heart Study FHS phs000007
Genes-Environments and Admixture in Latino Asthmatics GALAII phs001180
Genetic Study of Atherosclerosis Risk GENESTAR phs001074
Genetic Epidemiology Network of Arteriopathy GENOA phs001238
Genetic Epidemiology Network of Salt Sensitivity GENSALT phs000784
Heart and Vascular Health Study HVH phs001013
The Jackson Heart Study JHS phs000286
The Multi-Ethnic Study of Atherosclerosis MESA phs000209
Massachusetts General Hospital (MGH) Atrial Fibrillation Study MGH_AF phs001001
Women’s Health Initiative WHI phs000200

 

Be sure to receive late-breaking updates from Seven Bridges and follow us on ​LinkedIn and ​Twitter​.