Festival of Genomics London — Day 2


Thursday at the Festival

Yesterday, the Seven Bridges team and other delegates were back in action at the Festival of Genomics London. On another busy day, we focused our attention on the challenges of cancer data analysis.

Nazneen Rahman from the Institute of Cancer Research described the difficulties of turning cancer variant data into a clinically useful resource. As a doctor, the wish is to model genetic variant information as clinically implementable actions, where variants guide treatment decisions. For example, in breast cancer many novel BRCA variants can be classified as pathogenic vs nonpathogenic through automated processes. Importantly, the effects of variants differ according to context—for example, between familial and nonfamilial cancers—and these contexts also need to be included in variant effect databases.

Jan Korbel from EMBL spoke about the work of the germline working group from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project. PCAWG is an attempt to standardize the world’s cancer data, which has been generated and processed in many different ways worldwide. The analysis has been done using distributed computation, with subsets of the data analysed on the cloud in conjunction with industry partners (including Seven Bridges). There is now a cloud release of ~1,300 germline cancer genomes, some via The Cancer Genome Atlas (TGCA), which is accessible through our Cancer Genomics Cloud.

Nils Gehlenborg from Harvard Medical School spoke on the role of data visualization in enhancing our understanding of the cancer genome. Visualization is an important tool for pattern discovery and hypothesis generation, but is difficult to apply to increasingly large datasets. He presented StratomeX, an interface for exploring complex datasets such as TGCA. A demonstration video is available here.

Discovery in Millions of Genomes

The morning plenary session concluded with Julia Fan Li from Seven Bridges UK speaking about our experience working with very large genomic data sets.

The number of human sequences generated worldwide is increasing rapidly. Many large sequencing projects are underway, including the UK 100,000 Genomes Project and US Million Veteran Program. One million sequenced genomes is about to become a reality.

Why do we need millions of genomes? Julia uses the example of cancer—where genomics is very much driving research, diagnosis and treatment. Discovering genetic variants associated with cancer is still very much a numbers game. More sequenced tumor–normal pairs leads to more identification of genes that are mutated at clinically important frequencies.

But sequencing an ever increasing number of human genomes creates challenges, for data acquisition, storage, distribution and analysis. In terms of difficulty, the challenges of genomics are at least on a par with other major big data enterprises: astronomy, Youtube and Twitter.

Several trends emerge from these challenges:

  • First, computation centers will replace data repositories;
  • Second, portable workflows will replace data transfers;
  • Third, advanced data structures will replace static flat files.

Computation centers replace data repositories

Genomic data poses major challenges for storage and distribution. It will rapidly become impractical to download genomic data to work with locally. The alternative is to store data in a cloud environment and to bring algorithms to the data. This will become the norm.

Portable workflows replace data transfers

As algorithms are brought to data, the portability of workflows becomes more important. Instead of being detailed in a paper’s methods section, the exact pipelines used for an analysis (including version and parameter settings) will be recorded in simple text files that are easily shared and hyperlinked. This means that computational analyses will be transparent, and reproducible.

Advanced data structures replace static flat files

Current sequence and reference data is static, flat, and difficult to store and distribute. As we move towards a million sequenced genomes we must develop novel tools that better represent genetic variation and metadata. Graph genomes will capture the variation of a whole population and self-improve as more genomes become available.

This concept proved popular with the audience:

Graph Aligner

Finally, Julia invited the audience to apply for early access to our Graph Aligner. Our London-based team has been developing this tool with support from Genomics England as part of the 100,000 Genomes Project.

That’s it from the event in London. The Festival of Genomics will be back in Boston in June 2016.