Microsatellite instability

Microsatellite instability is a condition in which cells (such as tumor cells) accumulate microsatellite mutations (short, repeated sequences of DNA or STRs) at a higher rate. This is believed to be due to failure in DNA repair mechanisms. MSI profiling in the clinic is typically performed at defined loci using commercially available primers, PCR, and capillary electrophoresis.

Detecting microsatellites from NGS data is challenging because of the low information content caused by reads that don’t span an STR fully. Furthermore, PCR amplification of a microsatellite can lead to noise, since some DNA amplicons have false repeat lengths due to DNA polymerase slippage. We selected the tool lobSTR (Gymrek et al), since it addresses the issues mentioned above.

In addition to lobSTR, we implemented a tool called MSIsensor (Niu et al), which compares the length distributions of microsatellites per site between tumor and normal paired samples. This tool provides the percentage of somatic MSI mutations.



The workflow was benchmarked by profiling 592 colorectal adenocarcinomas for which information about MSI status is present on the Genomics Data Commons (GDC) portal.

For MSIsensor, MSI score was considered, and everything above the limit of 3.5 (proposed in MSIsensor: microsatellite instability detection using paired tumor-normal sequence data, Niu et al.) was considered as MSI, while everything below was considered as MSS. For lobSTR output VCFs, total number of differences from reference genome and total increase and decrease of length (due to different number of microsatellite core repetitions) in relation to reference genome were calculated. Based on that, using leave-one-out training, Support Vector Machine (SVM) predictions were made for remaining samples.


Tumor Heterogeneity

A cancer tumor is composed by a diverse collection of cells with different molecular signatures that lead to different potentials to metastasize or resist treatment. This makes the identification and tracking of subclones over time an important goal in cancer research. While identifying somatic mutations in a tumor has become commonplace, tools that use these mutations to reconstruct and track tumor subclones are still in its infancy.

We built a tumor heterogeneity workflow using the tool sciClone, which was developed for detecting tumor subclones by clustering Variant Allele Frequencies (VAFs) of somatic mutations in copy number neutral regions. The results of sciClone are directly fed into two additional tools, ClonEvol and Fishplot, which produce useful visualizations, e.g. phylogenetic trees showing clonal evolution.

The Tumor Heterogeneity SciClone-based workflow was validated against data provided by the ICGC-TCGA DREAM Somatic Mutation Calling – Tumor Heterogeneity Challenge (SMC-Het), an international effort to improve standard methods for subclonal reconstruction. Two different samples were made available (Tumor1 and Tumor2), along with their respective truth data results.

The results of the SB workflow are provided in the table below:


Purity Number of clusters Cluster info
Truth SB Truth SB Truth SBG
Tumor 1 70% 64.9% 4 3 1: 713 (70%)

2: 1,483  (45%)

3: 1  (13%)

4: 1,285 (10%)

1: 729 (65%)

2: 1,923 (43%)

3: 830 (15%)

Tumor 2 90% 88.3% 2 2 1: 186 (90%)

2: 54 (36%)

1: 186 (88%)

2: 54 (35%)

Table 1: Comparison of SBG results with DREAM challenge truth sets
. Results include purity (or “tumor cellularity”) (DREAM sub-challenge 1A), the number of clusters of mutations (sub-challenge 1B), and mutation cluster sizes and cellular proportions (sub-challenge 1C). Cluster info is as follows: cluster ID, number of mutations in cluster, and in parentheses the estimated proportion of cells in the sample containing these mutations. Note that proportions do not necessarily add up to 100%, because smaller clones may be nested within bigger clones.

Results Integration

BMS compared MSI results obtained using wet lab methods (a combination of PCR and immunohistochemistry) against the WES derived MSI score. Out of 38 samples tested,  3 false negatives were observed (samples that are not detected as MSI-high based on the WES data, but are considered to be MSI-high based on the clinical lab data).

The same samples were evaluated for tumor purity. The 3 samples that appeared as false negatives contain less than 20% tumor cells according to the WES tumor purity algorithm. A number of published studies and genomic centers recommend cutoff of 20% below which variant calling is not reliable. In the case we present, using 20% cutoff eliminates all false negatives, but eliminates 3 samples where we would have called the MSI status correctly.


As a result of this work we have established best practices for MSI status calling based on WES data: to use SciClone to evaluate tumor purity from WES data and remove samples with tumor content less than 20%,  then utilize MSIsensor and call MSI status based on recommended threshold.

Currently BMS is using these guidelines to deliver a composite biomarker consisting of MSI status given a certain tumor purity for a number of clinical and preclinical studies across multiple indications.


Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Research. 2012;22(6):1154-1162. doi:10.1101/gr.135780.111.

Niu B, Ye K, Zhang Q, et al. MSIsensor: microsatellite instability detection using paired tumor-normal sequence data. Bioinformatics. 2014;30(7):1015-1016. doi:10.1093/bioinformatics/btt755.