Smart Variant Filtering

Science
Back to Blog

Smart Variant Filtering

The variant filtering process consists of selecting highly confident variants and removing the ones that are falsely called. Secondary genomic DNA analysis is mainly oriented toward alignment and variant calling, assuming the accuracy of these two would provide major influence on the overall quality. The variant filtering step used to be mostly left out from deeper testing, even though it can boost precision of variant calls significantly. After developing graph genome alignment tool and reassembly variant caller, Seven Bridges Genomics now launches Smart Variant Filtering tool used for filtering germline variants.

Smart Variant Filtering (SVF) uses machine learning algorithms trained on features from the existing Genome In A Bottle (GIAB) variant-called samples (HG001-HG005) to perform variant filtering (classification).

SVF is available on Github and also as a Public project on the Seven Bridges Platform. It is completely open-sourced and free to use by any party (BSD 3-Clause). The comparison results obtained during deep, 3-stage testing demonstrate that it outperforms the solutions currently used within most secondary DNA analyses. Smart Variant Filtering increases the precision of called SNVs (removes false positives) for up to 0.2% while keeping the overall f-score higher by 0.12-0.27% than in existing solutions. Indel precision is increased by up to 7.8%, while the f-score increase is in range of 0.1 to 3.2%.

Under the hood

The entire process of obtaining the SVF algorithm consists of the following stages:

  1. Preparing the features
  2. Training a model
  3. Applying the filtering

1. Preparing the features

The purpose of this stage is to create a dataset with features which will be used by the machine learning algorithm to perform classification. Reads from GIAB HG001-HG005 samples with available truth sets (gold standard variants) are processed with variant calling workflows generating raw Variant Calling Format (VCF) files.

Any DNA analysis bioinformatics workflow such as GATK 4.0 Whole Genome Sequencing, GATK 4.0 Whole Exome Sequencing or Graph Genome workflow can be used for creating the raw VCF files.

Raw and Truth-set VCFs are processed with the VCFEval tool on high confidence regions, creating an annotated VCF with marked True positive (TP) and False positive variants (FP).

Features from raw VCF and the correctness flag from the annotated VCF (TP or FP) are filled into the table (Tab Separated Values file) suitable for model training. As an initial set of SVF features, we have selected the ones recommended in the Broad Institute’s Best practice Hard Variant filtering for GATK Whole exome workflow: QD, MQ, FS, MQRankSum, ReadPosRankSum and SOR.

Another optional feature added to the table is the information about the existence of the variant in known variants database (dbSNP). The Smart Variant Filtering – Prepare Features workflow that creates the features table is shown in Figure 1. Table 1 shows an example of the features table used for training a model.

Figure 1: The workflow for extracting features from raw and annotated VCFs and creating a table.

 

CHR POS QD MQ FS MQRankSum ReadPosRankSum SOR TYPE VALID
1 3544235 16.66 60 3.825 0 0.551 0.388 INDEL TP
1 3743433 15.62 60 2.427 0 -0.817 1.205 INDEL TP
1 6239937 31.67 60 0 NA NA 1.609 INDEL TP
1 6453520 29.53 60 0 NA NA 2.985 INDEL TP
1 6530965 33.63 60 0 NA NA 1.685 INDEL TP
1 6610743 7.62 60 0 0 0.253 0.132 INDEL FP
1 11782884 5.45 60 8.113 0 -0.489 0.033 INDEL FP

Table 1: The example of the features table.

2. Training a model

The process of training a machine learning model (classifier) is performed using the python sklearn library. The tables with features from all samples with truth-set VCF available (HG001-HG005) are first merged together for learning and then split into SNVs and indels.

Training of the machine learning model is performed with a highly configurable Smart Variant Filtering tool by maximizing the separation between TP and FP samples (Figure 2).

This tool wraps two python scripts (available on Github) – one for training and one for applying the filter. The tool also offers the possibility to switch between two operating modes.

In the training mode, the tables with features are provided as an input and the obtained filtering model is written to an output file that can be used when applying the filter in the next stage.

Figure 2: Smart Variant Filtering tool with inputs and outputs.

3. Applying the filtering

When in the filtering mode, the Smart Variant Filtering tool receives the VCF file together with trained models for SNVs and indels as inputs and creates a filtered VCF (Figure 3). SVF reads desired features from Info field for every variant of VCF file and, using a provided models, decides whether it is TP or FP. If variant is classified as FP then SVF will mark it in the Filter field of the output, filtered VCF with labels SVF_SNV or SVF_INDEL .

If some of the filters exist in the VCF it is possible to discard all of them during filtering (discard the existing filters). By default SVF will keep them or overwrite them with SVF label if the classifier made such a decision. Another option is to force keeping the variants found in the input database (dbSNP).

Figure 3: Smart Variant Filtering tool with inputs and outputs.

In order to speed up the filtering process, the input VCF can be divided into smaller VCFs by chromosome contigs and a separate processing job can be performed for each part of the VCF.

Finally, all VCF parts are merged creating a filtered VCF. These operations are performed inside the Apply Smart Variant Filtering Parallel workflow.

Another operation performed by this workflow is addition of information about every variant’s existence in known databases such is dbSNP.

Testing the quality of Smart Variant Filtering

The testing of the SVF has been done in three separate phases. The first phase was searching for the best classifier and its configuration.

The second phase was testing of the quality (precision, recall, f-score) using leave-one-out sample strategy. The third phase was independent validation. Each phase is described in the following sections.

Testing phase 1 – Classifier and parameters selection

In order to find the best possible model and configuration, several available classification algorithms have been tested (AdaBoost, K-Nearest Neighbors, Random Forest, Support Vector Machine, Quadratic Discriminant Analysis, Multilayer Perceptron) with a total of 372 different parameter configurations (hyper-parameter space testing), as listed in Table 2.

 

Classifier Parameter set
Ada Boost n_estimators = [150, 200, 300, 400, 500, 1000]
learning_rate = [0.5, 1.0, 1.2, 1.5, 1.7]
algorithm = [‘SAMME’, ‘SAMME.R’]
K-Nearest Neighbors neighbors = [7, 9, 10, 13, 14, 15, 17, 19, 22, 25, 29, 30, 31, 33, 34, 35, 38, 39, 41, 43, 45, 49, 53]
algorithms = [‘ball_tree’, ‘kd_tree’]
p_distance = [1, 2]
SVM C = [0.2, 0.6, 1.0, 1.5, 2.0, 4.0, 10.0, 100.0, 1000.0]
kernels = [‘linear’, ‘rbf’]
Random Forest n_estimators = [5, 10, 20, 50, 100, 150, 200, 300, 500, 1000]
criterion = [‘gini’, ‘entropy’]
Quadratic Discriminant Analysis tol = [0.00001, 0.00005, 0.0001, 0.0005, 0.001]
Multi Layer Perceptron hidden_layer_sizes = [50, 100, 150, 250, 500, (10,10), (20,25), (30,50), (10,10,10), (20,30,40), (30,50,70)]
activation = [‘identity’, ‘logistic’, ‘tanh’, ‘relu’]
solver = [‘lbfgs’, ‘sgd’, ‘adam’]

Table 2: List of classifiers and corresponding parameters used for testing of the SVF performance. All possible parameter combinations within same classifier are tested for both SNVs and indels.

 

An example of a configuration would be: AdaBoost Classifier{n_estimators=50, learning_rate=1, algorithm=’SAMME’}.

For every configuration from Table 2, a cross-validation with 10 folds has been conducted across total 123.000 variants. F-score was used as the criteria for the model’s configuration quality and mean and standard deviations are calculated for 10 f-scores obtained for every parameter configuration. Due to intense computational demands, the entire testing process is done on the Seven Bridges Platform with a downsampled set of whole genome samples.

The differences in mean f-score and the standard deviation between top parameter configurations are minor. The obtained best configurations from the experiment are shown in Table 3.

 

Type Algorithm Parameters Stddev f-score Mean f-score
Indels MLP 250-logistic-sgd 0.0018 0.96466
SNVs MLP 500-tanh-adam 0.0001 0.99856

Table 3: Model configuration with highest mean f-score.

 

Testing phase 2 – Leave-one-out testing

After obtaining the best configuration, the second testing phase of Smart Variant Filtering can be performed.

To fully automate the entire flow (training, filtering, testing), we have created the Train, filter and test workflow (Figure 4).

Train and filter parts are described in the previous sections, while the test part consists of VCF Benchmark workflow with the VCFEval tool which calculates precision, recall and f-score values by comparing filtered VCF to truth set VCF file.

Figure 4: Smart Variant Filtering – Train, filter and test workflow.

The second phase of testing is performed using tables with features from all samples and library preparations, except the one on which the filtering is performed. This process is known as leave-one-out testing (e.g. train on HG001-GH004 and test on HG005).

A total of 5 whole exome sequencing (WES) samples (254 000 variants) and 8 different library preps for whole genome sequencing (WGS, 25 million variants) are used for this purpose.

A separate training and testing is performed for SNVs and indels using the following 6 features: QD, MQ, FS, MQRankSum, ReadPosRankSum and SOR.

Precision/recall diagram for SNVs with whole genome samples is presented in Figure 5.

Figure 5: Precision and recall scores for 8 whole genome samples marked with different markers. The color of the bar denotes a filtering method: yellow – GATK4 Variant Quality Score Recalibration (VQSR), gray – raw, unfiltered VCF, dark blue – Smart Variant FIltering (SVF). The scores for these three methods for the same sample are connected by identically colored dashed lines, creating a triangle for easier tracking inside the diagram. Both axes on the diagram are equalized (on the same scale). The results for GATK4 VQSR were obtained using Whole Genome Sequencing GATK4 public workflow.

An ideal variant filter would increase precision to 100% without any decrease in recall – a horizontal shift to the right from the gray dots in Fig 5. In a real world, variant filtering should make a larger increase in precision than the cost in recall. This is well captured by a single metric – the F Score. VQSR, the Broad Best Practices variant filter, does increase precision but decreases recall from 0.1 – 0.5% at the same time. In contrast, SVF has greater or equal increases in precision with minimal decreases in recall, as expected for the filtering process.

General comparison in the quality of calls can be seen through f-score comparison of these methods (f-score = 2*((precision*recall) / (precision+recall)) ), as shown in Figure 6.

Figure 6: F-scores for 8 whole genome samples marked with different methods. The color of the bar denotes a filtering method: yellow – GATK4 Variant Quality Score Recalibration (VQSR), gray – raw, unfiltered VCF, dark blue – Smart Variant FIltering (SVF).

Precision/recall diagrams for indels with whole genome samples are presented in Figure 7.

Figure 7: Precision and recall scores for 8 whole genome samples marked with different markers. The color of the bar denotes a filtering method: yellow – GATK4 Variant Quality Score Recalibration (VQSR), gray – raw, unfiltered VCF, dark blue – Smart Variant FIltering (SVF). The scores for these three methods for the same sample are connected by identically colored dashed lines, creating a triangle for easier tracking inside the diagram. Both axes on the diagram are equalized (on the same scale). Representation of samples is split into three diagrams for better visibility. The results for GATK4 VQSR were obtained using Whole Genome Sequencing GATK4 public workflow.

As with SNVs, SVF for indels has shown significant increase in precision and minor decrease in recall compared to the Raw variants. Also, VQSR method decreases the recall and gives lower precision than SVF with samples (Figure 7). General comparison in the quality of calls can be seen through f-score comparison of these methods, as shown in Figure 8.

Figure 8: F-scores for 8 whole genome samples (indels) marked with different methods. The color of the bar denotes a filtering method: yellow – GATK4 Variant Quality Score Recalibration (VQSR), gray – raw, unfiltered VCF, dark blue – Smart Variant FIltering (SVF).

Precision/recall diagrams for SNVs and indels with whole exome samples are presented in Figure 9 and Figure 11, while f-scores are given in Figure 10 and Figure 12. Training was performed using the following 7 features: QD, MQ, FS, MQRankSum, ReadPosRankSum, SOR, dbSNPBuildID.

Figure 9: Precision and recall scores for 8 whole genome samples marked with different markers. The color of the bar denotes a filtering method:: yellow – GATK4 Variant Quality Score Recalibration (VQSR), gray – raw, unfiltered VCF, dark blue – Smart Variant FIltering (SVF). The scores for these three methods for the same sample are connected by identically colored dashed lines, creating a triangle for easier tracking inside the diagram. Both axes on the diagram are equalized (on the same scale). The results for GATK4 Hard Filtering were obtained using Whole Exome Sequencing GATK4 public workflow.

Figure 11: Indel precision and recall scores for 5 whole exome samples marked with different markers. The color of the bar denotes a filtering method:: yellow – GATK4 Hard Filter, gray – raw, unfiltered VCF, dark blue – Smart Variant FIltering (SVF). The scores for these three methods for the same sample are connected by identically colored dashed lines, creating a triangle for easier tracking inside the diagram. Both axes on the diagram are equalized (on the same scale). Representation of samples is split into two diagrams for better visibility. The results for GATK4 Hard Filtering were obtained using Whole Exome Sequencing GATK4 public workflow.
Figure 12: Indel f-scores for 5 whole exome samples marked with different methods. The color of the bar denotes a filtering method: yellow – GATK4 Hard Filter, gray – raw, unfiltered VCF, dark blue – Smart Variant FIltering (SVF).

Testing phase 3 – independent validation

The third phase of testing is an independent validation on samples where no portion of the data was used to train the SVF filter. This creates a larger testing set and better evaluates SVF generalization relative to using a large portion of a sample for training and testing performance with a previously unused portion of the same sample.

Two whole genome library preparations of HG001 from one of the past FDA Challenges were selected (Garvan and TSnano), as well as one new library preparation of HG001 50x WES NA12878-NGv3-LAB1360 for whole exome analysis.

Figures 13-16 show the f-scores for SNVs and indels, WGS and WES, obtained from the new samples using the filtering models from leave-one-out testing. The similar trend in f-score across all previously created filtering models demonstrates that SVF provides generalization in usage different training sets. Also, Figures 13-16 confirm the robust and stable performance without any bias.

Figure 13: F-score for SNVs of two new WGS samples (Garvan and TSNano) obtained using filtering models from leave-one-out testing.
Figure 14: F-score for indels of two new WGS samples (Garvan and TSNano) obtained using filtering models from leave-one-out testing.
Figure 15: F-score for SNVs of new WES sample (HG001-50x) obtained using filtering models from leave-one-out testing.
Figure 16: F-score for indels of new WES sample (HG001-50x) obtained using filtering models from leave-one-out testing.

How to use Smart Variant Filtering?

The Smart Variant Filtering tool is available within Public projects in any Seven Bridges environment. After making a copy of the project, a user can utilize SVF in several different scenarios:

  1. Smart Variant Filtering tool can be run in applying a filter mode by providing it with pre-trained models for SNVs and indels (available inside the project) and a VCF file that will be filtered.
  2. For larger VCF files, it is recommended to use the Apply Smart Variant Filtering Scatter workflow which performs filtering by parallelizing the process per chromosome.
  3. SVF can be used inside some of the existing workflows on the Seven Bridges Platform (Graph genome or GATK4 Whole Genome/Exome Sequencing) by adding it and/or replacing the existing filtering tools.
  4. Training a model for filtering can be performed by running the Smart Variant Filtering tool in training mode and providing it with tables containing features which are available inside the project.
  5. The entire process of training a model, applying a filter and benchmarking of the obtained results can be done by running only one integral Smart Variant Filtering – Train, filter and test workflow.