Obtaining probability of all variant calls being correct

The VCF format specifies quality scores (QUAL) for each variable position (variant) in a genome. The QUAL value is the Phred quality score for the assertion that alternative bases of a variant are correct, that is, \mathrm{QUAL} = -10 \log_{10} p, where p is the probability that the alternative base calls are wrong. Using the QUAL scores, one may easily calculate the probability that all variant calls in a VCF file are correct.

Here we give an equation for that probability, a Python script that implements it and an example of its usage.

Equation

Let us have n variants which QUAL scores are q_1, \dots, q_n. Then the probability P that alternative base calls of the variants are correct is given by the following formula: P = \prod_{i=1}^n \left( 1 - 10^{-q_i/10} \right).

Python script

The probability value P specified above can be calculated using the following Python script.

Note that the script requires the following Python packages: matplotlib, numpy and cyvcf2. One may conveniently install them via the pip tool:


pip install matplotlib numpy cyvcf2

The script produces a summary on the specified VCF file. Also it provides two additional options:

  • -t or –table produces a table of three columns: the QUAL value threshold (increases from the least to the greatest QUAL value in the file), the percentage of the variants which QUAL values are greater than the threshold (tends to zero) and the probability that all that variants are correct (tends to one);
  • -p or –plot produces the plot showing the variant percentage and the probability of all of them being correct as a function of the QUAL threshold.

Example: DoGSD wolf sample

We used a sample from the Dog Genome SNP Database to demonstrate the Python script. The VCF file CHW.g.vcf.tar.gz of the wolf genomic variants was downloaded from the database; non-variable sites were removed from it using the grep command in the following way:


grep -v '\s<NON_REF>\s' CHW.g.vcf > CHW.vcf

The obtained VCF file was passed to the Python script:


./vcf_prob_estimation.py -t CHW_qual.txt -p CHW_qual.png CHW.vcf

The script produced the following summary

# variant sites: 9261448
Max QUAL value: 97725
Min QUAL value: 0
Max correct site probability: 1.00e+00
Min correct site probability: 0.00e+00
All sites being correct probability: 0.00e+00

The output plot CHW_qual.png, that is shown below, indicates that the probability of all sites being correct is close to one starting from the QUAL threshold of 70. For more precise estimation, we should use the output table file.

wolf-variant-sites-number-probability-qual

The number of the wolf variant sites and the probability of them being correct as a function of the QUAL threshold.

The output table file CHW_qual.txt, which fragment is given below, shows that the QUAL threshold of 75 corresponds to the 99% probability of all such variant calls to be correct and the percentage of such variants is 90.1%.


73    90.27    9.984220e-01
74    90.17    9.988120e-01
75    90.10    9.990282e-01
76    90.03    9.992038e-01
77    89.93    9.993972e-01

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s