The VCF format specifies quality scores (QUAL) for each variable position (variant) in a genome. The QUAL value is the Phred quality score for the assertion that alternative bases of a variant are correct, that is, , where is the probability that the alternative base calls are wrong. Using the QUAL scores, one may easily calculate the probability that all variant calls in a VCF file are correct.
Here we give an equation for that probability, a Python script that implements it and an example of its usage.
Let us have variants which QUAL scores are . Then the probability that alternative base calls of the variants are correct is given by the following formula: .
The probability value specified above can be calculated using the following Python script.
pip install matplotlib numpy cyvcf2
The script produces a summary on the specified VCF file. Also it provides two additional options:
- -t or –table produces a table of three columns: the QUAL value threshold (increases from the least to the greatest QUAL value in the file), the percentage of the variants which QUAL values are greater than the threshold (tends to zero) and the probability that all that variants are correct (tends to one);
- -p or –plot produces the plot showing the variant percentage and the probability of all of them being correct as a function of the QUAL threshold.
Example: DoGSD wolf sample
We used a sample from the Dog Genome SNP Database to demonstrate the Python script. The VCF file CHW.g.vcf.tar.gz of the wolf genomic variants was downloaded from the database; non-variable sites were removed from it using the grep command in the following way:
grep -v '\s<NON_REF>\s' CHW.g.vcf > CHW.vcf
The obtained VCF file was passed to the Python script:
./vcf_prob_estimation.py -t CHW_qual.txt -p CHW_qual.png CHW.vcf
The script produced the following summary
# variant sites: 9261448 Max QUAL value: 97725 Min QUAL value: 0 Max correct site probability: 1.00e+00 Min correct site probability: 0.00e+00 All sites being correct probability: 0.00e+00
The output plot CHW_qual.png, that is shown below, indicates that the probability of all sites being correct is close to one starting from the QUAL threshold of 70. For more precise estimation, we should use the output table file.
The output table file CHW_qual.txt, which fragment is given below, shows that the QUAL threshold of 75 corresponds to the 99% probability of all such variant calls to be correct and the percentage of such variants is 90.1%.
73 90.27 9.984220e-01 74 90.17 9.988120e-01 75 90.10 9.990282e-01 76 90.03 9.992038e-01 77 89.93 9.993972e-01