One-liner to get distribution of the alternative allele numbers in a VCF file

The VCF format allows for multiple alternative alleles in a single variant record. The alternative alleles are specified as a comma-separated list of their bases, so one may easily estimate the distribution of the alternative allele numbers in a command line using the following one-line script:

bcftools query -f '%ALT\n' input.vcf.gz | \
    tr -d 'A-Za-z0-9<>:' | sort | uniq -c

Here we use bcftools query from the bcftools package for rapid extraction of alternative alleles from a VCF file. We need to know only the number of commas in each line, so we remove all other symbols using tr. Finally, we count lines containing the particular numbers of commas.

Example: 1000 Genomes variants on chromosome 22

Let us demonstrate the script using the VCF file of 1000 Genomes variants on chromosome 22. The file contains 1,103,547 variants, including 1,060,388 SNPs and 43,230 indels.


bcftools query -f '%ALT\n' \
    ALL.chr22.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz | \
    tr -d 'A-Za-z0-9<>:' | sort | uniq -c

The script produced the following output.

1097199
6073 ,
 224 ,,
  38 ,,,
   9 ,,,,
   3 ,,,,,
   1 ,,,,,,,

According to the output, most of the variants in the file are biallelic (i.e., having a reference allele and a single alternative allele) and less than 1% of them are multiallelic. Most of the multiallelic variants are triallelic (i.e., having a reference allele and two alternative alleles) and only 275 multiallelic variants have more than two alternative alleles.

bcftools plugin to convert a VCF file to the BED format

bcftools provides a convenient way to extend its functionality using plugins. Technically,  the bcftools plugins are dynamic libraries that are executed when a user launches the bcftools plugin tool.

Here I present a simple plugin named vcf2bed that converts a VCF file to the BED format.

Continue reading

Obtaining probability of all variant calls being correct

The VCF format specifies quality scores (QUAL) for each variable position (variant) in a genome. The QUAL value is the Phred quality score for the assertion that alternative bases of a variant are correct, that is, \mathrm{QUAL} = -10 \log_{10} p, where p is the probability that the alternative base calls are wrong. Using the QUAL scores, one may easily calculate the probability that all variant calls in a VCF file are correct.

Here we give an equation for that probability, a Python script that implements it and an example of its usage.

Continue reading

Combining a large number of VCF files

The bcftools and vcftools packages provide routines for merging or concatenating multiple VCF files. However, specifying a large number of input VCF files may terminate their processing because an operating system will not be able to keep so many files opened. This problem can be overcome by iterative combining of files: first, pairs of the original VCF files are processed, then pairs of the obtained files are processed and so on until we get the resulting VCF file.

Here we describe an iterative scheme for merging or concatenating VCF files using bcftools and GNU parallel and present a Python script that implements it.

Continue reading

Sample-based format for predicted variant effects

VCF is a variant-based format, i.e., each its record (line) represents a single genomic variant: its location, reference and alternative alleles, variant calling characteristics and sample genotypes. However, sample-based datasets are more convenient for some applications, especially if each variant allele has its special meaning. For example, one may predict variant effects with snpEff or Ensembl VEP and consider only the samples having specific effects for both their alleles.

Here we introduce the BED-based format for sample-centered storing of predicted variant effects. Before describing the format, we give a sample of records in it.

Continue reading

BED-based format for genotype counts

Being a comprehensive format for storing variant calling data, VCF is superfluous for some kinds of analysis. Although it is a plain-text tabular format, it may take significant time and memory to load a large VCF file into an R or Python session. For that reason, it is usually effective to preprocess a VCF file using a stand-alone tool in order to extract the information required for further analysis.

An example of such a tool is the vcftools package that implements a number of routines for processing VCF files. In particular, vcftools can be used to obtain allele frequencies from a VCF file:

CHROM POS   N_ALLELES N_CHR {ALLELE:COUNT}
7     45524 2         414   C:414 T:0
7     45569 3         414   G:413 A:1 T:0

Note that the number of columns differs because the second variant has two alternative alleles while the first variant has only one. It may cause problems during further processing of the data, e.g., the R function read.table will fail to read such a file. Also it would be more informative to store genotype counts instead of allele counts. Here we describe the normalized table format for genotype counts and the tool to obtain it from a VCF file.

Continue reading