One-liner to get distribution of the alternative allele numbers in a VCF file

The VCF format allows for multiple alternative alleles in a single variant record. The alternative alleles are specified as a comma-separated list of their bases, so one may easily estimate the distribution of the alternative allele numbers in a command line using the following one-line script:

bcftools query -f '%ALT\n' input.vcf.gz | \
    tr -d 'A-Za-z0-9<>:' | sort | uniq -c

Here we use bcftools query from the bcftools package for rapid extraction of alternative alleles from a VCF file. We need to know only the number of commas in each line, so we remove all other symbols using tr. Finally, we count lines containing the particular numbers of commas.

Example: 1000 Genomes variants on chromosome 22

Let us demonstrate the script using the VCF file of 1000 Genomes variants on chromosome 22. The file contains 1,103,547 variants, including 1,060,388 SNPs and 43,230 indels.


bcftools query -f '%ALT\n' \
    ALL.chr22.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz | \
    tr -d 'A-Za-z0-9<>:' | sort | uniq -c

The script produced the following output.

1097199
6073 ,
 224 ,,
  38 ,,,
   9 ,,,,
   3 ,,,,,
   1 ,,,,,,,

According to the output, most of the variants in the file are biallelic (i.e., having a reference allele and a single alternative allele) and less than 1% of them are multiallelic. Most of the multiallelic variants are triallelic (i.e., having a reference allele and two alternative alleles) and only 275 multiallelic variants have more than two alternative alleles.