BED-based format for genotype counts

Being a comprehensive format for storing variant calling data, VCF is superfluous for some kinds of analysis. Although it is a plain-text tabular format, it may take significant time and memory to load a large VCF file into an R or Python session. For that reason, it is usually effective to preprocess a VCF file using a stand-alone tool in order to extract the information required for further analysis.

An example of such a tool is the vcftools package that implements a number of routines for processing VCF files. In particular, vcftools can be used to obtain allele frequencies from a VCF file:

CHROM POS   N_ALLELES N_CHR {ALLELE:COUNT}
7     45524 2         414   C:414 T:0
7     45569 3         414   G:413 A:1 T:0

Note that the number of columns differs because the second variant has two alternative alleles while the first variant has only one. It may cause problems during further processing of the data, e.g., the R function read.table will fail to read such a file. Also it would be more informative to store genotype counts instead of allele counts. Here we describe the normalized table format for genotype counts and the tool to obtain it from a VCF file.

Let us consider two variants located on chromosome 1 at positions 935187 and 948928 (both variants were taken from 1000 Genomes). The first variant is an SNP which reference and alternative alleles are G and A, respectively. The second variant is an indel which reference allele is GGCCCACA and that has two alternative alleles G and GGCCCACAGCCCACA. For that variants, the genotype table will be the following.

1  935187  935188  A         G               0       2       2502    2
1  935187  935188  G         G               2502    0       0       1
1  948928  948929  G         GGCCCACA        0       11      2484    2
1  948928  948929  GGCCCACA  GGCCCACA        2484    0       0       1
1  948928  948929  GGCCCACA  GGCCCACAGCCCACA 2484    9       0       1

The proposed format for genotype counts is based on the BED format and shares the first three columns with it, namely, a chromosome name, start and end positions of a variant. Other columns contain information about a pair of alleles and their genotype counts.

  1. The first allele.
  2. The second allele.
  3. The number of genotypes that are homozygous for the first allele.
  4. The number of genotypes that are heterozygous.
  5. The number of genotypes that are homozygous for the second allele.
  6. An integer value indicating whether the first or the second allele is the reference one. If none of them are reference, then the value is set to 0.

Each line corresponds to a certain allele pair for each variant; the first allele always precedes the second one in the alphabetical order. The lines in a file are sorted by their chromosome and position as for a sorted BED file and by allele pairs in the alphabetical order. If there are no genotypes for a certain allele pair, then this pair is not reported in the file.

For the example given above, we have the following genotype counts for the SNP.

G/G 2502
A/G 2
A/A 0

Note that there is no line for the pair A/A in the genotype count table because there are no such genotypes.

For the same example, we have the following genotype counts for the indel.

G/GGCCCACA 11
GGCCCACA/GGCCCACA 2484
GGCCCACA/GGCCCACAGCCCACA 9

To obtain a genotype table from a VCF file, one may use the vcfgeno2bed tool from the bioformats package. The package can be conveniently installed from the Python Package Index via pip.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s