bcftools plugin to convert a VCF file to the BED format

bcftools provides a convenient way to extend its functionality using plugins. Technically,  the bcftools plugins are dynamic libraries that are executed when a user launches the bcftools plugin tool.

Here I present a simple plugin named vcf2bed that converts a VCF file to the BED format.

Continue reading

Converting an AGP file to the BED format

The AGP format is used to describe the assembly structure in the NCBI Genome database. Since AGP is a plain-text tabular data format that specifies positions of smaller sequence objects on larger ones (e.g., contigs on scaffolds), AGP files can be converted to the BED format for their further processing.

Continue reading

Sample-based format for predicted variant effects

VCF is a variant-based format, i.e., each its record (line) represents a single genomic variant: its location, reference and alternative alleles, variant calling characteristics and sample genotypes. However, sample-based datasets are more convenient for some applications, especially if each variant allele has its special meaning. For example, one may predict variant effects with snpEff or Ensembl VEP and consider only the samples having specific effects for both their alleles.

Here we introduce the BED-based format for sample-centered storing of predicted variant effects. Before describing the format, we give a sample of records in it.

Continue reading

BED-based format for genotype counts

Being a comprehensive format for storing variant calling data, VCF is superfluous for some kinds of analysis. Although it is a plain-text tabular format, it may take significant time and memory to load a large VCF file into an R or Python session. For that reason, it is usually effective to preprocess a VCF file using a stand-alone tool in order to extract the information required for further analysis.

An example of such a tool is the vcftools package that implements a number of routines for processing VCF files. In particular, vcftools can be used to obtain allele frequencies from a VCF file:

7     45524 2         414   C:414 T:0
7     45569 3         414   G:413 A:1 T:0

Note that the number of columns differs because the second variant has two alternative alleles while the first variant has only one. It may cause problems during further processing of the data, e.g., the R function read.table will fail to read such a file. Also it would be more informative to store genotype counts instead of allele counts. Here we describe the normalized table format for genotype counts and the tool to obtain it from a VCF file.

Continue reading