VCF is a variant-based format, i.e., each its record (line) represents a single genomic variant: its location, reference and alternative alleles, variant calling characteristics and sample genotypes. However, sample-based datasets are more convenient for some applications, especially if each variant allele has its special meaning. For example, one may predict variant effects with snpEff or Ensembl VEP and consider only the samples having specific effects for both their alleles.
Here we introduce the BED-based format for sample-centered storing of predicted variant effects. Before describing the format, we give a sample of records in it.
1 865624 865625 A G ENSG00000187634 ENST00000437963 IND1 missense_variant - 2 1 877830 877831 C C ENSG00000187634 ENST00000341065 IND1 missense_variant missense_variant 0 1 981130 981131 A G ENSG00000188157 ENST00000379370 IND1 - missense_variant 1 1 981367 981368 C T ENSG00000188157 ENST00000379370 IND1 - missense_variant 1 1 1120369 1120370 C G ENSG00000162571 ENST00000379288 IND1 - missense_variant 1 1 223722779 223722780 A A ENSG00000203697 ENST00000366872 IND1 stop_gained stop_gained 0 2 29444094 29444095 T T ENSG00000171094 ENST00000453137 IND1 stop_gained stop_gained 0 2 85549867 85549868 G G ENSG00000152291 ENST00000409015 IND1 stop_lost stop_lost 0
A file of the variant effect format is a tab-delimited plain text file. The first three columns that describe a variant location are the same as in the BED format. Other columns contain information about a specific sample and its allele effects.
- The first allele of a variant;
- The second allele of a variant;
- A gene ID the variant effect is related to;
- A gene feature ID the variant effect is related to (usually it is the ID of a gene transcript) or NA if there is no such feature;
- A sample ID;
- Effect type of the first allele (e.g., stop_codon);
- Effect type of the second allele (e.g., missense_variant);
- An integer value that indicates if the first or the second allele is the reference one (1 or 2). If none of them are reference, then this value is zero.
Also the following conditions must be met for the records.
- The first allele always precedes the second one in the lexicographic order.
- For a reference allele, its effect is denoted by a dash sign (-).
- The records must be sorted in the following order:
- a first allele;
- a gene ID;
- a gene feature ID;
For variant effects, it is recommended to use the Sequence Ontology (SO) terms. In general, one may use any arbitrary notation, but most of the variant effect prediction tools (including snpEff and Ensembl VEP) use the SO notation.
To get a file in the described format from an snpEff-annotated VCF file, one may use the vcfeffect2bed tool from the bioformats package. The tool provides options for filtering effects by their impact (the -i or --impacts option) and genotype (the -g or --genotypes option).