Sample-based format for predicted variant effects

VCF is a variant-based format, i.e., each its record (line) represents a single genomic variant: its location, reference and alternative alleles, variant calling characteristics and sample genotypes. However, sample-based datasets are more convenient for some applications, especially if each variant allele has its special meaning. For example, one may predict variant effects with snpEff or Ensembl VEP and consider only the samples having specific effects for both their alleles.

Here we introduce the BED-based format for sample-centered storing of predicted variant effects. Before describing the format, we give a sample of records in it.

1	865624	865625	A	G	ENSG00000187634	ENST00000437963	IND1	missense_variant	-	2
1	877830	877831	C	C	ENSG00000187634	ENST00000341065	IND1	missense_variant	missense_variant	0
1	981130	981131	A	G	ENSG00000188157	ENST00000379370	IND1	-	missense_variant	1
1	981367	981368	C	T	ENSG00000188157	ENST00000379370	IND1	-	missense_variant	1
1	1120369	1120370	C	G	ENSG00000162571	ENST00000379288	IND1	-	missense_variant	1
1	223722779	223722780	A	A	ENSG00000203697	ENST00000366872	IND1 stop_gained	stop_gained	0
2	29444094	29444095	T	T	ENSG00000171094	ENST00000453137	IND1 stop_gained	stop_gained	0
2	85549867	85549868	G	G	ENSG00000152291	ENST00000409015	IND1 stop_lost	stop_lost	0

Format description

A file of the variant effect format is a tab-delimited plain text file.  The first three columns that describe a variant location are the same as in the BED format. Other columns contain information about a specific sample and its allele effects.

  1. The first allele of a variant;
  2. The second allele of a variant;
  3. A gene ID the variant effect is related to;
  4. A gene feature ID the variant effect is related to (usually it is the ID of a gene transcript) or NA if there is no such feature;
  5. A sample ID;
  6. Effect type of the first allele (e.g., stop_codon);
  7. Effect type of the second allele (e.g., missense_variant);
  8. An integer value that indicates if the first or the second allele is the reference one (1 or 2). If none of them are reference, then this value is zero.

Also the following conditions must be met for the records.

  • The first allele always precedes the second one in the lexicographic order.
  • For a reference allele, its effect is denoted by a dash sign (-).
  • The records must be sorted in the following order:
    1. a first allele;
    2. a gene ID;
    3. a gene feature ID;

For variant effects, it is recommended to use the Sequence Ontology (SO) terms. In general, one may use any arbitrary notation, but most of the variant effect prediction tools (including snpEff and Ensembl VEP) use the SO notation.

To get a file in the described format from an snpEff-annotated VCF file, one may use the vcfeffect2bed tool from the bioformats package. The tool provides options for filtering effects by their impact (the -i or --impacts option) and genotype (the -g or --genotypes option).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s