Adding rs numbers to VCF file

The VCF format provides a fixed field for a variant ID. It is recommended to use IDs from the NCBI dbSNP database (so-called rs numbers) for variants that have been already described in it. Here we describe how to add rs numbers to a custom VCF file using the bcftools package.

Step 1. Obtain dbSNP VCF file

To add rs numbers to a VCF file, we need the dbSNP VCF file that contains that numbers. The file can be downloaded from the NCBI FTP server as described here.

For example, the VCF file of all human variants from the dbSNP build 147 on the GRCh37.p13 assembly can be obtained at the following location:

Step 2. (Optional) Remove existing IDs from VCF file

You may skip this step if you would like to preserve existing IDs in your VCF file. Otherwise, the existing variant IDs can be removed from the VCF file using the bcftools annotate tool with the –remove option.

bcftools annotate --output file.noids.vcf.gz --output-type z \
  --remove ID file.vcf.gz
tabix -p vcf file.noids.vcf.gz

Note that we use the –output-type option to produce a gzipped VCF file and apply tabix to index it for the next step.

Step 3. Add rs numbers from dbSNP VCF file

Finally, we use bcftools annotate with the –columns option to add the rs numbers to the VCF file.

bcftools annotate --annotations All_20160408.vcf.gz --columns ID \
  --output file.rsnum.vcf.gz --output-type z file.noids.vcf.gz

9 thoughts on “Adding rs numbers to VCF file

  1. When I try to use this command, it shows “[W::bcf_hdr_check_sanity] GL should be declared as Number=G”. How do I fix this?


    • It looks like there is something wrong with the header of your VCF files. You can extract the header using `bcftools view –header-only`, fix it in any text editor, and replace headers in the VCF files using `bcftools reheader`.


  2. Excuse my beginner question… I’m trying to generate a .vcf file from a list of rsnumbers. If I’m understanding this post correctly I could take any .vcf file remove the ids with step 2 code and then use step 3 to create a .vcf from my rsnumbers? In that case is file.rsnum.vcf.gz just a text file with a new line for each rsnumber?


    • Output file *file.rsnum.vcf.gz* is in the VCF format and contains the same records as file *file.noids.vcf.gz* except for its third column (variant IDs). The third column of the output file will contain either rs IDs for variants present in the provided dbSNP file or dots (.) for variants missing in dbSNP.


  3. When I did with dbsnp build 138, the total number of germline mutations are 156190 in which 1335265 are known and rest are novel. But when I tried to update my ID with the latest dbsnp file from the link (above) mentioned by you then the total number of germline mutations I found same but known are reduced to 0 and all 156190 are turned into novels. How it can be possible. Can you tell me what is wrong here? Thanks


    • Absence of known variants might have been caused by different naming of chromosomes in your VCF file and the dbSNP VCF file, for example, ‘chr1’ vs ‘1’. Another option is to intersect a subset of your variants with the dbSNP variants to make sure that the both datasets are consistent. You may use the bedtools package for this purpose.


    • You should use the command `bedtools intersect` to obtain variants shared in the both VCF files, for example: `bedtools intersect -header -a variants.vcf.gz -b dbsnp.vcf.gz > shared_variants.vcf`.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s