The VCF format provides a fixed field for a variant ID. It is recommended to use IDs from the NCBI dbSNP database (so-called rs numbers) for variants that have been already described in it. Here we describe how to add rs numbers to a custom VCF file using the bcftools package.
Step 1. Obtain dbSNP VCF file
To add rs numbers to a VCF file, we need the dbSNP VCF file that contains that numbers. The file can be downloaded from the NCBI FTP server as described here.
For example, the VCF file of all human variants from the dbSNP build 147 on the GRCh37.p13 assembly can be obtained at the following location: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160408.vcf.gz.
Step 2. (Optional) Remove existing IDs from VCF file
You may skip this step if you would like to preserve existing IDs in your VCF file. Otherwise, the existing variant IDs can be removed from the VCF file using the bcftools annotate tool with the –remove option.
bcftools annotate --output file.noids.vcf.gz --output-type z \ --remove ID file.vcf.gz tabix -p vcf file.noids.vcf.gz
Note that we use the –output-type option to produce a gzipped VCF file and apply tabix to index it for the next step.
Step 3. Add rs numbers from dbSNP VCF file
Finally, we use bcftools annotate with the –columns option to add the rs numbers to the VCF file.
bcftools annotate --annotations All_20160408.vcf.gz --columns ID \ --output file.rsnum.vcf.gz --output-type z file.noids.vcf.gz