Combining a large number of VCF files

The bcftools and vcftools packages provide routines for merging or concatenating multiple VCF files. However, specifying a large number of input VCF files may terminate their processing because an operating system will not be able to keep so many files opened. This problem can be overcome by iterative combining of files: first, pairs of the original VCF files are processed, then pairs of the obtained files are processed and so on until we get the resulting VCF file.

Here we describe an iterative scheme for merging or concatenating VCF files using bcftools and GNU parallel and present a Python script that implements it.

Iterative scheme for combining VCF files

We implement an iterative scheme that will combine the lesser number of VCF files simultaneously. To process VCF files in parallel, we use GNU parallel with the -a argument that specifies the input source for it. Since the bcftools routines do not index their output VCF files, we use the tabix tool for this purpose.

At some stages, the number of VCF files to be combined may be odd. In this case, one file will be skipped from the combining procedure at the current stage and moved to the next iteration (see the figure below).

iterative-scheme-for-combining-multiple-VCF-files

The iterative scheme for combining multiple VCF files.

Python script for combining VCF files

The Python script implementing the described iterative scheme is given below. It provides the following options:

  • -c or –command specifies how input VCF files are combined (merged or concatenated);
  • -j or –jobs sets the number of the bcftools processes to be launched in parallel;
  • -k or –keep prevents removal of temporary files created by the script.

The script creates temporary files of two types:

  1. plain-text lists of file pairs;
  2. intermediate combined VCF files.

For their names, the script randomly creates a random 5-letter prefix, e.g., tchkc. This prefix is followed by the iteration number starting from 1. Temporary VCF files also have serial numbers in their names. For example, the name of the pair list file for the second iteration might be tchkc_2.txt and the corresponding VCF files may have names like tchkc_2_3.vcf.gz.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s