Obtaining scaffold positions on assembled chromosomes from NCBI Genome

NCBI Genome stores genomic assemblies of numerous species. Besides assembly sequences, it also contains the related auxiliary information, including AGP files that describe how large sequence objects (e.g., chromosomes) were assembled from smaller ones (e.g., scaffolds or contigs).

For some assemblies, their chromosome-from-scaffold AGP files may be missing although the chromosomes were assembled from the scaffolds. In that case, one may reconstruct the AGP file of scaffolds on chromosomes using chromosome-from-components and scaffold-from-components AGP files.

Further we describe how to perform such a reconstruction and present a Python script implementing it.

Example: chimpanzee genome

As an example, consider the chimpanzee genome assembly Pan_troglodytes-2.1.4 which files are located on the NCBI FTP server in the following directory: ftp://ftp.ncbi.nlm.nih.gov/genomes/Pan_troglodytes. In that directory, there are chromosome-from-contigs AGP files and a scaffold-from-configs AGP file but no chromosome-from-scaffolds AGP file.

Python script to get chromosome-from-scaffolds AGP file

To obtain the chromosome-from-scaffolds AGP file, we scan the chromosome-from-contigs AGP file and replace contigs with the corresponding scaffolds. Also we remove gaps between contigs within a scaffold using the gap_type value in the seventh column of an AGP file: if the value is scaffold, then the gap is located between two contigs within a scaffold.

Note that the script requires a single chromosome-from-contigs AGP file that can be obtained by merging per-chromosome AGP files with cat.




