Converting an AGP file to the BED format

The AGP format is used to describe the assembly structure in the NCBI Genome database. Since AGP is a plain-text tabular data format that specifies positions of smaller sequence objects on larger ones (e.g., contigs on scaffolds), AGP files can be converted to the BED format for their further processing.

Two ways to convert AGP to BED

An AGP file specifies the structure of two sets of sequence objects. Thus, a single AGP file can be converted to a pair of BED files based either on large sequence objects (e.g., scaffolds or chromosomes) or smaller sequence objects (e.g., contigs).

Let us consider the following example: the SEQ1 sequence composed of three fragments: FRAG1FRAG2 and FRAG3 with two gaps of 30 bp. The lengths of FRAG1FRAG2 and FRAG3 are 300, 700 and 250 bp, respectively.


The example of a large sequence object composed from three smaller ones.

The larger sequence object-based BED file will be the following one:

SEQ1    0  300 FRAG1 1000 + 0 300
SEQ1  300  330   GAP 1000 + 0  30
SEQ1  330 1030 FRAG2 1000 + 0 700
SEQ1 1030 1060   GAP 1000 + 0  30
SEQ1 1060 1310 FRAG3 1000 + 0 250

This is a BED6+2 file that contains 6 columns of the BED format and 2 additional columns: the fragment start and end positions. The smaller sequence object-based BED will be of the same format:

FRAG1 0 300 SEQ1 1000 +    0  300
  GAP 0  30 SEQ1 1000 +  300  330
FRAG2 0 700 SEQ1 1000 +  330 1030
  GAP 0  30 SEQ1 1000 + 1030 1060
FRAG3 0 250 SEQ1 1000 + 1060 1310

Note that the smaller sequence object-based BED file can be obtained from the BED file based on larger sequence objects by switching columns 1 and 4 (sequence names) and 2-3 and 7-8 (sequence coordinates).

Python script converting AGP files to the BED format

The Python script given below implements the conversion of AGP files to the BED format as described previously. The script provides the following options:

  • –gaps: add gaps between smaller sequence objects to a resulting BED file;
  • –smaller: produce a BED file that describes positions of smaller sequence objects on larger ones.

Example: distribution of genes on genome fragments

We demonstrate the AGP-to-BED script by using it to produce the BED file of the GRCh38 human genome assembly structure and estimate the distribution of human genes on the assembly fragments.

Obtaining the BED file of human genes

The BED file of human genes was obtained from Ensembl Biomart in the following way.

  1. The GRCh38.p5 dataset from the Ensembl Genes 84 database was selected;
  2. In the Filter section, genes located on chromosomes and having UniProt/SwissProt accessions were specified;
  3. In the Attributes section, Chromosome NameGene Start (bp)Gene End (bp) and Ensembl Gene ID were chosen;
  4. Only unique results were exported.

The exported file was converted to the BED format in the following way:

tail -n +2  mart_export.txt | sed 's/^/chr/' | \
    sort -k1,1 -k2,2n | \
    awk 'BEGIN { OFS="\t" } { $2=$2-1; print $0 }' > hs_genes.bed

The resulting file contained 18,951 genes.

Obtaining the BED file of human assembly structure

First, we downloaded AGP files of human chromosomes from NCBI’s FTP server and merged them to a single file:

wget "*.agp.gz"
gzcat *.agp.gz | zfgrep -v '#' | fgrep chr | \
    sort -k1,1 -k2,2n > hs_ref_GRCh38.p7.agp

Next, we apply the script to convert the obtained AGP file to the BED format:

./ hs_ref_GRCh38.p7.agp hs_ref_GRCh38.p7.bed

Estimating distribution of human genes on assembly fragments

Finally, for each gene we counted the number of fragments it was located on using bedtools and summarized the counts:

bedtools intersect -c -a hs_genes.bed -b hs_ref_GRCh38.p7.bed | \
    cut -f5 | sort -k1,1n | uniq -c > genes_on_fragments.txt

The bar plot below shows the distribution of the genes on the fragments.


Distribution of genes on the human assembly fragments.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s