Obtaining probability of all variant calls being correct

The VCF format specifies quality scores (QUAL) for each variable position (variant) in a genome. The QUAL value is the Phred quality score for the assertion that alternative bases of a variant are correct, that is, \mathrm{QUAL} = -10 \log_{10} p, where p is the probability that the alternative base calls are wrong. Using the QUAL scores, one may easily calculate the probability that all variant calls in a VCF file are correct.

Here we give an equation for that probability, a Python script that implements it and an example of its usage.

Continue reading

Converting an AGP file to the BED format

The AGP format is used to describe the assembly structure in the NCBI Genome database. Since AGP is a plain-text tabular data format that specifies positions of smaller sequence objects on larger ones (e.g., contigs on scaffolds), AGP files can be converted to the BED format for their further processing.

Continue reading

Obtaining scaffold positions on assembled chromosomes from NCBI Genome

NCBI Genome stores genomic assemblies of numerous species. Besides assembly sequences, it also contains the related auxiliary information, including AGP files that describe how large sequence objects (e.g., chromosomes) were assembled from smaller ones (e.g., scaffolds or contigs).

For some assemblies, their chromosome-from-scaffold AGP files may be missing although the chromosomes were assembled from the scaffolds. In that case, one may reconstruct the AGP file of scaffolds on chromosomes using chromosome-from-components and scaffold-from-components AGP files.

Further we describe how to perform such a reconstruction and present a Python script implementing it.

Continue reading

Filtering noise in LASTZ dot plots

LASTZ, a whole-genome alignment tool, provides an option to produce a dot plot file of the obtained pairwise alignments. Such a file can be visualized in R using its plot function or from the command line using this R script. However, LASTZ dot plots often contain noise that originates from repetitive elements even if the genomes being aligned to each other have been masked.

For example, the dot plot below shows the pairwise alignments between chromosome 1 sequences of the human genome (the GRCh38.p2 assembly) and the chimpanzee genome (the Pan_troglodytes-2.1.4 assembly). Both sequences were masked with RepeatMasker before alignment; LASTZ was launched with the following parameters.

lastz hs_ref_GRCh38.p2_chr1.mfa \
    ptr_ref_Pan_troglodytes-2.1.4_chr1.mfa \
    --nogapped --notransition --step=20 --ambiguous=iupac \
    --format=rdotplot --output=human_chimp_chr1.rdotplot
lastz-alignment-human-chimpanzee-chromosome-1

LASTZ alignments between chromosome 1 sequences of the human and chimpanzee genomes.

Continue reading

Creating GIF animations of protein molecules with PyMOL

PyMOL is an open-source molecular visualization system useful for producing high-quality figures of protein structures. Besides static figures, PyMOL can also generate animations with the mpng command that writes movie frames to separate files. However, mpng provides no options to customize the produced images. Here we describe an appoach to get a customized looped animation in PyMOL and present a Python script implementing it. The script is based on Maximilian Ebert’s solution from the PyMOL mailing list.

Continue reading