LASTZ, a whole-genome alignment tool, provides an option to produce a dot plot file of the obtained pairwise alignments. Such a file can be visualized in R using its plot function or from the command line using this R script. However, LASTZ dot plots often contain noise that originates from repetitive elements even if the genomes being aligned to each other have been masked.
For example, the dot plot below shows the pairwise alignments between chromosome 1 sequences of the human genome (the GRCh38.p2 assembly) and the chimpanzee genome (the Pan_troglodytes-2.1.4 assembly). Both sequences were masked with RepeatMasker before alignment; LASTZ was launched with the following parameters.
lastz hs_ref_GRCh38.p2_chr1.mfa \ ptr_ref_Pan_troglodytes-2.1.4_chr1.mfa \ --nogapped --notransition --step=20 --ambiguous=iupac \ --format=rdotplot --output=human_chimp_chr1.rdotplot
The noise from the dot plot can be easily removed by filtering the alignments by their length. The following Python script implements that kind of filtering for LASTZ dot plot files; note that the script requires NumPy.
So, let’s use the filter_dotplot.py script keeping the alignments which length is greater than 1 kbp.
./filter_dotplot.py human_chimp_chr1.rdotplot 1000 \ human_chimp_chr1_filtered.rdotplot
We got the following dot plot after the filtration by the alignment length; it contains much less noise compared to the dot plot above.