Fast vector product in MATLAB

The MATLAB function cross implements the vector product for two kinds of input data: a pair of vectors and a pair of N-D arrays. Surprisingly, it works slower for successive computations of cross products for 3-D vector pairs than the following naive implementation.

function C = cross3d(A, B)

C = [A(2)*B(3) - A(3)*B(2), ...
     A(3)*B(1) - A(1)*B(3), ...
     A(1)*B(2) - A(2)*B(1)];


Further, we estimate the median running time of cross and cross3d in a reproducible way using MATLAB scripts.

Continue reading

BED-based format for genotype counts

Being a comprehensive format for storing variant calling data, VCF is superfluous for some kinds of analysis. Although it is a plain-text tabular format, it may take significant time and memory to load a large VCF file into an R or Python session. For that reason, it is usually effective to preprocess a VCF file using a stand-alone tool in order to extract the information required for further analysis.

An example of such a tool is the vcftools package that implements a number of routines for processing VCF files. In particular, vcftools can be used to obtain allele frequencies from a VCF file:

7     45524 2         414   C:414 T:0
7     45569 3         414   G:413 A:1 T:0

Note that the number of columns differs because the second variant has two alternative alleles while the first variant has only one. It may cause problems during further processing of the data, e.g., the R function read.table will fail to read such a file. Also it would be more informative to store genotype counts instead of allele counts. Here we describe the normalized table format for genotype counts and the tool to obtain it from a VCF file.

Continue reading