Optimizing probABEL runs with file vector format

This might be of interest to those of you using probABEL software for GWAs analyses.

I recently discovered you can speed up reading the data into probABEL and also use far less RAM per run by using the “file vector format” of the dosage.

This conversion needs to be done using R on a machine with a fairly high amount of RAM but is a one time process. The details are on the GenABEL/probABEL website but site seems to be down at the moment. In summary, you will need to execute the following R codes (modify accordingly):

mach2databel( “chr22.mldose”, “chr22.mlinfo”, “chr22″ )
system(“rm chr22_fvtmp.fvd chr22_fvtmp.fvi”)

This should produce chr22.fvi and chr22.fvd. Keep both in the same directory. Then all you have to do is feed in the fvi file as –dose instead of the mldose in the call. Everything else remains unchanged. E.g.

palinear –dose chr22.fvi –info chr22.info –pheno mypheno.txt –robust –out res

Note that a) chr22.info is the original info file and b) mypheno.txt needs to be same order as dosage file.

Here is my example, 6 million SNPs (all chromosomes lumped together) and 134 individuals which is comparable to having 200,000 SNPs x 4,000 individuals. The conversion in R took about 45min and lots of memory (can’t remember how much). I ran a palinear with no covariates and here are the results:

File vector format : 6 min & 1.6 GB RAM
mldose format : 9 min & 4.5 GB RAM

You might find it useful if you have lots of GWAs to run (my current job is an eQTL projects where I have to run 12 million GWAs) or have memory restrictions. Worth investigating as it’s a one time process. This might be become more important as more and more groups update their GWAs data to 1000G imputation where number of SNPs can double or even triple compared to HapMap.


