A quick tutorial on using MatrixEQTL

Andrey has many useful examples/tutorials on his MatrixEQTL page. Here I am going to demonstrate how to use MatrixEQTL using data from the UK Brain Expression Consortium (UKBEC). I have chosen the MAPT gene which is important for various neurodegenerative disorders. The data was downloaded from the BRAINEAC website (link to come) and can be downloaded from here. This consortium collected DNA and RNA from ten brain regions in 134 neuropathologically normal individuals. Additionally, the mean expression profile for each individual across tissues (aveALL) was calculated.

Let us first load the R data and inspect the format. You may need to change working directory using the setwd() command first or supply the full path.

This particular dataset contains information for the 26 expression profiles: one representing the gene-level profile and prefixed with “t” plus 25 representing the exon-level profiles. It also contains the information for 5,287 markers (SNPs and indels) located within 1Mb of transcription start and end sites.

load("t3723687.rda")

#####################
## Expression data ##
#####################

length(expr)
 [1] 11

names(expr)
 [1] "aveALL" "CRBL" "FCTX" "HIPP" "MEDU" "OCTX" "PUTM" "SNIG" "TCTX" "THAL" "WHMT"

dim(expr[["TCTX"]])
 [1]  26 134

head(expr[["TCTX"]])[ , 1:8]
          00_38    01_37    02_06   03_28    04_05    04_19   05_10   06_02
3723688 9.63145  9.22946  9.47970 9.47775  9.24175  9.74315 9.46278 9.76640
3723690 6.03620  5.55045  5.08430 5.09810  6.67395  4.95544 5.56240 4.67415
3723691 7.41365  7.11841  7.16574 6.39494  7.58188  6.28772 6.93655 6.34788
3723707 9.96076 10.13550 10.21800 9.70254 10.54540 10.34620 9.66656 9.74161
3723710 9.31395  9.44499  9.31668 8.51235 10.07000  9.95134 9.41852 8.65756
3723712 5.60436  5.21131  5.89260 6.17088  5.09820  6.14388 5.83319 5.94182

tail(expr[["TCTX"]])[ , 1:8]
           00_38   01_37   02_06   03_28    04_05   04_19   05_10   06_02
3723750  8.42139 8.72951 8.69241 9.06205  9.37904 9.33639 8.59345 9.15941
3723751  8.63697 8.00765 8.81346 8.64289  9.52436 8.87514 8.50429 8.56228
3723752  8.93384 8.19434 8.14235 8.14987 10.02850 9.10575 8.05589 7.93080
3723753  9.06106 9.26484 9.20654 9.38277  9.73640 9.61577 9.15330 9.17392
3723754  8.63553 8.80820 9.22138 8.64040  8.73758 9.45185 8.83666 8.96825
t3723687 8.39886 8.36485 8.36244 8.29161  8.91036 8.64564 8.28501 8.19473

dim(expr.info)
[1] 26  4

head(expr.info)
          chr    start     stop     tID
3723688 chr17 43971878 43972051 3723687
3723690 chr17 43973814 43974885 3723687
3723691 chr17 43974897 43975038 3723687
3723707 chr17 44039714 44039808 3723687
3723710 chr17 44049243 44049311 3723687
3723712 chr17 44051752 44051833 3723687

#################
## Marker data ##
#################

dim(markers)
 [1] 5287  134

head(markers)[ , 1:8]
                     00_38 01_37 02_06 03_28 04_05 04_19 05_10 06_02
chr17:42972122           2 2.000     2 2.000     2 1.417 1.927     2
chr17:42972206           2 2.000     2 2.000     2 1.000 2.000     1
chr17:42972490:GGA_G     2 2.000     2 2.000     2 1.000 1.000     2
chr17:42972648           2 2.000     2 1.000     2 2.000 2.000     2
chr17:42972690           1 1.999     0 1.996     0 2.000 1.000     1
chr17:42972960           1 2.000     0 2.000     0 2.000 1.000     1

dim(markers.info)
[1] 5287    8

head(markers.info)
                     Al1 Al2   Freq1     MAF AvgCall     Rsq       rsid  type
chr17:42972122         G   A 0.88433 0.11567 0.97738 0.84068 rs75707322   SNP
chr17:42972206         G   A 0.88433 0.11567 1.00000 1.00370  rs8069039   SNP
chr17:42972490:GGA_G   R   D 0.79364 0.20636 0.99882 0.99709        indel
chr17:42972648         G   C 0.89551 0.10449 0.99997 1.00342 rs71373525   SNP
chr17:42972690         C   G 0.62428 0.37572 0.98117 0.94353  rs8069754   SNP
chr17:42972960         G   A 0.64327 0.35673 0.99849 0.99826  rs2878972   SNP

Before we proceed, first of all note that some tissue is missing for some individuals as the RNA was not available or low quality. Using temporal cortex (TCTX) as example, we see that 15 individuals are missing for this tissue.

table( w <- colMeans(is.na(expr[["TCTX"]])) )
   0   1
 119  15

length( valids <- names( which(w==0) ) )
 [1] 119

head(valids)
 [1] "00_38" "01_37" "02_06" "03_28" "04_05" "04_19" ...

Note that MatrixEQTL imputes the missing values before doing the calculations. This is not idea as our values are not missing at random. So we will need to manually exclude these individuals when we convert these datasets into MatrixEQTL format.

library(MatrixEQTL)   # version 2.0.0 - check yours with sessionInfo()

my.expr <- SlicedData$new()
my.expr$CreateFromMatrix( as.matrix( expr[["TCTX"]][ , valids ] ) )

my.markers <- SlicedData$new()
my.markers$CreateFromMatrix( as.matrix( markers[ , valids ] ) )

It is important to ensure the datasets are in the same order of subjects before we proceed to execute the MatrixEQTL

identical( colnames(my.expr), colnames(my.markers) )
 [1] TRUE

tmpf <- tempfile()

res <- Matrix_eQTL_main( my.markers, my.expr, cvrt=SlicedData$new(),
                         output_file_name=tmpf, pvOutputThreshold=1e-5,
                         useModel=modelLINEAR, errorCovariance=numeric(0) )
unlink(tmpf)

Let us now extract and format the results a bit. There are 9,412 significant cis-eQTLs (at p < 1e-5) identified within 0.07 seconds and here are the top 6.

out &lt;- res$all$eqtls
out &lt;- out[ , c("snps", "gene", "beta", "statistic", "pvalue")]
colnames(out) &lt;- c("marker", "exprID", "beta", "t-statistic", "p-value")
dim(out)
 [1] 9412    5

res$time.in.sec
 elapsed
    0.07

head(out)
           marker  exprID       beta t-statistic      p-value
 1 chr17:44061025 3723751  0.3759085    5.761778 6.865336e-08
 2 chr17:44067382 3723751  0.2699755    5.349497 4.439403e-07
 3 chr17:44060492 3723751  0.5931873    5.335221 4.729509e-07
 4 chr17:44762589 3723707 -0.1788872   -5.213412 8.085859e-07
 5 chr17:44066464 3723751  0.2720258    5.128588 1.169849e-06
 6 chr17:44059497 3723751  0.2703483    5.121740 1.205076e-06

Now let us try it the traditional approach.

ptm &lt;- proc.time()

TCTX &lt;- expr[["TCTX"]]

big.out &lt;- NULL

for(i in 1:nrow(TCTX)){

 print(i)
 my.expr &lt;- as.numeric( TCTX[i, ] )

 for(j in 1:nrow(markers)){
   fit &lt;- lm( my.expr ~ as.numeric(markers[j, ]) )
   out &lt;- coef( summary(fit) )[ 2, c("Estimate", "Std. Error", "Pr(&gt;|t|)")]

   if( out[ "Pr(&gt;|t|)" ] &lt; 1e-5 ){
     out &lt;- c( rownames(markers)[j], rownames(TCTX)[i], out )
     big.out &lt;- rbind(big.out, out)
   }

  rm(j, fit, out)
 }
 rm(i, my.expr)
}

colnames(big.out) &lt;- c("marker", "exprID", "Estimate", "SE", "pvalue")
big.out &lt;- big.out[ order(as.numeric(big.out[ , "pvalue"])), ]
rownames(big.out) &lt;- NULL

proc.time() - ptm
   user  system elapsed
 522.83    0.06  528.19

Wow, that took over 8 minutes to complete vs. 0.07 seconds for MatrixEQTL. Sure, I can optimize the code above using apply() etc but it still isn’t as fast as MatrixEQTL.

More importantly, the results the same.

dim(big.out)
 [1] 9412    5

head(big.out)
      marker           exprID    Estimate             SE                   pvalue
 [1,] "chr17:44061025" "3723751" "0.375908482121651"  "0.0652417514030518" "6.86533571244123e-08"
 [2,] "chr17:44067382" "3723751" "0.269975506772536"  "0.0504674529615548" "4.43940323821489e-07"
 [3,] "chr17:44060492" "3723751" "0.593187349664759"  "0.111183272810956"  "4.72950945585888e-07"
 [4,] "chr17:44762589" "3723707" "-0.178887231952771" "0.0343128918690251" "8.08585850844625e-07"
 [5,] "chr17:44066464" "3723751" "0.272025753138588"  "0.0530410575681518" "1.16984903179711e-06"
 [6,] "chr17:44059497" "3723751" "0.270348263447638"  "0.052784452164607"  "1.20507597159997e-06"

4 comments

  1. Hi, great post but I’m confused as to how you actually downloaded the braineac data and loaded it into R. Can you give a little more detail about that please? Thanks!

  2. Hi, Adai! Thank you very much for your nice post. I am not able to download the test data. Is it possible for you to make them available again?

  3. Pingback: Breathing Labs – MatrixEQTL: Inconsistent Results to lm()


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s