Two great posts on the R pipe operator

https://www.r-bloggers.com/getting-started-with-dplyr-in-r-using-titanic-dataset/

https://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/

Advertisements

How to merge multiple datasets in R based on row names?

Here is a common everyday challenge. How does one combine several datasets using rownames as the key? This is known as recursive or repeated merging in R and a full outer join in SQL speak.

Let me generate some simple dataset first:

df1 <- data.frame( x = 1:5,        y = 20 + 1:5,   row.names=letters[1:5] )
df2 <- data.frame( x = 100 + 4:6,  z = 200 + 4:6,  row.names=letters[4:6] )
df3 <- data.frame( x = 1000 + 5:8, z = 2000 + 5:8, row.names=letters[5:8] )

df1
#   x  y
# a 1 21
# b 2 22
# c 3 23
# d 4 24
# e 5 25

df2
#     x   z
# d 104 204
# e 105 205
# f 106 206

df3
#      x    z
# e 1005 2005
# f 1006 2006
# g 1007 2007
# h 1008 2008

First option is to manually merge the dataset by hand as follows. This advantage of this method is that it is clear and easy to double check. The disadvantage is the length codes and intermediate objects which becomes difficult to track with increasing number of datasets. I seen the use of Reduce() function in some forum threads that might help automate this.

tmp12 <- merge(df1, df2, by=0, all=T)
rownames(tmp12) <- tmp12$Row.names; tmp12$Row.names <- NULL

tmp123 <- merge(tmp12, df3, by=0, all=T)
rownames(tmp123) <- tmp123$Row.names; tmp123$Row.names <- NULL
tmp123
#   x.x  y x.y z.x    x  z.y
# a   1 21  NA  NA   NA   NA
# b   2 22  NA  NA   NA   NA
# c   3 23  NA  NA   NA   NA
# d   4 24 104 204   NA   NA
# e   5 25 105 205 1005 2005
# f  NA NA 106 206 1006 2006
# g  NA NA  NA  NA 1007 2007
# h  NA NA  NA  NA 1008 2008

rm(tmp12, tmp123)

Second option is to use the join_all() function from the plyr package. I am new to this function and naively thought something like the following will do the job:

library(plyr)
mylist <- list( one=df1, two=df2, three=df3 )
join_all( mylist, type="full" )  # not what I want!
# Joining by: x
# Joining by: x, z
#       x  y    z
# 1     1 21   NA
# 2     2 22   NA
# 3     3 23   NA
# 4     4 24   NA
# 5     5 25   NA
# 6   104 NA  204
# 7   105 NA  205
# 8   106 NA  206
# 9  1005 NA 2005
# 10 1006 NA 2006
# 11 1007 NA 2007
# 12 1008 NA 2008

A few extra lines of coding to make the column names unique and adding the rownames as a column seems to solve this problem. However, I have not tested this widely.

for(i in 1:length(mylist)){
  colnames(mylist[[i]]) <- paste0( names(mylist)[i], "_", colnames(mylist[[i]]) )
  mylist[[i]]$ROWNAMES  <- rownames(mylist[[i]])
}

out <- join_all( mylist, by="ROWNAMES", type="full" )
rownames(out) <- out$ROWNAMES; out$ROWNAMES <- NULL
#   one_x one_y two_x two_z three_x three_z
# a     1    21    NA    NA      NA      NA
# b     2    22    NA    NA      NA      NA
# c     3    23    NA    NA      NA      NA
# d     4    24   104   204      NA      NA
# e     5    25   105   205    1005    2005
# f    NA    NA   106   206    1006    2006
# g    NA    NA    NA    NA    1007    2007
# h    NA    NA    NA    NA    1008    2008

rm(out, mylist)

Third option is my own custom code which I used successfully for years and is pretty fast for very large datasets. Here is a pseudo-algorithm

  1. Generate the universe of IDs across all datasets
  2. Pads each dataset with the extra rows for missing IDs (all datasets have same number of rows)
  3. Sort every dataset (all datasets have same order)
  4. Row bind the datasets
multimerge <- function (mylist) {
  ## mimics a recursive merge or full outer join

  unames <- unique(unlist(lapply(mylist, rownames)))

  n <- length(unames)

  out <- lapply(mylist, function(df) {

    tmp <- matrix(nr = n, nc = ncol(df), dimnames = list(unames,colnames(df)))
    tmp[rownames(df), ] <- as.matrix(df)
    rm(df); gc()

    return(tmp)
  })

  stopifnot( all( sapply(out, function(x) identical(rownames(x), unames)) ) )

  bigout <- do.call(cbind, out)
  colnames(bigout) <- paste(rep(names(mylist), sapply(mylist, ncol)), unlist(sapply(mylist, colnames)), sep = "_")
  return(bigout)
}

multimerge( list (one=df1, two=df2, three=df3 ) )
#   one_x one_y two_x two_z three_x three_z
# a     1    21    NA    NA      NA      NA
# b     2    22    NA    NA      NA      NA
# c     3    23    NA    NA      NA      NA
# d     4    24   104   204      NA      NA
# e     5    25   105   205    1005    2005
# f    NA    NA   106   206    1006    2006
# g    NA    NA    NA    NA    1007    2007
# h    NA    NA    NA    NA    1008    2008

As a bonus, my codes is also ignorant of the column names which can be handy if every column is unique across the datasets. The column orders are preserved.

out <- multimerge( list (df1, df2, df3 ) )
colnames(out) <- gsub("^_", "", colnames(out))
out
#    x  y   x   z    x    z
# a  1 21  NA  NA   NA   NA
# b  2 22  NA  NA   NA   NA
# c  3 23  NA  NA   NA   NA
# d  4 24 104 204   NA   NA
# e  5 25 105 205 1005 2005
# f NA NA 106 206 1006 2006
# g NA NA  NA  NA 1007 2007
# h NA NA  NA  NA 1008 2008

Download 1000 Genomes Phase3 and calculate allele frequencies

Here are some codes to download the data from the 1000 Genomes Phase 3 website into your own server and calculating the allele frequencies for the European populations. Here are some setup codes. The panel file tells you which population and super-population each sample belongs to.

FTP_SITE=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502
wget $FTP_SITE/integrated_call_samples_v3.20130502.ALL.panel

Next we will download each chromosome (I am ignoring the Y and MT chromosomes here). Alternatively, you can download all the files on the FTP site using wget -r $FTP_SITE but I preferred to download each one separately. Note the Chromosome X is based on version 1b and not 5a like the autosomes, so has to be downloaded separately. I am renaming it for convenience later.

for CHR in `​seq 1 22`; do
   FILE=$FTP_SITE/ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   wget $FILE $FILE.tbi
   sleep 60
done

FILE=$FTP_SITE/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget $FILE $FILE.tbi
rename -v 's/v1b/v5a/' *

The next step is to identify the Europeans and calculate the allele frequencies. I am also ignoring any variants that have a call rate < 95% and fails Hardy-Weinberg Equilibrium at p < 10-6. You can do this with vcftools or other softwares but I am most familiar with PLINK. And if you are using PLINK, please use version 1.90 (https://www.cog-genomics.org/plink2) and above as it is much faster than previous versions.

grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EUR.id  ## 503 Europeans

for CHR in `seq 1 23  | sed 's/23/X/'`​; do
   FILE=ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   plink --vcf $FILE --keep-fam EUR.id --geno 0.05 --hwe 1e-6 --make-bed --out tmp
   plink --bfile tmp --freq --out EUR_$CHR rm EUR_$CHR.{log,nosex} tmp.*
done

awk 'FNR==1 && NR!=1{next;}{print}' *.frq | cut -f1-4 > EUR.freq
rm EUR_*.frq

You may also find pre-computed allele frequencies per population but my purpose here is how to download and run some calculations on the VCF files.

Leading edge subset from Gene Set Enrichment Analysis (GSEA) using blood transcription modules

Recently, I have been taken up very much by the modular analysis developed by Damien Chaussabel and colleagues (see original paper or their recent review paper) for analysis of whole blood data.

A recent R package called tmod from Dr. January Weiner provides a very easy and elegant implementation of this concept and data. The vignette is well written and the author is very responsive to requests and suggestions. Currently, the evidence plot is based on the receiver operating curve. The next release will contain an option to produce an evidence plot that is the equivalent to enrichment score plot of the popular GSEA software from the Broad Institute.

I tweaked this function a bit to allow for different weighting of the genes in the module and to allow it to return the indices of the leading edge subset (i.e. core enrichment genes) which can be useful to generate heatmaps etc. tmod’s evidencePlot has the advantage of not requiring the logFC as an additional parameter.

getLeadingEdgeSubset <- function(symbol, lfc, mod, alpha=1, plot=T){

  stopifnot( length(symbol) == length(lfc) )

  genes <- getGenes(mod, mset="all")$Genes
  genes <- unique(unlist(strsplit(genes, split=",")))
  inMod <- sapply(symbol, function(x) any(x %in% genes))  ## "symbol" can be a list to allow multi-mapping probes

  N     <- length(symbol)
  N.hit <- sum(inMod)

  runningES <- NA
  runningES[  inMod ] <-  abs( lfc[ inMod ]^alpha )/sum(abs(lfc[ inMod ]^alpha))
  runningES[ !inMod ] <-  -1/(  N - N.hit )
  runningES <- cumsum(runningES)

  rank.at.max <- which.max(runningES)
  leadingEdge <- which(inMod[ 1:rank.at.max ])

 if(plot){

    data(tmod)
    title.main <- paste0(tmod$MODULES[ mod, "Title" ], " (", mod, ")", collapse="; ")

    plot(1:N, runningES, type="l",
         ylab="Enrichment Score (ES)", xlab="Ordered rank of genes", main=title.main)

    mtext(side=3, at=which(inMod), text="|", line=-1, cex=0.5)

    arrows(rank.at.max, 0, rank.at.max, max(runningES), length=0, col="blue")

    mtext(side=3, text=paste0("# genes in module:", length(genes), "                                       ",
                              "# genes found:", length(unique(unlist(symbol[inMod]))), " (", N.hit, " probes)\n"), line=-1)

    legend("bottom", bty="n",
           legend=paste0("Rank at max:", rank.at.max, "\n# probes in leading edge subset:", length(leadingEdge)))
  }

  return(leadingEdge)
}

Here is a simple example taken from the tmod vignette:

library(limma)
library(tmod)

data(Egambia)
design <- cbind(Intercept=rep(1, 30), TB=rep(c(0,1), each= 15))
E <- as.matrix(Egambia[,-c(1:3)])
fit <- eBayes( lmFit(E, design))
tt <- topTable(fit, coef=2, number=Inf, genelist=Egambia[,1:3] )
rm(design, E, fit)

evidencePlot(tt$GENE_SYMBOL, "LI.M37.0")  ## from tmod version 0.24 onwards

Rplot

core <- getLeadingEdgeSubset(tt$GENE_SYMBOL, tt$logFC, "LI.M37.0", alpha=1)

Rplot04

 

I set alpha=1 as default to match Broad Institute’s default value. Setting alpha=0 will produce exactly the same graph as evidencePlot from tmod. Setting the alpha=2 will overweight the genes in the module and sometimes useful for finding smaller set of core genes.

You can then take on these indices (“w”), pull out the genes and visualize them as such

top <- Egambia[ core, -c(1:3)]
rownames(top) <- paste0( tt$GENE_SYMBOL[ core ], " (", core, ")")
status <- data.frame(status=ifelse( grepl("TB", colnames(top)), "TB", "Healthy"))

library(NMF)
library(gplots)

aheatmap(top, annCol=status,
         color=bluered(10), Rowv=NA, Colv=NA, scale="row",
         main="leading edge subset from module LI.M37.0")

 

Rplot05

How polymorphism(s) in probe sequences affect cis-eQTL results

In our paper, we show that the presence of polymorphisms (SNPs and indels) in probe sequences can induce a serious technical artefact (through reduced hybridization efficiency) when conducting expression QTL (eQTL) studies, especially if you are using array based technologies. You can find the paper here: http://nar.oxfordjournals.org/content/early/2013/02/21/nar.gkt069

But for all of you busy people, the main result is summarized in Figure 1 (also attached below).

Ramasamy_NAR2012

Quick summary:

  • About 6.1% of the probes (25-mers) in Affymetrix Human Exon 1.0 array contains polymorphisms but they account 50 – 90% of cis-eQTLs
  • About 11.7% of the probes (50-mers) in Illumina HT12 array contains polymorphisms but they account for 30 – 45% of the cis-eQTLs
  • The binding efficiency of longer probes are less affected by polymorphism (as expected) but still > 30% of cis-eQTLs are false!
  • The 1000G dataset appears to be good enough to identify probes containing polymorphisms. The only way to possibly improve on this is to look at private mutations from exome-seq etc but may not be worth the effort.
  • Increasing p-value stringency seems to make situation worse (cis-eQTLs due to polymorphisms are very strong) so many published cis-eQTLs may suffer from this (especially if the authors have not done this or used an older reference panel to identify polymorphisms)

A quick tutorial on using MatrixEQTL

Andrey has many useful examples/tutorials on his MatrixEQTL page. Here I am going to demonstrate how to use MatrixEQTL using data from the UK Brain Expression Consortium (UKBEC). I have chosen the MAPT gene which is important for various neurodegenerative disorders. The data was downloaded from the BRAINEAC website (link to come) and can be downloaded from here. This consortium collected DNA and RNA from ten brain regions in 134 neuropathologically normal individuals. Additionally, the mean expression profile for each individual across tissues (aveALL) was calculated.

Let us first load the R data and inspect the format. You may need to change working directory using the setwd() command first or supply the full path.

This particular dataset contains information for the 26 expression profiles: one representing the gene-level profile and prefixed with “t” plus 25 representing the exon-level profiles. It also contains the information for 5,287 markers (SNPs and indels) located within 1Mb of transcription start and end sites.

load("t3723687.rda")

#####################
## Expression data ##
#####################

length(expr)
 [1] 11

names(expr)
 [1] "aveALL" "CRBL" "FCTX" "HIPP" "MEDU" "OCTX" "PUTM" "SNIG" "TCTX" "THAL" "WHMT"

dim(expr[["TCTX"]])
 [1]  26 134

head(expr[["TCTX"]])[ , 1:8]
          00_38    01_37    02_06   03_28    04_05    04_19   05_10   06_02
3723688 9.63145  9.22946  9.47970 9.47775  9.24175  9.74315 9.46278 9.76640
3723690 6.03620  5.55045  5.08430 5.09810  6.67395  4.95544 5.56240 4.67415
3723691 7.41365  7.11841  7.16574 6.39494  7.58188  6.28772 6.93655 6.34788
3723707 9.96076 10.13550 10.21800 9.70254 10.54540 10.34620 9.66656 9.74161
3723710 9.31395  9.44499  9.31668 8.51235 10.07000  9.95134 9.41852 8.65756
3723712 5.60436  5.21131  5.89260 6.17088  5.09820  6.14388 5.83319 5.94182

tail(expr[["TCTX"]])[ , 1:8]
           00_38   01_37   02_06   03_28    04_05   04_19   05_10   06_02
3723750  8.42139 8.72951 8.69241 9.06205  9.37904 9.33639 8.59345 9.15941
3723751  8.63697 8.00765 8.81346 8.64289  9.52436 8.87514 8.50429 8.56228
3723752  8.93384 8.19434 8.14235 8.14987 10.02850 9.10575 8.05589 7.93080
3723753  9.06106 9.26484 9.20654 9.38277  9.73640 9.61577 9.15330 9.17392
3723754  8.63553 8.80820 9.22138 8.64040  8.73758 9.45185 8.83666 8.96825
t3723687 8.39886 8.36485 8.36244 8.29161  8.91036 8.64564 8.28501 8.19473

dim(expr.info)
[1] 26  4

head(expr.info)
          chr    start     stop     tID
3723688 chr17 43971878 43972051 3723687
3723690 chr17 43973814 43974885 3723687
3723691 chr17 43974897 43975038 3723687
3723707 chr17 44039714 44039808 3723687
3723710 chr17 44049243 44049311 3723687
3723712 chr17 44051752 44051833 3723687

#################
## Marker data ##
#################

dim(markers)
 [1] 5287  134

head(markers)[ , 1:8]
                     00_38 01_37 02_06 03_28 04_05 04_19 05_10 06_02
chr17:42972122           2 2.000     2 2.000     2 1.417 1.927     2
chr17:42972206           2 2.000     2 2.000     2 1.000 2.000     1
chr17:42972490:GGA_G     2 2.000     2 2.000     2 1.000 1.000     2
chr17:42972648           2 2.000     2 1.000     2 2.000 2.000     2
chr17:42972690           1 1.999     0 1.996     0 2.000 1.000     1
chr17:42972960           1 2.000     0 2.000     0 2.000 1.000     1

dim(markers.info)
[1] 5287    8

head(markers.info)
                     Al1 Al2   Freq1     MAF AvgCall     Rsq       rsid  type
chr17:42972122         G   A 0.88433 0.11567 0.97738 0.84068 rs75707322   SNP
chr17:42972206         G   A 0.88433 0.11567 1.00000 1.00370  rs8069039   SNP
chr17:42972490:GGA_G   R   D 0.79364 0.20636 0.99882 0.99709        indel
chr17:42972648         G   C 0.89551 0.10449 0.99997 1.00342 rs71373525   SNP
chr17:42972690         C   G 0.62428 0.37572 0.98117 0.94353  rs8069754   SNP
chr17:42972960         G   A 0.64327 0.35673 0.99849 0.99826  rs2878972   SNP

Before we proceed, first of all note that some tissue is missing for some individuals as the RNA was not available or low quality. Using temporal cortex (TCTX) as example, we see that 15 individuals are missing for this tissue.

table( w <- colMeans(is.na(expr[["TCTX"]])) )	
   0   1 
 119  15

length( valids <- names( which(w==0) ) )
 [1] 119

head(valids)
 [1] "00_38" "01_37" "02_06" "03_28" "04_05" "04_19" ...

Note that MatrixEQTL imputes the missing values before doing the calculations. This is not idea as our values are not missing at random. So we will need to manually exclude these individuals when we convert these datasets into MatrixEQTL format.

library(MatrixEQTL)   # version 2.0.0 - check yours with sessionInfo()

my.expr <- SlicedData$new()
my.expr$CreateFromMatrix( as.matrix( expr[["TCTX"]][ , valids ] ) )
 
my.markers <- SlicedData$new()
my.markers$CreateFromMatrix( as.matrix( markers[ , valids ] ) )

It is important to ensure the datasets are in the same order of subjects before we proceed to execute the MatrixEQTL

identical( colnames(my.expr), colnames(my.markers) )
 [1] TRUE

tmpf <- tempfile()
 
res <- Matrix_eQTL_main( my.markers, my.expr, cvrt=SlicedData$new(),
                         output_file_name=tmpf, pvOutputThreshold=1e-5,
                         useModel=modelLINEAR, errorCovariance=numeric(0) )
unlink(tmpf)

Let us now extract and format the results a bit. There are 9,412 significant cis-eQTLs (at p < 1e-5) identified within 0.07 seconds and here are the top 6.

out <- res$all$eqtls
out <- out[ , c("snps", "gene", "beta", "statistic", "pvalue")]
colnames(out) <- c("marker", "exprID", "beta", "t-statistic", "p-value")
dim(out)
 [1] 9412    5

res$time.in.sec
 elapsed
    0.07

head(out)
           marker  exprID       beta t-statistic      p-value
 1 chr17:44061025 3723751  0.3759085    5.761778 6.865336e-08
 2 chr17:44067382 3723751  0.2699755    5.349497 4.439403e-07
 3 chr17:44060492 3723751  0.5931873    5.335221 4.729509e-07
 4 chr17:44762589 3723707 -0.1788872   -5.213412 8.085859e-07
 5 chr17:44066464 3723751  0.2720258    5.128588 1.169849e-06
 6 chr17:44059497 3723751  0.2703483    5.121740 1.205076e-06

Now let us try it the traditional approach.

ptm <- proc.time()

TCTX <- expr[["TCTX"]]

big.out <- NULL

for(i in 1:nrow(TCTX)){

 print(i)
 my.expr <- as.numeric( TCTX[i, ] )

 for(j in 1:nrow(markers)){
   fit <- lm( my.expr ~ as.numeric(markers[j, ]) )
   out <- coef( summary(fit) )[ 2, c("Estimate", "Std. Error", "Pr(>|t|)")]

   if( out[ "Pr(>|t|)" ] < 1e-5 ){
     out <- c( rownames(markers)[j], rownames(TCTX)[i], out )
     big.out <- rbind(big.out)
   }
 
  rm(j, fit, out)
 }
 rm(i, my.expr)
}

colnames(big.out) <- c("marker", "exprID", "Estimate", "SE", "pvalue")
big.out <- big.out[ order(as.numeric(big.out[ , "pvalue"])), ]
rownames(big.out) <- NULL

proc.time() - ptm
   user  system elapsed 
 522.83    0.06  528.19 

Wow, that took over 8 minutes to complete vs. 0.07 seconds for MatrixEQTL. Sure, I can optimize the code above using apply() etc but it still isn’t as fast as MatrixEQTL.

More importantly, the results the same.

dim(big.out)
 [1] 9412    5

head(big.out)
      marker           exprID    Estimate             SE                   pvalue
 [1,] "chr17:44061025" "3723751" "0.375908482121651"  "0.0652417514030518" "6.86533571244123e-08"
 [2,] "chr17:44067382" "3723751" "0.269975506772536"  "0.0504674529615548" "4.43940323821489e-07"
 [3,] "chr17:44060492" "3723751" "0.593187349664759"  "0.111183272810956"  "4.72950945585888e-07"
 [4,] "chr17:44762589" "3723707" "-0.178887231952771" "0.0343128918690251" "8.08585850844625e-07"
 [5,] "chr17:44066464" "3723751" "0.272025753138588"  "0.0530410575681518" "1.16984903179711e-06"
 [6,] "chr17:44059497" "3723751" "0.270348263447638"  "0.052784452164607"  "1.20507597159997e-06"

SNP vs. genotype definition

Recently, a friend asked for the difference between a SNP and genotype. After some babbling on my part, here is a summary.

SNP Positions of the genome that have a mutation
Genotype Alleles of at a given SNP

Typically we only call a mutation if is seen above a certain frequency (typically > 0.1%) across the population. Otherwise, these are known as private mutations.

Confused? Here is a simple analogy. A metabolic disease called diabetes exists but not everyone has diabetes. A medical diagnosis is required to determine the diabetic status of an individual.

Similarly, genotyping is the process of determining the SNP alleles. Here is example for three individuals:

Position Individual 1 Individual 2 Individual 3
1 A/A A/A

A/A

2

C/C

C/C

C/C

3

G/G

G/G

G/G

4

C/T

T/T

C/C

Nucleotides in positions 1 – 3 are identical across all individuals and therefore not considered a SNP in this sample. The nucleotide in position 4 is a SNP. The genotype for individual 001 at position 4 is C/T.

More generally, in the human genome, there are approximately 40 million SNPs which represent about 1.25% of the genome. In other words, the genomes of two unrelated individuals are identical at around 99% of the nucleotides.