How to merge multiple datasets in R based on row names?

Here is a common everyday challenge. How does one combine several datasets using rownames as the key? This is known as recursive or repeated merging in R and a full outer join in SQL speak.

Let me generate some simple dataset first:

df1 <- data.frame( x = 1:5,        y = 20 + 1:5,   row.names=letters[1:5] )
df2 <- data.frame( x = 100 + 4:6,  z = 200 + 4:6,  row.names=letters[4:6] )
df3 <- data.frame( x = 1000 + 5:8, z = 2000 + 5:8, row.names=letters[5:8] )

df1
#   x  y
# a 1 21
# b 2 22
# c 3 23
# d 4 24
# e 5 25

df2
#     x   z
# d 104 204
# e 105 205
# f 106 206

df3
#      x    z
# e 1005 2005
# f 1006 2006
# g 1007 2007
# h 1008 2008

First option is to manually merge the dataset by hand as follows. This advantage of this method is that it is clear and easy to double check. The disadvantage is the length codes and intermediate objects which becomes difficult to track with increasing number of datasets. I seen the use of Reduce() function in some forum threads that might help automate this.

tmp12 <- merge(df1, df2, by=0, all=T)
rownames(tmp12) <- tmp12$Row.names; tmp12$Row.names <- NULL

tmp123 <- merge(tmp12, df3, by=0, all=T)
rownames(tmp123) <- tmp123$Row.names; tmp123$Row.names <- NULL
tmp123
#   x.x  y x.y z.x    x  z.y
# a   1 21  NA  NA   NA   NA
# b   2 22  NA  NA   NA   NA
# c   3 23  NA  NA   NA   NA
# d   4 24 104 204   NA   NA
# e   5 25 105 205 1005 2005
# f  NA NA 106 206 1006 2006
# g  NA NA  NA  NA 1007 2007
# h  NA NA  NA  NA 1008 2008

rm(tmp12, tmp123)

Second option is to use the join_all() function from the plyr package. I am new to this function and naively thought something like the following will do the job:

library(plyr)
mylist <- list( one=df1, two=df2, three=df3 )
join_all( mylist, type="full" )  # not what I want!
# Joining by: x
# Joining by: x, z
#       x  y    z
# 1     1 21   NA
# 2     2 22   NA
# 3     3 23   NA
# 4     4 24   NA
# 5     5 25   NA
# 6   104 NA  204
# 7   105 NA  205
# 8   106 NA  206
# 9  1005 NA 2005
# 10 1006 NA 2006
# 11 1007 NA 2007
# 12 1008 NA 2008

A few extra lines of coding to make the column names unique and adding the rownames as a column seems to solve this problem. However, I have not tested this widely.

for(i in 1:length(mylist)){
  colnames(mylist[[i]]) <- paste0( names(mylist)[i], "_", colnames(mylist[[i]]) )
  mylist[[i]]$ROWNAMES  <- rownames(mylist[[i]])
}

out <- join_all( mylist, by="ROWNAMES", type="full" )
rownames(out) <- out$ROWNAMES; out$ROWNAMES <- NULL
#   one_x one_y two_x two_z three_x three_z
# a     1    21    NA    NA      NA      NA
# b     2    22    NA    NA      NA      NA
# c     3    23    NA    NA      NA      NA
# d     4    24   104   204      NA      NA
# e     5    25   105   205    1005    2005
# f    NA    NA   106   206    1006    2006
# g    NA    NA    NA    NA    1007    2007
# h    NA    NA    NA    NA    1008    2008

rm(out, mylist)

Third option is my own custom code which I used successfully for years and is pretty fast for very large datasets. Here is a pseudo-algorithm

  1. Generate the universe of IDs across all datasets
  2. Pads each dataset with the extra rows for missing IDs (all datasets have same number of rows)
  3. Sort every dataset (all datasets have same order)
  4. Row bind the datasets
multimerge <- function (mylist) {
  ## mimics a recursive merge or full outer join

  unames <- unique(unlist(lapply(mylist, rownames)))

  n <- length(unames)

  out <- lapply(mylist, function(df) {

    tmp <- matrix(nr = n, nc = ncol(df), dimnames = list(unames,colnames(df)))
    tmp[rownames(df), ] <- as.matrix(df)
    rm(df); gc()

    return(tmp)
  })

  stopifnot( all( sapply(out, function(x) identical(rownames(x), unames)) ) )

  bigout <- do.call(cbind, out)
  colnames(bigout) <- paste(rep(names(mylist), sapply(mylist, ncol)), unlist(sapply(mylist, colnames)), sep = "_")
  return(bigout)
}

multimerge( list (one=df1, two=df2, three=df3 ) )
#   one_x one_y two_x two_z three_x three_z
# a     1    21    NA    NA      NA      NA
# b     2    22    NA    NA      NA      NA
# c     3    23    NA    NA      NA      NA
# d     4    24   104   204      NA      NA
# e     5    25   105   205    1005    2005
# f    NA    NA   106   206    1006    2006
# g    NA    NA    NA    NA    1007    2007
# h    NA    NA    NA    NA    1008    2008

As a bonus, my codes is also ignorant of the column names which can be handy if every column is unique across the datasets. The column orders are preserved.

out <- multimerge( list (df1, df2, df3 ) )
colnames(out) <- gsub("^_", "", colnames(out))
out
#    x  y   x   z    x    z
# a  1 21  NA  NA   NA   NA
# b  2 22  NA  NA   NA   NA
# c  3 23  NA  NA   NA   NA
# d  4 24 104 204   NA   NA
# e  5 25 105 205 1005 2005
# f NA NA 106 206 1006 2006
# g NA NA  NA  NA 1007 2007
# h NA NA  NA  NA 1008 2008

One comment


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s