Here is a common everyday challenge. How does one combine several datasets using rownames as the key? This is known as recursive or repeated merging in R and a full outer join in SQL speak.
Let me generate some simple dataset first:
df1 <- data.frame( x = 1:5, y = 20 + 1:5, row.names=letters[1:5] ) df2 <- data.frame( x = 100 + 4:6, z = 200 + 4:6, row.names=letters[4:6] ) df3 <- data.frame( x = 1000 + 5:8, z = 2000 + 5:8, row.names=letters[5:8] ) df1 # x y # a 1 21 # b 2 22 # c 3 23 # d 4 24 # e 5 25 df2 # x z # d 104 204 # e 105 205 # f 106 206 df3 # x z # e 1005 2005 # f 1006 2006 # g 1007 2007 # h 1008 2008
First option is to manually merge the dataset by hand as follows. This advantage of this method is that it is clear and easy to double check. The disadvantage is the length codes and intermediate objects which becomes difficult to track with increasing number of datasets. I seen the use of Reduce() function in some forum threads that might help automate this.
tmp12 <- merge(df1, df2, by=0, all=T) rownames(tmp12) <- tmp12$Row.names; tmp12$Row.names <- NULL tmp123 <- merge(tmp12, df3, by=0, all=T) rownames(tmp123) <- tmp123$Row.names; tmp123$Row.names <- NULL tmp123 # x.x y x.y z.x x z.y # a 1 21 NA NA NA NA # b 2 22 NA NA NA NA # c 3 23 NA NA NA NA # d 4 24 104 204 NA NA # e 5 25 105 205 1005 2005 # f NA NA 106 206 1006 2006 # g NA NA NA NA 1007 2007 # h NA NA NA NA 1008 2008 rm(tmp12, tmp123)
Second option is to use the join_all() function from the plyr package. I am new to this function and naively thought something like the following will do the job:
library(plyr) mylist <- list( one=df1, two=df2, three=df3 ) join_all( mylist, type="full" ) # not what I want! # Joining by: x # Joining by: x, z # x y z # 1 1 21 NA # 2 2 22 NA # 3 3 23 NA # 4 4 24 NA # 5 5 25 NA # 6 104 NA 204 # 7 105 NA 205 # 8 106 NA 206 # 9 1005 NA 2005 # 10 1006 NA 2006 # 11 1007 NA 2007 # 12 1008 NA 2008
A few extra lines of coding to make the column names unique and adding the rownames as a column seems to solve this problem. However, I have not tested this widely.
for(i in 1:length(mylist)){ colnames(mylist[[i]]) <- paste0( names(mylist)[i], "_", colnames(mylist[[i]]) ) mylist[[i]]$ROWNAMES <- rownames(mylist[[i]]) } out <- join_all( mylist, by="ROWNAMES", type="full" ) rownames(out) <- out$ROWNAMES; out$ROWNAMES <- NULL # one_x one_y two_x two_z three_x three_z # a 1 21 NA NA NA NA # b 2 22 NA NA NA NA # c 3 23 NA NA NA NA # d 4 24 104 204 NA NA # e 5 25 105 205 1005 2005 # f NA NA 106 206 1006 2006 # g NA NA NA NA 1007 2007 # h NA NA NA NA 1008 2008 rm(out, mylist)
Third option is my own custom code which I used successfully for years and is pretty fast for very large datasets. Here is a pseudo-algorithm
- Generate the universe of IDs across all datasets
- Pads each dataset with the extra rows for missing IDs (all datasets have same number of rows)
- Sort every dataset (all datasets have same order)
- Row bind the datasets
multimerge <- function (mylist) { ## mimics a recursive merge or full outer join unames <- unique(unlist(lapply(mylist, rownames))) n <- length(unames) out <- lapply(mylist, function(df) { tmp <- matrix(nr = n, nc = ncol(df), dimnames = list(unames,colnames(df))) tmp[rownames(df), ] <- as.matrix(df) rm(df); gc() return(tmp) }) stopifnot( all( sapply(out, function(x) identical(rownames(x), unames)) ) ) bigout <- do.call(cbind, out) colnames(bigout) <- paste(rep(names(mylist), sapply(mylist, ncol)), unlist(sapply(mylist, colnames)), sep = "_") return(bigout) } multimerge( list (one=df1, two=df2, three=df3 ) ) # one_x one_y two_x two_z three_x three_z # a 1 21 NA NA NA NA # b 2 22 NA NA NA NA # c 3 23 NA NA NA NA # d 4 24 104 204 NA NA # e 5 25 105 205 1005 2005 # f NA NA 106 206 1006 2006 # g NA NA NA NA 1007 2007 # h NA NA NA NA 1008 2008
As a bonus, my codes is also ignorant of the column names which can be handy if every column is unique across the datasets. The column orders are preserved.
out <- multimerge( list (df1, df2, df3 ) ) colnames(out) <- gsub("^_", "", colnames(out)) out # x y x z x z # a 1 21 NA NA NA NA # b 2 22 NA NA NA NA # c 3 23 NA NA NA NA # d 4 24 104 204 NA NA # e 5 25 105 205 1005 2005 # f NA NA 106 206 1006 2006 # g NA NA NA NA 1007 2007 # h NA NA NA NA 1008 2008
THANK YOU! That was exactely what I was looking for! You made my day 🙂