Beware when using the scale parameter in pheatmap or using scaled data in ComplexHeatmap

Using the scale=”row” or scale=”column” in the pheatmap() function looks like an easy option but we recently realized there is subtle conceptual assumption that can be misleading.

TLDR: The function actually scales the data before calculating the distance metrics and performing the clustering. Problem persists when using correlation-based distance or Euclidean distance. It also persists when using Heatmap() function from ComplexHeatmap. The better solution is to to calculate and feed in the row and column clustering (jump to Section 9 for solution).

1. Load in demo data

We will be using the Wisconsin diagnostic breast cancer (WDBC) data. This dataset is for fine needle aspirate of breast lump samples from 569 patients who were labelled as either malignant (M, n=212) or benign (B, n=357). Let’s load the data and save the Diagnosis column as character and color coding.

pacman::p_load(tidyverse, ClassDiscovery, pheatmap, ComplexHeatmap)
rm(list=ls())

data(wdbc, package = "mclust")

dim(wdbc)
# 569 32

## Diagnosis column and color mapping
Diagnosis <- wdbc %>% 
  column_to_rownames("ID") %>% 
  mutate(Diagnosis       = as.character(Diagnosis),
         Diagnosis_color = recode(Diagnosis, "M"="red", "B"="blue")) %>% 
  select(Diagnosis, Diagnosis_color)

tail(Diagnosis, 3)

#       Diagnosis Diagnosis_color
# 926954         M             red
# 927241         M             red
# 92751          B            blue

For each sample, ten different histopathological features of the cell nuclei was obtained from digitized images. The mean, sd and extreme values of the features were recorded but we will only use the mean values for the demonstration. I have also transposed the data as I want to view features as rows and samples as columns in the heatmap.

## Select only the mean values for the 10 histopathological factors
df <- wdbc %>% 
  column_to_rownames("ID") %>% 
  select(ends_with("_mean")) %>% 
  rename_with(~gsub("_mean", "", .)) %>%   # Optional, for shorter names
  t()

dim(df)
# 10 569


df[ , 567:569]
#                926954    927241     92751
# Radius       16.60000 2.060e+01   7.76000
# Texture      28.08000 2.933e+01  24.54000
# Perimeter   108.30000 1.401e+02  47.92000
# Area        858.10000 1.265e+03 181.00000
# Smoothness    0.08455 1.178e-01   0.05263
# Compactness   0.10230 2.770e-01   0.04362
# Concavity     0.09251 3.514e-01   0.00000
# Nconcave      0.05302 1.520e-01   0.00000
# Symmetry      0.15900 2.397e-01   0.15870
# Fractaldim    0.05648 7.016e-02   0.05884

rm(wdbc)

2. Generate the row and column clustering individually

hc.features <- df %>% t() %>%  
  distanceMatrix(metric="pearson") %>%
  hclust(method="average")               

hc.samples <- df %>% 
  distanceMatrix(metric="pearson") %>%
  hclust(method="average")               

par(mfrow=c(1, 2))

par(mar=c(3,1,1,5))
plot(as.dendrogram(hc.features), ylab="", sub="", horiz=T)

par(mar=c(5, 2, 4, 2) + 0.1)
plotColoredClusters(hc.samples, 
                    labs = colnames(df), 
                    cols = Diagnosis %>% pull(Diagnosis_color),
                    ylab="", sub="", cex=0.5)

Keep an eye on the figures above. These are our target plots for confirmation.

3. Using pheatmap() function without scaling

One thing to note is that the function pheatmap() can be found in the pheatmap and ComplexHeatmap packages. So we need to be explicit on which pheatmap() we are calling.

pheatmap::pheatmap(df, 
                   scale = "none",          # the default value anyway
                   annotation_col = Diagnosis %>% select(Diagnosis),
                   show_colnames  = FALSE,
                   clustering_distance_rows = "correlation",
                   clustering_distance_cols = "correlation",
                   clustering_method = "average")

You can see that the column clustering is identical to independently produced clustering in Section 2. Ditto for row clustering (except order is reversed which can be ignored). The scale within the body of the heatmap is ugly because the variables are on different scales. E.g. Area ranges from 143 to 2501 while Fractaldim ranges from 0.05 to 0.10.

4. Using pheatmap() function with scaling

I assumed using the scale="row" argument in pheatmap() would make life easy but …

pheatmap::pheatmap(df, 
                   scale = "none",          # the default value anyway
                   annotation_col = Diagnosis %>% select(Diagnosis),
                   show_colnames  = FALSE,
                   clustering_distance_rows = "correlation",
                   clustering_distance_cols = "correlation",
                   clustering_method = "average")

The body of the heatmap looks fine but the column clustering is very different to the clustering in Section 2. The row cluster is unaffected here because correlation does an internal scaling beforehand. However, if we are using Euclidean distance, the row cluster would also look very different (see Section 8).

5. Reason for the difference

The data is scaled before the hierarchical clustering of rows and columns are generated. To demonstrate this, let’s scale the features/columns to have zero mean and unit variance and then run the clustering on it.

scaled_by_feature <- df %>% t() %>% scale() %>% t()
dim(scaled_by_feature)
# [1]  10 569

apply(scaled_by_feature, 1, mean)  # close enought to zero
#        Radius       Texture     Perimeter          Area    Smoothness 
# -1.379776e-16  6.157678e-17 -1.190559e-16  1.219971e-16  1.618800e-16 
#   Compactness     Concavity      Nconcave      Symmetry    Fractaldim 
# -7.605173e-17  3.929080e-17 -5.337457e-17  1.679646e-16  4.810260e-16 

apply(scaled_by_feature, 1, var)
#      Radius     Texture   Perimeter       Area   Smoothness 
#           1           1           1          1            1          
# Compactness   Concavity    Nconcave   Symmetry   Fractaldim 
#           1           1           1          1            1 

scaled_by_feature %>% 
  distanceMatrix(metric="pearson") %>%
  hclust(method="average") %>% 
  plotColoredClusters(labs = colnames(df), 
                      cols = Diagnosis %>% pull(Diagnosis_color),
                      ylab="", sub="", cex=0.5)

We have reproduced the undesired clustering in Section 4.

6. Fix for pheatmap()

One way to fix this is to feed in the independent clusterings into pheatmap using cluster_rows and cluster_cols arguments. You no longer need to set clustering_distance_rows, clustering_distance_cols and clustering_method arguments.

pheatmap::pheatmap(df, 
                   scale = "row",
                   annotation_col = Diagnosis %>% select(Diagnosis),
                   show_colnames  = FALSE,
                   cluster_rows   = hc.features,
                   cluster_cols   = hc.samples)

7. What about ComplexHeatmap?

The function Heatmap() from ComplexHeatmap package does not have an option to scale. You will have to feed in the scaled input manually (will use the input from Section 5) but you can see that we end up with the same problem.

column_ha <- HeatmapAnnotation(Diagnosis = Diagnosis %>% pull(Diagnosis), 
                               col = list(Diagnosis = c("M" = "red", "B" = "blue")),
                               which = "column")

Heatmap(scaled_by_feature,
        top_annotation = column_ha,
        clustering_distance_rows    = "pearson",
        clustering_method_rows      = "average",
        clustering_distance_columns = "pearson",
        clustering_method_columns   = "average",
        show_column_names = FALSE)

We can apply the same fix in Section 6. However, the argument names are slightly different. i.e. cluster_columns instead of cluster_cols.

Heatmap(scaled_by_feature,
        cluster_rows      = hc.features, 
        cluster_columns   = hc.samples,
        top_annotation    = column_ha,
        show_column_names = FALSE)

8. What about Euclidean distance?

The situation is worse as the row clustering will also be affected. To demonstrate this, change the phrase “pearson” and “correlation” in the codes above to “euclidean”.

9. Summary

Calculate the row and column clusterings (see Section 2) and feed the clusterings into pheatmap() function (see Section 6) or Heatmap() function. In short:

hc.features <- df %>% t() %>%  
  distanceMatrix(metric="pearson") %>%
  hclust(method="average")               

hc.samples <- df %>% 
  distanceMatrix(metric="pearson") %>%
  hclust(method="average")

pheatmap::pheatmap(df, 
                   scale = "row",
                   annotation_col = Diagnosis %>% select(Diagnosis),
                   show_colnames  = FALSE,
                   cluster_rows   = hc.features,
                   cluster_cols   = hc.samples)            
Advertisement

How to setup and run a local BLAST

I do not run BLAST often and I use the web interface when I need to. However, I recently had a request to run for several hundred thousand sequences against the genome for which I felt running locally was a preferred choice. This also gave me a chance to test Windows Subsystem for Linux (WSL2).

Setting up the blastn software

I followed the steps from the first few sections of https://www.exoscale.com/syslog/blast/

sudo apt install lftp
lftp -e "cd blast/executables/LATEST; dir; quit" ftp.ncbi.nlm.nih.gov | awk '{print $NF}'

wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.13.0+-x64-linux.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.13.0+-x64-linux.tar.gz.md5
md5sum --check ncbi-blast-2.13.0+-x64-linux.tar.gz.md5

tar -xzvf ncbi-blast-2.3.0+-x64-linux.tar.gz 
rm ncbi-blast-2.3.0+-x64-linux.tar.gz 

Setup the BLAST databases

I am going to setup the database folder and download the preformatted human database which comes in two volumes. Change the database according to your species of interest and if you want to do it at DNA, RNA or protein level. You can also use the makeblastdb or update_blastdb.pl as alternatives.

mkdir $HOME/blast_db
export BLASTDB=$HOME/blast_db

cd $BLASTDB

wget https://ftp.ncbi.nlm.nih.gov/blast/db/human_genome.00.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/blast/db/human_genome.00.tar.gz.md5
md5sum --check human_genome.00.tar.gz.md5
tar -xzvf human_genome.00.tar.gz

wget https://ftp.ncbi.nlm.nih.gov/blast/db/human_genome.01.tar.gz
wget https://ftp.ncbi.nlm.nih.gov/blast/db/human_genome.01.tar.gz.md5
md5sum --check human_genome.01.tar.gz.md5
tar -xzvf human_genome.01.tar.gz

For convenience, you can optionally add the following lines to you $HOME/.bashrc

export PATH=$PATH:$HOME/ncbi-blast-2.13.0+/bin
export BLASTDB=$HOME/blast_db

(Optional) You can find the location of the FASTA, GTF and other files used to create this preformatted database from the json file.

wget https://ftp.ncbi.nlm.nih.gov/blast/db/human_genome-nucl-metadata.json

cat human_genome-nucl-metadata.json
{
  "version": "1.1",
  "dbname": "human_genome",
  "dbtype": "Nucleotide",
  "description": "Homo sapiens GRCh38.p13 [GCF_000001405.39] chromosomes plus unplaced and unlocalized scaffolds",
  "number-of-letters": 3272089205,
  "number-of-sequences": 639,
  "files": [
    "ftp://ftp.ncbi.nlm.nih.gov/blast/db/human_genome.01.tar.gz",
    "ftp://ftp.ncbi.nlm.nih.gov/blast/db/human_genome.00.tar.gz"
  ],
  "last-updated": "2021-06-02T14:21:05.818179",
  "bytes-total": 1157233426,
  "bytes-total-compressed": 1024270191,
  "bytes-to-cache": 818356831,
  "number-of-volumes": 2
}

They point to note is that the genome reference is “GCF_000001405.39” which means you can find them at https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/

Running the blastn command

Here is how the first few lines of my input file looks like:

head input.txt

>seq1
ACTTCTCAGATATGTTTTTTGTCTTCAGTGAACTTAAAGTCTGTTTGAAA
>seq2
CTTCTCAGCTGCATTGCTTCCTCTTCAGACCTCTCTTCTCAAAGGACCTG
>seq3
TCTCACAAGACGCACAAATTCCTAAACCAGCAATGCTGCTTGGCAAGACG

Finally, the actual command:

blastn -query blast_input.txt -db GCF_000001405.39_top_level -out blast_output.txt -outfmt 6

Here is how the output looks like

head blast_output.txt cls

seq1 NC_000013.11  100.000  50  0  0  1  50   93689308   93689357  5.79e-18  93.5  
seq2 NC_000023.11  100.000  50  0  0  1  50   71618757   71618708  5.79e-18  93.5 
seq3 NC_000001.11  100.000  50  0  0  1  50  154607753  154607802  5.79e-18  93.5
seq3 NC_000001.11  100.000  49  0  0  2  50  154607824  154607872  2.08e-17  91.6
seq3 NC_000001.11  100.000  37  0  0  13 49  154607685  154607721  9.76e-11  69.4 

Unfortunately there is no header for this file but with a bit of digging around, you realize the headers are:

  1. query_id
  2. subject_id aka chromosome where it maps to
  3. per_identity aka percentage identity
  4. aln_length aka alignment length
  5. mismatches
  6. gap_openings
  7. q_start
  8. q_end
  9. s_start aka start position in the subject sequence
  10. s_end aka end position in the subject sequence
  11. e-value aka number of expected hits of similar quality (score) that could be found just by chance
  12. bit_score aka the raw alignment score

Lower the e-value and the higher the bit-score, the better the BLAST result. The e-value is a corrected bit-score adjusted to the sequence database size since large database increases the chance of false positive alignments. See https://www.metagenomics.wiki/tools/blast/evalue for full explaination.

(Optional) Post alignment filtering and annotation

As you can see, seq3 has three possible alignments. I will do the cleanup to choose the best alignment in R.

pacman::p_load(tidyverse, janitor)
rm(list=ls())

## Read in blastn and add header for outfmt6
blast <- read_delim("blast_output.txt", col_names=F) %>% 
  select(query_id   = 1,
         subject_id = 2,
         s_start    = 9,
         s_end      = 10,
         e_value    = 11)

## Filter for minimum e-value per gene to keep
blast_best <- blast %>%
  group_by(query_id) %>% 
  filter(e_value == min(e_value)) %>%
  ungroup()

## Check the e-value for the worst alignment after this filtering
blast_best %>% 
  pull(e_value) %>% 
  max()

## Check how many sequences have unique best alignment
tb <- blast_best %>% tabyl(query_id) 
tb %>% filter(n==1) %>% nrow()    # unique best alignment
tb %>% filter(n > 1) %>% nrow()   # have multiple best alignment

## Write out
write_delim(blast_best, file="blast_output_min_evalue.tsv", delim="\t")

In my case, I also needed to write this as a 6-column bed file for further processing. One challenge here was to create a unique key as the “query” column is repeated for multiple best alignments. You can omit the header by setting col_names=FALSE in the write_delim() command.

blast_best %>% 
   
  mutate( uid    = paste0(query_id, "_", row_number()),
          blank  = "",
          strand = ".",
          start  = min(s_start, s_end),
          end    = max(s_start, s_end),
          start  = start - 1) %>% 
  
  select(chr=subject_id, start, end, uid, blank, strand) %>% 
  
  write_delim(file="blast_output_min_evalue.bed", delim = "\t")

Note that the BED file format insists on:

  • start value to be smaller than end value even for negative strand
  • start column is 0-based

Next, we download and prepare the FASTA and GTF for annotation.

REF=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13

wget $REF/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
gunzip GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

wget $REF/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
gunzip GCF_000001405.39_GRCh38.p13_genomic.fna.gz
mv GCF_000001405.39_GRCh38.p13_genomic.fna GCF_000001405.39_GRCh38.p13_genomic.fa

Finally, the call to execute homer.

annotatePeaks.pl blast_output_min_evalue.bed GCF_000001405.39_GRCh38.p13_genomic.fa -gtf GCF_000001405.39_GRCh38.p13_genomic.gtf > homer_output.txt

Changing legend shape when geom_text() and colour aesthetic are used in ggplot2

When you use the geom_text() with the colour aesthetic is set, you get an ugly “a” in the legend with color. Let us first create a demo data.

library(tidyverse)
data(iris)
df <- iris %>% 
  sample_frac(size = 0.1) %>% 
  rowid_to_column("ID")

Here is how the data look. Yours will be different because of the random nature of sample_frac().

#    ID Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
# 1   1          5.0         3.5          1.6         0.6      setosa
# 2   2          7.1         3.0          5.9         2.1   virginica
# 3   3          6.5         3.0          5.8         2.2   virginica
# 4   4          6.3         2.7          4.9         1.8   virginica
# 5   5          6.8         2.8          4.8         1.4  versicolor
# 6   6          6.2         2.9          4.3         1.3  versicolor
# 7   7          5.7         3.8          1.7         0.3      setosa
# 8   8          5.4         3.9          1.3         0.4      setosa
# 9   9          4.8         3.1          1.6         0.2      setosa
# 10 10          5.1         2.5          3.0         1.1  versicolor
# 11 11          6.4         2.9          4.3         1.3  versicolor
# 12 12          5.7         4.4          1.5         0.4      setosa
# 13 13          5.1         3.4          1.5         0.2      setosa
# 14 14          4.4         2.9          1.4         0.2      setosa
# 15 15          5.0         3.3          1.4         0.2      setosa

Default plot

ggplot(df, aes(Sepal.Length, Sepal.Width, label=ID, col=Species)) +
  geom_text(size=6) +
  theme_bw() +
  labs(title="Default")

Notice the ugly “a” in the legend?

Alternative 1 – hide the legend

One simple option is to simply hide the legend by setting show.legend=FALSE in geom_text().

ggplot(df, aes(Sepal.Length, Sepal.Width, label=ID, col=Species)) +
  geom_text(show.legend=FALSE) +
  theme_bw() 

Alternative 2 – make a better legend

The other alternative is slightly more complex.

  • disable the legend aesthetic in geom_text() via the show.legend parameter
  • add transparent point in geom_point() by setting alpha=0 so no points are added to plot. This geom_point() will trigger the legend aesthetic
  • control the legend aesthetic for geom_point via guides()
ggplot(df, aes(Sepal.Length, Sepal.Width, label=ID, col=Species)) +
  geom_text(show.legend=FALSE) +
  geom_point(alpha=0) +
  guides(colour = guide_legend(title="Variety", override.aes=list(alpha=1, shape=15, size=6))) + 
  theme_bw()

How to merge multiple datasets in R based on row names?

Here is a common everyday challenge. How does one combine several datasets using rownames as the key? This is known as recursive or repeated merging in R and a full outer join in SQL speak.

Let me generate some simple dataset first:

df1 <- data.frame( x = 1:5,        y = 20 + 1:5,   row.names=letters[1:5] )
df2 <- data.frame( x = 100 + 4:6,  z = 200 + 4:6,  row.names=letters[4:6] )
df3 <- data.frame( x = 1000 + 5:8, z = 2000 + 5:8, row.names=letters[5:8] )

df1
#   x  y
# a 1 21
# b 2 22
# c 3 23
# d 4 24
# e 5 25

df2
#     x   z
# d 104 204
# e 105 205
# f 106 206

df3
#      x    z
# e 1005 2005
# f 1006 2006
# g 1007 2007
# h 1008 2008

First option is to manually merge the dataset by hand as follows. This advantage of this method is that it is clear and easy to double check. The disadvantage is the length codes and intermediate objects which becomes difficult to track with increasing number of datasets. I seen the use of Reduce() function in some forum threads that might help automate this.

tmp12 <- merge(df1, df2, by=0, all=T)
rownames(tmp12) <- tmp12$Row.names; tmp12$Row.names <- NULL

tmp123 <- merge(tmp12, df3, by=0, all=T)
rownames(tmp123) <- tmp123$Row.names; tmp123$Row.names <- NULL
tmp123
#   x.x  y x.y z.x    x  z.y
# a   1 21  NA  NA   NA   NA
# b   2 22  NA  NA   NA   NA
# c   3 23  NA  NA   NA   NA
# d   4 24 104 204   NA   NA
# e   5 25 105 205 1005 2005
# f  NA NA 106 206 1006 2006
# g  NA NA  NA  NA 1007 2007
# h  NA NA  NA  NA 1008 2008

rm(tmp12, tmp123)

Second option is to use the join_all() function from the plyr package. I am new to this function and naively thought something like the following will do the job:

library(plyr)
mylist <- list( one=df1, two=df2, three=df3 )
join_all( mylist, type="full" )  # not what I want!
# Joining by: x
# Joining by: x, z
#       x  y    z
# 1     1 21   NA
# 2     2 22   NA
# 3     3 23   NA
# 4     4 24   NA
# 5     5 25   NA
# 6   104 NA  204
# 7   105 NA  205
# 8   106 NA  206
# 9  1005 NA 2005
# 10 1006 NA 2006
# 11 1007 NA 2007
# 12 1008 NA 2008

A few extra lines of coding to make the column names unique and adding the rownames as a column seems to solve this problem. However, I have not tested this widely.

for(i in 1:length(mylist)){
  colnames(mylist[[i]]) <- paste0( names(mylist)[i], "_", colnames(mylist[[i]]) )
  mylist[[i]]$ROWNAMES  <- rownames(mylist[[i]])
}

out <- join_all( mylist, by="ROWNAMES", type="full" )
rownames(out) <- out$ROWNAMES; out$ROWNAMES <- NULL
#   one_x one_y two_x two_z three_x three_z
# a     1    21    NA    NA      NA      NA
# b     2    22    NA    NA      NA      NA
# c     3    23    NA    NA      NA      NA
# d     4    24   104   204      NA      NA
# e     5    25   105   205    1005    2005
# f    NA    NA   106   206    1006    2006
# g    NA    NA    NA    NA    1007    2007
# h    NA    NA    NA    NA    1008    2008

rm(out, mylist)

Third option is my own custom code which I used successfully for years and is pretty fast for very large datasets. Here is a pseudo-algorithm

  1. Generate the universe of IDs across all datasets
  2. Pads each dataset with the extra rows for missing IDs (all datasets have same number of rows)
  3. Sort every dataset (all datasets have same order)
  4. Row bind the datasets
multimerge <- function (mylist) {
  ## mimics a recursive merge or full outer join

  unames <- unique(unlist(lapply(mylist, rownames)))

  n <- length(unames)

  out <- lapply(mylist, function(df) {

    tmp <- matrix(nr = n, nc = ncol(df), dimnames = list(unames,colnames(df)))
    tmp[rownames(df), ] <- as.matrix(df)
    rm(df); gc()

    return(tmp)
  })

  stopifnot( all( sapply(out, function(x) identical(rownames(x), unames)) ) )

  bigout <- do.call(cbind, out)
  colnames(bigout) <- paste(rep(names(mylist), sapply(mylist, ncol)), unlist(sapply(mylist, colnames)), sep = "_")
  return(bigout)
}

multimerge( list (one=df1, two=df2, three=df3 ) )
#   one_x one_y two_x two_z three_x three_z
# a     1    21    NA    NA      NA      NA
# b     2    22    NA    NA      NA      NA
# c     3    23    NA    NA      NA      NA
# d     4    24   104   204      NA      NA
# e     5    25   105   205    1005    2005
# f    NA    NA   106   206    1006    2006
# g    NA    NA    NA    NA    1007    2007
# h    NA    NA    NA    NA    1008    2008

As a bonus, my codes is also ignorant of the column names which can be handy if every column is unique across the datasets. The column orders are preserved.

out <- multimerge( list (df1, df2, df3 ) )
colnames(out) <- gsub("^_", "", colnames(out))
out
#    x  y   x   z    x    z
# a  1 21  NA  NA   NA   NA
# b  2 22  NA  NA   NA   NA
# c  3 23  NA  NA   NA   NA
# d  4 24 104 204   NA   NA
# e  5 25 105 205 1005 2005
# f NA NA 106 206 1006 2006
# g NA NA  NA  NA 1007 2007
# h NA NA  NA  NA 1008 2008

Download 1000 Genomes Phase3 and calculate allele frequencies

Here are some codes to download the data from the 1000 Genomes Phase 3 website into your own server and calculating the allele frequencies for the European populations. Here are some setup codes. The panel file tells you which population and super-population each sample belongs to.

FTP_SITE=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502
wget $FTP_SITE/integrated_call_samples_v3.20130502.ALL.panel

Next we will download each chromosome (I am ignoring the Y and MT chromosomes here). Alternatively, you can download all the files on the FTP site using wget -r $FTP_SITE but I preferred to download each one separately. Note the Chromosome X is based on version 1b and not 5a like the autosomes, so has to be downloaded separately. I am renaming it for convenience later.

for CHR in `​seq 1 22`; do
   FILE=$FTP_SITE/ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   wget $FILE $FILE.tbi
   sleep 60
done

FILE=$FTP_SITE/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget $FILE $FILE.tbi
rename -v 's/v1b/v5a/' *

The next step is to identify the Europeans and calculate the allele frequencies. I am also ignoring any variants that have a call rate < 95% and fails Hardy-Weinberg Equilibrium at p < 10-6. You can do this with vcftools or other softwares but I am most familiar with PLINK. And if you are using PLINK, please use version 1.90 (https://www.cog-genomics.org/plink2) and above as it is much faster than previous versions.

grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EUR.id  ## 503 Europeans

for CHR in `seq 1 23  | sed 's/23/X/'`​; do
   FILE=ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   plink --vcf $FILE --keep-fam EUR.id --geno 0.05 --hwe 1e-6 --make-bed --out tmp
   plink --bfile tmp --freq --out EUR_$CHR rm EUR_$CHR.{log,nosex} tmp.*
done

awk 'FNR==1 && NR!=1{next;}{print}' *.frq | cut -f1-4 > EUR.freq
rm EUR_*.frq

You may also find pre-computed allele frequencies per population but my purpose here is how to download and run some calculations on the VCF files.

Leading edge subset from Gene Set Enrichment Analysis (GSEA) using blood transcription modules

Recently, I have been taken up very much by the modular analysis developed by Damien Chaussabel and colleagues (see original paper or their recent review paper) for analysis of whole blood data.

A recent R package called tmod from Dr. January Weiner provides a very easy and elegant implementation of this concept and data. The vignette is well written and the author is very responsive to requests and suggestions. Currently, the evidence plot is based on the receiver operating curve. The next release will contain an option to produce an evidence plot that is the equivalent to enrichment score plot of the popular GSEA software from the Broad Institute.

I tweaked this function a bit to allow for different weighting of the genes in the module and to allow it to return the indices of the leading edge subset (i.e. core enrichment genes) which can be useful to generate heatmaps etc. tmod’s evidencePlot has the advantage of not requiring the logFC as an additional parameter.

getLeadingEdgeSubset <- function(symbol, lfc, mod, alpha=1, plot=T){

  stopifnot( length(symbol) == length(lfc) )

  genes <- getGenes(mod, mset="all")$Genes
  genes <- unique(unlist(strsplit(genes, split=",")))
  inMod <- sapply(symbol, function(x) any(x %in% genes))  ## "symbol" can be a list to allow multi-mapping probes

  N     <- length(symbol)
  N.hit <- sum(inMod)

  runningES <- NA
  runningES[  inMod ] <-  abs( lfc[ inMod ]^alpha )/sum(abs(lfc[ inMod ]^alpha))
  runningES[ !inMod ] <-  -1/(  N - N.hit )
  runningES <- cumsum(runningES)

  rank.at.max <- which.max(runningES)
  leadingEdge <- which(inMod[ 1:rank.at.max ])

 if(plot){

    data(tmod)
    title.main <- paste0(tmod$MODULES[ mod, "Title" ], " (", mod, ")", collapse="; ")

    plot(1:N, runningES, type="l",
         ylab="Enrichment Score (ES)", xlab="Ordered rank of genes", main=title.main)

    mtext(side=3, at=which(inMod), text="|", line=-1, cex=0.5)

    arrows(rank.at.max, 0, rank.at.max, max(runningES), length=0, col="blue")

    mtext(side=3, text=paste0("# genes in module:", length(genes), "                                       ",
                              "# genes found:", length(unique(unlist(symbol[inMod]))), " (", N.hit, " probes)\n"), line=-1)

    legend("bottom", bty="n",
           legend=paste0("Rank at max:", rank.at.max, "\n# probes in leading edge subset:", length(leadingEdge)))
  }

  return(leadingEdge)
}

Here is a simple example taken from the tmod vignette:

library(limma)
library(tmod)

data(Egambia)
design <- cbind(Intercept=rep(1, 30), TB=rep(c(0,1), each= 15))
E <- as.matrix(Egambia[,-c(1:3)])
fit <- eBayes( lmFit(E, design))
tt <- topTable(fit, coef=2, number=Inf, genelist=Egambia[,1:3] )
rm(design, E, fit)

evidencePlot(tt$GENE_SYMBOL, "LI.M37.0")  ## from tmod version 0.24 onwards

Rplot

core <- getLeadingEdgeSubset(tt$GENE_SYMBOL, tt$logFC, "LI.M37.0", alpha=1)

Rplot04

 

I set alpha=1 as default to match Broad Institute’s default value. Setting alpha=0 will produce exactly the same graph as evidencePlot from tmod. Setting the alpha=2 will overweight the genes in the module and sometimes useful for finding smaller set of core genes.

You can then take on these indices (“w”), pull out the genes and visualize them as such

top <- Egambia[ core, -c(1:3)]
rownames(top) <- paste0( tt$GENE_SYMBOL[ core ], " (", core, ")")
status <- data.frame(status=ifelse( grepl("TB", colnames(top)), "TB", "Healthy"))

library(NMF)
library(gplots)

aheatmap(top, annCol=status,
         color=bluered(10), Rowv=NA, Colv=NA, scale="row",
         main="leading edge subset from module LI.M37.0")

 

Rplot05

How polymorphism(s) in probe sequences affect cis-eQTL results

In our paper, we show that the presence of polymorphisms (SNPs and indels) in probe sequences can induce a serious technical artefact (through reduced hybridization efficiency) when conducting expression QTL (eQTL) studies, especially if you are using array based technologies. You can find the paper here: http://nar.oxfordjournals.org/content/early/2013/02/21/nar.gkt069

But for all of you busy people, the main result is summarized in Figure 1 (also attached below).

Ramasamy_NAR2012

Quick summary:

  • About 6.1% of the probes (25-mers) in Affymetrix Human Exon 1.0 array contains polymorphisms but they account 50 – 90% of cis-eQTLs
  • About 11.7% of the probes (50-mers) in Illumina HT12 array contains polymorphisms but they account for 30 – 45% of the cis-eQTLs
  • The binding efficiency of longer probes are less affected by polymorphism (as expected) but still > 30% of cis-eQTLs are false!
  • The 1000G dataset appears to be good enough to identify probes containing polymorphisms. The only way to possibly improve on this is to look at private mutations from exome-seq etc but may not be worth the effort.
  • Increasing p-value stringency seems to make situation worse (cis-eQTLs due to polymorphisms are very strong) so many published cis-eQTLs may suffer from this (especially if the authors have not done this or used an older reference panel to identify polymorphisms)

A quick tutorial on using MatrixEQTL

Andrey has many useful examples/tutorials on his MatrixEQTL page. Here I am going to demonstrate how to use MatrixEQTL using data from the UK Brain Expression Consortium (UKBEC). I have chosen the MAPT gene which is important for various neurodegenerative disorders. The data was downloaded from the BRAINEAC website (link to come) and can be downloaded from here. This consortium collected DNA and RNA from ten brain regions in 134 neuropathologically normal individuals. Additionally, the mean expression profile for each individual across tissues (aveALL) was calculated.

Let us first load the R data and inspect the format. You may need to change working directory using the setwd() command first or supply the full path.

This particular dataset contains information for the 26 expression profiles: one representing the gene-level profile and prefixed with “t” plus 25 representing the exon-level profiles. It also contains the information for 5,287 markers (SNPs and indels) located within 1Mb of transcription start and end sites.

load("t3723687.rda")

#####################
## Expression data ##
#####################

length(expr)
 [1] 11

names(expr)
 [1] "aveALL" "CRBL" "FCTX" "HIPP" "MEDU" "OCTX" "PUTM" "SNIG" "TCTX" "THAL" "WHMT"

dim(expr[["TCTX"]])
 [1]  26 134

head(expr[["TCTX"]])[ , 1:8]
          00_38    01_37    02_06   03_28    04_05    04_19   05_10   06_02
3723688 9.63145  9.22946  9.47970 9.47775  9.24175  9.74315 9.46278 9.76640
3723690 6.03620  5.55045  5.08430 5.09810  6.67395  4.95544 5.56240 4.67415
3723691 7.41365  7.11841  7.16574 6.39494  7.58188  6.28772 6.93655 6.34788
3723707 9.96076 10.13550 10.21800 9.70254 10.54540 10.34620 9.66656 9.74161
3723710 9.31395  9.44499  9.31668 8.51235 10.07000  9.95134 9.41852 8.65756
3723712 5.60436  5.21131  5.89260 6.17088  5.09820  6.14388 5.83319 5.94182

tail(expr[["TCTX"]])[ , 1:8]
           00_38   01_37   02_06   03_28    04_05   04_19   05_10   06_02
3723750  8.42139 8.72951 8.69241 9.06205  9.37904 9.33639 8.59345 9.15941
3723751  8.63697 8.00765 8.81346 8.64289  9.52436 8.87514 8.50429 8.56228
3723752  8.93384 8.19434 8.14235 8.14987 10.02850 9.10575 8.05589 7.93080
3723753  9.06106 9.26484 9.20654 9.38277  9.73640 9.61577 9.15330 9.17392
3723754  8.63553 8.80820 9.22138 8.64040  8.73758 9.45185 8.83666 8.96825
t3723687 8.39886 8.36485 8.36244 8.29161  8.91036 8.64564 8.28501 8.19473

dim(expr.info)
[1] 26  4

head(expr.info)
          chr    start     stop     tID
3723688 chr17 43971878 43972051 3723687
3723690 chr17 43973814 43974885 3723687
3723691 chr17 43974897 43975038 3723687
3723707 chr17 44039714 44039808 3723687
3723710 chr17 44049243 44049311 3723687
3723712 chr17 44051752 44051833 3723687

#################
## Marker data ##
#################

dim(markers)
 [1] 5287  134

head(markers)[ , 1:8]
                     00_38 01_37 02_06 03_28 04_05 04_19 05_10 06_02
chr17:42972122           2 2.000     2 2.000     2 1.417 1.927     2
chr17:42972206           2 2.000     2 2.000     2 1.000 2.000     1
chr17:42972490:GGA_G     2 2.000     2 2.000     2 1.000 1.000     2
chr17:42972648           2 2.000     2 1.000     2 2.000 2.000     2
chr17:42972690           1 1.999     0 1.996     0 2.000 1.000     1
chr17:42972960           1 2.000     0 2.000     0 2.000 1.000     1

dim(markers.info)
[1] 5287    8

head(markers.info)
                     Al1 Al2   Freq1     MAF AvgCall     Rsq       rsid  type
chr17:42972122         G   A 0.88433 0.11567 0.97738 0.84068 rs75707322   SNP
chr17:42972206         G   A 0.88433 0.11567 1.00000 1.00370  rs8069039   SNP
chr17:42972490:GGA_G   R   D 0.79364 0.20636 0.99882 0.99709        indel
chr17:42972648         G   C 0.89551 0.10449 0.99997 1.00342 rs71373525   SNP
chr17:42972690         C   G 0.62428 0.37572 0.98117 0.94353  rs8069754   SNP
chr17:42972960         G   A 0.64327 0.35673 0.99849 0.99826  rs2878972   SNP

Before we proceed, first of all note that some tissue is missing for some individuals as the RNA was not available or low quality. Using temporal cortex (TCTX) as example, we see that 15 individuals are missing for this tissue.

table( w &lt;- colMeans(is.na(expr[["TCTX"]])) )
   0   1
 119  15

length( valids &lt;- names( which(w==0) ) )
 [1] 119

head(valids)
 [1] "00_38" "01_37" "02_06" "03_28" "04_05" "04_19" ...

Note that MatrixEQTL imputes the missing values before doing the calculations. This is not idea as our values are not missing at random. So we will need to manually exclude these individuals when we convert these datasets into MatrixEQTL format.

library(MatrixEQTL)   # version 2.0.0 - check yours with sessionInfo()

my.expr &lt;- SlicedData$new()
my.expr$CreateFromMatrix( as.matrix( expr[["TCTX"]][ , valids ] ) )

my.markers &lt;- SlicedData$new()
my.markers$CreateFromMatrix( as.matrix( markers[ , valids ] ) )

It is important to ensure the datasets are in the same order of subjects before we proceed to execute the MatrixEQTL

identical( colnames(my.expr), colnames(my.markers) )
 [1] TRUE

tmpf &lt;- tempfile()

res &lt;- Matrix_eQTL_main( my.markers, my.expr, cvrt=SlicedData$new(),
                         output_file_name=tmpf, pvOutputThreshold=1e-5,
                         useModel=modelLINEAR, errorCovariance=numeric(0) )
unlink(tmpf)

Let us now extract and format the results a bit. There are 9,412 significant cis-eQTLs (at p < 1e-5) identified within 0.07 seconds and here are the top 6.

out &lt;- res$all$eqtls
out &lt;- out[ , c("snps", "gene", "beta", "statistic", "pvalue")]
colnames(out) &lt;- c("marker", "exprID", "beta", "t-statistic", "p-value")
dim(out)
 [1] 9412    5

res$time.in.sec
 elapsed
    0.07

head(out)
           marker  exprID       beta t-statistic      p-value
 1 chr17:44061025 3723751  0.3759085    5.761778 6.865336e-08
 2 chr17:44067382 3723751  0.2699755    5.349497 4.439403e-07
 3 chr17:44060492 3723751  0.5931873    5.335221 4.729509e-07
 4 chr17:44762589 3723707 -0.1788872   -5.213412 8.085859e-07
 5 chr17:44066464 3723751  0.2720258    5.128588 1.169849e-06
 6 chr17:44059497 3723751  0.2703483    5.121740 1.205076e-06

Now let us try it the traditional approach.

ptm &lt;- proc.time()

TCTX &lt;- expr[["TCTX"]]

big.out &lt;- NULL

for(i in 1:nrow(TCTX)){

 print(i)
 my.expr &lt;- as.numeric( TCTX[i, ] )

 for(j in 1:nrow(markers)){
   fit &lt;- lm( my.expr ~ as.numeric(markers[j, ]) )
   out &lt;- coef( summary(fit) )[ 2, c("Estimate", "Std. Error", "Pr(&gt;|t|)")]

   if( out[ "Pr(&gt;|t|)" ] &lt; 1e-5 ){
     out &lt;- c( rownames(markers)[j], rownames(TCTX)[i], out )
     big.out &lt;- rbind(big.out, out)
   }

  rm(j, fit, out)
 }
 rm(i, my.expr)
}

colnames(big.out) &lt;- c("marker", "exprID", "Estimate", "SE", "pvalue")
big.out &lt;- big.out[ order(as.numeric(big.out[ , "pvalue"])), ]
rownames(big.out) &lt;- NULL

proc.time() - ptm
   user  system elapsed
 522.83    0.06  528.19

Wow, that took over 8 minutes to complete vs. 0.07 seconds for MatrixEQTL. Sure, I can optimize the code above using apply() etc but it still isn’t as fast as MatrixEQTL.

More importantly, the results the same.

dim(big.out)
 [1] 9412    5

head(big.out)
      marker           exprID    Estimate             SE                   pvalue
 [1,] "chr17:44061025" "3723751" "0.375908482121651"  "0.0652417514030518" "6.86533571244123e-08"
 [2,] "chr17:44067382" "3723751" "0.269975506772536"  "0.0504674529615548" "4.43940323821489e-07"
 [3,] "chr17:44060492" "3723751" "0.593187349664759"  "0.111183272810956"  "4.72950945585888e-07"
 [4,] "chr17:44762589" "3723707" "-0.178887231952771" "0.0343128918690251" "8.08585850844625e-07"
 [5,] "chr17:44066464" "3723751" "0.272025753138588"  "0.0530410575681518" "1.16984903179711e-06"
 [6,] "chr17:44059497" "3723751" "0.270348263447638"  "0.052784452164607"  "1.20507597159997e-06"

SNP vs. genotype definition

Recently, a friend asked for the difference between a SNP and genotype. After some babbling on my part, here is a summary.

SNP Positions of the genome that have a mutation
Genotype Alleles of at a given SNP

Typically we only call a mutation if is seen above a certain frequency (typically > 0.1%) across the population. Otherwise, these are known as private mutations.

Confused? Here is a simple analogy. A metabolic disease called diabetes exists but not everyone has diabetes. A medical diagnosis is required to determine the diabetic status of an individual.

Similarly, genotyping is the process of determining the SNP alleles. Here is example for three individuals:

Position Individual 1 Individual 2 Individual 3
1 A/A A/A

A/A

2

C/C

C/C

C/C

3

G/G

G/G

G/G

4

C/T

T/T

C/C

Nucleotides in positions 1 – 3 are identical across all individuals and therefore not considered a SNP in this sample. The nucleotide in position 4 is a SNP. The genotype for individual 001 at position 4 is C/T.

More generally, in the human genome, there are approximately 40 million SNPs which represent about 1.25% of the genome. In other words, the genomes of two unrelated individuals are identical at around 99% of the nucleotides.