Using the scale=”row” or scale=”column” in the pheatmap()
function looks like an easy option but we recently realized there is subtle conceptual assumption that can be misleading.
TLDR: The function actually scales the data before calculating the distance metrics and performing the clustering. Problem persists when using correlation-based distance or Euclidean distance. It also persists when using Heatmap() function from ComplexHeatmap. The better solution is to to calculate and feed in the row and column clustering (jump to Section 9 for solution).
1. Load in demo data
We will be using the Wisconsin diagnostic breast cancer (WDBC) data. This dataset is for fine needle aspirate of breast lump samples from 569 patients who were labelled as either malignant (M, n=212) or benign (B, n=357). Let’s load the data and save the Diagnosis column as character and color coding.
pacman::p_load(tidyverse, ClassDiscovery, pheatmap, ComplexHeatmap)
rm(list=ls())
data(wdbc, package = "mclust")
dim(wdbc)
# 569 32
## Diagnosis column and color mapping
Diagnosis <- wdbc %>%
column_to_rownames("ID") %>%
mutate(Diagnosis = as.character(Diagnosis),
Diagnosis_color = recode(Diagnosis, "M"="red", "B"="blue")) %>%
select(Diagnosis, Diagnosis_color)
tail(Diagnosis, 3)
# Diagnosis Diagnosis_color
# 926954 M red
# 927241 M red
# 92751 B blue
For each sample, ten different histopathological features of the cell nuclei was obtained from digitized images. The mean, sd and extreme values of the features were recorded but we will only use the mean values for the demonstration. I have also transposed the data as I want to view features as rows and samples as columns in the heatmap.
## Select only the mean values for the 10 histopathological factors
df <- wdbc %>%
column_to_rownames("ID") %>%
select(ends_with("_mean")) %>%
rename_with(~gsub("_mean", "", .)) %>% # Optional, for shorter names
t()
dim(df)
# 10 569
df[ , 567:569]
# 926954 927241 92751
# Radius 16.60000 2.060e+01 7.76000
# Texture 28.08000 2.933e+01 24.54000
# Perimeter 108.30000 1.401e+02 47.92000
# Area 858.10000 1.265e+03 181.00000
# Smoothness 0.08455 1.178e-01 0.05263
# Compactness 0.10230 2.770e-01 0.04362
# Concavity 0.09251 3.514e-01 0.00000
# Nconcave 0.05302 1.520e-01 0.00000
# Symmetry 0.15900 2.397e-01 0.15870
# Fractaldim 0.05648 7.016e-02 0.05884
rm(wdbc)
2. Generate the row and column clustering individually
hc.features <- df %>% t() %>%
distanceMatrix(metric="pearson") %>%
hclust(method="average")
hc.samples <- df %>%
distanceMatrix(metric="pearson") %>%
hclust(method="average")
par(mfrow=c(1, 2))
par(mar=c(3,1,1,5))
plot(as.dendrogram(hc.features), ylab="", sub="", horiz=T)
par(mar=c(5, 2, 4, 2) + 0.1)
plotColoredClusters(hc.samples,
labs = colnames(df),
cols = Diagnosis %>% pull(Diagnosis_color),
ylab="", sub="", cex=0.5)

Keep an eye on the figures above. These are our target plots for confirmation.
3. Using pheatmap() function without scaling
One thing to note is that the function pheatmap()
can be found in the pheatmap
and ComplexHeatmap
packages. So we need to be explicit on which pheatmap()
we are calling.
pheatmap::pheatmap(df,
scale = "none", # the default value anyway
annotation_col = Diagnosis %>% select(Diagnosis),
show_colnames = FALSE,
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
clustering_method = "average")

You can see that the column clustering is identical to independently produced clustering in Section 2. Ditto for row clustering (except order is reversed which can be ignored). The scale within the body of the heatmap is ugly because the variables are on different scales. E.g. Area ranges from 143 to 2501 while Fractaldim ranges from 0.05 to 0.10.
4. Using pheatmap() function with scaling
I assumed using the scale="row"
argument in pheatmap()
would make life easy but …
pheatmap::pheatmap(df,
scale = "none", # the default value anyway
annotation_col = Diagnosis %>% select(Diagnosis),
show_colnames = FALSE,
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
clustering_method = "average")

The body of the heatmap looks fine but the column clustering is very different to the clustering in Section 2. The row cluster is unaffected here because correlation does an internal scaling beforehand. However, if we are using Euclidean distance, the row cluster would also look very different (see Section 8).
5. Reason for the difference
The data is scaled before the hierarchical clustering of rows and columns are generated. To demonstrate this, let’s scale the features/columns to have zero mean and unit variance and then run the clustering on it.
scaled_by_feature <- df %>% t() %>% scale() %>% t()
dim(scaled_by_feature)
# [1] 10 569
apply(scaled_by_feature, 1, mean) # close enought to zero
# Radius Texture Perimeter Area Smoothness
# -1.379776e-16 6.157678e-17 -1.190559e-16 1.219971e-16 1.618800e-16
# Compactness Concavity Nconcave Symmetry Fractaldim
# -7.605173e-17 3.929080e-17 -5.337457e-17 1.679646e-16 4.810260e-16
apply(scaled_by_feature, 1, var)
# Radius Texture Perimeter Area Smoothness
# 1 1 1 1 1
# Compactness Concavity Nconcave Symmetry Fractaldim
# 1 1 1 1 1
scaled_by_feature %>%
distanceMatrix(metric="pearson") %>%
hclust(method="average") %>%
plotColoredClusters(labs = colnames(df),
cols = Diagnosis %>% pull(Diagnosis_color),
ylab="", sub="", cex=0.5)

We have reproduced the undesired clustering in Section 4.
6. Fix for pheatmap()
One way to fix this is to feed in the independent clusterings into pheatmap using cluster_rows and cluster_cols arguments. You no longer need to set clustering_distance_rows
, clustering_distance_cols
and clustering_method
arguments.
pheatmap::pheatmap(df,
scale = "row",
annotation_col = Diagnosis %>% select(Diagnosis),
show_colnames = FALSE,
cluster_rows = hc.features,
cluster_cols = hc.samples)

7. What about ComplexHeatmap?
The function Heatmap()
from ComplexHeatmap package does not have an option to scale. You will have to feed in the scaled input manually (will use the input from Section 5) but you can see that we end up with the same problem.
column_ha <- HeatmapAnnotation(Diagnosis = Diagnosis %>% pull(Diagnosis),
col = list(Diagnosis = c("M" = "red", "B" = "blue")),
which = "column")
Heatmap(scaled_by_feature,
top_annotation = column_ha,
clustering_distance_rows = "pearson",
clustering_method_rows = "average",
clustering_distance_columns = "pearson",
clustering_method_columns = "average",
show_column_names = FALSE)

We can apply the same fix in Section 6. However, the argument names are slightly different. i.e. cluster_columns
instead of cluster_cols
.
Heatmap(scaled_by_feature,
cluster_rows = hc.features,
cluster_columns = hc.samples,
top_annotation = column_ha,
show_column_names = FALSE)

8. What about Euclidean distance?
The situation is worse as the row clustering will also be affected. To demonstrate this, change the phrase “pearson” and “correlation” in the codes above to “euclidean”.
9. Summary
Calculate the row and column clusterings (see Section 2) and feed the clusterings into pheatmap()
function (see Section 6) or Heatmap()
function. In short:
hc.features <- df %>% t() %>%
distanceMatrix(metric="pearson") %>%
hclust(method="average")
hc.samples <- df %>%
distanceMatrix(metric="pearson") %>%
hclust(method="average")
pheatmap::pheatmap(df,
scale = "row",
annotation_col = Diagnosis %>% select(Diagnosis),
show_colnames = FALSE,
cluster_rows = hc.features,
cluster_cols = hc.samples)