Download 1000 Genomes Phase3 and calculate allele frequencies

Here are some codes to download the data from the 1000 Genomes Phase 3 website into your own server and calculating the allele frequencies for the European populations. Here are some setup codes. The panel file tells you which population and super-population each sample belongs to.

FTP_SITE=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502
wget $FTP_SITE/integrated_call_samples_v3.20130502.ALL.panel

Next we will download each chromosome (I am ignoring the Y and MT chromosomes here). Alternatively, you can download all the files on the FTP site using wget -r $FTP_SITE but I preferred to download each one separately. Note the Chromosome X is based on version 1b and not 5a like the autosomes, so has to be downloaded separately. I am renaming it for convenience later.

for CHR in `​seq 1 22`; do
   FILE=$FTP_SITE/ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   wget $FILE $FILE.tbi
   sleep 60
done

FILE=$FTP_SITE/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget $FILE $FILE.tbi
rename -v 's/v1b/v5a/' *

The next step is to identify the Europeans and calculate the allele frequencies. I am also ignoring any variants that have a call rate < 95% and fails Hardy-Weinberg Equilibrium at p < 10-6. You can do this with vcftools or other softwares but I am most familiar with PLINK. And if you are using PLINK, please use version 1.90 (https://www.cog-genomics.org/plink2) and above as it is much faster than previous versions.

grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EUR.id  ## 503 Europeans

for CHR in `seq 1 23  | sed 's/23/X/'`​; do
   FILE=ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   plink --vcf $FILE --keep-fam EUR.id --geno 0.05 --hwe 1e-6 --make-bed --out tmp
   plink --bfile tmp --freq --out EUR_$CHR rm EUR_$CHR.{log,nosex} tmp.*
done

awk 'FNR==1 && NR!=1{next;}{print}' *.frq | cut -f1-4 > EUR.freq
rm EUR_*.frq

You may also find pre-computed allele frequencies per population but my purpose here is how to download and run some calculations on the VCF files.

Leading edge subset from Gene Set Enrichment Analysis (GSEA) using blood transcription modules

Recently, I have been taken up very much by the modular analysis developed by Damien Chaussabel and colleagues (see original paper or their recent review paper) for analysis of whole blood data.

A recent R package called tmod from Dr. January Weiner provides a very easy and elegant implementation of this concept and data. The vignette is well written and the author is very responsive to requests and suggestions. Currently, the evidence plot is based on the receiver operating curve. The next release will contain an option to produce an evidence plot that is the equivalent to enrichment score plot of the popular GSEA software from the Broad Institute.

I tweaked this function a bit to allow for different weighting of the genes in the module and to allow it to return the indices of the leading edge subset (i.e. core enrichment genes) which can be useful to generate heatmaps etc. tmod’s evidencePlot has the advantage of not requiring the logFC as an additional parameter.

getLeadingEdgeSubset <- function(symbol, lfc, mod, alpha=1, plot=T){

  stopifnot( length(symbol) == length(lfc) )

  genes <- getGenes(mod, mset="all")$Genes
  genes <- unique(unlist(strsplit(genes, split=",")))
  inMod <- sapply(symbol, function(x) any(x %in% genes))  ## "symbol" can be a list to allow multi-mapping probes

  N     <- length(symbol)
  N.hit <- sum(inMod)

  runningES <- NA
  runningES[  inMod ] <-  abs( lfc[ inMod ]^alpha )/sum(abs(lfc[ inMod ]^alpha))
  runningES[ !inMod ] <-  -1/(  N - N.hit )
  runningES <- cumsum(runningES)

  rank.at.max <- which.max(runningES)
  leadingEdge <- which(inMod[ 1:rank.at.max ])

 if(plot){

    data(tmod)
    title.main <- paste0(tmod$MODULES[ mod, "Title" ], " (", mod, ")", collapse="; ")

    plot(1:N, runningES, type="l",
         ylab="Enrichment Score (ES)", xlab="Ordered rank of genes", main=title.main)

    mtext(side=3, at=which(inMod), text="|", line=-1, cex=0.5)

    arrows(rank.at.max, 0, rank.at.max, max(runningES), length=0, col="blue")

    mtext(side=3, text=paste0("# genes in module:", length(genes), "                                       ",
                              "# genes found:", length(unique(unlist(symbol[inMod]))), " (", N.hit, " probes)\n"), line=-1)

    legend("bottom", bty="n",
           legend=paste0("Rank at max:", rank.at.max, "\n# probes in leading edge subset:", length(leadingEdge)))
  }

  return(leadingEdge)
}

Here is a simple example taken from the tmod vignette:

library(limma)
library(tmod)

data(Egambia)
design <- cbind(Intercept=rep(1, 30), TB=rep(c(0,1), each= 15))
E <- as.matrix(Egambia[,-c(1:3)])
fit <- eBayes( lmFit(E, design))
tt <- topTable(fit, coef=2, number=Inf, genelist=Egambia[,1:3] )
rm(design, E, fit)

evidencePlot(tt$GENE_SYMBOL, "LI.M37.0")  ## from tmod version 0.24 onwards

Rplot

core <- getLeadingEdgeSubset(tt$GENE_SYMBOL, tt$logFC, "LI.M37.0", alpha=1)

Rplot04

 

I set alpha=1 as default to match Broad Institute’s default value. Setting alpha=0 will produce exactly the same graph as evidencePlot from tmod. Setting the alpha=2 will overweight the genes in the module and sometimes useful for finding smaller set of core genes.

You can then take on these indices (“w”), pull out the genes and visualize them as such

top <- Egambia[ core, -c(1:3)]
rownames(top) <- paste0( tt$GENE_SYMBOL[ core ], " (", core, ")")
status <- data.frame(status=ifelse( grepl("TB", colnames(top)), "TB", "Healthy"))

library(NMF)
library(gplots)

aheatmap(top, annCol=status,
         color=bluered(10), Rowv=NA, Colv=NA, scale="row",
         main="leading edge subset from module LI.M37.0")

 

Rplot05

How polymorphism(s) in probe sequences affect cis-eQTL results

In our paper, we show that the presence of polymorphisms (SNPs and indels) in probe sequences can induce a serious technical artefact (through reduced hybridization efficiency) when conducting expression QTL (eQTL) studies, especially if you are using array based technologies. You can find the paper here: http://nar.oxfordjournals.org/content/early/2013/02/21/nar.gkt069

But for all of you busy people, the main result is summarized in Figure 1 (also attached below).

Ramasamy_NAR2012

Quick summary:

  • About 6.1% of the probes (25-mers) in Affymetrix Human Exon 1.0 array contains polymorphisms but they account 50 – 90% of cis-eQTLs
  • About 11.7% of the probes (50-mers) in Illumina HT12 array contains polymorphisms but they account for 30 – 45% of the cis-eQTLs
  • The binding efficiency of longer probes are less affected by polymorphism (as expected) but still > 30% of cis-eQTLs are false!
  • The 1000G dataset appears to be good enough to identify probes containing polymorphisms. The only way to possibly improve on this is to look at private mutations from exome-seq etc but may not be worth the effort.
  • Increasing p-value stringency seems to make situation worse (cis-eQTLs due to polymorphisms are very strong) so many published cis-eQTLs may suffer from this (especially if the authors have not done this or used an older reference panel to identify polymorphisms)

A quick tutorial on using MatrixEQTL

Andrey has many useful examples/tutorials on his MatrixEQTL page. Here I am going to demonstrate how to use MatrixEQTL using data from the UK Brain Expression Consortium (UKBEC). I have chosen the MAPT gene which is important for various neurodegenerative disorders. The data was downloaded from the BRAINEAC website (link to come) and can be downloaded from here. This consortium collected DNA and RNA from ten brain regions in 134 neuropathologically normal individuals. Additionally, the mean expression profile for each individual across tissues (aveALL) was calculated.

Let us first load the R data and inspect the format. You may need to change working directory using the setwd() command first or supply the full path.

This particular dataset contains information for the 26 expression profiles: one representing the gene-level profile and prefixed with “t” plus 25 representing the exon-level profiles. It also contains the information for 5,287 markers (SNPs and indels) located within 1Mb of transcription start and end sites.

load("t3723687.rda")

#####################
## Expression data ##
#####################

length(expr)
 [1] 11

names(expr)
 [1] "aveALL" "CRBL" "FCTX" "HIPP" "MEDU" "OCTX" "PUTM" "SNIG" "TCTX" "THAL" "WHMT"

dim(expr[["TCTX"]])
 [1]  26 134

head(expr[["TCTX"]])[ , 1:8]
          00_38    01_37    02_06   03_28    04_05    04_19   05_10   06_02
3723688 9.63145  9.22946  9.47970 9.47775  9.24175  9.74315 9.46278 9.76640
3723690 6.03620  5.55045  5.08430 5.09810  6.67395  4.95544 5.56240 4.67415
3723691 7.41365  7.11841  7.16574 6.39494  7.58188  6.28772 6.93655 6.34788
3723707 9.96076 10.13550 10.21800 9.70254 10.54540 10.34620 9.66656 9.74161
3723710 9.31395  9.44499  9.31668 8.51235 10.07000  9.95134 9.41852 8.65756
3723712 5.60436  5.21131  5.89260 6.17088  5.09820  6.14388 5.83319 5.94182

tail(expr[["TCTX"]])[ , 1:8]
           00_38   01_37   02_06   03_28    04_05   04_19   05_10   06_02
3723750  8.42139 8.72951 8.69241 9.06205  9.37904 9.33639 8.59345 9.15941
3723751  8.63697 8.00765 8.81346 8.64289  9.52436 8.87514 8.50429 8.56228
3723752  8.93384 8.19434 8.14235 8.14987 10.02850 9.10575 8.05589 7.93080
3723753  9.06106 9.26484 9.20654 9.38277  9.73640 9.61577 9.15330 9.17392
3723754  8.63553 8.80820 9.22138 8.64040  8.73758 9.45185 8.83666 8.96825
t3723687 8.39886 8.36485 8.36244 8.29161  8.91036 8.64564 8.28501 8.19473

dim(expr.info)
[1] 26  4

head(expr.info)
          chr    start     stop     tID
3723688 chr17 43971878 43972051 3723687
3723690 chr17 43973814 43974885 3723687
3723691 chr17 43974897 43975038 3723687
3723707 chr17 44039714 44039808 3723687
3723710 chr17 44049243 44049311 3723687
3723712 chr17 44051752 44051833 3723687

#################
## Marker data ##
#################

dim(markers)
 [1] 5287  134

head(markers)[ , 1:8]
                     00_38 01_37 02_06 03_28 04_05 04_19 05_10 06_02
chr17:42972122           2 2.000     2 2.000     2 1.417 1.927     2
chr17:42972206           2 2.000     2 2.000     2 1.000 2.000     1
chr17:42972490:GGA_G     2 2.000     2 2.000     2 1.000 1.000     2
chr17:42972648           2 2.000     2 1.000     2 2.000 2.000     2
chr17:42972690           1 1.999     0 1.996     0 2.000 1.000     1
chr17:42972960           1 2.000     0 2.000     0 2.000 1.000     1

dim(markers.info)
[1] 5287    8

head(markers.info)
                     Al1 Al2   Freq1     MAF AvgCall     Rsq       rsid  type
chr17:42972122         G   A 0.88433 0.11567 0.97738 0.84068 rs75707322   SNP
chr17:42972206         G   A 0.88433 0.11567 1.00000 1.00370  rs8069039   SNP
chr17:42972490:GGA_G   R   D 0.79364 0.20636 0.99882 0.99709        indel
chr17:42972648         G   C 0.89551 0.10449 0.99997 1.00342 rs71373525   SNP
chr17:42972690         C   G 0.62428 0.37572 0.98117 0.94353  rs8069754   SNP
chr17:42972960         G   A 0.64327 0.35673 0.99849 0.99826  rs2878972   SNP

Before we proceed, first of all note that some tissue is missing for some individuals as the RNA was not available or low quality. Using temporal cortex (TCTX) as example, we see that 15 individuals are missing for this tissue.

table( w <- colMeans(is.na(expr[["TCTX"]])) )	
   0   1 
 119  15

length( valids <- names( which(w==0) ) )
 [1] 119

head(valids)
 [1] "00_38" "01_37" "02_06" "03_28" "04_05" "04_19" ...

Note that MatrixEQTL imputes the missing values before doing the calculations. This is not idea as our values are not missing at random. So we will need to manually exclude these individuals when we convert these datasets into MatrixEQTL format.

library(MatrixEQTL)   # version 2.0.0 - check yours with sessionInfo()

my.expr <- SlicedData$new()
my.expr$CreateFromMatrix( as.matrix( expr[["TCTX"]][ , valids ] ) )
 
my.markers <- SlicedData$new()
my.markers$CreateFromMatrix( as.matrix( markers[ , valids ] ) )

It is important to ensure the datasets are in the same order of subjects before we proceed to execute the MatrixEQTL

identical( colnames(my.expr), colnames(my.markers) )
 [1] TRUE

tmpf <- tempfile()
 
res <- Matrix_eQTL_main( my.markers, my.expr, cvrt=SlicedData$new(),
                         output_file_name=tmpf, pvOutputThreshold=1e-5,
                         useModel=modelLINEAR, errorCovariance=numeric(0) )
unlink(tmpf)

Let us now extract and format the results a bit. There are 9,412 significant cis-eQTLs (at p < 1e-5) identified within 0.07 seconds and here are the top 6.

out <- res$all$eqtls
out <- out[ , c("snps", "gene", "beta", "statistic", "pvalue")]
colnames(out) <- c("marker", "exprID", "beta", "t-statistic", "p-value")
dim(out)
 [1] 9412    5

res$time.in.sec
 elapsed
    0.07

head(out)
           marker  exprID       beta t-statistic      p-value
 1 chr17:44061025 3723751  0.3759085    5.761778 6.865336e-08
 2 chr17:44067382 3723751  0.2699755    5.349497 4.439403e-07
 3 chr17:44060492 3723751  0.5931873    5.335221 4.729509e-07
 4 chr17:44762589 3723707 -0.1788872   -5.213412 8.085859e-07
 5 chr17:44066464 3723751  0.2720258    5.128588 1.169849e-06
 6 chr17:44059497 3723751  0.2703483    5.121740 1.205076e-06

Now let us try it the traditional approach.

ptm <- proc.time()

TCTX <- expr[["TCTX"]]

big.out <- NULL

for(i in 1:nrow(TCTX)){

 print(i)
 my.expr <- as.numeric( TCTX[i, ] )

 for(j in 1:nrow(markers)){
   fit <- lm( my.expr ~ as.numeric(markers[j, ]) )
   out <- coef( summary(fit) )[ 2, c("Estimate", "Std. Error", "Pr(>|t|)")]

   if( out[ "Pr(>|t|)" ] < 1e-5 ){
     out <- c( rownames(markers)[j], rownames(TCTX)[i], out )
     big.out <- rbind(big.out)
   }
 
  rm(j, fit, out)
 }
 rm(i, my.expr)
}

colnames(big.out) <- c("marker", "exprID", "Estimate", "SE", "pvalue")
big.out <- big.out[ order(as.numeric(big.out[ , "pvalue"])), ]
rownames(big.out) <- NULL

proc.time() - ptm
   user  system elapsed 
 522.83    0.06  528.19 

Wow, that took over 8 minutes to complete vs. 0.07 seconds for MatrixEQTL. Sure, I can optimize the code above using apply() etc but it still isn’t as fast as MatrixEQTL.

More importantly, the results the same.

dim(big.out)
 [1] 9412    5

head(big.out)
      marker           exprID    Estimate             SE                   pvalue
 [1,] "chr17:44061025" "3723751" "0.375908482121651"  "0.0652417514030518" "6.86533571244123e-08"
 [2,] "chr17:44067382" "3723751" "0.269975506772536"  "0.0504674529615548" "4.43940323821489e-07"
 [3,] "chr17:44060492" "3723751" "0.593187349664759"  "0.111183272810956"  "4.72950945585888e-07"
 [4,] "chr17:44762589" "3723707" "-0.178887231952771" "0.0343128918690251" "8.08585850844625e-07"
 [5,] "chr17:44066464" "3723751" "0.272025753138588"  "0.0530410575681518" "1.16984903179711e-06"
 [6,] "chr17:44059497" "3723751" "0.270348263447638"  "0.052784452164607"  "1.20507597159997e-06"

SNP vs. genotype definition

Recently, a friend asked for the difference between a SNP and genotype. After some babbling on my part, here is a summary.

SNP Positions of the genome that have a mutation
Genotype Alleles of at a given SNP

Typically we only call a mutation if is seen above a certain frequency (typically > 0.1%) across the population. Otherwise, these are known as private mutations.

Confused? Here is a simple analogy. A metabolic disease called diabetes exists but not everyone has diabetes. A medical diagnosis is required to determine the diabetic status of an individual.

Similarly, genotyping is the process of determining the SNP alleles. Here is example for three individuals:

Position Individual 1 Individual 2 Individual 3
1 A/A A/A

A/A

2

C/C

C/C

C/C

3

G/G

G/G

G/G

4

C/T

T/T

C/C

Nucleotides in positions 1 – 3 are identical across all individuals and therefore not considered a SNP in this sample. The nucleotide in position 4 is a SNP. The genotype for individual 001 at position 4 is C/T.

More generally, in the human genome, there are approximately 40 million SNPs which represent about 1.25% of the genome. In other words, the genomes of two unrelated individuals are identical at around 99% of the nucleotides.

Compare timing with different compression levels in save()

R has a very useful save() function which allows you to save multiple objects and as R objects. This means that you can save a dataframe with all the column classes (factors as factors etc) as it is instead of reading it from an ASCII file and them worrying about these. Also can save more complicated objects such as list and arrays. These objects can be read back into R using load().

There are various options to make the output file smaller which has been useful to me. The default values for save() are compress=TRUE and ascii=TRUE. Note that compress=TRUE corresponds to gzip compression which is compression level 6. The highest compression level is 9 which can be achieved by specifying compress=”bzip2″ and ascii=FALSE.

I have been using the highest compression as it saved hard disk space but until now I have not consider the time it takes to create and load objects created using different compression levels. Here are my codes to test these:

N <- 100
mat <- matrix( rnorm(N*N), nc=N )

system.time( save( mat, file="standard.rda" ) )
system.time( save( mat, file="nonStandard.rda", compress="bzip2", ascii=T) )

system.time( load("standard.rda") )
system.time( load("nonstandard.rda") )

Here are the timings (in seconds) for various values of N:

  Time to save() Time for load()
N standard non-standard standard non-standard
100 0.02 0.03 ~0 0.11
1,000 1.82 3.51 0.03 7.13
5,000 11.60 90.1 1.45 181.00

So the penalty for higher compression becomes much more noticeable with larger datasets. I should factor this along with file size generated and how often will I need to load this dataset.

How to calculate linkage disequilibrium using VCF of the latest 1000 Genomes

There are few websites that allow you to calculate the LD between SNPs of interest.

  • The Broad Institute’s SNAP pairwise LD is easiest to use but their latest reference dataset is the 1000 Genomes Pilot 1 which is several years old (released Feb 2009) and many of the newer rs IDs are not found.
  • Ensembl. Enter your rs ID or chr:pos (e.g. rs682767 or “13:101706629..101706629) and then click on Linkage Disequilibrium (red triangle). After a couple of minutes, you should get the table where you can choose the population and 1000G version. Click the corresponding “LD (table)” to get the R2 and D’. Good for checking the LD structure of different population or historical analysis of how LD structure definitions changed over the years. Not feasible for repeated applications.
  • Haploview is also another popular software but I am uncertain of how up-to-date their reference data is. Not feasible for repeated applications.
  • Our web tool LD Calculator created by Gigaloluwa Peter Ilori. Uses the latest 1000 Genomes (March 2012) haplotypes. No visualization and needs a bit more work (e.g. it doesn’t notify you of SNPs not found in the database) but it is a good alternative to consider.

To enable others to work with VCF (variant calling files) from newer releases or different populations, I am providing the barebone codes behind our website:

Step 1: Download the VCF files of reference population of interest

This is a one time process and can takes at least 30min to download. You can get the latest VCF from the 1000 Genomes project FTP site here which gives data for ALL 1000G individuals (N=1,092).

However I prefer to download from Yu Li’s download page where I can download, say, just the European subset (N=379).

wget ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz.tgz

tar xvfz phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz.tgz

Step 2: Convert from rsids to chr:pos (optional)

This is optional but I would recommend converting all rsids to chr:pos format. The reason for this is that rs IDs can be changed or even retired. There will be a small gain in identifiability with the 1000 Genomes if you do this so you can automate this, For small number of SNPs, you can convert using BioMart as follows:

  1. Go to http://www.ensembl.org/biomart/martview/ 
  2. Choose “Ensembl Variation 70”
  3. Choose “Homo sapiens Short Variation (SNPs and indels) (GRCh37.p10)”
  4. Then click “Filters” -> “General Variation Filters” -> “Filter by Variation Name” and enter the rsids
  5. Then click on “Results” (top left hand after New and Count).
  6. Tick Unique Results and Export to TSV format file.

Step 3: Extract to make a smaller VCF

I am going to assume all of the SNPs are located on a single chromosome, say chr16. First, save the positions from Biomart step above into a file called “pos.txt”. Important to make sure it is sorted numerically (or use sort -n in unix) and there are no redundancies.

If your list contains SNPs from several chromosomes, it’s very easy to do a for loop in bash.

BigFile=chr16.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz

zcat $BigFile | head -28 > top.part
zfgrep -w --file=pos.txt $BigFile > bottom.part
wc -l bottom.part

cut -f2 bottom.part | diff pos.txt -    # shows the missing SNPs (if any)

cat top.part bottom.part > mini.vcf

rm top.part bottom.part

Line 07 shows you which SNPs were not found in the 1000G. This is the main improvement that is to be implemented into our website in the future.

You can modify these codes to find proxy SNPs by extracting all SNPs located within, say, 1Mb from the SNP of interest and calculate the LD among these.

Step 4: Extract to make a smaller VCF

Here comes the easiest step of all, thanks to the wonderful vcftools package, which you should download and install first. Now we can calculate the LD between every pair of SNP with the extracted VCF file.

vcftools --vcf mini.vcf --hap-r2 --out results
wc results.hap.ld  # should be exactly choose(Nsnps, 2)

Or if you want only R2 above certain limits, use the –min-r2 argument to reduce the output size. For example:

vcftools --vcf mini.vcf --hap-r2 --min-r2 0.5 --out results

Step 5: Convert into a tabular matrix format (optional)

Sometimes it is useful to convert the LD values into a N x N matrix (N = number of snps). Here are the R codes to do that:

long <- read.table("result.hap.ld", as.is=T, row.names=NULL, header=T)
pos <- as.character( sort(unique(c(long$POS1, long$POS2))) )

library(reshape2)
irregular <- acast(long, POS1 ~ POS2, value.var="R.2")  # wide format but irregular shape

mat.r2 <- matrix(NA, length(pos), length(pos), dimnames=list(pos, pos))
mat.r2[ rownames(irregular), colnames(irregular) ] <- irregular
mat.r2[ lower.tri(mat.r2) ] <- t(mat.r2)[ lower.tri( t(mat.r2) ) ]
diag(mat.r2) <- 1
rm(tmp)