How to calculate linkage disequilibrium using VCF of the latest 1000 Genomes

There are few websites that allow you to calculate the LD between SNPs of interest.

  • The Broad Institute’s SNAP pairwise LD is easiest to use but their latest reference dataset is the 1000 Genomes Pilot 1 which is several years old (released Feb 2009) and many of the newer rs IDs are not found.
  • Ensembl. Enter your rs ID or chr:pos (e.g. rs682767 or “13:101706629..101706629) and then click on Linkage Disequilibrium (red triangle). After a couple of minutes, you should get the table where you can choose the population and 1000G version. Click the corresponding “LD (table)” to get the R2 and D’. Good for checking the LD structure of different population or historical analysis of how LD structure definitions changed over the years. Not feasible for repeated applications.
  • Haploview is also another popular software but I am uncertain of how up-to-date their reference data is. Not feasible for repeated applications.
  • Our web tool LD Calculator created by Gigaloluwa Peter Ilori. Uses the latest 1000 Genomes (March 2012) haplotypes. No visualization and needs a bit more work (e.g. it doesn’t notify you of SNPs not found in the database) but it is a good alternative to consider.

To enable others to work with VCF (variant calling files) from newer releases or different populations, I am providing the barebone codes behind our website:

Step 1: Download the VCF files of reference population of interest

This is a one time process and can takes at least 30min to download. You can get the latest VCF from the 1000 Genomes project FTP site here which gives data for ALL 1000G individuals (N=1,092).

However I prefer to download from Yu Li’s download page where I can download, say, just the European subset (N=379).

wget ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz.tgz

tar xvfz phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz.tgz

Step 2: Convert from rsids to chr:pos (optional)

This is optional but I would recommend converting all rsids to chr:pos format. The reason for this is that rs IDs can be changed or even retired. There will be a small gain in identifiability with the 1000 Genomes if you do this so you can automate this, For small number of SNPs, you can convert using BioMart as follows:

  1. Go to http://www.ensembl.org/biomart/martview/ 
  2. Choose “Ensembl Variation 70”
  3. Choose “Homo sapiens Short Variation (SNPs and indels) (GRCh37.p10)”
  4. Then click “Filters” -> “General Variation Filters” -> “Filter by Variation Name” and enter the rsids
  5. Then click on “Results” (top left hand after New and Count).
  6. Tick Unique Results and Export to TSV format file.

Step 3: Extract to make a smaller VCF

I am going to assume all of the SNPs are located on a single chromosome, say chr16. First, save the positions from Biomart step above into a file called “pos.txt”. Important to make sure it is sorted numerically (or use sort -n in unix) and there are no redundancies.

If your list contains SNPs from several chromosomes, it’s very easy to do a for loop in bash.

BigFile=chr16.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz

zcat $BigFile | head -28 > top.part
zfgrep -w --file=pos.txt $BigFile > bottom.part
wc -l bottom.part

cut -f2 bottom.part | diff pos.txt -    # shows the missing SNPs (if any)

cat top.part bottom.part > mini.vcf

rm top.part bottom.part

Line 07 shows you which SNPs were not found in the 1000G. This is the main improvement that is to be implemented into our website in the future.

You can modify these codes to find proxy SNPs by extracting all SNPs located within, say, 1Mb from the SNP of interest and calculate the LD among these.

Step 4: Extract to make a smaller VCF

Here comes the easiest step of all, thanks to the wonderful vcftools package, which you should download and install first. Now we can calculate the LD between every pair of SNP with the extracted VCF file.

vcftools --vcf mini.vcf --hap-r2 --out results
wc results.hap.ld  # should be exactly choose(Nsnps, 2)

Or if you want only R2 above certain limits, use the –min-r2 argument to reduce the output size. For example:

vcftools --vcf mini.vcf --hap-r2 --min-r2 0.5 --out results

Step 5: Convert into a tabular matrix format (optional)

Sometimes it is useful to convert the LD values into a N x N matrix (N = number of snps). Here are the R codes to do that:

long <- read.table("result.hap.ld", as.is=T, row.names=NULL, header=T)
pos <- as.character( sort(unique(c(long$POS1, long$POS2))) )

library(reshape2)
irregular <- acast(long, POS1 ~ POS2, value.var="R.2")  # wide format but irregular shape

mat.r2 <- matrix(NA, length(pos), length(pos), dimnames=list(pos, pos))
mat.r2[ rownames(irregular), colnames(irregular) ] <- irregular
mat.r2[ lower.tri(mat.r2) ] <- t(mat.r2)[ lower.tri( t(mat.r2) ) ]
diag(mat.r2) <- 1
rm(tmp)

3 comments


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s