Finding SNPs within genes (or intersecting intervals)

We recently submitted a paper which shows the presence of polymorphisms with probe sequence (which is well known to affect binding affinity) can contribute to a very large proportion (up to 90% in our datasets) of false cis-eQTLs.

This is a well known phenomena among those who do expression QTL (i.e. which genetic variants influence expression levels) but only recently knowledge about polymorphisms (e.g. SNPs and indels) has been comprehensive enough to reveal the true extent of this “polymorphism-in-probe” problem.

Let me demonstrate  let me demonstrate what I want to achieve using two sample files below.

> cat probelist.txt
chr start stop probeName
chr1 34572 34621 ILMN_1723732
chr1 41029 41078 ILMN_1677162
chr1 69476 69525 ILMN_1776601
chr1 69649 69698 ILMN_2200738
...
> cat snpfile.txt 
chr1:41053
chr1:41050
chr1:69511
chr1:568800
chr1:759970
chr1:780191
chr1:787399
chr1:803596
...

Given the inputs above, I want the following output which the first two probes are free of SNPs, the third contains two SNPs while the fourth contains one SNP.

Probe ILMN_1677162 contains SNP chr1:41053
Probe ILMN_1677162 contains SNP chr1:41050
Probe ILMN_1776601 contains SNP chr1:69511
...

The real probelist file can have anywhere between 40,000 (gene-based arrays) to over 5 million probes (exon-based arrays) while the real snplist can be in the tens of million polymorphisms. The chr also ranges from 1, …, 22, X.

My R implementation was quite slow (just over an hour). Then, I spent a good long day re-coding it in awk here and surprisingly it was even slower (estimated about six hours) – I am probably doing something silly here.

Both of these are slow because the algorithms loop through every line in the snp file even if the the SNPs are clearly out of range. So with a bit more searching, I came across the Interval Tree concept which sounded exactly like what I wanted and a C++ implementation of it in libfbi. However, I had no experience with C, so decided to ask a few friends for suggestions.

My colleague Nikolas Barkas and friend Manjula Thimma both suggested BEDtools which turned out to be very easy to use and extremely fast and intuitive. The user manual is excellent too. Anyway, here is my solution:

> tail -n+2 probelist.txt | awk '{print $1, $2-1, $3, $4}' | \
 sed 's/ /\t/g'> probes.bed 
> cat probes.bed
chr1 34571 34621 ILMN_1723732
chr1 41028 41078 ILMN_1677162
chr1 69475 69525 ILMN_1776601
chr1 69648 69698 ILMN_2200738

There are three changes here. First, we remove the header line. Next, the second column of the BED file are the the probe start positions but are 0-based, hence we need to do minus one. The third column is the probe stop positions which is are 1-based. This is explained in the manual as something to simplify computational expense but confusing at first go. Finally, we change the separators to tab-separated which is compulsory when using bedtools (unfortunately the tabs are displaying as spaces above). We similarly create the BED file for snps:

> awk -F: '{print $1, $2-1, $2, $0}' snpfile.txt | sed 's/ /\t/g' > snps.bed
> cat snps.bed
chr1 41052 41053 chr1:41053
chr1 41049 41050 chr1:41050
chr1 69510 69511 chr1:69511
chr1 568799 568800 chr1:568800
chr1 759969 759970 chr1:759970
chr1 780190 780191 chr1:780191
chr1 787398 787399 chr1:787399
chr1 803595 803596 chr1:803596

Then finally, the main intersectBed command and some cleanup of the output:

> intersectBed -a snps.bed -b probes.bed -wb
chr1 41052 41053 chr1:41053 chr1 41028 41078 ILMN_1677162
chr1 41049 41050 chr1:41050 chr1 41028 41078 ILMN_1677162
chr1 69510 69511 chr1:69511 chr1 69475 69525 ILMN_1776601
> intersectBed -a snps.bed -b probes.bed -wb | \
 awk '{print "Probe " $8 " contains SNP " $4}'
Probe ILMN_1677162 contains SNP chr1:41053
Probe ILMN_1677162 contains SNP chr1:41050
Probe ILMN_1776601 contains SNP chr1:69511

Note that BEDtools authors suggest setting the smaller file as -b (here the probes file is smaller) as this is read into memory.

How fast is this? Well, it took less than 60 seconds to complete the same task. So about a 100-fold improvement in speed compared to R! BEDtools have number of other tools also, all looks equally awesome.

One more thing. I would strongly recommend breaking the probe and snp files by chromosome to speed up the process. I am not sure why BEDtools doesn’t do this internally since chromosome is a major structural component (it’s doesn’t make sense to scan intervals across different chromosomes).

Finally, the latest version of BEDtools can work directly with the VCF format produced by the 1000 Genomes Project. This is something I will try out in the next few days and anticipate should make life a bit easier when it comes to handling indels and other structural variation.


One comment


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s