How to functionally annotate SNPs and indels in BioConductor

GWAs or eQTL studies attempt to find the variants, typically SNP or indel, that are associated with the disease or gene expression changes. Once one has identified potential variants, it is common to annotate them in relation to the genes these variants sit in or genes in the proximal region. I have been using the SeattleSeq Annotation 137 which is very simple to use and very useful.

However, it does not give the genes immediately preceding or following a chromosomal location. This is useful information in the context of an eQTL study to see if an intergenic SNP is mapping to the nearby genes or somewhere else. To solve this problem, I started looking at various softwares and with some help from Valerie Obenchain on the BioConductor mailing list, I have come up with a neat and generic solution. Here is my solution.

The first step is to load the required packages.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # for annotation
library(org.Hs.eg.db)                      # to convert from Entrez Gene ID to Symbol

If you are missing any package, you can install via

source("http://bioconductor.org/biocLite.R")
biocLite("VariantAnnotation")

The next step is to prepare the query file. I have chosen three SNPs to illustrate below.

input <- rbind.data.frame( c("rs3753344", "chr1", 1142150),
                              c("rs12191877", "chr6", 31252925),
                              c("rs881375", "chr9", 123652898) )
colnames(input) <- c("rsid", "chr", "pos")
input$pos       <- as.numeric(as.character(input$pos))
input
#         rsid  chr       pos
# 1  rs3753344 chr1   1142150
# 2 rs12191877 chr6  31252925
# 3   rs881375 chr9 123652898

target <- with(input,
                GRanges( seqnames = Rle(chr),
                         ranges   = IRanges(pos, end=pos, names=rsid),
                         strand   = Rle(strand("*")) ) )
target
# GRanges with 3 ranges and 0 metadata columns:
#              seqnames                 ranges strand
#
#    rs3753344     chr1 [  1142150,   1142150]      *
#   rs12191877     chr6 [ 31252925,  31252925]      *
#     rs881375     chr9 [123652898, 123652898]      *
#   ---
#   seqlengths:
#    chr1 chr6 chr9
#      NA   NA   NA

Alternatively you can read from a VCF file if you have one handy already and reduce all the codes above to:

vcf    <- readVcf(filename, genome="hg19")
target <- rowData(vcf)

The third step is to the actual variant annotation thanks to the very useful VariantAnnotation package.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants())
names(loc) <- NULL
out <- as.data.frame(loc)
out$names <- names(target)[ out$QUERYID ]
out <- out[ , c("names", "seqnames", "start", "end", "LOCATION", "GENEID", "PRECEDEID", "FOLLOWID")]
out <- unique(out)
out
#        names seqnames     start       end   LOCATION GENEID PRECEDEID FOLLOWID
# 1  rs3753344     chr1   1142150   1142150   promoter   8784        NA       NA
# 5  rs3753344     chr1   1142150   1142150 intergenic     NA      8784     7293
# 6 rs12191877     chr6  31252925  31252925     intron   3106        NA       NA
# 7   rs881375     chr9 123652898 123652898 intergenic     NA     26147     7185

As you can see from above, rs3753344 sits in the promoter of one gene and also in between two other genes as confirmed by the Ensembl page here. rs12191877 is an intronic SNP while rs881375 is an intergenic SNP.

Finally, let’s convert from Entrez Gene IDs (which is stable) to more human readable Gene Symbols (which can have many aliases).

Symbol2id <- as.list( org.Hs.egSYMBOL2EG )
id2Symbol <- rep( names(Symbol2id), sapply(Symbol2id, length) )
names(id2Symbol) <- unlist(Symbol2id)

x <- unique( with(out, c(levels(GENEID), levels(PRECEDEID), levels(FOLLOWID))) )
table( x %in% names(id2Symbol) ) # good, all found

out$GENESYMBOL <- id2Symbol[ as.character(out$GENEID) ]
out$PRECEDESYMBOL <- id2Symbol[ as.character(out$PRECEDEID) ]
out$FOLLOWSYMBOL <- id2Symbol[ as.character(out$FOLLOWID) ]
out

And here is the final output:

names seqnames start end LOCATION GENEID PRECEDEID FOLLOWID GENESYMBOL PRECEDESYMBOL FOLLOWSYMBOL
rs3753344 chr1 1142150 1142150 promoter 8784 <NA> <NA> TNFRSF18 <NA> <NA>
rs3753344 chr1 1142150 1142150 intergenic <NA> 8784 7293 <NA> TNFRSF18 TNFRSF4
rs12191877 chr6 31252925 31252925 intron 3106 <NA> <NA> HLA-B <NA> <NA>
rs881375 chr9 123652898 123652898 intergenic <NA> 26147 7185 <NA> PHF19 TRAF1

There is the caveat that these annotations will differ slightly depending on the gene definition and genome build etc. For example, while rs12191877 is tagged as intronic in HLA-B according to UCSC. However, looking this SNP up on this ENSEMBL page shows that it sits in WASF5P and is approximately 100 kilobases from HLA-B.


9 comments

    • You are most welcome. The true credits should go to Valerie Obenchain who not only wrote the VariantAnnotation package but also tirelessly helps out fellow scientists on the bioconductor mailing list.

  1. Thanks for the useful information. I just looking around the annotation packages for SNPs and came across variantAnnotation package which seems pretty straight forward. I have one question though. What if my genome of interest is not present in R database? Is there a way to create one in R?

    • Apologies for the late reply. I am quite positive there must be a way to build a new genome though I don’t know exactly how. Alternatively, if you have the genomic coordinates for exons, introns, UTR etc, then you could also use intersectBED from the bedtools suite. Best is to ask the bioconductor mailing list – there are some great people there who can help you. Good luck.

  2. Hi, I know you posted this last year but I came across your code and have been using it but for my list of variants, I am getting multiple values in the PRECEDEID and FOLLOWID columns and symbol annotation doesn’t seem to work in those situations. Any advice?

    • Hi Claire, apologies for the delayed response. Yes, I also do get this. I think they must have changed the functionality to return all the preceding/following genes within a certain range. See the examples in help(“locateVariants”) on how they unlist this and attach distance metrics (using distance()). Hope that helps.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s