Download 1000 Genomes Phase3 and calculate allele frequencies

Here are some codes to download the data from the 1000 Genomes Phase 3 website into your own server and calculating the allele frequencies for the European populations. Here are some setup codes. The panel file tells you which population and super-population each sample belongs to.

FTP_SITE=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502
wget $FTP_SITE/integrated_call_samples_v3.20130502.ALL.panel

Next we will download each chromosome (I am ignoring the Y and MT chromosomes here). Alternatively, you can download all the files on the FTP site using wget -r $FTP_SITE but I preferred to download each one separately. Note the Chromosome X is based on version 1b and not 5a like the autosomes, so has to be downloaded separately. I am renaming it for convenience later.

for CHR in `​seq 1 22`; do
   FILE=$FTP_SITE/ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   wget $FILE $FILE.tbi
   sleep 60
done

FILE=$FTP_SITE/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget $FILE $FILE.tbi
rename -v 's/v1b/v5a/' *

The next step is to identify the Europeans and calculate the allele frequencies. I am also ignoring any variants that have a call rate < 95% and fails Hardy-Weinberg Equilibrium at p < 10-6. You can do this with vcftools or other softwares but I am most familiar with PLINK. And if you are using PLINK, please use version 1.90 (https://www.cog-genomics.org/plink2) and above as it is much faster than previous versions.

grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EUR.id  ## 503 Europeans

for CHR in `seq 1 23  | sed 's/23/X/'`​; do
   FILE=ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
   plink --vcf $FILE --keep-fam EUR.id --geno 0.05 --hwe 1e-6 --make-bed --out tmp
   plink --bfile tmp --freq --out EUR_$CHR rm EUR_$CHR.{log,nosex} tmp.*
done

awk 'FNR==1 && NR!=1{next;}{print}' *.frq | cut -f1-4 > EUR.freq
rm EUR_*.frq

You may also find pre-computed allele frequencies per population but my purpose here is how to download and run some calculations on the VCF files.


5 comments

  1. Great post! Could you make another post on how to download the frequencies for ALL populations for only a list of SNPs? This is useful for downloading frequencies from a set of GWAS hits for example! Thanks a lot!

  2. Can you clarify if you want:

    1. the frequency of each and every population separately? If so you write a for() loop

    2. the frequency across all individuals in 1000G? I am not sure if this makes sense but you can try “cut -f1 integrated_call_samples_v3.20130502.ALL.panel > all.ids” and replace EUR.ids with all.ids through the rest of the codes.

  3. I actually found a way to do it in R. However, I ran your code on my machine and it only partially worked. In fact, it runs over a chromosome, saves a file which is immediately written over (and deleted) by the next chromosome., so that I had to run the code for each chromosome separately (22 times). I now have the data I need but it was a lot of work!

    • Hi pifferdavide, can you share your R script? Im also looking for handy ways to query millions of variant frequencies from GWAS data. Thanks

      • Hi. It’s several scripts. 1 on Plink and 3 scripts with R. However, I am gonna publish the 1000 Genomes data into .csv file so everything will be easier. So far I have posted only the 5 superpopulation files (and some population files). Email me for more details: pifferdavide@gmail.com


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s