Download 1000 Genomes Phase3 and calculate allele frequencies

Here are some codes to download the data from the 1000 Genomes Phase 3 website into your own server and calculating the allele frequencies for the European populations. Here are some setup codes. The panel file tells you which population and super-population each sample belongs to.

wget $FTP_SITE/integrated_call_samples_v3.20130502.ALL.panel

Next we will download each chromosome (I am ignoring the Y and MT chromosomes here). Alternatively, you can download all the files on the FTP site using wget -r $FTP_SITE but I preferred to download each one separately. Note the Chromosome X is based on version 1b and not 5a like the autosomes, so has to be downloaded separately. I am renaming it for convenience later.

for CHR in `​seq 1 22`; do
   wget $FILE $FILE.tbi
   sleep 60

wget $FILE $FILE.tbi
rename -v 's/v1b/v5a/' *

The next step is to identify the Europeans and calculate the allele frequencies. I am also ignoring any variants that have a call rate < 95% and fails Hardy-Weinberg Equilibrium at p < 10-6. You can do this with vcftools or other softwares but I am most familiar with PLINK. And if you are using PLINK, please use version 1.90 ( and above as it is much faster than previous versions.

grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 >  ## 503 Europeans

for CHR in `seq 1 23  | sed 's/23/X/'`​; do
   plink --vcf $FILE --keep-fam --geno 0.05 --hwe 1e-6 --make-bed --out tmp
   plink --bfile tmp --freq --out EUR_$CHR rm EUR_$CHR.{log,nosex} tmp.*

awk 'FNR==1 && NR!=1{next;}{print}' *.frq | cut -f1-4 > EUR.freq
rm EUR_*.frq

You may also find pre-computed allele frequencies per population but my purpose here is how to download and run some calculations on the VCF files.


  1. Great post! Could you make another post on how to download the frequencies for ALL populations for only a list of SNPs? This is useful for downloading frequencies from a set of GWAS hits for example! Thanks a lot!

  2. Can you clarify if you want:

    1. the frequency of each and every population separately? If so you write a for() loop

    2. the frequency across all individuals in 1000G? I am not sure if this makes sense but you can try “cut -f1 integrated_call_samples_v3.20130502.ALL.panel > all.ids” and replace EUR.ids with all.ids through the rest of the codes.

  3. I actually found a way to do it in R. However, I ran your code on my machine and it only partially worked. In fact, it runs over a chromosome, saves a file which is immediately written over (and deleted) by the next chromosome., so that I had to run the code for each chromosome separately (22 times). I now have the data I need but it was a lot of work!

    • Hi pifferdavide, can you share your R script? Im also looking for handy ways to query millions of variant frequencies from GWAS data. Thanks

      • Hi. It’s several scripts. 1 on Plink and 3 scripts with R. However, I am gonna publish the 1000 Genomes data into .csv file so everything will be easier. So far I have posted only the 5 superpopulation files (and some population files). Email me for more details:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s