Here are some codes to download the data from the 1000 Genomes Phase 3 website into your own server and calculating the allele frequencies for the European populations. Here are some setup codes. The panel file tells you which population and super-population each sample belongs to.
FTP_SITE=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502 wget $FTP_SITE/integrated_call_samples_v3.20130502.ALL.panel
Next we will download each chromosome (I am ignoring the Y and MT chromosomes here). Alternatively, you can download all the files on the FTP site using wget -r $FTP_SITE
but I preferred to download each one separately. Note the Chromosome X is based on version 1b and not 5a like the autosomes, so has to be downloaded separately. I am renaming it for convenience later.
for CHR in `seq 1 22`; do FILE=$FTP_SITE/ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz wget $FILE $FILE.tbi sleep 60 done FILE=$FTP_SITE/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz wget $FILE $FILE.tbi rename -v 's/v1b/v5a/' *
The next step is to identify the Europeans and calculate the allele frequencies. I am also ignoring any variants that have a call rate < 95% and fails Hardy-Weinberg Equilibrium at p < 10-6. You can do this with vcftools or other softwares but I am most familiar with PLINK. And if you are using PLINK, please use version 1.90 (https://www.cog-genomics.org/plink2) and above as it is much faster than previous versions.
grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EUR.id ## 503 Europeans for CHR in `seq 1 23 | sed 's/23/X/'`; do FILE=ALL.chr$CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz plink --vcf $FILE --keep-fam EUR.id --geno 0.05 --hwe 1e-6 --make-bed --out tmp plink --bfile tmp --freq --out EUR_$CHR rm EUR_$CHR.{log,nosex} tmp.* done awk 'FNR==1 && NR!=1{next;}{print}' *.frq | cut -f1-4 > EUR.freq rm EUR_*.frq
You may also find pre-computed allele frequencies per population but my purpose here is how to download and run some calculations on the VCF files.
Great post! Could you make another post on how to download the frequencies for ALL populations for only a list of SNPs? This is useful for downloading frequencies from a set of GWAS hits for example! Thanks a lot!
Can you clarify if you want:
1. the frequency of each and every population separately? If so you write a for() loop
2. the frequency across all individuals in 1000G? I am not sure if this makes sense but you can try “cut -f1 integrated_call_samples_v3.20130502.ALL.panel > all.ids” and replace EUR.ids with all.ids through the rest of the codes.
I actually found a way to do it in R. However, I ran your code on my machine and it only partially worked. In fact, it runs over a chromosome, saves a file which is immediately written over (and deleted) by the next chromosome., so that I had to run the code for each chromosome separately (22 times). I now have the data I need but it was a lot of work!
Hi pifferdavide, can you share your R script? Im also looking for handy ways to query millions of variant frequencies from GWAS data. Thanks
Hi. It’s several scripts. 1 on Plink and 3 scripts with R. However, I am gonna publish the 1000 Genomes data into .csv file so everything will be easier. So far I have posted only the 5 superpopulation files (and some population files). Email me for more details: pifferdavide@gmail.com