Friday, July 12, 2013

PROCEDURE TO CHECK YOUR EAST ASIAN ADMIXTURE USING PASNP

This is a small procedure I made to help anyone interested in getting their 23andMe results checked against the PASNP data. For this to work, you need:

1.      Admixture

2.      Unix OS (Ubuntu, etc.)

3.      PLINK or WDIST

4.      23andMe raw data

5.      Pan-Asian SNP data

6.      A couple of Unix/Perl script (see procedure below)

Some notes:

1.      The final “merged” data will end up with only about 20K to 24K SNPs. Out of your ~100K SNP on ~20% is compared. Because of this, the admixture results may be skewed but is currently (at time of writing) the best data out there with further East Asian resolution.

2.      There are populations in East Asia that is not included in the PASNP data (e.g. Igorots of the Philippines, etc.).

3.      If you get your results, please share it as a comment on this blog or you can email it to me. Many Thanks!

4.      WDIST is faster than PLINK but does not have all the functionality. The speed helps if you have a slow computer, like mine J.

5.      The most difficult part is getting the 23andMe to merge with PASNP. Don’t get frustrated (like I did). Feel free to ask me questions.

 

PRE-ADMIXTURE PROCESSING

1.      Convert PASNP data to tped format (script here from Harappa)

./PASNP2TPED.pl Genotypes_All.txt

This will output PASNP.tped and PASNP.tfam

 

2.      Convert PASNP.tped to bed format

./wdist --tfile PASNP --out PASNP --make-bed

Or,

./plink --noweb --tfile PASNP --out PASNP --make-bed

This will output PASNP.bed, PASNP.bim and PASNP.fam

 

3.      Create SNP list from reference population

./wdist --bfile PASNP --write-snplist --out PASNP

Or,

./plink --noweb --bfile PASNP --write-snplist --out PASNP

This will output PASNPlist.snplist

 

4.      Convert 23nMe data to tped format (script here from Harappa)

./23nMe2TPED.pl 23nMe.txt

Use fid1, id1, pid1, mid1

This will output id1.tped and id1.tfam

 

5.      Convert id1.tped to bed format

./wdist --tfile id1 --out 23nMe --make-bed

Or,

./plink --noweb --tfile id1 --out 23nMeP --make-bed

This will output 23nMe.bed, 23nMe.bim and 23nMe.fam

 

6.      Filter 23nMe SNPs using the reference population SNP list (PASNP.snplist)

./wdist --bfile 23nMe --extract PASNPlist.snplist --make-bed --out 23nMeFil

Or,

./plink --noweb --bfile 23nMe --extract PASNP.snplist --make-bed --out 23nMeFil

 

7.      Create SNP list from 23nMeFil

./wdist --bfile 23nMeFil --write-snplist --out 23nMeFil

Or,

./plink --noweb --bfile 23nMeFil --write-snplist --out 23nMeFil

This will output 23nMeFil.snplist

 

8.      Filter PASNP using the 23nMeFil.snplist

./wdist --bfile PASNP --extract 23nMeFil.snplist --make-bed --out PASNPFil

Or,

./plink --noweb --bfile PASNP --extract 23nMeFil.snplist --make-bed --out PASNPFil

 

9.      Merge 23nMeFil.bed to PASNPFil.bed

./wdist --bfile PASNPFIL --bmerge 23nMeFil.bed  23nMeFil.bim  23nMeFil.fam --make-bed --out PAnMe

Or,

./plink --noweb --bfile PASNPFIL --bmerge 23nMeFil.bed  23nMeFil.bim  23nMeFil.fam --make-bed --out PAnMe

 

10.   If above merge step results in error, possible error is due to flipped strand. WDIST will output a file with a list of SNP with the potentially flipped strand named PAnMe.missnp. Use this list to perform flip option

./wdist --bfile 23nMeFil --flip PAnMe-merge.missnp --make-bed --out 23nMeFilFlip

Or,

./plink --noweb --bfile 23nMeFil --flip PAnMe-merge.missnp --make-bed --out 23nMeFilFlip

 

11.   Merge 23nMeFilFlip.bed to PASNPFil.bed

./wdist --bfile PASNPFil --bmerge 23nMeFilFlip.bed  23nMeFilFlip.bim  23nMeFilFlip.fam --make-bed --out PAnMe

Or,

./plink --noweb --bfile PASNP --bmerge 23nMeFilFlip.bed  23nMeFilFlip.bim  23nMeFilFlip.fam --make-bed --out PAnMe

 

12.   If above merge step results in another error, best to exclude the SNP causing problems. WDIST will output a file with a list of SNP with the problematic SNPs called PAnMe.missnp. Use this list to perform exclude option

Exlude from 23nMeFil

./wdist --bfile 23nMeFil --exclude PAnMe.missnp --make-bed --out 23nMeFilEx

Or,

./plink --noweb --bfile 23nMeFil --exclude PAnMe.missnp --make-bed --out 23nMeFilEx

 

13.   Exlude from PASNPFil

./wdist --bfile PASNPFil --exclude PAnMe.missnp --make-bed --out PASNPFilEx

Or,

./plink --noweb --bfile PASNPFil --exclude PAnMe.missnp --make-bed --out PASNPFilEx

 

14.   Merge 23nMeFilEx.bed to PASNPFilEx.bed

./wdist --bfile PASNPFilEx --bmerge 23nMeFilEx.bed  23nMeFilEx.bim  23nMeFilEx.fam --make-bed --out PAnMe

Or,

./plink --noweb --bfile PASNP --bmerge 23nMeFilFlip.bed  23nMeFilFlip.bim  23nMeFilFlip.fam --make-bed --out PAnMe

 

ADMIXTURE PROCESSING

Place the bed, bim, & fam files together with the admixture file in the same folder and run Admixture analysis. I recommend running K=8 at first to get a feel and then running K=16 when you’re comfortable. Warning! The higher the K the longer the processing time.

./admixture PAnMe.bed 8

This will output PAnMe.8.P, & PAnMe.8.Q files. The Q file has your ancestry information. You will need to use some spreadsheet (Excel, Google Docs, etc) to easily identify population assignment. I touched a bit on this on this blog.

Here’s a bash script if you want to perform detailed analysis (Warning! This takes a lot of time). The j2 option is if you have more than one processor (core) available; in this case I’m using 2 cores (dual core).

#!/bin/bash

Ki=10  # This is you min K

Kf=25  # This is your max K

K=$Ki

while [ $K -le $Kf ]

      do echo $K

      ./admixture –j2 --cv PAnMe.bed $K > Admixture$K.log

      K=$[ $K + 1 ]

done

echo Done

 

REFERENCES

1.      Yang X, Xu S, The HUGO Pan-Asian SNP Consortium (2011) Identification of Close Relatives in the HUGO Pan-Asian SNP Database. PLoS ONE 6(12): e29502. doi:10.1371/journal.pone.0029502

2.      D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009

3.      H. Zhou, D. H. Alexander, and K.  Lange. A quasi-Newton method for accelerating the convergence of iterative optimization algorithms. Statistics and Computing, 2009.

4.      Alexander D. H., Lange K. (2011). Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.

5.      Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.

6.      WDIST is developed, tested, and documented primarily by Christopher Chang and Laurent Tellier at the BGI Cognitive Genomics Lab and Carson Chow, James Lee, and Shashaank Vattikuti at the NIH-NIDDK's Laboratory of Biological Modeling.

7.      Thanks to Oceanboy (a 23andMe user and a distant cousin) with all the references.

8.      Razib Khan

9.      Dienekes Pontikos

10.   Harappa Ancestry Project

 

No comments: