This is a small procedure I made to help anyone interested in getting their 23andMe results checked against the PASNP data. For this to work, you need:
1. Admixture
2. Unix OS (Ubuntu, etc.)
4. 23andMe raw data
5. Pan-Asian SNP data
6. A couple of Unix/Perl script (see procedure below)
Some notes:
1. The final “merged” data will end up with only about 20K to 24K SNPs. Out of your ~100K SNP on ~20% is compared. Because of this, the admixture results may be skewed but is currently (at time of writing) the best data out there with further East Asian resolution.
2. There are populations in East Asia that is not included in the PASNP data (e.g. Igorots of the Philippines, etc.).
3. If you get your results, please share it as a comment on this blog or you can email it to me. Many Thanks!
4. WDIST is faster than PLINK but does not have all the functionality. The speed helps if you have a slow computer, like mine J.
5. The most difficult part is getting the 23andMe to merge with PASNP. Don’t get frustrated (like I did). Feel free to ask me questions.
PRE-ADMIXTURE PROCESSING
1. Convert PASNP data to tped format (script here from Harappa)
./PASNP2TPED.pl Genotypes_All.txt
This will output PASNP.tped and PASNP.tfam
2. Convert PASNP.tped to bed format
./wdist --tfile PASNP --out PASNP --make-bed
Or,
./plink --noweb --tfile PASNP --out PASNP --make-bed
This will output PASNP.bed, PASNP.bim and PASNP.fam
3. Create SNP list from reference population
./wdist --bfile PASNP --write-snplist --out PASNP
Or,
./plink --noweb --bfile PASNP --write-snplist --out PASNP
This will output PASNPlist.snplist
4. Convert 23nMe data to tped format (script here from Harappa)
./23nMe2TPED.pl 23nMe.txt
Use fid1, id1, pid1, mid1
This will output id1.tped and id1.tfam
5. Convert id1.tped to bed format
./wdist --tfile id1 --out 23nMe --make-bed
Or,
./plink --noweb --tfile id1 --out 23nMeP --make-bed
This will output 23nMe.bed, 23nMe.bim and 23nMe.fam
6. Filter 23nMe SNPs using the reference population SNP list (PASNP.snplist)
./wdist --bfile 23nMe --extract PASNPlist.snplist --make-bed --out 23nMeFil
Or,
./plink --noweb --bfile 23nMe --extract PASNP.snplist --make-bed --out 23nMeFil
7. Create SNP list from 23nMeFil
./wdist --bfile 23nMeFil --write-snplist --out 23nMeFil
Or,
./plink --noweb --bfile 23nMeFil --write-snplist --out 23nMeFil
This will output 23nMeFil.snplist
8. Filter PASNP using the 23nMeFil.snplist
./wdist --bfile PASNP --extract 23nMeFil.snplist --make-bed --out PASNPFil
Or,
./plink --noweb --bfile PASNP --extract 23nMeFil.snplist --make-bed --out PASNPFil
9. Merge 23nMeFil.bed to PASNPFil.bed
./wdist --bfile PASNPFIL --bmerge 23nMeFil.bed 23nMeFil.bim 23nMeFil.fam --make-bed --out PAnMe
Or,
./plink --noweb --bfile PASNPFIL --bmerge 23nMeFil.bed 23nMeFil.bim 23nMeFil.fam --make-bed --out PAnMe
10. If above merge step results in error, possible error is due to flipped strand. WDIST will output a file with a list of SNP with the potentially flipped strand named PAnMe.missnp. Use this list to perform flip option
./wdist --bfile 23nMeFil --flip PAnMe-merge.missnp --make-bed --out 23nMeFilFlip
Or,
./plink --noweb --bfile 23nMeFil --flip PAnMe-merge.missnp --make-bed --out 23nMeFilFlip
11. Merge 23nMeFilFlip.bed to PASNPFil.bed
./wdist --bfile PASNPFil --bmerge 23nMeFilFlip.bed 23nMeFilFlip.bim 23nMeFilFlip.fam --make-bed --out PAnMe
Or,
./plink --noweb --bfile PASNP --bmerge 23nMeFilFlip.bed 23nMeFilFlip.bim 23nMeFilFlip.fam --make-bed --out PAnMe
12. If above merge step results in another error, best to exclude the SNP causing problems. WDIST will output a file with a list of SNP with the problematic SNPs called PAnMe.missnp. Use this list to perform exclude option
Exlude from 23nMeFil
./wdist --bfile 23nMeFil --exclude PAnMe.missnp --make-bed --out 23nMeFilEx
Or,
./plink --noweb --bfile 23nMeFil --exclude PAnMe.missnp --make-bed --out 23nMeFilEx
13. Exlude from PASNPFil
./wdist --bfile PASNPFil --exclude PAnMe.missnp --make-bed --out PASNPFilEx
Or,
./plink --noweb --bfile PASNPFil --exclude PAnMe.missnp --make-bed --out PASNPFilEx
14. Merge 23nMeFilEx.bed to PASNPFilEx.bed
./wdist --bfile PASNPFilEx --bmerge 23nMeFilEx.bed 23nMeFilEx.bim 23nMeFilEx.fam --make-bed --out PAnMe
Or,
./plink --noweb --bfile PASNP --bmerge 23nMeFilFlip.bed 23nMeFilFlip.bim 23nMeFilFlip.fam --make-bed --out PAnMe
ADMIXTURE PROCESSING
Place the bed, bim, & fam files together with the admixture file in the same folder and run Admixture analysis. I recommend running K=8 at first to get a feel and then running K=16 when you’re comfortable. Warning! The higher the K the longer the processing time.
./admixture PAnMe.bed 8
This will output PAnMe.8.P, & PAnMe.8.Q files. The Q file has your ancestry information. You will need to use some spreadsheet (Excel, Google Docs, etc) to easily identify population assignment. I touched a bit on this on this blog.
Here’s a bash script if you want to perform detailed analysis (Warning! This takes a lot of time). The j2 option is if you have more than one processor (core) available; in this case I’m using 2 cores (dual core).
#!/bin/bash
Ki=10 # This is you min K
Kf=25 # This is your max K
K=$Ki
while [ $K -le $Kf ]
do echo $K
./admixture –j2 --cv PAnMe.bed $K > Admixture$K.log
K=$[ $K + 1 ]
done
echo Done
REFERENCES
1. Yang X, Xu S, The HUGO Pan-Asian SNP Consortium (2011) Identification of Close Relatives in the HUGO Pan-Asian SNP Database. PLoS ONE 6(12): e29502. doi:10.1371/journal.pone.0029502
2. D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009
3. H. Zhou, D. H. Alexander, and K. Lange. A quasi-Newton method for accelerating the convergence of iterative optimization algorithms. Statistics and Computing, 2009.
4. Alexander D. H., Lange K. (2011). Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.
5. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.
6. WDIST is developed, tested, and documented primarily by Christopher Chang and Laurent Tellier at the BGI Cognitive Genomics Lab and Carson Chow, James Lee, and Shashaank Vattikuti at the NIH-NIDDK's Laboratory of Biological Modeling.
7. Thanks to Oceanboy (a 23andMe user and a distant cousin) with all the references.
8. Razib Khan
No comments:
Post a Comment