Pre-Processing of PASNP Data

The Pan-Asian SNP (PASNP) data set basically contains about 55,000 SNPs from 1,928 individuals. These individuals, mostly from East Asia, have known language, ancestry/ethnicity and location with which, we can partially uncover the unique populations in Southeast Asia.

A team from UCLA developed an efficient algorithm called Admixture which estimates ancestry. To perform an admixture analysis, Admixture requires the number of unique populations in a given data set; this variable is called K. For example, if we provide a K value of 3 for the PASNP data set, it will likely end up grouping the data into Africans, Europeans, & East Asians because these three groups has the largest genetic distance. However, this study will look in to a relatively more in depth resolution within Southeast Asia and this can only be attained by a larger K value.

The question then becomes: what is the exact, or at least, the most optimum K value to use for the given data set?

Admixture provides a cross validation (CV) analysis. This option “cross validates” the assumed K for the given data set which provides an estimate of how likely the assumed K accurately represent the number of unique populations. Figure 1 best describes CV error where plotted the CV error for K from 3 to 25; in this figure, we can readily observe that as K is increases, the CV error decreases but there is a point where the error value begins to increase, meaning using much higher K value may not adequately represent the number of unique populations. The objective is to find the lowest CV error and in the PASNP data, K=19 provides the lowest CV error (0.47187).

Figure 1: Pan-Asian SNP Admixture Cross Validation Error

PASNP Population Breakdown

Once the optimum number of unique populations is found, the next step is to appropriately identify or assign each population. Using Excel sorting and filtering features; I have manged to identify the unique populations (see Table 1).

Table 1: Population Description

Population	Description
TaiKadai	Tai-Kadai speaking people
Nusantao	Austronesian speaking people
Ryukyu	Ethnic group from Japan
ST_Alt	Sino-Tibetan or Altaic speaking people (perhaps the Han people)
MonKhmer	Mon-Khmer speaking people
Dayak	Ethnic group from Kalamintan, Indonesia
Jinuo	Ethnic group from Southwest China (Indochina)
HmongMien	Hmong-Mien speaking people
Temuan	Ethnic group from Malay Peninsula
Papuan	Papuan speaking people
Indian	South Asians
Jehai_Negrito	Ethnic group from Malay Peninsula
Kensui_Negrito	Ethnic group from Malay Peninsula
Ati_Negrito	Ethnic group from Bisaya
Mamanwa_Negrito	Ethnic group from Mindanao
Mlabri	Ethnic group from Thailand (Indochina)
European	White Americans from Utah, Nevada
African	Africans from Yoruba
Amerindian	Native Americans from North and South America

Since the African, European, Indian and Amerindian have the highest genetic distance compared to the East Asian populations, they were really easy to identify. For the East Asian populations, I basically mixed and match between language family & ethnicity because there are samples from different ethnicities but have the same language family showing high percentage for a given population and vice versa. As an example (see Table 2), the HomngMien language family has a high frequency for the Hmongs of China and Miao of Thailand but they both use the same language family.

Except for the MonKhmer, Nusantao, TaiKadai, & ST_Alt, the East Asians population identification was simple; most of the population has several samples that has 90% or above for its group. As an example (see Table 3), a population showed several samples from the Mlabri with 100% and the next sample to have this population has only 6%; this gene obviously can be assigned as Mlabri.

For the Nusantao, the highest percentage was 88% and it’s a gradual decrease across many ethnic groups mainly in ISEA but the common denominator for this population was the Austronesian speaking people (see Table 4). I will explain later why I called them “Nusantao”. I took similar approaches for the MonKhmer and TaiKadai where each population also had the recurring language family from different ethnic backgrounds.

The most difficult population to identify was the ST_Alt. The highest percentage for this group is 60% and gradually spreads throughout the Sinto_Tibetan, Altaic and, to a minor extent, MonKhmer speaking people (see Table 4). Initially, I thought this could be the Han ancestry; if true, the Han may have spread all over Northeast Asia with decreasing percentage (admixing with the local populations) as they went South. For now, I’ll keep ST_Alt.

Table 2: Hmong-Mien Multi-Ethnic Background Example

Group	HmongMien
ChinaHmongHM	100%
ChinaHmongHM	100%
ChinaHmongHM	100%
ThailandHmongMiaoHM	84%
ThailandHmongMiaoHM	80%
ThailandHmongMiaoHM	78%
ThailandHmongMiaoHM	78%
ThailandHmongMiaoHM	77%
ThailandHmongMiaoHM	76%
ThailandHmongMiaoHM	76%
ThailandHmongMiaoHM	74%
ThailandHmongMiaoHM	73%
ThailandHmongMiaoHM	70%
ThailandHmongMiaoHM	69%
ThailandHmongMiaoHM	69%
ChinaHmongHM	69%
ThailandHmongMiaoHM	69%
ThailandHmongMiaoHM	69%
ThailandHmongMiaoHM	67%
ChinaHmongHM	67%
ThailandHmongMiaoHM	67%
ChinaHmongHM	65%
ThailandHmongMiaoHM	65%
ChinaHmongHM	64%
ThailandHmongMiaoHM	63%
ChinaHmongHM	62%
ChinaHmongHM	62%

Table 3: Mlabri Percent Ancestry Example

Group	Mlabri
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandMlabriMK	100%
ThailandH'tinMK	6%
MalaysiaMainTemuanAn	4%
IndonesiaJavaJavaAn	4%
IndonesiaJavaJavaAn	4%
IndonesiaJavaSundaAn	4%
ThailandKarenST	4%
SingapuraMalayAn	4%
ThailandKarenST	4%
SingapuraMalayAn	4%
IndonesiaSumatraSMalayAn	4%
SingapuraMalayAn	4%

Table 4: MonKhmer, Nusantao, TaiKadai, & ST_Alt Percentage Pattern

Group	Nusantao	Group	MonKhmer	Group	TaiKadai	Group	ST_Alt
IndonesiaSumatraNMentawaiAn	88%	ThailandH'tinMK	80%	ChinaSHainanJiamaoTK	88%	IndiaNSpitiST	60%
IndonesiaSumatraNMentawaiAn	87%	ThailandH'tinMK	79%	ChinaSHainanJiamaoTK	87%	IndiaNSpitiST	57%
IndonesiaSumatraNMentawaiAn	87%	ThailandH'tinMK	78%	ChinaSHainanJiamaoTK	87%	IndiaNSpitiST	56%
IndonesiaSumatraNMentawaiAn	86%	ThailandH'tinMK	75%	ChinaSHainanJiamaoTK	87%	IndiaNSpitiST	55%
IndonesiaSumatraNMentawaiAn	86%	ThailandH'tinMK	75%	ChinaSHainanJiamaoTK	86%	ChinaBeijingHanST	55%
IndonesiaSumatraNMentawaiAn	84%	ThailandH'tinMK	74%	ChinaSHainanJiamaoTK	86%	KoreaKoreanAlt	55%
IndonesiaSumatraNMentawaiAn	84%	ThailandH'tinMK	73%	ChinaSHainanJiamaoTK	85%	KoreaKoreanAlt	54%
IndonesiaSumatraNMentawaiAn	83%	ThailandH'tinMK	73%	ChinaSHainanJiamaoTK	85%	ThailandPaluangMK	53%
IndonesiaSumatraNMentawaiAn	82%	ThailandH'tinMK	72%	ChinaSHainanJiamaoTK	85%	ChinaBeijingHanST	53%
IndonesiaSumatraNMentawaiAn	81%	ThailandH'tinMK	71%	ChinaSHainanJiamaoTK	85%	KoreaKoreanAlt	53%
IndonesiaSumatraNMentawaiAn	81%	ThailandH'tinMK	68%	ChinaSHainanJiamaoTK	85%	KoreaKoreanAlt	53%
IndonesiaSumatraNMentawaiAn	81%	ThailandH'tinMK	68%	ChinaSHainanJiamaoTK	85%	ChinaBeijingHanST	53%
IndonesiaSumatraNMentawaiAn	80%	ThailandH'tinMK	67%	ChinaSHainanJiamaoTK	85%	KoreaKoreanAlt	52%
IndonesiaSumatraNMentawaiAn	80%	ThailandH'tinMK	67%	ChinaSHainanJiamaoTK	84%	KoreaKoreanAlt	52%
IndonesiaSumatraNMentawaiAn	79%	ThailandH'tinMK	63%	ChinaSHainanJiamaoTK	84%	KoreaKoreanAlt	52%
PhilippinesNLusonNIlokanoAn	77%	ChinaSWWaMK	59%	ChinaSHainanJiamaoTK	84%	ChinaBeijingHanST	52%
PhilippinesNLusonNIlokanoAn	77%	ThailandLawaMK	59%	ChinaSHainanJiamaoTK	84%	ChinaBeijingHanST	52%
TaiwanAmiAn	71%	ThailandLawaMK	59%	ChinaSHainanJiamaoTK	83%	ChinaBeijingHanST	51%
PhilippinesNLusonNIlokanoAn	70%	ChinaSWWaMK	58%	ChinaSHainanJiamaoTK	83%	ChinaBeijingHanST	51%
TaiwanAtayalAn	69%	ChinaSWWaMK	58%	ChinaSHainanJiamaoTK	82%	KoreaKoreanAlt	51%
PhilippinesNLusonNIlokanoAn	69%	ChinaSWWaMK	58%	ChinaSHainanJiamaoTK	82%	ThailandPaluangMK	51%
TaiwanAtayalAn	69%	ChinaSWWaMK	57%	ChinaSHainanJiamaoTK	82%	ChinaBeijingHanST	51%
TaiwanAtayalAn	68%	ChinaSWWaMK	57%	ChinaSHainanJiamaoTK	81%	ChinaBeijingHanST	51%
PhilippinesNLusonCentralTagalogAn	68%	ChinaSWWaMK	57%	ChinaSHainanJiamaoTK	81%	KoreaKoreanAlt	51%
PhilippinesNLusonNIlokanoAn	67%	ThailandTaiLueTK	57%	ChinaSHainanJiamaoTK	81%	ChinaBeijingHanST	51%
TaiwanAtayalAn	67%	ChinaSWWaMK	56%	ChinaSHainanJiamaoTK	80%	KoreaKoreanAlt	51%
PhilippinesSMindanaoCentralItaManoboAn	66%	ThailandH'tinMK	56%	ThailandTaiLueTK	80%	KoreaKoreanAlt	51%
TaiwanAmiAn	66%	ChinaSWWaMK	56%	ThailandTaiLueTK	79%	ChinaShanghaiHanST	51%
PhilippinesSMindanaoCentralItaManoboAn	66%	ThailandLawaMK	56%	ChinaSZhuangTK	79%	KoreaKoreanAlt	50%
PhilippinesNLusonNIlokanoAn	66%	ChinaSWWaMK	55%	ChinaSHainanJiamaoTK	78%	KoreaKoreanAlt	50%
PhilippinesNLusonNIlokanoAn	66%	ChinaSWWaMK	55%	ChinaSZhuangTK	78%	KoreaKoreanAlt	50%
TaiwanAmiAn	65%	ChinaSWWaMK	55%	ChinaSHainanJiamaoTK	78%	KoreaKoreanAlt	50%
PhilippinesNLusonNIlokanoAn	65%	ThailandLawaMK	55%	ChinaSHainanJiamaoTK	77%	KoreaKoreanAlt	50%
PhilippinesNLusonNIlokanoAn	65%	ThailandLawaMK	55%	ChinaSHainanJiamaoTK	77%	KoreaKoreanAlt	50%
TaiwanAmiAn	65%	ChinaSWWaMK	53%	ThailandTaiLueTK	76%	KoreaKoreanAlt	50%
PhilippinesSMindanaoCentralItaManoboAn	65%	ThailandLawaMK	53%	ChinaSZhuangTK	75%	ThailandPaluangMK	50%
TaiwanAmiAn	65%	ChinaSWWaMK	53%	ThailandTaiLueTK	75%	ChinaBeijingHanST	50%
TaiwanAtayalAn	64%	ThailandLawaMK	53%	ChinaSHainanJiamaoTK	75%	KoreaKoreanAlt	50%
PhilippinesCentralBisayaNItaIrayaAn	64%	ThailandLawaMK	53%	ThailandTaiLueTK	74%	KoreaKoreanAlt	50%
TaiwanAtayalAn	64%	ThailandLawaMK	52%	ChinaSZhuangTK	73%	ChinaBeijingHanST	50%
PhilippinesNLusonNIlokanoAn	64%	ChinaSWWaMK	52%	ChinaSZhuangTK	72%	ChinaBeijingHanST	49%
TaiwanAtayalAn	64%	ChinaSWWaMK	52%	ChinaSZhuangTK	72%	ChinaBeijingHanST	49%
TaiwanAmiAn	64%	ThailandLawaMK	52%	ChinaSZhuangTK	72%	ChinaBeijingHanST	49%
TaiwanAmiAn	64%	ThailandH'tinMK	52%	ChinaSZhuangTK	71%	KoreaKoreanAlt	49%
PhilippinesNLusonNIlokanoAn	64%	ChinaSWWaMK	52%	ChinaSZhuangTK	71%	ThailandPaluangMK	49%
PhilippinesSMindanaoCentralItaManoboAn	63%	ChinaSWWaMK	51%	ThailandTaiYongTK	71%	ChinaBeijingHanST	49%
TaiwanAmiAn	63%	ChinaSWWaMK	51%	ThailandTaiLueTK	71%	ThailandPaluangMK	49%
PhilippinesCentralBisayaNItaIrayaAn	63%	ThailandLawaMK	51%	ChinaSZhuangTK	71%	ThailandPaluangMK	49%
PhilippinesSMindanaoCentralItaManoboAn	63%	ChinaSWWaMK	51%	ChinaSZhuangTK	70%	ChinaBeijingHanST	49%

Genetic Distances

I tabulated the Fst (genetic distance) data to demonstrate how the populations compare with each other. For ease of interpretation, I used Excel data bars and sorting options; the combination of which produced a chart shown in Table 5. In this chart, I used the top 10 percentile to highlight the close populations. Basically, the closer the Fst value is to zero, the closer the two populations are. I also performed correlation analysis of the allele frequencies as shown in Table 6. Correlation analysis shows how much two groups vary together; the closer the Rho value is to 1 (unity), the closer the two populations are and for this analysis, I used 20 percentile to demonstrate the closeness of the populations. Although, for the most part, the predictability of the populations is obvious, a principal component analysis (PCA) for the allele frequencies and a and a dendrogram of the genetic distances can better provide visualization of the variability/separation.

The Tai-Kadai speakers has the lowest Fst values. The Tai-Kadai, Nusantao, Ryukyu, Sino-Tibetan, Mon-Khmer, Dayak, Jinuo, Hmong-Mien & Temuan are very close to each other which is both represented in the Fst Table 5, correlation Table 6 and the PCA Figure 3. I believe these populations once lived in close proximity to each other in Southwest China (which I will try to expand a bit in later blogs) before later expansions. The expansion was likely triggered by rice agriculture; It is possible these populations even spoke a more common language than they do today. Sagart has a paper linking Sino-Tibetan-Austronesian languages. The Ryukyu people has a unique ethnic language which is now considered by linguist in danger of extinction. The Jomon people are considered to be the first group to populate Japan. The Yayoi people came next who brought rice agriculture. There are some evidence linking Japanese with the Austronesian language.

Obviously, the Europeans are close to the Indians based from many genetic studies and linguistic connections (Indo-Aryan language family) but is really not the scope of this blog (at the moment).

It looks like the Ita groups are farther apart than what I expected. The Fst, correlation and PCA all shows that there are at least four unique groups;

1. Orang Asli (Malaysian Ita: Kensui and Jehai)

2. Aytas (Philippine Ita: Ayta, Ati, Agta, & Iraya)

3. Mamanwa (Philippine Ita)

4. Papuan

It’s possible the large separation of the four groups is due to a very long period of isolation from each other. Based on the Callao Man findings, Ita populations may have been in ISEA at least as far back as 67K BP. This was likely facilitated with the receded sea level, giving rise to the land bridges due to some glacial maxima. The way to most of Luson, Bisayas, and Mindanao Island were perhaps made by some raft and when the sea level rose back; the separation/isolation began. This separation seems to have occurred longer/older than the separation of the Tai-Kadai, Nusantao, Ryukyu, Sino-Tibetan, Mon-Khmer, Dayak, Jinuo, Hmong-Mien & Temuan given the individual Ita population Fst are very large. Additionally, I also believe that the East Asian admixture with the Ita populations is what partially caused experts to differentiate the Northern and Southern East Asians (again for another blog).

Table 5: Fst divergences between estimated populations

Table 6: PASNP K=19 Allele Frequency Correlation

Table 7: Ita Populations Relative Fst

Figure 2: Allele Frequency PCA Loadings (Scree Plot)

Figure 3: Allele Frequency PC1 vs PC2 Plot

Figure 4: PC1 vs PC2 Plot (Zoomed)

Figure 5:Dendrogram, K=19

Figure 5:Admixture Result, K=19

References

1. Yang X, Xu S, The HUGO Pan-Asian SNP Consortium (2011) Identification of Close Relatives in the HUGO Pan-Asian SNP Database. PLoS ONE 6(12): e29502. doi:10.1371/journal.pone.0029502

2. D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009

3. H. Zhou, D. H. Alexander, and K. Lange. A quasi-Newton method for accelerating the convergence of iterative optimization algorithms. Statistics and Computing, 2009.

4. Alexander D. H., Lange K. (2011). Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.

5. Greenhill, S.J., Blust. R, & Gray, R.D. (2008). The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics. Evolutionary Bioinformatics, 4:271-283.

6. Mijares, A.S.B. et al. 2010. New evidence for a 67,000-year-old human presence at Callao Cave , Luzon , Philippines. Journal of Human Evolution, 59:123-132. doi:10.1016/j.jhevol.2010.04.008.

7. Sagart, L. (2002). SINO-TIBETO-AUSTRONESIAN: An Updated and Improved Argument. BMC Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.

8. http://www.r-project.org/

MANGLALAYAG

Friday, June 21, 2013

SOUTHEAST ASIAN POPULATIONS ACCORDING TO PAN-ASIAN SNP ADMIXTURE ANALYSIS

Pre-Processing of PASNP Data

PASNP Population Breakdown

Genetic Distances

References

Translate

Blog Archive