Friday, June 21, 2013

SOUTHEAST ASIAN POPULATIONS ACCORDING TO PAN-ASIAN SNP ADMIXTURE ANALYSIS


Pre-Processing of PASNP Data

The Pan-Asian SNP (PASNP) data set basically contains about 55,000 SNPs from 1,928 individuals. These individuals, mostly from East Asia, have known language, ancestry/ethnicity and location with which, we can partially uncover the unique populations in Southeast Asia.
A team from UCLA developed an efficient algorithm called Admixture which estimates ancestry. To perform an admixture analysis, Admixture requires the number of unique populations in a given data set; this variable is called K. For example, if we provide a K value of 3 for the PASNP data set, it will likely end up grouping the data into Africans, Europeans, & East Asians because these three groups has the largest genetic distance. However, this study will look in to a relatively more in depth resolution within Southeast Asia and this can only be attained by a larger K value.
The question then becomes: what is the exact, or at least, the most optimum K value to use for the given data set?
Admixture provides a cross validation (CV) analysis. This option “cross validates” the assumed K for the given data set which provides an estimate of how likely the assumed K accurately represent the number of unique populations. Figure 1 best describes CV error where plotted the CV error for K from 3 to 25; in this figure, we can readily observe that as K is increases, the CV error decreases but there is a point where the error value begins to increase, meaning using much higher K value may not adequately represent the number of unique populations. The objective is to find the lowest CV error and in the PASNP data, K=19 provides the lowest CV error (0.47187).


Figure 1: Pan-Asian SNP Admixture Cross Validation Error

PASNP Population Breakdown

Once the optimum number of unique populations is found, the next step is to appropriately identify or assign each population. Using Excel sorting and filtering features; I have manged to identify the unique populations (see Table 1).

Table 1: Population Description
Population
Description
TaiKadai
Tai-Kadai speaking people
Nusantao
Austronesian speaking people
Ryukyu
Ethnic group from Japan
ST_Alt
Sino-Tibetan or Altaic speaking people (perhaps the Han people)
MonKhmer
Mon-Khmer speaking people
Dayak
Ethnic group from Kalamintan, Indonesia
Jinuo
Ethnic group from Southwest China (Indochina)
HmongMien
Hmong-Mien speaking people
Temuan
Ethnic group from Malay Peninsula
Papuan
Papuan speaking people
Indian
South Asians
Jehai_Negrito
Ethnic group from Malay Peninsula
Kensui_Negrito
Ethnic group from Malay Peninsula
Ati_Negrito
Ethnic group from Bisaya
Mamanwa_Negrito
Ethnic group from Mindanao
Mlabri
Ethnic group from Thailand (Indochina)
European
White Americans from Utah, Nevada
African
Africans from Yoruba
Amerindian
Native Americans from North and South America

Since the African, European, Indian and Amerindian have the highest genetic distance compared to the East Asian populations, they were really easy to identify. For the East Asian populations, I basically mixed and match between language family & ethnicity because there are samples from different ethnicities but have the same language family showing high percentage for a given population and vice versa. As an example (see Table 2), the HomngMien language family has a high frequency for the Hmongs of China and Miao of Thailand but they both use the same language family.
Except for the MonKhmer, Nusantao, TaiKadai, & ST_Alt, the East Asians population identification was simple; most of the population has several samples that has 90% or above for its group. As an example (see Table 3), a population showed several samples from the Mlabri with 100% and the next sample to have this population has only 6%; this gene obviously can be assigned as Mlabri.
For the Nusantao, the highest percentage was 88% and it’s a gradual decrease across many ethnic groups mainly in ISEA but the common denominator for this population was the Austronesian speaking people (see Table 4). I will explain later why I called them “Nusantao”. I took similar approaches for the MonKhmer and TaiKadai where each population also had the recurring language family from different ethnic backgrounds.
The most difficult population to identify was the ST_Alt. The highest percentage for this group is 60% and gradually spreads throughout the Sinto_Tibetan, Altaic and, to a minor extent, MonKhmer speaking people (see Table 4). Initially, I thought this could be the Han ancestry; if true, the Han may have spread all over Northeast Asia with decreasing percentage (admixing with the local populations) as they went South. For now, I’ll keep ST_Alt.

Table 2: Hmong-Mien Multi-Ethnic Background Example
Group
HmongMien
ChinaHmongHM
100%
ChinaHmongHM
100%
ChinaHmongHM
100%
ThailandHmongMiaoHM
84%
ThailandHmongMiaoHM
80%
ThailandHmongMiaoHM
78%
ThailandHmongMiaoHM
78%
ThailandHmongMiaoHM
77%
ThailandHmongMiaoHM
76%
ThailandHmongMiaoHM
76%
ThailandHmongMiaoHM
74%
ThailandHmongMiaoHM
73%
ThailandHmongMiaoHM
70%
ThailandHmongMiaoHM
69%
ThailandHmongMiaoHM
69%
ChinaHmongHM
69%
ThailandHmongMiaoHM
69%
ThailandHmongMiaoHM
69%
ThailandHmongMiaoHM
67%
ChinaHmongHM
67%
ThailandHmongMiaoHM
67%
ChinaHmongHM
65%
ThailandHmongMiaoHM
65%
ChinaHmongHM
64%
ThailandHmongMiaoHM
63%
ChinaHmongHM
62%
ChinaHmongHM
62%

Table 3: Mlabri Percent Ancestry Example
Group
Mlabri
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandMlabriMK
100%
ThailandH'tinMK
6%
MalaysiaMainTemuanAn
4%
IndonesiaJavaJavaAn
4%
IndonesiaJavaJavaAn
4%
IndonesiaJavaSundaAn
4%
ThailandKarenST
4%
SingapuraMalayAn
4%
ThailandKarenST
4%
SingapuraMalayAn
4%
IndonesiaSumatraSMalayAn
4%
SingapuraMalayAn
4%

Table 4: MonKhmer, Nusantao, TaiKadai, & ST_Alt Percentage Pattern
Group
Nusantao
Group
MonKhmer
Group
TaiKadai
Group
ST_Alt
IndonesiaSumatraNMentawaiAn
88%
ThailandH'tinMK
80%
ChinaSHainanJiamaoTK
88%
IndiaNSpitiST
60%
IndonesiaSumatraNMentawaiAn
87%
ThailandH'tinMK
79%
ChinaSHainanJiamaoTK
87%
IndiaNSpitiST
57%
IndonesiaSumatraNMentawaiAn
87%
ThailandH'tinMK
78%
ChinaSHainanJiamaoTK
87%
IndiaNSpitiST
56%
IndonesiaSumatraNMentawaiAn
86%
ThailandH'tinMK
75%
ChinaSHainanJiamaoTK
87%
IndiaNSpitiST
55%
IndonesiaSumatraNMentawaiAn
86%
ThailandH'tinMK
75%
ChinaSHainanJiamaoTK
86%
ChinaBeijingHanST
55%
IndonesiaSumatraNMentawaiAn
84%
ThailandH'tinMK
74%
ChinaSHainanJiamaoTK
86%
KoreaKoreanAlt
55%
IndonesiaSumatraNMentawaiAn
84%
ThailandH'tinMK
73%
ChinaSHainanJiamaoTK
85%
KoreaKoreanAlt
54%
IndonesiaSumatraNMentawaiAn
83%
ThailandH'tinMK
73%
ChinaSHainanJiamaoTK
85%
ThailandPaluangMK
53%
IndonesiaSumatraNMentawaiAn
82%
ThailandH'tinMK
72%
ChinaSHainanJiamaoTK
85%
ChinaBeijingHanST
53%
IndonesiaSumatraNMentawaiAn
81%
ThailandH'tinMK
71%
ChinaSHainanJiamaoTK
85%
KoreaKoreanAlt
53%
IndonesiaSumatraNMentawaiAn
81%
ThailandH'tinMK
68%
ChinaSHainanJiamaoTK
85%
KoreaKoreanAlt
53%
IndonesiaSumatraNMentawaiAn
81%
ThailandH'tinMK
68%
ChinaSHainanJiamaoTK
85%
ChinaBeijingHanST
53%
IndonesiaSumatraNMentawaiAn
80%
ThailandH'tinMK
67%
ChinaSHainanJiamaoTK
85%
KoreaKoreanAlt
52%
IndonesiaSumatraNMentawaiAn
80%
ThailandH'tinMK
67%
ChinaSHainanJiamaoTK
84%
KoreaKoreanAlt
52%
IndonesiaSumatraNMentawaiAn
79%
ThailandH'tinMK
63%
ChinaSHainanJiamaoTK
84%
KoreaKoreanAlt
52%
PhilippinesNLusonNIlokanoAn
77%
ChinaSWWaMK
59%
ChinaSHainanJiamaoTK
84%
ChinaBeijingHanST
52%
PhilippinesNLusonNIlokanoAn
77%
ThailandLawaMK
59%
ChinaSHainanJiamaoTK
84%
ChinaBeijingHanST
52%
TaiwanAmiAn
71%
ThailandLawaMK
59%
ChinaSHainanJiamaoTK
83%
ChinaBeijingHanST
51%
PhilippinesNLusonNIlokanoAn
70%
ChinaSWWaMK
58%
ChinaSHainanJiamaoTK
83%
ChinaBeijingHanST
51%
TaiwanAtayalAn
69%
ChinaSWWaMK
58%
ChinaSHainanJiamaoTK
82%
KoreaKoreanAlt
51%
PhilippinesNLusonNIlokanoAn
69%
ChinaSWWaMK
58%
ChinaSHainanJiamaoTK
82%
ThailandPaluangMK
51%
TaiwanAtayalAn
69%
ChinaSWWaMK
57%
ChinaSHainanJiamaoTK
82%
ChinaBeijingHanST
51%
TaiwanAtayalAn
68%
ChinaSWWaMK
57%
ChinaSHainanJiamaoTK
81%
ChinaBeijingHanST
51%
PhilippinesNLusonCentralTagalogAn
68%
ChinaSWWaMK
57%
ChinaSHainanJiamaoTK
81%
KoreaKoreanAlt
51%
PhilippinesNLusonNIlokanoAn
67%
ThailandTaiLueTK
57%
ChinaSHainanJiamaoTK
81%
ChinaBeijingHanST
51%
TaiwanAtayalAn
67%
ChinaSWWaMK
56%
ChinaSHainanJiamaoTK
80%
KoreaKoreanAlt
51%
PhilippinesSMindanaoCentralItaManoboAn
66%
ThailandH'tinMK
56%
ThailandTaiLueTK
80%
KoreaKoreanAlt
51%
TaiwanAmiAn
66%
ChinaSWWaMK
56%
ThailandTaiLueTK
79%
ChinaShanghaiHanST
51%
PhilippinesSMindanaoCentralItaManoboAn
66%
ThailandLawaMK
56%
ChinaSZhuangTK
79%
KoreaKoreanAlt
50%
PhilippinesNLusonNIlokanoAn
66%
ChinaSWWaMK
55%
ChinaSHainanJiamaoTK
78%
KoreaKoreanAlt
50%
PhilippinesNLusonNIlokanoAn
66%
ChinaSWWaMK
55%
ChinaSZhuangTK
78%
KoreaKoreanAlt
50%
TaiwanAmiAn
65%
ChinaSWWaMK
55%
ChinaSHainanJiamaoTK
78%
KoreaKoreanAlt
50%
PhilippinesNLusonNIlokanoAn
65%
ThailandLawaMK
55%
ChinaSHainanJiamaoTK
77%
KoreaKoreanAlt
50%
PhilippinesNLusonNIlokanoAn
65%
ThailandLawaMK
55%
ChinaSHainanJiamaoTK
77%
KoreaKoreanAlt
50%
TaiwanAmiAn
65%
ChinaSWWaMK
53%
ThailandTaiLueTK
76%
KoreaKoreanAlt
50%
PhilippinesSMindanaoCentralItaManoboAn
65%
ThailandLawaMK
53%
ChinaSZhuangTK
75%
ThailandPaluangMK
50%
TaiwanAmiAn
65%
ChinaSWWaMK
53%
ThailandTaiLueTK
75%
ChinaBeijingHanST
50%
TaiwanAtayalAn
64%
ThailandLawaMK
53%
ChinaSHainanJiamaoTK
75%
KoreaKoreanAlt
50%
PhilippinesCentralBisayaNItaIrayaAn
64%
ThailandLawaMK
53%
ThailandTaiLueTK
74%
KoreaKoreanAlt
50%
TaiwanAtayalAn
64%
ThailandLawaMK
52%
ChinaSZhuangTK
73%
ChinaBeijingHanST
50%
PhilippinesNLusonNIlokanoAn
64%
ChinaSWWaMK
52%
ChinaSZhuangTK
72%
ChinaBeijingHanST
49%
TaiwanAtayalAn
64%
ChinaSWWaMK
52%
ChinaSZhuangTK
72%
ChinaBeijingHanST
49%
TaiwanAmiAn
64%
ThailandLawaMK
52%
ChinaSZhuangTK
72%
ChinaBeijingHanST
49%
TaiwanAmiAn
64%
ThailandH'tinMK
52%
ChinaSZhuangTK
71%
KoreaKoreanAlt
49%
PhilippinesNLusonNIlokanoAn
64%
ChinaSWWaMK
52%
ChinaSZhuangTK
71%
ThailandPaluangMK
49%
PhilippinesSMindanaoCentralItaManoboAn
63%
ChinaSWWaMK
51%
ThailandTaiYongTK
71%
ChinaBeijingHanST
49%
TaiwanAmiAn
63%
ChinaSWWaMK
51%
ThailandTaiLueTK
71%
ThailandPaluangMK
49%
PhilippinesCentralBisayaNItaIrayaAn
63%
ThailandLawaMK
51%
ChinaSZhuangTK
71%
ThailandPaluangMK
49%
PhilippinesSMindanaoCentralItaManoboAn
63%
ChinaSWWaMK
51%
ChinaSZhuangTK
70%
ChinaBeijingHanST
49%

Genetic Distances

I tabulated the Fst (genetic distance) data to demonstrate how the populations compare with each other. For ease of interpretation, I used Excel data bars and sorting options; the combination of which produced a chart shown in Table 5. In this chart, I used the top 10 percentile to highlight the close populations. Basically, the closer the Fst value is to zero, the closer the two populations are. I also performed correlation analysis of the allele frequencies as shown in Table 6. Correlation analysis shows how much two groups vary together; the closer the Rho value is to 1 (unity), the closer the two populations are and for this analysis, I used 20 percentile to demonstrate the closeness of the populations. Although, for the most part, the predictability of the populations is obvious, a principal component analysis (PCA) for the allele frequencies and a and a dendrogram of the genetic distances can better provide visualization of the variability/separation.
The Tai-Kadai speakers has the lowest Fst values. The Tai-Kadai, Nusantao, Ryukyu, Sino-Tibetan, Mon-Khmer, Dayak, Jinuo, Hmong-Mien & Temuan are very close to each other which is both represented in the Fst Table 5, correlation Table 6 and the PCA Figure 3. I believe these populations once lived in close proximity to each other in Southwest China (which I will try to expand a bit in later blogs) before later expansions. The expansion was likely triggered by rice agriculture;  It is possible these populations even spoke a more common language than they do today. Sagart has a paper linking Sino-Tibetan-Austronesian languages. The Ryukyu people has a unique ethnic language which is now considered by linguist in danger of extinction. The Jomon people are considered to be the first group to populate Japan. The Yayoi people came next who brought rice agriculture. There are some evidence linking Japanese with the Austronesian language.
Obviously, the Europeans are close to the Indians based from many genetic studies and linguistic connections (Indo-Aryan language family) but is really not the scope of this blog (at the moment).
It looks like the Ita groups are farther apart than what I expected. The Fst, correlation and PCA all shows that there are at least four unique groups;
1.      Orang Asli (Malaysian Ita: Kensui and Jehai)
2.      Aytas (Philippine Ita: Ayta, Ati, Agta, & Iraya)
3.      Mamanwa  (Philippine Ita)
4.      Papuan

It’s possible the large separation of the four groups is due to a very long period of isolation from each other. Based on the Callao Man findings, Ita populations may have been in ISEA at least as far back as 67K BP. This was likely facilitated with the receded sea level, giving rise to the land bridges due to some glacial maxima. The way to most of Luson, Bisayas, and Mindanao Island were perhaps made by some raft and when the sea level rose back; the separation/isolation began. This separation seems to have occurred longer/older than the separation of the Tai-Kadai, Nusantao, Ryukyu, Sino-Tibetan, Mon-Khmer, Dayak, Jinuo, Hmong-Mien & Temuan given the individual Ita population Fst are very large. Additionally, I also believe that the East Asian admixture with the Ita populations is what partially caused experts to differentiate the Northern and Southern East Asians (again for another blog).

Table 5: Fst divergences between estimated populations


Table 6: PASNP K=19 Allele Frequency Correlation

Table 7: Ita Populations Relative Fst

Figure 2: Allele Frequency PCA Loadings (Scree Plot)


Figure 3: Allele Frequency PC1 vs PC2 Plot

Figure 4: PC1 vs PC2 Plot (Zoomed)


Figure 5:Dendrogram, K=19


Figure 5:Admixture Result, K=19


References

1.      Yang X, Xu S, The HUGO Pan-Asian SNP Consortium (2011) Identification of Close Relatives in the HUGO Pan-Asian SNP Database. PLoS ONE 6(12): e29502. doi:10.1371/journal.pone.0029502
2.      D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009
3.      H. Zhou, D. H. Alexander, and K.  Lange. A quasi-Newton method for accelerating the convergence of iterative optimization algorithms. Statistics and Computing, 2009.
4.      Alexander D. H., Lange K. (2011). Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.
5.      Greenhill, S.J., Blust. R, & Gray, R.D. (2008). The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics. Evolutionary Bioinformatics, 4:271-283.
6.      Mijares, A.S.B. et al. 2010. New evidence for a 67,000-year-old human presence at Callao Cave , Luzon , Philippines. Journal of Human Evolution, 59:123-132. doi:10.1016/j.jhevol.2010.04.008.
7.      Sagart, L. (2002). SINO-TIBETO-AUSTRONESIAN: An Updated and Improved Argument. BMC Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.