Pre-Processing of PASNP Data
The Pan-Asian SNP (PASNP)
data set basically contains about 55,000 SNPs from 1,928 individuals. These
individuals, mostly from East Asia, have known language, ancestry/ethnicity and
location with which, we can partially uncover the unique populations in
Southeast Asia.
The question then becomes: what is the exact, or at least,
the most optimum K value to use for the given data set?
Admixture provides a cross validation (CV) analysis.
This option “cross validates” the assumed K for the given data set which
provides an estimate of how likely the assumed K accurately represent the
number of unique populations. Figure 1 best describes CV error where plotted
the CV error for K from 3 to 25; in this figure, we can readily observe that as
K is increases, the CV error decreases but there is a point where the error
value begins to increase, meaning using much higher K value may not adequately
represent the number of unique populations. The objective is to find the lowest
CV error and in the PASNP data, K=19 provides the lowest CV error (0.47187).
Figure 1: Pan-Asian SNP Admixture
Cross Validation Error
PASNP Population Breakdown
Once the optimum number of unique populations is found, the
next step is to appropriately identify or assign each population. Using Excel
sorting and filtering features; I have manged to identify the unique
populations (see Table 1).
Table 1: Population Description
Population
|
Description
|
TaiKadai
|
Tai-Kadai
speaking people
|
Nusantao
|
Austronesian
speaking people
|
Ryukyu
|
Ethnic
group from Japan
|
ST_Alt
|
Sino-Tibetan
or Altaic speaking people (perhaps the Han people)
|
MonKhmer
|
Mon-Khmer
speaking people
|
Dayak
|
Ethnic
group from Kalamintan, Indonesia
|
Jinuo
|
Ethnic
group from Southwest China (Indochina)
|
HmongMien
|
Hmong-Mien
speaking people
|
Temuan
|
Ethnic
group from Malay Peninsula
|
Papuan
|
Papuan
speaking people
|
Indian
|
South
Asians
|
Jehai_Negrito
|
Ethnic
group from Malay Peninsula
|
Kensui_Negrito
|
Ethnic
group from Malay Peninsula
|
Ati_Negrito
|
Ethnic
group from Bisaya
|
Mamanwa_Negrito
|
Ethnic
group from Mindanao
|
Mlabri
|
Ethnic
group from Thailand (Indochina)
|
European
|
White
Americans from Utah, Nevada
|
African
|
Africans
from Yoruba
|
Amerindian
|
Native
Americans from North and South America
|
Since the African, European, Indian and Amerindian have the
highest genetic distance compared to the East Asian populations, they were
really easy to identify. For the East Asian populations, I basically mixed and
match between language family & ethnicity because there are samples from
different ethnicities but have the same language family showing high percentage
for a given population and vice versa. As an example (see Table 2), the HomngMien language family has a high frequency for the Hmongs of China and Miao of
Thailand but they both use the same language family.
Except for the MonKhmer, Nusantao, TaiKadai, & ST_Alt,
the East Asians population identification was simple; most of the population
has several samples that has 90% or above for its group. As an example (see Table 3), a population showed several samples from the Mlabri with 100% and the next
sample to have this population has only 6%; this gene obviously can be assigned
as Mlabri.
For the Nusantao, the highest percentage was 88% and it’s a
gradual decrease across many ethnic groups mainly in ISEA but the common
denominator for this population was the Austronesian speaking people (see Table 4). I will explain later why I called them “Nusantao”. I took similar
approaches for the MonKhmer and TaiKadai where each population also had the
recurring language family from different ethnic backgrounds.
The most difficult population to identify was the ST_Alt.
The highest percentage for this group is 60% and gradually spreads throughout
the Sinto_Tibetan, Altaic and, to a minor extent, MonKhmer speaking people (see
Table 4). Initially, I thought this could be the Han ancestry; if true, the Han
may have spread all over Northeast Asia with decreasing percentage (admixing
with the local populations) as they went South. For now, I’ll keep ST_Alt.
Table 2: Hmong-Mien Multi-Ethnic
Background Example
Group
|
HmongMien
|
ChinaHmongHM
|
100%
|
ChinaHmongHM
|
100%
|
ChinaHmongHM
|
100%
|
ThailandHmongMiaoHM
|
84%
|
ThailandHmongMiaoHM
|
80%
|
ThailandHmongMiaoHM
|
78%
|
ThailandHmongMiaoHM
|
78%
|
ThailandHmongMiaoHM
|
77%
|
ThailandHmongMiaoHM
|
76%
|
ThailandHmongMiaoHM
|
76%
|
ThailandHmongMiaoHM
|
74%
|
ThailandHmongMiaoHM
|
73%
|
ThailandHmongMiaoHM
|
70%
|
ThailandHmongMiaoHM
|
69%
|
ThailandHmongMiaoHM
|
69%
|
ChinaHmongHM
|
69%
|
ThailandHmongMiaoHM
|
69%
|
ThailandHmongMiaoHM
|
69%
|
ThailandHmongMiaoHM
|
67%
|
ChinaHmongHM
|
67%
|
ThailandHmongMiaoHM
|
67%
|
ChinaHmongHM
|
65%
|
ThailandHmongMiaoHM
|
65%
|
ChinaHmongHM
|
64%
|
ThailandHmongMiaoHM
|
63%
|
ChinaHmongHM
|
62%
|
ChinaHmongHM
|
62%
|
Table 3: Mlabri Percent
Ancestry Example
Group
|
Mlabri
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandMlabriMK
|
100%
|
ThailandH'tinMK
|
6%
|
MalaysiaMainTemuanAn
|
4%
|
IndonesiaJavaJavaAn
|
4%
|
IndonesiaJavaJavaAn
|
4%
|
IndonesiaJavaSundaAn
|
4%
|
ThailandKarenST
|
4%
|
SingapuraMalayAn
|
4%
|
ThailandKarenST
|
4%
|
SingapuraMalayAn
|
4%
|
IndonesiaSumatraSMalayAn
|
4%
|
SingapuraMalayAn
|
4%
|
Table
4: MonKhmer, Nusantao, TaiKadai, & ST_Alt Percentage Pattern
Group
|
Nusantao
|
Group
|
MonKhmer
|
Group
|
TaiKadai
|
Group
|
ST_Alt
|
IndonesiaSumatraNMentawaiAn
|
88%
|
ThailandH'tinMK
|
80%
|
ChinaSHainanJiamaoTK
|
88%
|
IndiaNSpitiST
|
60%
|
IndonesiaSumatraNMentawaiAn
|
87%
|
ThailandH'tinMK
|
79%
|
ChinaSHainanJiamaoTK
|
87%
|
IndiaNSpitiST
|
57%
|
IndonesiaSumatraNMentawaiAn
|
87%
|
ThailandH'tinMK
|
78%
|
ChinaSHainanJiamaoTK
|
87%
|
IndiaNSpitiST
|
56%
|
IndonesiaSumatraNMentawaiAn
|
86%
|
ThailandH'tinMK
|
75%
|
ChinaSHainanJiamaoTK
|
87%
|
IndiaNSpitiST
|
55%
|
IndonesiaSumatraNMentawaiAn
|
86%
|
ThailandH'tinMK
|
75%
|
ChinaSHainanJiamaoTK
|
86%
|
ChinaBeijingHanST
|
55%
|
IndonesiaSumatraNMentawaiAn
|
84%
|
ThailandH'tinMK
|
74%
|
ChinaSHainanJiamaoTK
|
86%
|
KoreaKoreanAlt
|
55%
|
IndonesiaSumatraNMentawaiAn
|
84%
|
ThailandH'tinMK
|
73%
|
ChinaSHainanJiamaoTK
|
85%
|
KoreaKoreanAlt
|
54%
|
IndonesiaSumatraNMentawaiAn
|
83%
|
ThailandH'tinMK
|
73%
|
ChinaSHainanJiamaoTK
|
85%
|
ThailandPaluangMK
|
53%
|
IndonesiaSumatraNMentawaiAn
|
82%
|
ThailandH'tinMK
|
72%
|
ChinaSHainanJiamaoTK
|
85%
|
ChinaBeijingHanST
|
53%
|
IndonesiaSumatraNMentawaiAn
|
81%
|
ThailandH'tinMK
|
71%
|
ChinaSHainanJiamaoTK
|
85%
|
KoreaKoreanAlt
|
53%
|
IndonesiaSumatraNMentawaiAn
|
81%
|
ThailandH'tinMK
|
68%
|
ChinaSHainanJiamaoTK
|
85%
|
KoreaKoreanAlt
|
53%
|
IndonesiaSumatraNMentawaiAn
|
81%
|
ThailandH'tinMK
|
68%
|
ChinaSHainanJiamaoTK
|
85%
|
ChinaBeijingHanST
|
53%
|
IndonesiaSumatraNMentawaiAn
|
80%
|
ThailandH'tinMK
|
67%
|
ChinaSHainanJiamaoTK
|
85%
|
KoreaKoreanAlt
|
52%
|
IndonesiaSumatraNMentawaiAn
|
80%
|
ThailandH'tinMK
|
67%
|
ChinaSHainanJiamaoTK
|
84%
|
KoreaKoreanAlt
|
52%
|
IndonesiaSumatraNMentawaiAn
|
79%
|
ThailandH'tinMK
|
63%
|
ChinaSHainanJiamaoTK
|
84%
|
KoreaKoreanAlt
|
52%
|
PhilippinesNLusonNIlokanoAn
|
77%
|
ChinaSWWaMK
|
59%
|
ChinaSHainanJiamaoTK
|
84%
|
ChinaBeijingHanST
|
52%
|
PhilippinesNLusonNIlokanoAn
|
77%
|
ThailandLawaMK
|
59%
|
ChinaSHainanJiamaoTK
|
84%
|
ChinaBeijingHanST
|
52%
|
TaiwanAmiAn
|
71%
|
ThailandLawaMK
|
59%
|
ChinaSHainanJiamaoTK
|
83%
|
ChinaBeijingHanST
|
51%
|
PhilippinesNLusonNIlokanoAn
|
70%
|
ChinaSWWaMK
|
58%
|
ChinaSHainanJiamaoTK
|
83%
|
ChinaBeijingHanST
|
51%
|
TaiwanAtayalAn
|
69%
|
ChinaSWWaMK
|
58%
|
ChinaSHainanJiamaoTK
|
82%
|
KoreaKoreanAlt
|
51%
|
PhilippinesNLusonNIlokanoAn
|
69%
|
ChinaSWWaMK
|
58%
|
ChinaSHainanJiamaoTK
|
82%
|
ThailandPaluangMK
|
51%
|
TaiwanAtayalAn
|
69%
|
ChinaSWWaMK
|
57%
|
ChinaSHainanJiamaoTK
|
82%
|
ChinaBeijingHanST
|
51%
|
TaiwanAtayalAn
|
68%
|
ChinaSWWaMK
|
57%
|
ChinaSHainanJiamaoTK
|
81%
|
ChinaBeijingHanST
|
51%
|
PhilippinesNLusonCentralTagalogAn
|
68%
|
ChinaSWWaMK
|
57%
|
ChinaSHainanJiamaoTK
|
81%
|
KoreaKoreanAlt
|
51%
|
PhilippinesNLusonNIlokanoAn
|
67%
|
ThailandTaiLueTK
|
57%
|
ChinaSHainanJiamaoTK
|
81%
|
ChinaBeijingHanST
|
51%
|
TaiwanAtayalAn
|
67%
|
ChinaSWWaMK
|
56%
|
ChinaSHainanJiamaoTK
|
80%
|
KoreaKoreanAlt
|
51%
|
PhilippinesSMindanaoCentralItaManoboAn
|
66%
|
ThailandH'tinMK
|
56%
|
ThailandTaiLueTK
|
80%
|
KoreaKoreanAlt
|
51%
|
TaiwanAmiAn
|
66%
|
ChinaSWWaMK
|
56%
|
ThailandTaiLueTK
|
79%
|
ChinaShanghaiHanST
|
51%
|
PhilippinesSMindanaoCentralItaManoboAn
|
66%
|
ThailandLawaMK
|
56%
|
ChinaSZhuangTK
|
79%
|
KoreaKoreanAlt
|
50%
|
PhilippinesNLusonNIlokanoAn
|
66%
|
ChinaSWWaMK
|
55%
|
ChinaSHainanJiamaoTK
|
78%
|
KoreaKoreanAlt
|
50%
|
PhilippinesNLusonNIlokanoAn
|
66%
|
ChinaSWWaMK
|
55%
|
ChinaSZhuangTK
|
78%
|
KoreaKoreanAlt
|
50%
|
TaiwanAmiAn
|
65%
|
ChinaSWWaMK
|
55%
|
ChinaSHainanJiamaoTK
|
78%
|
KoreaKoreanAlt
|
50%
|
PhilippinesNLusonNIlokanoAn
|
65%
|
ThailandLawaMK
|
55%
|
ChinaSHainanJiamaoTK
|
77%
|
KoreaKoreanAlt
|
50%
|
PhilippinesNLusonNIlokanoAn
|
65%
|
ThailandLawaMK
|
55%
|
ChinaSHainanJiamaoTK
|
77%
|
KoreaKoreanAlt
|
50%
|
TaiwanAmiAn
|
65%
|
ChinaSWWaMK
|
53%
|
ThailandTaiLueTK
|
76%
|
KoreaKoreanAlt
|
50%
|
PhilippinesSMindanaoCentralItaManoboAn
|
65%
|
ThailandLawaMK
|
53%
|
ChinaSZhuangTK
|
75%
|
ThailandPaluangMK
|
50%
|
TaiwanAmiAn
|
65%
|
ChinaSWWaMK
|
53%
|
ThailandTaiLueTK
|
75%
|
ChinaBeijingHanST
|
50%
|
TaiwanAtayalAn
|
64%
|
ThailandLawaMK
|
53%
|
ChinaSHainanJiamaoTK
|
75%
|
KoreaKoreanAlt
|
50%
|
PhilippinesCentralBisayaNItaIrayaAn
|
64%
|
ThailandLawaMK
|
53%
|
ThailandTaiLueTK
|
74%
|
KoreaKoreanAlt
|
50%
|
TaiwanAtayalAn
|
64%
|
ThailandLawaMK
|
52%
|
ChinaSZhuangTK
|
73%
|
ChinaBeijingHanST
|
50%
|
PhilippinesNLusonNIlokanoAn
|
64%
|
ChinaSWWaMK
|
52%
|
ChinaSZhuangTK
|
72%
|
ChinaBeijingHanST
|
49%
|
TaiwanAtayalAn
|
64%
|
ChinaSWWaMK
|
52%
|
ChinaSZhuangTK
|
72%
|
ChinaBeijingHanST
|
49%
|
TaiwanAmiAn
|
64%
|
ThailandLawaMK
|
52%
|
ChinaSZhuangTK
|
72%
|
ChinaBeijingHanST
|
49%
|
TaiwanAmiAn
|
64%
|
ThailandH'tinMK
|
52%
|
ChinaSZhuangTK
|
71%
|
KoreaKoreanAlt
|
49%
|
PhilippinesNLusonNIlokanoAn
|
64%
|
ChinaSWWaMK
|
52%
|
ChinaSZhuangTK
|
71%
|
ThailandPaluangMK
|
49%
|
PhilippinesSMindanaoCentralItaManoboAn
|
63%
|
ChinaSWWaMK
|
51%
|
ThailandTaiYongTK
|
71%
|
ChinaBeijingHanST
|
49%
|
TaiwanAmiAn
|
63%
|
ChinaSWWaMK
|
51%
|
ThailandTaiLueTK
|
71%
|
ThailandPaluangMK
|
49%
|
PhilippinesCentralBisayaNItaIrayaAn
|
63%
|
ThailandLawaMK
|
51%
|
ChinaSZhuangTK
|
71%
|
ThailandPaluangMK
|
49%
|
PhilippinesSMindanaoCentralItaManoboAn
|
63%
|
ChinaSWWaMK
|
51%
|
ChinaSZhuangTK
|
70%
|
ChinaBeijingHanST
|
49%
|
Genetic Distances
I tabulated the Fst (genetic distance) data to demonstrate
how the populations compare with each other. For ease of interpretation, I used
Excel data bars and sorting options; the combination of which produced a chart
shown in Table 5. In this chart, I used the top 10 percentile to highlight the
close populations. Basically, the closer the Fst value is to zero, the closer
the two populations are. I also performed correlation analysis of the allele
frequencies as shown in Table 6. Correlation analysis shows how much two groups
vary together; the closer the Rho value is to 1 (unity), the closer the two
populations are and for this analysis, I used 20 percentile to demonstrate the
closeness of the populations. Although, for the most part, the predictability
of the populations is obvious, a principal component analysis (PCA) for the
allele frequencies and a and a dendrogram of the genetic distances can better
provide visualization of the variability/separation.
The Tai-Kadai speakers has the lowest Fst values. The
Tai-Kadai, Nusantao, Ryukyu, Sino-Tibetan, Mon-Khmer, Dayak, Jinuo, Hmong-Mien
& Temuan are very close to each other which is both represented in the Fst Table 5, correlation Table 6 and the PCA Figure 3. I believe these populations once
lived in close proximity to each other in Southwest China (which I will try to
expand a bit in later blogs) before later expansions. The expansion was likely
triggered by rice agriculture; It is possible these populations even spoke a more
common language than they do today. Sagart has a paper linking
Sino-Tibetan-Austronesian languages. The Ryukyu people has a unique ethnic
language which is now considered by linguist in danger of extinction. The
Jomon people are considered to be the first group to populate Japan. The Yayoi
people came next who brought rice agriculture. There are some evidence linking
Japanese with the Austronesian language.
Obviously, the Europeans are close to the Indians based from
many genetic studies and linguistic connections (Indo-Aryan language family)
but is really not the scope of this blog (at the moment).
It looks like the Ita groups are farther apart than what I
expected. The Fst, correlation and PCA all shows that there are at least four
unique groups;
1. Orang
Asli (Malaysian Ita: Kensui and Jehai)
2. Aytas
(Philippine Ita: Ayta, Ati, Agta, & Iraya)
3. Mamanwa
(Philippine Ita)
4. Papuan
It’s possible the large separation of the four groups is due
to a very long period of isolation from each other. Based on the Callao
Man findings, Ita populations may have been in ISEA at least as far back as
67K BP. This was likely facilitated with the receded sea level, giving rise to
the land bridges due to some glacial maxima. The way to most of Luson, Bisayas,
and Mindanao Island were perhaps made by some raft and when the sea level rose
back; the separation/isolation began. This separation seems to have occurred
longer/older than the separation of the Tai-Kadai, Nusantao, Ryukyu,
Sino-Tibetan, Mon-Khmer, Dayak, Jinuo, Hmong-Mien & Temuan given the
individual Ita population Fst are very large. Additionally, I also believe that
the East Asian admixture with the Ita populations is what partially caused
experts to differentiate the Northern and Southern East Asians (again for
another blog).
Table
5: Fst divergences between estimated populations
Table 6: PASNP K=19 Allele
Frequency Correlation
Figure 2: Allele Frequency PCA Loadings (Scree Plot)
Figure 3: Allele Frequency PC1 vs PC2 Plot
Figure 4: PC1 vs PC2 Plot (Zoomed)
Figure 5:Dendrogram, K=19
Figure 5:Admixture Result, K=19
References
1. Yang
X, Xu S, The HUGO Pan-Asian SNP Consortium (2011) Identification of Close
Relatives in the HUGO Pan-Asian SNP Database. PLoS ONE 6(12): e29502.
doi:10.1371/journal.pone.0029502
2. D.H.
Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry
in unrelated individuals. Genome Research, 19:1655–1664, 2009
3. H.
Zhou, D. H. Alexander, and K. Lange. A quasi-Newton method for accelerating
the convergence of iterative optimization algorithms. Statistics and
Computing, 2009.
4. Alexander
D. H., Lange K. (2011). Enhancements to the ADMIXTURE algorithm for individual
ancestry estimation. BMC Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.
5. Greenhill,
S.J., Blust. R, & Gray, R.D. (2008). The Austronesian Basic Vocabulary
Database: From Bioinformatics to Lexomics. Evolutionary Bioinformatics,
4:271-283.
6. Mijares,
A.S.B. et al. 2010. New evidence for a 67,000-year-old human presence at Callao
Cave , Luzon , Philippines. Journal of Human Evolution, 59:123-132.
doi:10.1016/j.jhevol.2010.04.008.
7. Sagart,
L. (2002). SINO-TIBETO-AUSTRONESIAN: An Updated and Improved Argument. BMC
Bioinformatics 12:246. doi: 10.1186/1471-2105-12-246.