Corpus Pinyin Distribution Analysis — Phonetic (Phoneme)
Phoneme is the basic unit of a syllable in Hanyu Pinyin. Compared to the work done by last week, analyze the distribution of phoneme in the entire corpus is more challenging. This is because a single syllable might consist of different phonemes. For example, “ca” consists of phoneme “tsh” and phoneme “a”. By conducting this analysis, it is possible to understand if the distribution of phonetic in the entire corpus is similar to the statistic of frequent phoneme appears in modern daily Chinese conversation. The following table shows the analysis result.
Phonemes | Initial/Final | Percentage (Corpus) | Frequency (Resource) | Percentage (Resource) | Difference |
N | ang,ong,iong,ing,eng,iang,uang,yang,yong | 8.16% | 621 | 6.38% | 0.02 |
a | ang,ai,uan,ao,an,uai,ia,a,iao,ian,van,iang,uang,ua,yang,yuan,yao,yan,ya,yvan | 17.89% | 1279 | 13.13% | 0.05 |
f | f, | 1.36% | 119 | 1.22% | 0.00 |
i | ei,ai,iu,uai,iong,vn,in,ia,ing,ie,iao,ian,iang,i,ui,ya, yan, yang, yao, ye, yi, yin, ying, yong, you | 24.45% | 1422 | 14.60% | 0.10 |
k | g, | 2.45% | 141 | 1.45% | 0.01 |
kh | k, | 1.09% | 93 | 0.95% | 0.00 |
l | l, | 3.03% | 223 | 2.29% | 0.01 |
m | m, | 2.10% | 143 | 1.47% | 0.01 |
n | n,uan,an,vn,in,ian,van,un,yvn,yan,yin | 8.41% | 800 | 8.21% | 0.00 |
p | b, | 2.53% | 159 | 1.63% | 0.01 |
ph | p, | 0.76% | 118 | 1.21% | 0.00 |
r | r,er | 1.41% | 58 | 0.60% | 0.01 |
s | s, | 0.71% | 305 | 3.13% | 0.02 |
t | d, | 4.22% | 165 | 1.69% | 0.03 |
th | t | 2.17% | 144 | 1.48% | 0.01 |
ts | j,z | 5.17% | 351 | 3.60% | 0.02 |
tsh | c,q | 2.47% | 223 | 2.29% | 0.00 |
u | uan,iu,ong,ao,uai,iong,iao,uo,un,u,ui,ou,uang,ua,w,yong,yao | 17.02% | 1339 | 13.75% | 0.03 |
x | h,s | 3.13% | 168 | 1.73% | 0.01 |
y | van,v,yv, yvan, yve | 2.46% | 187 | 1.92% | 0.01 |
§ | sh, | 3.58% | 189 | 1.94% | 0.02 |
« | ei,ve,iu,en,e,ing,ie,er,eng,o,uo,un,ui,ou,yve,ye,ying | 18.29% | 1130 | 11.60% | 0.07 |
ÿ§ | zh, | 2.68% | 218 | 2.24% | 0.00 |
ÿ§h | ch, | 1.54% | 144 | 1.48% | 0.00 |
The above table shows that the distribution of phonemes in our corpus is similar to the statistics of frequent phoneme appears in modern Chinese. Therefore, this corpus is said to be useful in assisting the beginner learners in learning Mandarin Chinese.
Resources:
http://lingua.mtsu.edu/chinese-computing/phonology/phoneme3500.php
http://lingua.mtsu.edu/chinese-computing/phonology2004/py2phoneme.php