Chinese character frequency (simplified Chinese: 汉字字频; traditional Chinese: 漢字字頻; pinyin: hànzì zì pín) is the applicational frequency of characters in written Chinese. It is calculated on a corpus, i.e., a collection of texts representing one or more languages. The frequency of a character is the ratio of the number of its occurrences to the total number of characters in the corpus, with the formula of
Fi = ni ⁄N × 100%,
where ni is the number of times a certain (i) Chinese character appears in the corpus, and N is the total number of (occurrences of) characters in the corpus.
Chinese character frequency is fundamental to quantitative linguistics of Chinese, and is of referential value to Chinese language teaching and information processing.
Origins
The first person to make a serious statistic study on the frequency of Chinese characters was Chen Heqin (陳鶴琴). In the 1920s, he and his assistants spent over two years manually counting and comparing the characters in a corpus of six categories of texts. There were totally 554,478 characters in 4,261 different character forms. They then compiled a book entitled Applied Lexis of Vernacular Chinese (語體文應用字彙). The 10 most frequently-used characters in their corpus are, by descending frequency,
的 (of), 不 (no, not), 一 (one, a(n)), 了 (PERF), 是 (to be), 我 (I/me), 上 (on, up), 他 (he/him), 有 (to have), 人 (person).
A trans-regional diachronic survey
In 2001, the Chinese University of Hong Kong (CUHK) published a number of frequency lists on the Web, entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a Trans-regional Diachronic Survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, mainland China and Taiwan and in the two time periods of the 1960s and 1980/90's. Each sub-corpus consists of approximately 660,000 characters, making a total of 3,970,514 characters for the whole corpus. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists.
From the data of these frequency lists, some important and interesting features of Chinese can be discovered:
- 的, 一 and 是 are the three most frequently-used characters across the regions and time periods of the corpora. And 的 is number one in all the frequency lists.
- The 10 most frequently-used characters across the three regions and two time periods are very consistent. That means a frequently-used character in one region or period is very likely to be frequently-used in another region or period.
- The 100 most frequently-used characters in the 80/90's cover (i.e., have an accumulated frequency of) 41.00% of the Hong Kong texts of that period, 41.34% of the Mainland texts, and 41.88% of the Taiwan texts. That is more than 4 out of every 10 characters for the three regions.
- The 1000 most frequently-used characters in the 80/90's cover 89.25% of the Hong Kong texts of that period, 90.26% of the Mainland texts, and 88.74% of the Taiwan texts.
The top 10 characters in the frequency lists for the three regions of the 1980/1990's are
Hong Kong: 的,一,是,不,人,有,在,了,我,中; Taiwan: 的,一,是,不,人,在,有,我,了,中; Mainland: 的,一,是,了,不,在,有,人,我,他.
More information can be found in the English Users' Guide on the home page.
Frequencies in different divisions
Most of the previous frequency experiments are for comprehensive usage of Chinese characters. In addition, there is the frequency of use of Chinese characters in a certain discipline, such as news reporting, literature and art, information technology, etc.
And there are frequency lists for linguistic divisions. Polyphonic characters may be counted separately according to different pronunciations, for example, the frequencies for 的 (de), 的 (di1), 的 (di2) and 的 (di4). Polysemy characters are counted separately according to different meanings, for example, 里 (裡裏, inside) and 里 (里, 0.5 km). There are also frequencies for different parts of speech, for example: 花(n) and 花(v). Or a combination of the above divisions.
Application of frequency statistics
Chinese character frequency is essential to quantitative research of Chinese characters, and has been applied to language teaching, dictionary composition, character lists compilation, Chinese character information processing, etc.
Chinese character utility decline rate
The uses of Chinese characters mainly concentrate on frequently used characters. Zhou Youguang summarized the Chinese character utility decline rate (汉字效用递减率; 漢字效用遞减率) based on the frequency statistics results of various parties. Its basic content is:
The coverage rate of the most frequently-used 1,000 characters on the corpus is about 90%, which means the missing rate is about 10%. For every additional 1,400 secondary frequent characters, the missing rate is reduced to 10% of the original number. For example, The missing rate of 1000+1400=2400 most frequently-used characters is approximately 10% * 10% =1% of the corpus, that means the coverage rate is 99%. The missing rate of 2400+1400=3800 most frequently-used characters is about 1% * 10% = 0.1%, and the coverage rate is 99.9%. The rule is supported by later experiment results as well, such as:
characters | occurrences | % |
---|---|---|
100 | 782,866 | 42.14 |
500 | 1,439,352 | 77.48 |
1,000 | 1,681,228 | 90.50 |
2,000 | 1,817,047 | 97.81 |
3,000 | 1,848,648 | 99.51 |
4,000 | 1,856,226 | 99.92 |
4,868 | 1,857,660 | 100 |
Decreasing rate of frequently-used character strokes
The basic content of the Decreasing rate of frequently-used character strokes (simplified Chinese: 常用字笔划趋减率; traditional Chinese: 常用字筆劃趨减率) is:
The application rate of a character is inversely proportional to its number of strokes, that is, characters with high application rates have fewer strokes on average. This is supported by the data in article Stroke numbers. According to the data of the second and third tables, the average number of strokes of the 3,500 frequently-used characters is 9.74, and the average number of strokes of the 7.000 commonly-used characters (a super set of the 3,500 characters) is 10.75. That means generally speaking, frequently-used characters have less strokes than less frequently-used characters.
The reason is for convenience of writing. If a character of many strokes is used frequently, people will try to simplify it. If there are multiple variant characters of the same function, regardless of other reasons, the one with fewer strokes is more likely to be used.
Distribution rate and application rate
When determining the importance of a character, in addition to frequency of use, it is often necessary to consider distribution rate. The formula for calculating distribution rate is
Di = ti ⁄T,
where Di is the distribution rate of character or word i, ti is the number of texts in which the character or word appears, and T is the total number of texts in the corpus.
Application rate is a combination of distribution rate and frequency. A newer calculation formula is:
Ui=(Fi*Di)/Σ(j=1 to n)(Fj*Dj)
where Ui is the application rate of character i, Fi is the frequency of character i, Di is the distribution rate of character i, and n represents the total number of characters. This calculation method allows the cumulative application rates to approach 1.
Application in Media
Large-scale surveys by the Ministry of Education and the State Language Commission of PRC over the years have shown that the use of Chinese characters and words has a strong distribution pattern. The number of different characters used in modern Chinese is stable at about 12,000, and the number of different words has stabilized at around 2.5 million.
The number of most frequently-used characters with a coverage rate of 80%, 90%, and 99% is about 590, 940, and 2,400 respectively. The number of words with coverage rates of 80%, 90%, 95%, and 99% is about 4,900, 14,000, 32,000, and 241,000 respectively. Words with greater changes from the previous years in frequency of use reflect the hot topics of social life and media attention that year.
See also
References
Citations
- Su 2014, p. 34.
- Yang 2008, p. 192.
- Su 2014, p. 35.
- Chen 1928.
- ^ CUHK 2001.
- Su 2014, p. 42.
- Su 2014, p. 42-45.
- Zhou 2006.
- Xing 2007, p. 15.
- Wang 1980.
- ^ National Language Commission 2007, p. 2.
- ^ "2016年中国语言文字事业发展状况 - 中华人民共和国教育部政府门户网站". www.moe.gov.cn. Retrieved 2024-06-18.
Works cited
- Chen, Heqin 陳鶴琴 (1928). 語體文應用字彙 [Applied Lexis of Vernacular Chinese] (in Chinese). Beijing: Shangwu (The Commercial Press).
- CUHK (2001). "Chinese Character Frequency Statistics for Hong Kong, Mainland China and Taiwan – A Trans-Regional, Diachronic Survey: 香港、大陸、台灣 – 跨地區、跨年代漢語常用字頻統計".
- National Language Commission, Ministry of Education, China (2013). 2012年中國語言生活狀况報告 [Report on Language Life in China 2012] (in Chinese). Beijing: Shangwu (The Commercial Press).
{{cite book}}
: CS1 maint: multiple names: authors list (link) - National Language Commission, National Language Resources Testing and Research Center (國家語言資源檢測與研究中心), China (2007). 中國語言生活狀况報告(2006), 下篇 [Report on the Situation of Language Life in China (2006), Part 2] (in Chinese). Beijing: Shangwu (The Commercial Press).
{{cite book}}
: CS1 maint: multiple names: authors list (link) - Su, Peicheng 苏培成 (2014). 现代汉字学纲要 [Essentials of Modern Chinese Characters] (in Chinese) (3rd ed.). Beijing: 商务印书馆 (The Commercial Press, Shangwu). ISBN 978-7-100-10440-1.
- Wang, Fengyang 王凤阳 (1980). 汉字频率与汉字简化, in 语文现代化 丛刊第三辑 [Chinese character frequency and Chinese character simplification, in Chinese Language Modernization Series 3] (in Chinese) (3rd ed.). Beijing: 知识出版社 (Knowledge Press). ISBN 978-7-100-10440-1.
- Xing, Hongbing 邢红兵 (2007). 现代汉字特征分析与计算研究 [Characteristic Analysis and Computational Research on Modern Chinese Characters] (in Chinese). Beijing: 商务印书馆 (The Commercial Press, Shangwu). ISBN 978-7-100-05310-5.
- Yang, Runlu 杨润陆 (2008). 现代汉字学 [Modern Chinese Characters] (in Chinese). Beijing: Beijing Normal University Press. ISBN 978-7-303-09437-0.
- Zhou, Youguang (周有光) (2006). 语言文字学的新探索 [New exploration of linguistics] (in Chinese). Beijing: 語文出版社 (Chinese Language Press). p. 139.