(Last updated: 2000-10-22)
This page provides information on the collection of
Chinese texts used in this study. It also provides information about how the various
frequency and digram lists as well as their statistics were computered.
If you are interested in viewing those texts at their
original sites or the frequency lists I generated, you need to configure your web browser
properly. Some tutorials are provided here for your reference.
1. Data collection
The corpus of this study consists of 110 megabytes of
modern Chinese texts from two types of sources: 1) Various online Chinese e-magazines.
They are written and/or adopted for the internet and published only on the internet. 2)
Chinese literature and other writings for the general public. The set of data used in this
study consists of the ebooks collection of
the Xi Yu Si Electronic Library. All the
Chinese texts used in the study are GB encoded.
1.2 Sources of Chinese texts
The following are links to the various web sites from
which the Chinese text data used in this study were collected:
- XIN YU SI (XYS):
Monthly e-magazine. All the issues up to 12/1998
Notes: This site provides an excellent collection of both Chinese classical and modern
texts. Two subsets of their collection are used in this study: The Xin Yu Si magazine up
to 1998/12 and its entire ebooks collection (current as of 12/23/1998).
- HUA XIA WEN ZHAI (HXWZ):
Weekly e-magazine. All the issues up to 12/1998
Notes: HXWZ is the first Chinese online magazine ever published on the internet.
- FENG HUA YUAN (FHY):
Trimonthly e-magazine. All the issues up to 12/1998
Notes: The second major Chinese e-magazine on the internet ever published on the internet.
- HUA DE TONG XUN (HDTX):
(An alternative web site can be found at the Sunrise Library: http://www.sunrisesite.org.)
Bimonthly e-magazine. All the issues up to 12/1998
Notes: It looks like their official web and ftp sites are not accessible from outside
Germany. The Sunrise site contains the complete
collection of HDTX magazine.
- COMPUTERWORLD (CW):
Daily computer news. Most of the daily news summary between: 7/18/97 - 6/29/98
Notes: Due to technical difficulties, the latest issues after 6/29/98 are unavailable to
- CHINESE SCHOLARS ABROAD (CHISA)
Weekly e-magazine. All the issues up to 12/1998
Notes: It is (perhaps) the first e-magazine published online from inside Mainland China.
1.3 An opportunistic and biased corpus
The corpus collection used in this study is opportunistic
and biased in that:
1. The set of texts used in this study are chosen simply
because they are in the public domain accessible to everyone on the internet. With the
exception of the ComputerWorld daily news, all the other texts are on topics of general
2. The data used in this study are edited written texts.
No effort has been made to collect informal postings on the Internet such as those found
at various web forums and use them in this study. As such, the corpus is biased towards
formal written Chinese.
3. Selection of those Chinese texts is opportunistic.
There are many other sources of Chinese texts on the Internet which could have been used
in this study. I chose the subset of data in this study simply because 1) I have read most
of the e-magazine texts and am familiar with the materials I am dealing with; and 2) Those
texts contain few encoding errors, which is a major concern for manipulating two-byte
4. The collection of the Xi Yu Si Electronic Library
contains literary writings of only a handful of authors. It can be considered a 'best
seller' list rather than a distributed and balanced collection of Chinese literary works
5. All the texts are GB encoded. Most, if not all of them,
were written by users of simplified Chinese. This is especially true for those e-magazines
whose intended audience are overseas Chinese from mainland China . As such, this corpus
can only be considered a subset of the Chinese language in use today.
2. Data processing
2.1 The data set
As introduced in the above section, the corpus of this
study consists of 110 megabytes of modern Chinese texts from two types of sources:
various online Chinese e-magazines and modern literature and other writings for the
general public. The former includes texts from six online publications between 1991 and
1998, whereas the latter consists of the ebooks
collection of the Xi Yu Si Electronic Library.
Raw data as downloaded from the internet were used
directly in the computing. No pre-editing was ever made on the data. As a result, if there
are corrupted codes in the original file(s), they will be reflected in the final result.
2.2.2 Segmenting individual characters
In running Chinese texts, no white space is used to
delimit individual characters or words. Hence, the first task for any Chinese text
computing is to identify or segment individual characters in a running text. A segmenting
script written in Perl is provided here for your
reference. For performance considerations, I used a tiny C
program in the actual computation, which runs much faster than the Perl script.
2.2.3 Making n-grams
The two digram lists presented here were generated using a
modified method based on Brew and Moens's online tutorial on Making
n-grams. As compared to their brutal-force method, my approach is fine-tuned as
- All running Chinese texts were first segmented into
continous GB strings. A continous GB string is one which contains GB characters only (but
not symbols in GB encoding). Both GB encoded symbols or ASCII codes are considered
delimiters of continuous GB string;
- Digrams were calculated with respect to each continuous
Again, a C program was used to generate the lists.
3.1 Individual character counts and frequency lists
Results are given on the Statistics
page. The total number of characters turns out to be more than 45 million. Note that in
the lists of distinctive characters, some of the entries are unrecognizable. There are two
possible causes to this problem: 1) There are some enconding errors in the raw GB texts
used in this study. Due to the amount of data used in this study, no effort is taken to
manually correct such errors (if there is any); 2) Some of the characters are beyond the
GB2312 encoding scheme, which may be displayed improperly using Chinese font set (such as
MS Song from Microsoft) based on the GB2312 standard.
3.2 Character frequency lists for sub-corpra
Individual frequency lists are provided for each
sub-corpus. In general, there is no statistical justification to divide them up as such.
However, if you are interested in sub-language studies, a comparison can be made between
the collection technical Chinese (the ComputerWorld collection) and the rest of materials
in this data collection.
3.3 Diagram lists
Diagrams from two sub-corpra (Feng Hua Yuan and ComputerWorld)
were computed and provided here. Mutual information is the only statistical signifance
measure that is currently available at this moment. For information about statistical
measurement of collocation, please refer to the bigram
3.4 A note
Be careful if you want to use the frequency information
for other purposes, since the corpus data are not 100% randomly sampled. If you want to
make use of the frequency lists, please compare its frequency in each sub-corpus. A high
frequency in one list does not necessarily mean that it is high in every list. You can use
this Search Engine to find out the frequency of an individual
character in each of the sub-corpra used in this study.