Jun Da's WebCentral

Home | Academic | Chinese | CALL | Systems | Personal | Contact

 

Chinese text computing

(This is the 1998 version. An updated 2004 version is now available)

 

Jun's Chinese Text Computing - Technical Notes

Technical notes

(Last updated: 2000-10-22)

This page provides information on the collection of Chinese texts used in this study. It also provides information about how the various frequency and digram lists as well as their statistics were computered.

If you are interested in viewing those texts at their original sites or the frequency lists I generated, you need to configure your web browser properly. Some tutorials are provided here for your reference.

1. Data collection

1.1 Overview

The corpus of this study consists of 110 megabytes of modern Chinese texts from two types of sources: 1) Various online Chinese e-magazines. They are written and/or adopted for the internet and published only on the internet. 2) Chinese literature and other writings for the general public. The set of data used in this study consists of the ebooks collection of the Xi Yu Si Electronic Library. All the Chinese texts used in the study are GB encoded.

1.2 Sources of Chinese texts

The following are links to the various web sites from which the Chinese text data used in this study were collected:

  1. XIN YU SI (XYS):
    http://www.xys.org
    ftp://www.xys.org/pub/xys
    Monthly e-magazine. All the issues up to 12/1998
    Notes: This site provides an excellent collection of both Chinese classical and modern texts. Two subsets of their collection are used in this study: The Xin Yu Si magazine up to 1998/12 and its entire ebooks collection (current as of 12/23/1998).
  2. HUA XIA WEN ZHAI (HXWZ):
    http://www.cnd.org
    ftp://cnd.org/pub/hxwz
    Weekly e-magazine. All the issues up to 12/1998
    Notes: HXWZ is the first Chinese online magazine ever published on the internet.
  3. FENG HUA YUAN (FHY):
    http://www.fhy.net
    ftp://uwalpha.uwinnipeg.ca/pub/fcssc/fhy/
    ftp://ftp.fhy.net/pub/fhy
    Trimonthly e-magazine. All the issues up to 12/1998
    Notes: The second major Chinese e-magazine on the internet ever published on the internet.
  4. HUA DE TONG XUN (HDTX):
    http://cdn.unibw-hamburg.de
    (An alternative web site can be found at the Sunrise Library: http://www.sunrisesite.org.)
    ftp://tptp08.gkss.de
    Bimonthly e-magazine. All the issues up to 12/1998
    Notes: It looks like their official web and ftp sites are not accessible from outside Germany. The Sunrise site contains the complete collection of HDTX magazine.
  5. COMPUTERWORLD (CW):
    http://www.computerworld.com.cn
    Daily computer news. Most of the daily news summary between: 7/18/97 - 6/29/98
    Notes: Due to technical difficulties, the latest issues after 6/29/98 are unavailable to me.
  6. CHINESE SCHOLARS ABROAD (CHISA)
    http://www.chisa.edu.cn
    ftp://chisa.edu.cn/pub/chisa-cm
    Weekly e-magazine. All the issues up to 12/1998
    Notes: It is (perhaps) the first e-magazine published online from inside Mainland China.

1.3 An opportunistic and biased corpus

The corpus collection used in this study is opportunistic and biased in that:

1. The set of texts used in this study are chosen simply because they are in the public domain accessible to everyone on the internet. With the exception of the ComputerWorld daily news, all the other texts are on topics of general interest;

2. The data used in this study are edited written texts. No effort has been made to collect informal postings on the Internet such as those found at various web forums and use them in this study. As such, the corpus is biased towards formal written Chinese.

3. Selection of those Chinese texts is opportunistic. There are many other sources of Chinese texts on the Internet which could have been used in this study. I chose the subset of data in this study simply because 1) I have read most of the e-magazine texts and am familiar with the materials I am dealing with; and 2) Those texts contain few encoding errors, which is a major concern for manipulating two-byte encoded texts.

4. The collection of the Xi Yu Si Electronic Library contains literary writings of only a handful of authors. It can be considered a 'best seller' list rather than a distributed and balanced collection of Chinese literary works in general.

5. All the texts are GB encoded. Most, if not all of them, were written by users of simplified Chinese. This is especially true for those e-magazines whose intended audience are overseas Chinese from mainland China . As such, this corpus can only be considered a subset of the Chinese language in use today.

2. Data processing

2.1 The data set

As introduced in the above section, the corpus of this study consists of 110 megabytes of  modern Chinese texts from two types of sources: various online Chinese e-magazines and modern literature and other writings for the general public. The former includes texts from six online publications between 1991 and 1998, whereas the latter consists of the ebooks collection of the Xi Yu Si Electronic Library.

2.2 Procedure

2.2.1 Pre-processing

Raw data as downloaded from the internet were used directly in the computing. No pre-editing was ever made on the data. As a result, if there are corrupted codes in the original file(s), they will be reflected in the final result.

2.2.2 Segmenting individual characters

In running Chinese texts, no white space is used to delimit individual characters or words. Hence, the first task for any Chinese text computing is to identify or segment individual characters in a running text. A segmenting script written in Perl is provided here for your reference. For performance considerations, I used a tiny C program in the actual computation, which runs much faster than the Perl script.

2.2.3 Making n-grams

The two digram lists presented here were generated using a modified method based on Brew and Moens's online tutorial on Making n-grams. As compared to their brutal-force method, my approach is fine-tuned as follows:

  1. All running Chinese texts were first segmented into continous GB strings. A continous GB string is one which contains GB characters only (but not symbols in GB encoding). Both GB encoded symbols or ASCII codes are considered delimiters of continuous GB string;
  2. Digrams were calculated with respect to each continuous string.  

Again, a C program was used to generate the lists.

3. Results

3.1 Individual character counts and frequency lists

Results are given on the Statistics page. The total number of characters turns out to be more than 45 million. Note that in the lists of distinctive characters, some of the entries are unrecognizable. There are two possible causes to this problem: 1) There are some enconding errors in the raw GB texts used in this study. Due to the amount of data used in this study, no effort is taken to manually correct such errors (if there is any); 2) Some of the characters are beyond the GB2312 encoding scheme, which may be displayed improperly using Chinese font set (such as MS Song from Microsoft) based on the GB2312 standard.

3.2 Character frequency lists for sub-corpra

Individual frequency lists are provided for each sub-corpus. In general, there is no statistical justification to divide them up as such. However, if you are interested in sub-language studies, a comparison can be made between the collection technical Chinese (the ComputerWorld collection) and the rest of materials in this data collection.

3.3 Diagram lists 

Diagrams from two sub-corpra (Feng Hua Yuan and ComputerWorld) were computed and provided here. Mutual information is the only statistical signifance measure that is currently available at this moment. For information about statistical measurement of collocation, please refer to the bigram page.

3.4 A note

Be careful if you want to use the frequency information for other purposes, since the corpus data are not 100% randomly sampled. If you want to make use of the frequency lists, please compare its frequency in each sub-corpus. A high frequency in one list does not necessarily mean that it is high in every list. You can use this Search Engine to find out the frequency of an individual character in each of the sub-corpra used in this study.

Chinese Computing Site Map

Chinese Text Computing Sitemap
Title page
Introduction
Statistics
Search
Technical notes
Chinese computing FAQ
Relevant links
Suggestions
What's new
Copyright notice
My homepage

Copyright. 1998-2000. Jun Da. jda@mtsu.edu