Jun Da's WebCentral

Home | Academic | Chinese | CALL | Systems | Personal | Contact

 

Chinese text computing

(This is the 1998 version. An updated 2004 version is now available)

 

Jun's Chinese Text Computing Project - Bigram Mutual Information Page

Bigrams and statistical measures
两字符串频率列表

(Last updated: 2000-02-18)

(The file size of the following lists is too big for proper display within a web browser viewing session and hence the lists are provided here as compressed text files for downloading. You can generate smaller and yet more meaningful lists of bigrams using this SEARCH FORM! Please refer to the search page for more information.)

This page provides bigram lists from two subcorpora as well as their mutual information statistics. Brief information about mutual information as well as its computing is provided in Section 3.

1. The purpose

Chinese is a monosyllabic language in that each character (roughly speaking, one character = one minimal morpheme) corresponds to one syllable.  Most words in Chinese are disyllabic. There is no delimiter in running written Chinese text. Hence one of the tough tasks in any Chinese text computing is to identify bisyllabic words. Collocation study of n-grams is a meaningful step towards autosegmentation of  words in running Chinese text.

2. The list

The following table provides four bigram lists. The four files are .Z compressed GB encoded texts. After downloading, you need to uncompress it first. On the Windows platform, you can use, for example, WinZip. On unix platforms, you can simply issue the command 'uncompress filename.Z' at the prompt.

(Check out this simple tutorial for more information about displaying Chinese on both the Windows and Mac platforms. Please read the Technical Notes page for detailed information about data collection and computing.)

Corpus

Total number of characters Total number of digrams Number of distinctive digrams Diagram lists containing raw frequency only Diagram lists with  mutual information scores (Scores are computed for those whose raw frequency is 6 or bigger.)
Feng Hua Yuan (FHY) 4,718,131 4,159,927 506,732 fhy.Z (1906K) fhy-mi.Z (1932K)
(92,623 bigrams)
ComputerWorld (CW) 1,857,538 1,705,062 145,391 cw.Z (566K) cw-mi.Z (682K)
(34,134 bigrams)

In the above table:

In the bigram lists with mutual information statistics.

3. Computing mutual information statistics for collocation

How mutual information scores are computed?

Please refer http://www.umiacs.umd.edu/users/resnik/nlstat_tutorial_summer1998/Lab_ngrams.html, where both computing procedure and formula can be found.

How to interpret mutual information scores?

The following guidelines can be used:

If the scores are high or medium, the collocation strength is strong. If MI is below 1, it is less likely that the two tokens are related. MI scores between 1 and 3 are in the gray area. My intuitive judegement of  the bigram lists with MI score larger than 2.5 appear to be bisyllabic words in Chinese, though such intuition needs to be verified.

Other statistical measures of collocation

Other statistical measures such as t-score, likelihood ratio, chi-square and Yule's Y are often used to measure collocation strength. For an introduction and comparison of those measures, please refer to, among others:

In my informal experiments with likelihood ratios and chi-square measures, I found that those two statistics do not provide a reliable measure of collocation as far as the two diagram lists are concerned. The problem with the two measures is that they are much less discriminative as compared with MI. For preliminary result, please take a look at this comparison page.

Viewing instruction: The comparison page is pretty big (818K) and contains GB-encoded Chinese text. It takes time to download. If you are using a non-localized operating system, it also takes time for your web browser to parse the .html source. PATIENCE is recommended.

The comparison page lists a subset of bigrams from the Feng Huan Yuan corpus. Each bigram contains the character "的
(de, POSSESSIVE of)". In Chinese, there are a only few bisyllabic words which can be formed with the character "的 (of)". As you will find out from the page, most of the bigrams in the list are non-sense two-character sequences. If we use 2.5 as the cut-off score for mutual information, we can easily identify those bigrams as non-sense sequence. However, the likelihood ratio statistics have a wide range. It is difficult to use the ratio as a reliable measure to separate meaningful sequence (i.e., bisyllablc words) from non-sense sequences.

A more formal study is underway and I will post my result when it becomes available.

 

 

Chinese Computing Site Map

Chinese Text Computing Sitemap
Title page
Introduction
Statistics
Search
Technical notes
Chinese computing FAQ
Relevant links
Suggestions
What's new
Copyright notice
My homepage

Copyright. 1998-2000. Jun Da. jda@mtsu.edu