Chinese text computing

(This is the 1998 version. An updated 2004 version is now available)


Chinese Text Computing - Introduction


(Last updated: 2000-10-22)

This website provides Chinese character frequency lists generated from 110 megabytes of Chinese text corpus. It also provides bigram lists from two sub corpra. More work is underway which studies concordance and collocation (e.g., bigrams and trigrams) of individual characters using such statistics as mutual information, likelihood ratio and HMM probabilities. Results will be posted as they become available. All the contents presented here are related to my research on automatic phrase identification and acquisition.

1. What is already available at this web site?

  • Chinese character frequency lists based on a 45-million character corpus. They can be found at the Statistics page;
  • Bigram lists as well as their mutual information scores based on two subsets of the Chinese text collection used in this study. They can be downloaded from the bigram page.
  • A simple search engine by which you can find information about individual characters such as frequency, encoding and Pinyin, etc.;
  • Tutorials and information pages about various aspects of Chinese computing such as how to display Chinese on non-localized platforms and how to segment Chinese characters in running texts, etc.. Documents can be found from the FAQ page.

2. What will be available at this page?

  • More digram and n-gram lists generated from the current corpus;
  • A search engine for character co-occurrence patterns.

3. What are the Chinese texts used in this frequency study?

4. What are some of the (potential) benefits of data-driven study of Chinese text in general and character frequencies in particular?

  • Language teaching;
  • Design of Chinese input methods;
  • Automatic lexical acquisition;
  • Online dictionary compilation;
  • Basic tools for information retrival;
  • Classification and organization of texts and text collections (for better information presentation and retrieval);
  • Speech recognition and adaptive user interfaces;
  • Authoring aids and translation aids, etc..

5. What is the research question?

There are no delimiters in running Chinese texts which separte words from each other. Further, as in other language, a compound or phrase consists of more than one characters (or words). This project is an attempt to discover reliable and efficient ways of segmenting words and phrases in written Chinese texts.

6. About me

You can find more information about my other research interests in the Curriculum Vitae section of this website.

Copyright. 1998-2000. Jun Da. jda@mtsu.edu