|Chinese text computing|
|Jun Da at lingua.mtsu.edu||Home | Academic | Chinese computing | Learning Chinese | CALL | System admin | Contact me|
Computing mutual information statistics for collocation
How mutual information scores are computed?
Please refer to http://www.umiacs.umd.edu/users/resnik/nlstat_tutorial_summer1998/Lab_ngrams.html for computing procedure and formula.
How to interpret mutual information scores?
The following guidelines can be used:
If the scores are high or medium, the collocation strength is strong. If MI is below 1, it is less likely that the two tokens are related. MI scores between 1 and 3 are in the gray area. My intuitive judegement of the bigram lists with MI score larger than 2.5 appear to be bisyllabic words in Chinese, though such intuition needs to be verified.
Other statistical measures of collocation
Other statistical measures such as t-score, likelihood ratio, chi-square and Yule's Y are often used to measure collocation strength. For an introduction and comparison of those measures, please refer to, among others: