Chinese text computing
         | | | | | | |      
 
 

Computing mutual information statistics for collocation

How mutual information scores are computed?

Please refer to http://www.umiacs.umd.edu/users/resnik/nlstat_tutorial_summer1998/Lab_ngrams.html for computing procedure and formula.

How to interpret mutual information scores?

The following guidelines can be used:

  • High (MI >= 5)
  • Medium ( 4 >= MI >=3)
  • Low (MI <= 1)

If the scores are high or medium, the collocation strength is strong. If MI is below 1, it is less likely that the two tokens are related. MI scores between 1 and 3 are in the gray area. My intuitive judegement of  the bigram lists with MI score larger than 2.5 appear to be bisyllabic words in Chinese, though such intuition needs to be verified.

Other statistical measures of collocation

Other statistical measures such as t-score, likelihood ratio, chi-square and Yule's Y are often used to measure collocation strength. For an introduction and comparison of those measures, please refer to, among others:

 

 

 
Copyright. 1998-2017. Jun Da. jda@mtsu.edu. Page last updated: 2010-09-16