Jun Da's WebCentral 
Home  Academic  Chinese  CALL  Systems  Personal  Contact 
Chinese text computing(This is the 1998 version. An updated 2004 version is now available)

TSCORE AND MUTUAL INFORMATION SCORE
FROM BIRMINGHAM CORPUS WEBSITE
The two statistical measures of significance which are used by the
collocations feature of the CobuildDirect service are explained below in
layman's terms. It is not really possible to explain the complete
statistical background to the use of Mutual Information and tscores here.
Let us work through some example data (taken from a 20m word corpus) for
the word "post".
It cooccurs with many words, among which are "the", "office" and
"mortem".
The observable facts are that "post" has an overall corpus freq of 2579
(let's refer to this as f(post)=2579) and also
f(office) = 5237
f(the) = 1019262
f(mortem) = 51
We also observe the number of times these words cooccurred with "post"
(for shorthand I'll write j(the) = 1583 to mean that "the" occurred with
"post" 1583 times: this is the "joint" frequency). So
j(the) = 1583
j(office) = 297
j(mortem) = 51
Now if we were to list the collocates of "post' by raw frequency of
cooccurrence we would order them according to j(x), as above. Of course,
a full collocation listing of "post" in this form would have many other
words with intermediate frequencies  we are just focussing on these
three words for the moment. But the ordering show above doesn't tell us
anything much about the strength of association between "post" and these
other words: it is simply a reflection of the basic overall frequency of
the collocating words (i.e. "the" is much more frequent than "office"
which is much more frequent than "mortem"). We just showed that in the
f(x) list! This is true in general: ordering collocates by j(x) simply
places words like "the", "a", "of", "to" at the
top of every collocate
list. What we would like to know is:

IMPORTANTQUESTION: to what extent does the word "post" condition its
lexical environment by selecting particular words with which it will
cooccur?

We can compare the relative frequencies of what we observed with what we
would expect under the null hypothesis:

NULLHYPOTHESIS: the word "post" has no effect whatsoever on its lexical
environment and the frequencies of words surrounding "post" will be
exactly (give or take random fluctuation) the same as they would be if
"post" were present or not.

That is, if "the" has an overall relative frequency of 1 in 20 (about 1m
occurrences in a 20m word corpus  see f(the) above) then we can expect
"the" to occur with the same relative frequency in a subset of the corpus
which is the 4 words either side of "post": hence under the null
hypothesis we would expect j(the) to be
(f(post) * span ) * relative_freq(the)
which is
(2579 * 8) * (1 / 20) = 20632 / 20 = 1031
So under the null hypothesis we would expect j(the) to be 1031. We
actually observed j(the) to be 1583, which is rather higher, and we could
simply express the difference as ratio (of observed to expected joint
frequency) thus:
1583/1031
This is the Mutual Information score and it expresses the extent to which
observed frequency of cooccurrence differs from expected (where we mean
"expected under the null hypothesis"). Of course, big differences indicate
massive divergence from the null hypothesis and indicate that "post" is
exerting a strong influence over its lexical environment.
BUT BUT BUT! there is Big Problem with Mutual Information: suppose the
word "egregious" appears just once with "post" (not an unreasonable
event)
in the corpus. And "egregious" may have a very low overall freq:
f(egregious) = 3
Now we carry out the sums to calculate the expected j(egregious) figure. I
can assure you it will be a small number! It is:
(f(post) * span ) * relative_freq(egregious)
(2579 * 8) * ( 3 / 20000000)
= 0.0030948
Now you'll see that even if "egregious" occurs just once in the vicinity
of "post" the observed j(egregious) will be 323 times more than the
expected joint frequency, and the mutual information value will be high.
Common sense tells us that since words cannot appear 0.0030948 times 
they either occur zero or one times, nothing in between  that claiming
that "post"+"egregious" is a significant collocation is rather
dubious.
In general, the comparison of observed j(x) and expected j(x) will be very
unreliable when values of j(x) are low; this is common sense, too. Just
because I've seen these two words together once in 20m words doesn't give
me much confidence that they are strongly associated: I'd need to see them
together several times at least before I could start to feel at all secure
in claiming that they have some sort of significant association.
Now here comes Tscore. We can calculate a secondorder statistic which
is, crudely, this:

IMPORTANTQUESTION: how confident can I be that the association that I've
measured between "post" and "egregious" is true and not due to the
vagaries of chance?

Tscore answers this question. It takes account of the size of j(x) and
weights its value accordingly. A high Tscore says: it is safe (very
safe/pretty safe/extremely secure etc according to value) to claim that
there is some nonrandom association between these two words. So tscores
are higher when the figure j(x) is higher. In the case of "egregious" we
would get a very low tscore. In the case of "the" the tscore might be
quite high, but not huge because "the" doesn't have that strong an
association with "post". "office" gets a really high tscore because
not
only is the observed j(office) way higher than expected, but we seen a
goodly number of such cooccurrences, enough to be pretty damn sure that
this can't be due to some freak of chance.
In practical terms, raw frequency or j(x) won't tell you much at all about
collocation: you'll simply discover what you already knew that "the" is a
*very* frequenct word and seems to cooccur with just about everything. MI
is the proper measure of strength of association: if the MI score is high,
then observed j(x) is massively greater then expected, BUT you've got to
watch out for the low j(x) frequencies because these are very likely to be
freaks of chance, not consistent trends. tscore is best of the lot,
because it highlights those collocations where j(x) is high enough not to
be unreliable and where the strength of association is distinctly
measurable.
Try the different measures: you'll soon see the difference. Raw freq often
picks out the obvious collocates ("post office" "side effect") but you
have no way of distinguishing these objectively from frequent non
collocations (like "the effect" "an effect" "effect is"
"effect it" etc).
MI will highlight the technical terms, oddities, weirdos, totally fixed
phrases, etc ("post mortem" "Laurens van der Post"
"postmenopausal"
"prepaid post"/"post prepaid" "postgrad") Tscore will get
you
significant collocates which have occurred frequently ("post office"
"Washington Post" "postwar", "by post" "the
post").
If a collocate appears in the top of both MI and tscore lists it is
clearly a humdinger of a collocate, rocksolid, typical, frequent,
strongly associated with its node word, recurrent, reliable, etc etc etc.
Jem Clear
June 1995

Chinese
Text Computing Sitemap

Copyright. 19982000.
Jun Da. jda@mtsu.edu
