SE(IMENTING A SENTENf,I¢ INTO MOItl)IIEM1,;S 
USING STNI'ISTIC INFOI{MATION BI,TFWEEN WORI)S 
Shiho Nobesawa 
Junya Tsutsumi, Tomoaki Nitta, Kotaro One, Sun Da Jiang, M~Lsakazu Nakanishi 
N;tkanishi L;d)oratory 
Faculty of Science and Technology, Keio University 
ABSTRACT 
This paper is on dividing non-separated language sen- 
tences (whose words are not separated from each other 
with a space or other separaters) into morphemes 
using statistical information, not grammatical infor- 
mation which is often used in NLP. In this paper 
we describe our method and experimental result on 
Japanese and Chinese se~,tences. As will be seen in 
the body of this paper, the result shows that this sys- 
tent is etlicient for most of tile sentences. 
1 INTRODUCTION AND MOTIVATION 
An English sentence has several words and those words 
are separated with a space, it is e~usy to divide an 
English sentence into words. I\[owever a a apalmse sen- 
tence needs parsing if you want to pick up the words in 
the sentence. This paper is on dividing non-separated 
language sentences into words(morphemes) without 
using any grammatical information. Instead, this sys- 
tem uses the statistic information between morphenws 
to select best ways of segmenting sentences in non- 
separated languages. 
Thinldng about segmenting a sentence into pieces, 
it is not very hard to divide a sentence using a cer- 
tain dictionary for that. The problem is how to de- 
cide which 'segmentation' the t)est answer is. For ex- 
aml)le , there must be several ways of segmenting a 
Japanese sentence written in lliragana(Jal)a,lese al- 
phabet). Maybe a lot more than 'several'. So, to 
make the segmenting system useful, we have to cot> 
sider how to pick up the right segmented sentences 
from all the possible seems-like-scgrne, nted sentences, 
This system is to use statistical inforn,ation be- 
tween morphemes to see how 'sentence-like'(how 'likely' 
to happen a.s a sentence) the se.gmented string is. To 
get the statistical association between words, mutual 
information(MI) comes to be one of the most inter- 
esting method. In this paper MI is used to calculate 
the relationship betwee.n words found ill the given sen- 
tence. A corpus of sentences is used to gain the MI. 
'Fo implement this method, we iml)lemented a sys- 
tem MSS(Morphological Segmentation using Statisti- 
cal information). What MSS does is to find the best 
way of segmenting a non-separated language, sentence 
into morphemes without depending on granamatieal 
information. We can apply this system to many lan- 
guages. 
~2 )/\[ORPHOLOGICAL ANALYSIS 
2.1 What; a Morphological Analysis Is 
A morpheme is the smallest refit of a string of char- 
acters which has a certain linguistic l/leaning itself. It 
includes both content words and flmction words, in 
this l)aper the definition of a morl)heme is a string of 
characters which is looked u I) in tile dictionary. 
Morphoh)gical analysis is to: 
l) recognize the smallest units making up tile given 
sentellce 
if the sentence is of a l|on-separated hmguage, 
divide the sentence into morphenms (auto- 
matic segmentation), and 
2) check the morlflmmes whether they are the right 
units to make up the sentence. 
2.2 Segmenting Methods 
We have some ways to segment a non-separated sen- 
tence into meaningflll morphemes. These three meth- 
ods exl)lained below are the most popular ones to seg- 
ment ,I apanese sentences. 
• The longest-sc'gment method: 
l~,ead the given sentence fi'om left to right and 
cut it with longest l)ossible segment. For exam- 
pie, if we get 'isheohl' first we look for segments 
wilich uses the/irst few lette,'s in it,'i' and 'is'. 
it is ol)vious that 'i';' is loIlger thall 'i', SO tile 
system takes 'is' as the segment. Then it tries 
the s;tllle method to find the segnlents in 'heold' 
and tinds 'he' and 'old'. 
The, least-bunsetsu segmenting m(',thod: 
Get all the possible segmentations of the input 
sentence and choose the segmentation(s) which 
has least buusetsu in it.. 'l'his method is to seg:- 
ment Japanese sentence.s, which have content 
words anti function words together in one bun- 
setsu most of the time. This method helps not to 
cut a se, ntenee into too small meaningless pieces. 
Lettm'-tyl)e, segmenting method: 
In Japanese language we have three kinds of let- 
ters called Iliragana, Katakana and Kanji. This 
227 
method divides a Japanese sentence into mean- 
ingful segments checking the type of letters. 
2.3 The Necessity of Morphological 
Analysis 
When we translate an English sentence into another 
language, the easiest way is to change the words in 
the sentence into the corresponded words in the tar- 
get language. It is not a very hard job. All we have 
to do is to look up the words in the dictionary, flow- 
ever when it comes to a non-separated language, it is 
not as simple. An non-separated language does not 
show the segments included in a sentence. For ex- 
ample, a Japanese sentence does not have any space 
between words. A Japanese-speaking person can di- 
vide a Japanese sentence into words very easily, how- 
ever, without arty knowledge in Japanese it is im- 
possible. When we want a machine to translate an 
non-separated language into another language, first 
we need to segment the given sentence into words. 
Japanese is not the only language which needs the 
morphological segmentation. For example, Chinese 
and Korean are non-separated too. We can apply this 
MSS system to those languages too, with very simple 
preparation. We do not have to change the system, 
just prepare the corpus for the purpose. 
2.4 Problems of Morphological 
Analysis 
The biggest problems through the segmentation of an 
non-separated language sentence are the ambiguity 
and unknown words. 
For example, 
niwanihaniwatorigairu. 
~: ¢-¢2 N ~: w6 
niwa niha niwatori ga iru 
A cock is in the yard. 
/E I,c t.~ -<NI £'0 ~: v, 6 o 
niwa niha niwa tori ga iru 
Two birds are in the yard. 
1~ tc *gN ~ ~ .a: N 7o 
niwa ni haniwa tori ga iru 
A clay-figure robber is in the yard. 
Those sentences are all made of same strings but the 
included morphemes are different. With dill>rent seg- 
ments a sentence can have several meanings. Japanese 
h~ three types of letters: I\[iragana, Katakana and 
Kanji. lIiragana and Katakana are both phonetic 
symbols, and each Kanji letters has its own mean- 
ings. We can put several Kanji letters to one lliragana 
word. This makes morphological analysis of Japanese 
sentence very difficult. A Japanese sentence can have 
more than one morphological segmentation and it is 
not easy to figure out which one makes sense. Even 
two or nlore seglnentation can be 'correct' lbr one sen- 
tence. 
To get the right segmentation of a sentence one 
may need not only morphological analysis but also se- 
mantic analysis or grammatical parsing. In this paper 
no grammatical information is used arid MI between 
morphemes becomes the key to solve this problem. 
rio deal with unknown words is a big problem in 
natural language processing(NLP) too. To recognize 
unknown segments in tim sentences, we have to dis- 
cuss the likelihood of tim unknown segment being a 
linguistic word. In this pal)er unknown words are not 
acceptable as a 'morpheme'. We define that 'mor- 
pheme' is a string of characters which is registered in 
the dictionary. 
3 CALCULATING TIlE SCORES OF 
SENTENCES 
3.1 Scores of Sentences 
When the system searches the ways to divide a sen- 
tence into morphemes, more than one segmentation 
come out most of the time. What we want is one 
(or more) 'correct' segmeutation and we do not need 
any other possibilities. If there arc many ways of seg- 
,nenting, we need to select the best one of them. For 
that purpose the system introduced the 'scores of sen- 
tences'. 
3.2 Mutual Information 
A mutual information(MI)\[1\]\[2\]\[3\] is tile information 
of the ~ussociation of several things. When it comes to 
NLI', M I is used I.o see the relationship between two 
(or more) certain words. 
The expression below shows the definition of the 
MI for NI, P: 
l'(wl, w2) Ml(wt 
;w2) = 1o9 l'(Wl )P(w2) (t) 
lo i : a word 
P(wi) : the probability wl appears in a corpus 
P(wl ,w,2) : the probability w~ and 'w2 comes out 
together in a corpus 
Tiffs expression means that when wl and w.2 has 
a strong association between them, P(wt)P(w~) << 
P(wt,w2) i.e. MI(wl,w2) >> 0. When wl and w~ 
do not have any special association, P(w,)P(w.a) 
P(wl,w2) i.e. Ml(wl,'w2) ~ O. And wl,en wx and 
w2 come out together very rarely, P(wl)P(w2) >> 
,'(~,,, ,,,~) i.e. M X(w,,,~,~) << 0. 
228 
3.3 Calculating the Score of a Sentence 
Using the words in the given dictionary, it is easy to 
make up a 'sentence'. llowever, it is hard to con- 
sider whether the 'sentence' is a correct one or not. 
The meaning of 'correct sentence' is a sentence which 
makes sense. For example, 'I am Tom.' can make 
sense, however, 'Green the adzabak arc the a ran 
four.' is hardly took ms a meaningful sentence. 'Fhe 
score is to show how 'sentence-like' the given string of 
morphemes is. Segmenting ~t non-sel)arated language 
sentence, we often get a lot of meaningless strings of 
morphemes. To pick up secms-likc-mea,fingfid strings 
from the segmentations, we use MI. 
Actually what we use in tim calculation is not l, he 
real MI described in section 3.2. The MI expression 
in section 3.2 introduced the bigrams. A bigram is a 
possibility of having two certain words together in a 
corpus, as you see in the expression(l). Instead of the 
bigram we use a new method named d-bigram here in 
this paper\[3\]. 
3.3.1 D-bigram 
The idea of bigrams and trigraiT~s are often used in 
the studies on NLP. A bigram is the information of 
the association between two certain words and a tri- 
gram is the information among three. We use a new 
idea named d-bigram in this paper\[3\]. A d-bigram is 
the possibility that two words wt and w2 come out 
together at a distance of d words in a corpus. For 
example, if we get 'he is Tom' as input sentence, we 
have three d-bigram data: 
('he' 'is' 1) 
('is' 'Tom' 1) 
('he' 'Tom' 2) 
('he' 'is' 1) means the information of the association 
of the two words 'tie' and 'is' appear at the distance 
of 1 word in the corpus. 
3.4 Calculation 
The expression to calculate the scores between two 
words is\[3\]: 
t'(wl, w~, d) Mid(w1, w,2, d) = 1o9~~ (2) 
lu i : ;t word 
d : distance of the two words Wl and w2 
P(wi) : the possibility the wm'd wl appears 
in the coq)us 
P(wl,w2,d) : the possibility wl and w2 eoll'le out 
d words away fl'om each other 
in the corpus 
As the value of Mid gets bigger, the more those 
words have the ,association. And the score of a sen- 
tence is calculated with these Mid data(expression(2)). 
The definition of the sentence score is\[l\]: 
ia(W)= 9 9 Mia(wi,w'+ d,d) d-' (a) 
i:0 d:l 
d : distance of the two words 
m : distance limit 
?1. : the llUllti|lel" Of Wol'ds ill tile SelttellCe 
I~ll : it selttence 
wi : The i-th morpheme in the sentence I~V 
This expression(3) calculates the scores with the 
algoritlmt below: 
1) Calculate Mld of every pair of words included in 
the given sentence. 
2) Give a certain weight accordiug to the distance, d 
to all those Mid. 
3) Sum up those 3~7~. The sum is the score of the 
sentence. 
Church and llanks said in their pN)er\[1\] that the 
information between l.wo remote wo,'ds h~s less mean- 
ing in a sentence when it comes to the semantic analy- 
sis. According to the idea we l)ut d 2 in the expression 
so that nearer pair can be more effective in calculating 
the score of the sentence. 
4 Tns SYSTSM MSS 
4.1 Overview 
M,qS takes a lliragana sentence as its input. First, 
M,qS picks Ul) the morphemes found ill the giwm sen- 
tence with checking the dictionary. The system reads 
the sentence from left to rigltt, cutting out every pos- 
sibility. Each segment of the sentence is looked up in 
the dictionary and if it is found in the dictionary the 
system recognize the segnlent as a morpheme. Those 
morphemes are replaced by its corresponded Kanji(or 
lliragana, Katakana or mixed) morpheme(s). As it 
is tohl in section 2.4, a lliragana morpheme can have 
several corresponded l(anji (or other lettered) mor- 
phemes. In that case all the segments corresponded 
to the found l|iragana morpheme, are memorized as 
morl)hemes found in the sentence,. All the found mor- 
phemes are nunfl)ered by its position in the sentence. 
After picking Illl all the n,orphenu.'s in I.he sentence 
the system tries to put them together mtd brings them 
up back to sentence(tat)h~ I). 
\[nl)ut a lliragana sentence. 
Cut out t, he morphemes. lI 
Make up sentences with the morphemes. 
tI 
Calculate the score of sentences 
using the mutual information. 
g 
Compare. the scores of all the. made-up sentences 
and get the best-marked one 
as the most 'sentence-like' sentence. 
Then the system compares those sentences made 
up with found morl)he.mes and sees which one is the 
229 
Table 1: MSS example 
0 1 2 3 4 
4 5 6 7 8 
IT. ~ "9 
8 9 10 11 12 
(('~t/~" 03)('~" 12) ('ff~" 23) (" L'23) ('~" ad) ('1:" 4 s)('~" ,ls)('~'67) 
{'')" 78)('|77-" 89) ('l,'" 910) 
("R"a" 911) ('~'~" 911)('~" 1112)) 
('~" 0a) 
1 ('~" a4) 
1 ('~" 4s) 
l failed 
1 ('$~" , e) 
1 (-~" sg) 
! 
('~,~" 9 1o) 
1 failed 
1 ('m~" o tl) 
l 
l 
~¢cepted 
I 
t ('P." ll l~} 
t accepted 
(('~/,''~" ",~" "t:" "N--a" "~') 
most 'sentence-like'. For that purpose this system cal- 
culate the score of likelihood of each sentences(section 
3.4). 
4.2 The Corpus 
A corpus is a set of sentences, These sentences are 
of target language. For example, when we apply this 
system to Japanese morphological analysis we need a 
corpus of Japanese sentences which are already seg- 
mented. 
The corpus prepared for the paper is the trans- 
lation of English textbooks for Japanese junior high 
school students. The reason why we selected junior 
high school textbooks is that the sentences in the text- 
books are simple and do not include too many words. 
This is a good environment for evaluating this system. 
4.3 The Dictionary 
The dictionary for MSS is made of two part. One is 
the heading words and the other is the morphemes 
corresponded to the headings. There may be more 
than one morphemes attached to one heading word. 
The second part which has morphemes is of type list, 
so that it can have several morphemes. 
Japanese : (" I,,~" (" ~: ~ .... ~:\[ o ")) 
heading word morphemes 
Chinese : ( "tiny" (" ~,, .... ~t")) 
heading word morpherne~ 
5 RESULTS 
Implement MSS to all input sentences and get the 
score of each segmentation. After getting the list 
of segmentations, look for the 'correct' segmented- 
sentence and see where in the list tile right one is. 
The data shows the scores the 'correct' segmentations 
got(table 2). 
Table 2: Experiment in Japanese 
corpus 
dictionary 
input 
number of 
input sentence 
distance limit 
about 630 J~tp,'tnese sentences 
(with three kinds of letters mixed) 
about 1500 heading words 
(includes morphemes 
not in tile corpus) 
lion-segmented Ja.p;~nese selltences 
using lllragana only 
about 100 e~tch 
5 
~ -V~score 
a 99% 
loo% 
7 100% 
95% 
E 80% 
2nd best T ~ 3rd best 
100% 100% 
100 % 100 % 
100% :100% 
98 % 98 % 
90 % 95 % 
the very sentences in tile corpus 
replaced one rnorllheme in the sentence 
(the buried morpheme is in the corpus) 
replaced one morpheme in the sentence 
(tile buried morpbeme is not in the corpus) 
sentences not in the corpus 
(the morphemes are all in tim corpus) 
sentences not in the corpus 
(include morphemes not; in the corpus) 
5.1 Ext)eriment in Japanese 
According to the experimental results(table 2), it is 
obvious that MSS is w.'ry useful. The table 2 shows 
that most of the sentences, no matter whether the 
sentences are in the. corpus or not, are segmented cor- 
rectly. We find the right segmentation getting the 
best score in the list of possible segmentations, c~ 
is tile data when the input sentences are in corpus. 
That is, all the 'correct' morphemes have association 
between each other. That have a strong effect in cal- 
culating the sco,'es of sentences. The condition is al- 
most same for fl and 7. Though the sentence has one 
word replaced, all other words in the sentence have 
relationship between them. Tim sentences in 7 in- 
elude one word which is not in the corpus, but still 
tile 'correct' sentence can get the best score among 
the possibilities. We can say that the data c~, fl and 
7 are very successfld. 
230 
llowever, we shouhl remember that not all the sen- 
tences in the given corpus wouht get the best score 
through the list. MSS does trot cheek the corpus itself 
when it calculate the score, it just use the Mid, the 
essential information of the corpus. That is, whether 
the input sentence is written in the corpus or not 
does not make any effect in calculating scores directly. 
Ilowever, since MSS uses Mid to calculate the. scores, 
the fact that every two morphemes in the sentence 
have connection between them raises the score higher. 
When it comes to the sentences which are not in 
corpus themselves, the ratio that the 'correct' sen- 
tence get the best score gets down (see table 2, data 
~, e). 
The sentences of 6 and g are not found in the cor- 
pus. Even some sentences which are of spoken lan- 
guage and not grammatically correct are included in 
the input sentences. It can be said that those ~ and 
e sentences arc nearer to the real worhl of Japanese 
language. For ti sentences we used only morphemes 
which are in the corpus. That means that all tim mor- 
phenres used in the 5 sentences have their own MI,I. 
And e sentences have both morphemes it( the corpus 
and the ones not in the corpus. The morphemes which 
arc not in the corpus do not have any Ml(l. Table 2 
shows that MSS gets quite good result eve(, though 
the input sentences arc not in the corpus. MSS do not 
take the necessary information directly from the co> 
pus and it uses the MIa instead. This method makes 
the information generalize.d and this is the reason why 
5 and e can get good results too. Mid comes to }>e 
the key to use the effect of the MI between morphemes 
indirectly so that wc can put the information of the 
mssoeiation between morphemes to practical use. This 
is what we expected and MSS works successfldly at 
this point. 
5.2 The Corpus 
In this paper we used the translation of English text: 
books for Japanese junior high school students. Pri- 
mary textbooks are kiud of a closed worhl which have 
limited words in it an<l the included sentences are 
mostly in some lixed styles, in good graummr. The 
corpus we used in this pal)er has about 630 sentences 
which have three types of Japanese letters all mixed. 
This corpus is too small to take ms a model of the ,'eal 
world, however, for this pal>e( it is big enough. Actu- 
ally, the results of this paper shows that this system 
works efficiently even though the corpus is small. 
The dictionary an<l the statistical information are 
got from the given corpus. So, the experimental re= 
suit totally depends on the corpus. That is, selecting 
which corpus to take to implement, we can use I.his 
system ill many purposes(section 5.5). 
5.3 Comparison with the Other 
Methods 
It is not easy to compare this system with other seg- 
,nenting methods. We coral)are with tile least-bunsetsu 
method here ill this paper. 
The least-bunselsv method segment the given sen- 
tences into morphemes and fin(l the segmentations 
with least bunselsu. This method makes all the seg- 
mentation first an(l selects the seems-like-best seg- 
mentations. This is the same way MSS does. The 
difference is that the least-bdnsetsv method checkes 
the nmnber of tile bumselsu instead of calculating the 
scores of sen(elites. 
Let us think about implementing a sentence the 
morl)hcmes are l,ot in the dictionary. That means 
that the morphemes do not have any statistical in- 
formations between them. In this situation MSS can 
not use statistical informations to get the scores. Of 
course MSS caliculate the scores of sentences accord: 
ing to tile statistical informations between given mor- 
phemes, llowe.ver, all the Ml,l say that they have no 
association I)etween t\]le (~lorpherlles. When there is no 
possibility that the two morl>hemes appears together 
ill the corpus, we give a minus score ~s tit('. Ml,t wdue, 
so, as the result, with more morphemes the score of 
the+ sentence gets lower. That is, tire segmentation 
which has less segments ill it gets better scores. Now 
compare it with the least-bunsetsu method. With us- 
ing MSS the h.'ast-morpheme segme.ntations are se- 
lected as the goo(I answer, q'hat is tile same way 
the least-bunsetsu method selects the best one. '\['his 
means that MSS and the least-bttnscts.le method have 
the same efficiency when it comes to the sentences 
which morl(hemes are not in the corpus. It is obvious 
that when the sentence has morphemes in the corpus 
the ellicie.ncy of this systern gets umch higher(table 
2). 
Now it is proved that MSS is, at least, as etli: 
cicnt as the least-b'unsets'~ nmthod, no matter what 
sentence it takes. We show a data which describes 
I.his(tabh~ 3). 
"Fable 3 is a good exanq)le of the c;use whelL the. 
input sentence has few morphemes which are in the 
corl)uS. This dal.a shows that in I.his situal.ion I.here is 
an outstanding relation between the number of mor- 
l)hemes and the scores of the segmented se.ntenees. 
This example(table 3) has an ambiguity how to seg- 
ment the sentence using the registere(l morphemes, 
and all the morphemes which causes the alnbiguity 
are not in the given (:orpus. Those umrl)hemes not 
in the corpus do not have any statistical information 
betweel, them and we have no way to select which is 
bett<.'r. So, the scores of sentences are Ul) to the length 
of the s<~gmented sentence, that is, the number how 
many morl)hemes the sentence has. '\['he segmented 
sentence which has least segments gets the best score, 
since MSS gives a minus score for unknown mssocia- 
tion between morphemes. That means that with more 
segments in the sentence the score gets lower. This sit- 
ZT/ 
Table 3: MSS and The least-bvnselsu method 
input : a non-segmented 
Japanese tliragana sentence 
not in the corpus 
all unknown morphemes in the sentence 
are registered in the (lictionary 
(some morphemes in the corpus 
are included) 
" sumomo mo nlonlo hie memo no ilCh\] 
the number of 
the morphemes 6 7 8 9 10 
the scores of 
the sentences -65,0 -79.6 -9,1.3 -108.9 -123.5 
the number of 
tile segmented 5 20 21 8 1 
sentences 
tile tcorrectl 
segmentation ~k" 
MSS O 
tile least- 
bunsetsu 0 
method 
morphemes included : " © .... ~2 " 
in the corpus : " no .... Ill(lllO " 
morphemes not included : " IAI .... ~4!. ~ " 
in the corpus : " uchi .... sunm " 
" sumomo " *' hie j~ " ~t" 
'P nlOUlO ~p 
uation is resemble to the way how the least-bunseisu 
method selects the answer. 
5.4 Experiment in Chinese 
The theme of tiffs paper is to segment non-separaLe(\] 
language sentences into morphemes. In this paper we 
described on segmentation of Japanese non-segmented 
sentences only but we are working on Chinese sen- 
tences too. This MSS is not for Japanese only. It can 
be used for other non-separated languages too. "lb 
implement for other languages, we just need to pre- 
pare the corpus for that and make up the dictionary 
from it. 
llere is the example of implementing MSS for Chi- 
nese language(table 4). The input is a string of char- 
acters which shows the pronounciations of a Chinese 
sentence. MSS changes it into Chinese character sen- 
teces, segmenting the given string. 
5.5 Changing the Corpus 
To implement tiffs MSS system, we only need a eel 
pus. The dictionary is made from the corpus. This 
Tal)le 4: Experiment in Chinese 
input : nashiyizhangditu. 
correct answer output sentences scores 
-~ )J\[~ ~: --,~ ;t~. 15.04735 
)Jl~ ~! --~ .tt~\]. -14.80836 
)JI~ {0~ --'\]~ ~1~. -14.80836 
gives MSS system a lot of usages and posibilities. 
Most of the NLP systems need grammatical i,ffof 
malleus, and it is very hard to make up a certain 
grammatical rule to use in a NLP. The corpus MSS 
needs to implement is very easy to get. As it is de- 
scribed in the previous section, a corpus is a set of real 
sentence.s. We can use IVISS in other languages or in 
other purposes just getting a certain corpus for that 
and making up a dictionary from the corpus. That is, 
MSS is available in many lmrposes with very simple, 
easy preparation. 
6 CONCLUSION 
This paper shows that this automatic segmenting sys- 
tem MSS is quite efficient for segmentation of non- 
separated language sentences. MSS do not use any 
grammatical information to divide input sentences. 
Instead, MSS uses MI l)etween morphenres included in 
the input sentence to select the best segmentation(s) 
frorn all the possibilities. According to the results 
of the experiments, MSS can segment ahnost all the 
sentences 'correctly'. This is such a remarkable result. 
When it comes to the sentences which are not in the 
corpus the ratio of selecting the right segmentation 
as the best answer get a little bit lower, however, the 
result is considerably good enough. 
The result shows that using Mid between mor- 
phemes is a very effective method of selecting 'correct' 
sentences, aml this means a lot in NLP. 
REFERENCES 
\[1\] Kenneth Church, William Gale, Patrick lhmks, 
and Donald llindle. Parsing, Word Associations 
and Typical Predlcate-Argument t{,elations. In- 
ternational Parsing Workshop, 1989. 
\[2\] Frank Smadja. Itow to compile a hilingual collo- 
cational lexicon automatically. Statislically-based 
Natural Language Programming Techniques, pages 
57--63, 1992. 
\[3\] dunya Tsutsurni, Tomoaki Nitta, Kotaro One, 
and Shlho Nobesawa. A Multi-Lingual Transla- 
tion System Based on A Statistical Model(written 
in Jal)anese). JSAI Technical report, SIG-PPAI- 
9302-2, pages 7-12, 1993. 
232 
\[4\] David M.Magerman and Mitchell P.Marcus. Pars- 
ing a Natural Language Using Mutual Information 
Statistics. AAAI, 1990. 
\[5\] It.Brown, J.Cocke, S.Della Pietra, V.Della Pietra, 
F.Jelinek, R.Mercer, and P.Roossin. A Statist, i- 
eal Approach to Language Translation. l'roc, of 
COLING-88, pages 71-76, 1989. 
233 
