Colloeational Analysis in Japanese Text Input 
Masaki YAMASHINA Fumihiko OBASHI 
NTT Electrical Communication Laboratories 
1-2356 Take Yokosuka~shi Kanagawa-ken 
238-03 JAPAN 
Abstract 
This paper proposes a new disambiguation method for 
Japanese text input. This method evaluates candidate sen- 
tences by measuring the number of Word Co-occurrence Pat- 
terns (WCP) included in the candidate sentences. An au- 
tomatic WCP extraction method is also developed. An ex- 
traction experiment using the example sentences from dic- 
tionaries confirms that WCP can be collected automaticMly 
with an accuracy of 98.7% using syntactic analysis and some 
heuristic rules to eliminate erroneous extraction. Using this 
method, about 305,000 sets of WCP are collected. A co- 
occurrence pattern matrix with semantic categories is built 
based on these WCP. Using this matrix, the mean number of 
candidate sentences in Kana.-to-Kanji translation is reduced 
to about 1/10 of those fi-om existing morphological methods. 
1.Introduction 
For keyboard input of Japanese, Kana-to-kanji translation 
method \[Kawada79\] \[Makino80\] \[Abe86\] is the most popu- 
lar technique. In this method, Kana input sentences are 
translated automatically into Kanji-Kana sentences. How- 
ever, non-segmentcd Kana input is highly ambiguous, be.. 
cause of the segmentation ambiguities of Kana input into 
morphemes, and homonym ambiguities. Some research has 
been carried out mainly to overcome homonym ambiguity 
using a word usage dictionary \[Makino80\] and by using case 
grammar \[Abe86\]. 
A new technique named collocational analysis method, is 
proposed to overcome both ambiguities. This evaluates the 
certainty of candidate sentences by measuring the number 
of co-occurrence patterns between word paix~. It is used 
in addition to the usual morphological analysis. To realize 
this, it is essential to build a dictionary which can reflect 
Word Co-occurrence Patterns (WCP). In English processing 
research, there has been an attempt \[Grishman86\] to col- 
lect semi-automatically sublanguage selectional patterns. In 
Japanese processing research, there have been attempts \[Shi- 
rai86\] \[Tanaka86\] to collect combinations of words with this 
kind of relationship, eittmr completely- or semi-automatically. 
These two attempts did not provide a dictionary for practical 
use. 
A new method is proposed for building a dictionary which 
accumulates WCP. The first feature of this method is the col- 
lection of WCP from the common combination of two words 
having a dependency relationship in a sentence, because these 
common combinations will most likely reoccur in future texts. 
In this method, it is important to identify dependency re- 
lationships between word pai~s, instead of identifying, the 
whole dependency structure of the sentence. For this pur- 
pose, Dependency Localization Analysis (DLA) is used. This 
identifies the word pairs having a definite dependency rela- 
tionship using syntactic analysis and some heuristic rules. 
This paper will first describe eoUocational analysis, a new 
concept in Kana-to-Kanji translation, then the compilation of 
WCP dictionary, next the translation Mgorithm and finMly 
translation experimental results. 
2. Concept of Colloeatlonal Analysis in Translation 
CollocationM analysis evaluates the correctness of a trans- 
lated sentence by measuring the WCP within the sentence. 
The WCP data is accmnulated in a 2-dimensional matrix, by 
information milts indicating more restricted concepts than 
the words can indica.te by themselves. 
As previously mentioned there are two kinds of ambigui- 
ties in Kana-to-Kanji translation. In Fig.i, disambiguation 
process of homonyms is illustrated. ' NA;R (a national 
anthem) and ~\[~'~- ~ (to play)' and ' NAg( (a state) aud 
~.~-~- ;5 (to build)' etc. are examples of WCP. If the simple 
Kana sequence ' ~_ -~ h~ ~- .~./~ ~ 5 ~" ;5 \[kokkaoensousuru\]' is 
input, the usual translation system will develop two possible 
candidate words ' NJN ' (a national anthem) and ' NAg( (a 
state)', for the partial Kana sequence of ' ~ ~J h~ \[kokk@ 
The system will also develop uniquely the creed(date word, 
' ~-¢ ;5 (to play) ' for' R./~ <- -) -~- ;5 \[ensousumq'. These 
candidate words are obtained by table searching and mor- 
phologicM analysis. Itowever, morphological analysis alone 
can't identify which one is correct for ' ~. o h~ \[kokka\]. 
Using eo!loeationM analysis, ~he WCP of ' NA ~.7,~(a state)' and 
' ~- ;5 (to play)' is found to be nil, while that of' NA~ (a 
national anthem)' and ' ~ ;5 (to play)' is found to be 
probable. Using WCP, ' NA ~ik ~ ~ ~ ~" -5 (to play a national 
anthem)' is selected as the final candidate sentence. If the 
Kana sequence' c o h~ ~ l:Y/b -t~ ~ ~" .5 \[kokkaokensetsusnru\]' 
is input, ' NA~-$k~: ;5 (to build a state)' is obtained in 
same manner. 
E\] ~ Homonyms for \[ ~'j~ ~ -~ ~ I 
(Japanese\] ' ,Z-)~ ' (kokka)\]j,~__~toptay)~.~ 
'l~-I~ ~' (nihon) L~ ~ V '~'~'~- ° -~ ~' 
(a national anthem) I (enaousuru) 
N 
( a state ) 
~(Canstitutional)\] ( to build ) 
q~t "~ "1o ' (houch i) 'l;t/vVO ~ ~' 
(kensetsuauru) : ~aKana sentence 
: Candid te~~ 
O ~ ~__& (to play a national anthem) 
× NL~ ~" ~ ~ ;5 (to play a state) 
Fig. 1 Concept of colloeational analysis 
770 
,:~..A ~*~CP Dh:i;h:mary 
3o1..g,*j .Automath: Compilation Method 
2)he new compilation :method extracts fl'om a sent nee two 
,ma'cl combinations whMh l'lave a dependency relationship. 
This is ilhmtmted with the sa,nple sentmme ~ i:L Ci "~ ~: ~1~ 3.~ ,e$ ~' 
~l'I :, i: (i shot ~ bird fl'ying in the sky,)'. 
;it i.e~,.,\] 
A t fir~:;t rids n,egho,l, analyz."e~ a sentem-e morpl;ological,ly. 
\]n t\[6:~ c~ample, the sentence, is s(,.g;menl;ed into live \[hmsetsu 
( Ja\[)alit;p,e grammatical, units, like t)hrase,q) and i.hc )arts oi:' 
simech of each wo~d are ,,brained. ' ~;,\], (1)', ' ,i.5 (a bird)' and 
' ,u (sky)' arc noires. ' tl~.;v (to 3y) and gld ~, #. (to 
shoot) ',re ,,erl',,~. 'a (ha)' in the first m~,,.:~;s,,, ' ~.' 0')' 
ii,. the se:cond one and in the. fl)tn'th one ace poat0o:dtional 
words. They determine tl'm dependent attributes of ha,ms in 
dependency re\]atioml'}it). 
ex. 
..!i~/_,.!~.. / ~ ~±- / .:~\]~_ZA / .... }/;t ::;.-. / Ji,!.~,~ 
( l ) (~qky) (to i?ly) (a blFd) (to $11OO1;) 
' I ~, ~ ............. ¢. ~ ........... _~ 4 . 
' t .............. .~ ........................ J I 
(ID. F, ngl'isl'l: I shot a bird flying in the sky.) 
Step.2 
Tim d,;pendeney relationship between words is a.aal,yzed 
using Japanese syntactic rules. In the extractkm step, DLA 
is used. This process first, finds out unique dependency rela. 
tkmr:l,fips. "Unique relationslhip" inca.us that a dependc'nt has 
o~.,\]y one \],ossibl,e ,,~oven or within the sentence, hi this exam. 
pie, the :colal, ionships between ' ,t',!>'% (a bird) and !1~1 ", i:: (to 
hoe~) s.:nd ' Jl~-~; (to fly) and ,t'i$ ~i (a hh'd)' a.n~ idc~.d;iiled 
as lraiqn~: 
Next, 'ambiguous r'.\[ationshipu '~ are processed. .i'!}.is re. 
lationsl'fi\[- means th.t~ a (tei)e.tt¢\[cnt has sevet'fl\] po~sil)le gov- 
ernors. In this cwm, the governor which can be identiiied as 
,n ~>;~ l'ikcly by heuristic rules is local;cal. Thi.~; rul'e wil! only 
~,:ccpt rd.~tio,~ships wlherc dependent and p;oeernor are adjw. 
eeLq:, because this rel'ationship l,nt.~ the highest possibility. 
In thi~ example, ' "}4 :~:(sky)' has two pos,.dbl'e, candidate 
governor:s, ' ~i}~.;c (to dy)' an ~ ' ~}~J .~ #. (to shoot), in tiffs 
ea.se, because, '?,~ 4- (sky) a,nd Ji~ A; (to ily)' are adjacent, it 
is identified that '3~ @ (siky)' is dependent and ' )l~g: (to 
fly)' is govcr.mu'. 
Next., ',/, ;t(I)' l,m.s also two possil)le candidate governors, 
' \]t{.;: (to fly)' :rod' fl,~ -., ~'u (to shoot)', in this case, because, 
these two governors are not adjacent to the depemlent, the 
dependency relationship between '$\].,i,~(I)' and two candidate 
governors rl,on't be identified for extraction. 
I °' t urthe~ more, some speefl:tc pa.rt-of-speeeh sequenees which 
have many sanbiguotm dependency rel'ationships are rejected 
fi)r extraction. Following is an exarnple of eonihsing part-of- 
speech sequence. In spite of similar syntactic style, ' ~li t,~ (red)' 
in ' ,~ t,~ *li a) ~g (a red car's window)' modifys adjacent 
word' _qi (a car)', while, '~, ~,,(red)'in' kl: ~' N a0 ~g (a 
red tl,ower in fal'l.)' modifys a word at end of .qent;enee ' :\]~ (a 
flower)'. '\]'has, in case.' of thiq sequence, if a dependent and a 
governor t~re adjacent, the relationsl,fip between the modify- 
ins adjeet:ive and the modified noun is not identified. 
t£'g:. 
modifying adj. etc. -t noun 't-- ' (0 '(postf,osition ) + noun 
;3, ~, qt o) ;,g ,£ v, ~2 0.~ gF, 
r ' window) (red flower in fall) (a red :at s 
3.2, Extraction Experiment 
'\]?o provide it large volume of syntacticall,y correct sen- 
"~enees, ezample sentences written in dictionaries \[Ohno82\] 
\[MasndaS3\] were employed. This is because, tl,mse example 
senLe*~ces are a rich source of data indicating typical, usage 
of each common word wit;h short sentences and they are as. 
sumed go represent eornmon usages witl'fin gn extremely large 
~4~niount of Sollrce data. 
Five hundred example sentences were used t,o examine the 
accuracy of this automatic exh'aetion method. 82% o\[" st, 
tenets eouhl be analyzed morphol'ogiea.lly. As result, 7\]~; 
sets of dependency rel'ationship were extracted from tlhese 
morphologically-.analyzed sentences with m! accuracy of 98.7% 
'.('he causes of erroneous extraction are ma.inly mi:;identifica 
thin of part-of-speech and of compound words. 'FL, e misidm> 
tifieation of dependency relationship was much l'ess fcequem.. 
Using thi~ mcghod~ 305,000 sets of WCP were collected 
from 300,000 example sentences, in these WCP, about 45% 
of them are relationships I)etweeD noan and verb or adjective 
with pmtpositional word, 21% are. relationships between noun 
and nomt with ' 00 (postpositional word)', and 26% are the 
nouns palm constructing compound words. 
3.3 Co-occurrence Pattern Matrix 
With the vim of constructing a rel,iM)le WCP dict\]ol!.a.ry~ 
the use of individual words, is impnu:t\]cal, l)ocall~,c the dic 
tionary becomes too large. Semantic categorie~ an. useful be 
cause, if word A and B are synonyms, they will have ~;imihn' 
eo-occurrence pal;terns to other words. 'J'lds allows I;he WCP 
dictionary, d¢>;cribed in scmanl:ie categories, i;~) be greatly rc 
dueed in size. Scores of semantic categories were d.evelop~xi, 
however, it was flmml' t~ihat the munl)~:r ef these categori~; 
was ~oo smMl' to aeeuraWly describe word rel'atiol~,hips, l'br 
tunately, i;hcre is a Japimes,; thesaurus IOhno82\] with \] ,000 
semantic cal, egories. Based on the 305,000 wets ofWCP (h)ota 
a. 2-dimensional matrix was devch)l,)ed which indicates co 
occu.rren.ee patterns in semantic categories \[ohno,~I2\]. 
\]?ig.2 shows an image of this matrix. In this matrix, word 
pairs which have same semantic categories lm.ve high co oc 
currence possibil,ity. The words incl'uded in the categc, ie~; 
indicating 'action' and 'mow~ment' etc. are the .p;ow.'rnor in a. 
co. occurrence re.l'ationship with various words as their depe~t 
dm~f.. 
U ~d 
1 
1 
Position l 1 
. Quantity 
1 
1 Person 
,l 
i 
1 
(~ Ovo, ruor 
Q) 
.~ 
\[111 1 
11 1 11 
1 l I1 
1111111 
111 i1 
1111111 
11 1111 
1 1 
1 
1 1 
lllllll 
I 
1 111 
iII II 
ll 1 
11t 1 
llI 
lll 
1 l 1\[11 ill 
I I 1 
1 lI \[ 1 
l 
1 
11 
1 
I1 
1 
l 11 
1 
1 
1 1 
1 1 
1 
Fig. 2 An image of WCP matrix 
-/7\] 
4. Translation Algorlthm 
Fig.3 shows the translation process outline. First, table- 
searching is done for all segmentation possibilities to get each 
part-of-speech of segment. This' is carried out referring to 
independent word •dictionary (nouns, verbs, adjectives, etc. 
\[65,000 words\]), prefix and suffix dictionary \[1085 words\], de- 
pendent word dictionary (postpositions, auxiliary verbs, etc. 
\[422 words D, Then, among the morpheme sequences con- 
structed with each segment, the grammatically possible se- 
quences are selected. 
Next, the candidate sentences with the least number of 
Bunsetsu are selected \[Yoshimura83\]. Furthermore, among 
tt~ese selected sentences, those which have the least number of 
words are selected. In this process, a heuristic rule is used to 
prevent morPheme sequence mis-selection. This rule rejects 
the combinations of nouns constructing a compound word, if 
the usage frequency of either nouns is very low. 
ex, 
Input Kana sequence: 75~/b I;~ ~ 03 ~2 ~ \[kankeinonaka\] 
x ~ (noun) ~ (noun, freq. : very low) 
(a relation) (in a field) 
O ~ (noun) o) (postposition) OO (noun, freq. : high) 
(a relation) (among) 
Secondly, the co-occurrence pattern matrix is utilized in 
order to determine the number of WCP within each candi- 
date sentence. The counting operation is carried out only on 
adjacent Bunsetsu, because , in most eases, relationships are 
between adjacent Bunsetsu and determining extended rela- 
tionships would prove to be too time-consuming. 
Finally, the cand{date sentence with the maximum WCP 
number is chosen as the prime candidate. To prevent mis- 
taken deletion of prime candidates caused by word pairs which 
rarely co-occur, following rule is used. If the usage frequency 
of either word in WCP is low, the,candidate sentences of 
which WCP number is less one than maximum number, are 
also identified as prime candidates. In following example, 
both are identified as prime candidates. 
¢X 
Input i{ana sequence: ,~/v b i ") O) C 5 ~ ~ 
\[bunshounokonsei i 
03 ~___~ (freq. : low) 
(a sentence) (p~oofreading) 
I .... ' 
WCP 
0) ~(freq. : high) 
(a senttenee) (composition) 
not WCP 
0~ 
__\]" \[ Candidate evaluation. 
Morphological 
analysis Collocations! analysis 
i ................................ i Dictionaries 
i (66,500 words) i 
Fig. 3 Translation process outline 
772 
5. Translation Experimental it~es-ifl~ 
About four hundred test sentences were used '.';o evahm.tc 
the accuracy of eollocational analysis. The mean m/robe,: ~,{; 
candidate sentences was 62.6, selected by considering !cu~. 
number of Bunsetsu. Error ratio for ehis was 1,7%. ~,;~rox 
ratio means the proportion of correct Hunsetsu mi.'~sc~"t !:.;, 
the selecting operation in each process to total nm~iber o( ~,.I} 
Bunsetsu. The mean number of candidate se~tencc.~ ~c!ee~cd 
by least number of words was 1.6.1 with a~i erJ:or r:,~i:'~ ~S 
0.8%. Finatly, the nmnber d candidate sentences selected by 
collocational analysis method was thrther reduced to 6.4 wil;b 
an error ratio of 1.6%, 
Furthermore, translation accuracy of the praci;ica( tr~,~a.~',l;~. 
tion algorithm based on the above description was c'xanfi~led 
using 10 leading articles in news papers(about 14,000 clm~!> 
acters). This practical algorithm was modified J))r proce~.,;i~.~.g 
proper nouns, numerals and symbols, a~M to sa~e memory 
It was confirmed that the translation accuracy evaluated by 
character unit of this method was higher thaxt 95%. 
6. Conclusion 
A method for disambiguation based on colloeal;ional anal 
ysis of non-segmented Kana-to-Kanji translation has be(m de- 
veloped. To realize this, an automatic WCP dictionary coi c.- 
pilation method has also been developed. In an extractio~ 
experiment using example sentences fl'om dictionm'ie,q, it wm~ 
confirmed that WCP can be collected automatically wiflt a 
98.7% accuracy using syntactic anMysis and some heu.cistie 
rules to eliminate errors. Using this method, about 305,000 
sets of WCP were collected. The co-occurrence patterrt m~ 
trix was built based on these WCP mid used in b'artslat.ion 
experiments. 
Experimental results show that tim mean umnber of can. 
didate sentences is reduced to about 1/10 of those fl:om exist- 
ing morphological methods and that a translatitm acem'~my ~i 
• 1 95% can be achieved. The collocatioual analysis met\[ ou c~.o 
also be applied to Japanese text input by speech reeog~dU~ .... 

References

Abe, M., et al.(1986), "A Kana-Kanji Translation Systeul for 
Non-Segmented Input Sentences Based on Syntactic ~:,~.cl 
Semantic Analysis", Proceeding of COLING86~ 28(I-285 

Grishman, R., et al.(1986), "Discovery Procedures for sub. 
language SelectionM Patterns", Computational Ling0is- 
tics, vo1.12, no.3,205.-215 

Kawada, t., et al.(1979), "Japanese Word Processor JW-10"~ 
Proceeding of COMPCOM'79 fall, 238-242 

Makino, H., et al. (1980), "An Automatic Translation 
system of Non-segmented Kana Senteimes i~lto Kanji~ 
Kalm sentences", Proceeding of COLING, 2954~02 

Masuda, K., et al. (1983), "Kenkynsya's New Japanese 
English Dictionary", Kenkyusya, Tokyo 

Ohno, S., et al.(1982), "New Synonyms Dict, ionary ~ (is 
Japanese), Kadokuwa-syoten, Tokyo 

Shirai, K., et a1.(1986). "Linguistic Knowledge Extra(:- 
lion from Real Language Behavior? ~ , Proceeding of 
COLING86, 253-255 

Tanaka, Y., et a1.(1986), "Acquisition of Knowle@y D~i,~ 
by Analyzing Natural Language", Proceedlug of CO-- 
LING86, 448-450 

Yoshimura, K., et al.(1983), "Morphological Am~lysi~; af 
Nonmarked-off Japanese Sentences by the lc,~t BUN-. 
SETSU's Number Method", Johoshori, Vol.24, No.l, 44.46 
