Processing Homonyms in the Kana-to-Kanji Conversion 
Masahito Takahashi 
Fukuoka University 
8-19-1, Nanakuma, 
Jonan-ku, Fukuoka, 
814-01, Japan 
takahasi@helio.tl. 
fukuoka-u.ac.jp 
Tsuyoshi Shinchu 
Fukuoka University 
8-19-1, Nanakulna, 
aonan-ku, Fukuoka, 
814-01, Japan 
shinchu((}helio.tl. 
fukuoka-u.~u:.jp 
Kenji Yoshimura 
Fukuoka University 
8-19-1, Nanakuma, 
Jonan-ku, Fukuoka, 
814-01, Jal)an 
yosilnu ra<¢~tlsun, tl. 
fukuoka-u.ac.jp 
Kosho Shudo 
Fukuoka University 
8-19-1, Nanakuma, 
Jonan-ku, Fukuoka, 
814-01, Jut)an 
shudo (~tlsun.tl. 
fukuoka-u.ac.jt) 
Abstract 
This paper I)roI)oses two new meth- 
ods to identify the correct meaning of 
Japanese honmnyms in text based on tile 
iloun:verb co-occIlrrenc( ~, ill a sentence 
which (:an be obtained easily from cor- 
pora. The first method uses the near 
co-occurrence data sets, which are con- 
structed from the above (:o-occurrence 
relation, to select the most fe~Lsible word 
among homonyms in the s(:ol)e of a sea> 
tence. Tile se(:ond uses the flu' co- 
occurrence data sets, which are con- 
strutted dynamically fl'om the near co- 
occurrence data sets in the course of 
processing input sentences, to select the 
most feasible word among homonyms ill 
the s(:ope of a sequence of sentences. An 
experiment of kana-to-kanfi(phonogran> 
to-ideograph) conversion has shown that 
the conversion is carried out at the ac- 
curacy rate of 79.6% per word by the 
first method. This accuracy rate of our 
method is 7.4% higher than that of the 
ordinary method based on the word oc- 
currence frequency. 
1 Introduction 
Processing hontonynLs, i.e. identifying the correct 
meaning of homonyms in text, is one of the most 
important phases of kana-to-kanji conversion, cur- 
rently the most popular method for int)utting 
Japanese characters into a computer. Recently, 
severM new methods fi)r processing homonyms, 
based on neural networks(Kol)ayashi,1992) or tile 
co-occurrence relation of words(Yamamot<),1992) 
, have been proposed. These methods apl)ly to 
the co-occurrence relation of words not only in 
a sentence but also ill a sequence of sentellC(~s. 
It appears impra<:ticat)le to prepare a neural net- 
work for co-oecurren(:e data large enough to han- 
dle 50,000 to 100,000 Japanese words. 
In this 1)aper, we propose two uew methods for 
processing Japanese homonyms based on the (:o- 
occurrence relation between a noun and a verb ill 
a sentence. We have defined two co-occurrence 
data sets. One is a set of nouns ~companied by a 
case marking particle, e~:h element of which has a 
set of co-occurring w~rbs in a sentence. The other 
is a set of verbs accompanied by a case marking 
partMe, each element of which has a set of co- 
occurring nouns in a sentence. We (:all these tv~o 
co-occllrrence data sets near co-occurrence data 
sets. Thereafter, we apply the data sets to the 
1)ro<:essing of holuonylns. Two strategies are used 
to al>l)roach the problem. The first uses the near 
co-occurrence data sets to select the most feasible 
word among homonyms in the scope of a sentence. 
The aim is to evaluate the possible existen<-e of a 
near co-occurrence relation, or co-occurrence re- 
lation betweeu a noun and a verb within a sen- 
tence. The second ewfluates the possibh' existence 
of a far co-occurrence relation, referring to a co- 
occurrence relation among words in different sen- 
tences. This is achieved by constructing fitr co- 
occurrence data sets from near co-occurrence data 
sets in the course of processing input sentences. 
2 Co-occurrence data sets 
The near co-occurrence data sets are (lefined. 
The first near co-occurrence data set is the set 
EN ........ each element of which(n) is a triplet con- 
sisting of a noun, a case marking partMe, and a 
set of w~rl)s which co-occur with that noun and 
l)artMe pair in a sentence, as follows: 
n = (noun, par ticle, {(Vl, kl ), (v2, ~;2),"" }) 
Ill this description, particle is a Japanese case 
marking particle, such as 7)'-'; (nominative case), 
(ac(:usative case), or tC (dative case), vi(i = 
1,2,..-) is a verb, and ki(i ---- 1,2,...) is the fre- 
quency of occurren(:e of the combination noun, 
particle and vl, which is del;ermined in the course 
of constructing EN ...... . fi'om corpora. The follow- 
ing are examl)les of the elements of EN ...... .. 
(g\[~ (rain), 7)~ (nominative case), 
{(~7~ (fall),10),(lk~2e(stop),3), ..}) 
(~(rain), ~ (accusative case), 
{(~JT~-~Xa (take precautions),3), ..}) 
1135 
The second near co-occurrence data set is the 
set By, ...... each element of which(v) is a triplet 
consisting of a verb, a case marking partMe, and 
a set of nouns which co-occur with that verb and 
particle pair in a sentence, as follows: 
v = (verb,particle, {(nt, ll), (n2,12),"' "}) 
In this description, particle is a Japanese case 
marking particle, ni(i = 1,2,...) is a noun, and 
li(i : 1,2," ") is the frequency of occurrence of 
the combination verb, particle and hi. The follow- 
ing are examples of the elements of Ev, ..... .. Ev,~,~,. 
can be constructed fi'om E~ ..... ., and vice versa. 
(}Yo (fall), 7~ (nominative case), 
{(~\]~ (rain) ,10) , (~ (snow) ,8), . .}) 
(~ (fall), }C (dative case), 
{(JIL'J'\]'\] (Kyushu), 1), . .}) 
3 Processing homonyms in a 
simple sentence 
Using the near co-occurrcncc data set.s, the most 
feasible word among possible homonynls can 1)e 
selected within the scope of a sentence. Our hy- 
pothesis states that the most feasible noun o1' com- 
bination of nouns has the largest nulnber of verbs 
with which it can co-occur ill a sentence. 
The structure of an input Japanese sentence 
written in kana-characters can be simplified as fol- 
lows: 
N~ . P~,N2 . P=,...,N,,, . p,,,, v 
where Ni(i = 1,2,...,m) is a noun, l'i(i = 
1,2,...,m) is a particle and V is a verb. 
3.1 Procedure 
Following is the procedure for finding the most 
feasible combination of words for an input kana- 
string which has the above simplified Japanese 
sentence structure. This procedure can also ac- 
cept an input kana-string which does not include 
a final position verb. 
Stepl Let m = 0 and Ti = e(i = 1,2,...). 
Step2 If an input kana-string is null, go to Step4. 
Otherwise read one block of kana-string, that 
is N • P or V, fi'om tile left side of the in- 
put kana-string. And delete the one block 
of kana-string fi'om the left side of the inlmt 
kana-string. 
Step3 Find all homonyinic ka,q/-variants 
Wk(k = 1,2,...) for tile kana-string N or V 
which is read in Step2. 
Increase m by 1. 
For each Wk(k = 1, 2,...): 
1. If W~ is a noun, retrieve (W~, P, V~,) from 
tile near co-occurrence data ,set ~N ...... . 
and add the doublet (W~, V~) to T,~. 
2. If W~ is a verb, add the doublet 
(w~, {(w~,0)}) to T,,,. 
Go to Step2. 
Step4 From Ti(i = 1,2,...,m), find the combi- 
nation: 
(Wl, Vi)(W~, V.2),.-', (Win, Vm) 
(w, ~) ~ T~(i = 1, 2,..., ,,~) 
which has the largest value of 
I N(v,, v.~,..., <,,) I. Where the function 
f-\](v,, v2,..., v,,,) is deiilled as Mlows. 
('\](Vl, v.~,..., v.D = {(v, y~. k~) I 
i=1 
(~, k,) e v, A... A (v, k,,,) c v,,,} 
And \] \["1(1/1, V,x,..., V,,<) i is defined: 
I N(vl, v.~,..., v,,,) I-- ~ k 
(~.k)6n(v, ,v.2 .....vm) 
The sequence of words WI,W2,...,W,~ is 
the most feasible conibination of homonymic 
ka'aji-w, riants for tile ini)ut kana-string. 
3.2 An example of processing homonyms 
in a simple sentence 
Following is an example of homonynl processing 
nsiug the abow,, procedures. 
For the input kana-string 
"7~a~{C_l~ b,~e (kawa ni hashi o)" 
"D~ (~a.~a)" means a riw'.r dud <'~ b (ha~h0" 
nleans a bridge. "\]0~ (kawa)" and "~ 1_, (hashi)" 
both have honionyniic kanji-variants: 
holllonynls of "7~a~\] ~" ) (\]gawa)" : )\[\] (river) 
)59. (leather) 
honlouyuls of "}'k'\[ 1~ (hashi)" : ~ (bridge) 
~ (chopsticks) 
The near co-occurrence data for ")il (river)" and 
"~ (leather)" followed by the particle "~:-(dative 
case)" and tile near co-occurrence data for "~} 
(1,ridge)" and ".~ (chopsticks)" followed by tile 
t)article "~ (accusative case)" are shown below. 
()ll (river), }C,{({-~< (go),8), 
(~-~70 (build),6), 
('\]~:J- (drop), 5) }) 
(~ (leather), ~Y-,{("~Xa (paint),6), 
(~\]~7o (touch),3)}) 
(~(bridge), "~,{(~7o (walk across),9), 
(~ (build),7), 
('~j- (drop) ,4) }) 
(~(chopsticks), ~,{(~ (use),7), 
('~J- (drop),3)}) 
Following tile procedure, the. resultant frequency 
values are as follows: 
)il~c ~ 8{i ~9-} 
~ -~ 0{} 
Therefore, the nlost feasible combination of words 
is "),l (river) }:- ~ (bridge) ~." 
1136 
4 An experiment on processing 
homonyms in a simple senten(:e 
4.1 Preparing a dictionary and a 
co-occurrence data file 
4.1.1 a noun file 
A noun file in<:luding 323 nOUllS, whi(:h con- 
sists oi" 190 nouns extra.l-ted front text concern- 
ing (:urrent topics ~til(t their 133 holnollylns, was 
pret)~m'd. 
4.1.2 a co-occurrence data file 
i (:o-occurrence (l~(;;~ ill(', was l)repa.r('d. The 
record format of the file is specitied as folh)ws: 
\[I1()1111, C~:K'-le marking 1)~Lrticle, verl), tlw 
frequency of occurrence\] 
wher(; case marking t)~trti(:le is (:hosen from 8 kinds 
of particles, i~mlely, "7)~"," "~" ," ~", "~'-" ," &" ," 7) ~ 
5"," J:. 9","'~". 
It includes 25,665 re.cords of co-o(:curr(mce re- 
l~d;ion (79 records per noun) for (;he nouns in the 
nollIl file by inerging 11,294 re(:ords fron~ EDR 
Co-o(:(:urrence Di(:tionary(EDR.,1994) with 15,856 
records from handmade simI)le sentences. 
4.1.3 an input file and an answer file 
An intmt file for ml exl)erilnent , whi(:h in(:hules 
1,1.29 silnple sentences written in ks'as a.ll)haJ)et , 
and an ~mswer tile, which includes the same 1,129 
sentences written in kanji (:hzLra(:l;ers, were l)re- 
pared, ilere, every noun of the sel~ten(:es in the 
files was chosen fl'om the nOUl~ file. 
4.1.4 a word dictionary 
A word dictionary, which consists of 323 lmuns 
in the llOUll file ;tlld 23,912 verbs in a, ,\]at)aJlese 
dictionary for kana-to-kanfi conversion ', w~Ls l)re - 
1)~r(~(1. It is use(1 to find all honmnymi(: ka, nji- 
varimd;s for each noun or verb of the Sellt(Hices in 
the input lilt. 
4.2 Experiment results 
An exl)eriment on processing homonylns in at sim- 
1)le sentence was carried out. In this exper- 
iment, kana-to-kanji conversion was N)plied to 
e.ach of tim sentences, or the inlmt kana-striugs, 
ill the al)ove input file ~md the 'near eo-occwrrc'ace 
data sets wer('. (:onstrlt(:t;ed froll~ I;he ~d)ove co- 
o(:currence (la.t;t ill(.'. Tal)lel shows the resul(,s 
of kana-to-ka'aji (:onversion in the fofh)wing two 
cases, in tile first (n~se, ~LII inl)ut ks, us-string does 
not include ~L fired position ver\]). It lIl(~a.llS I;haJ; 
each verl) of the kana..strings ill the input file ix 
neglecte(l. In the s(~,COll(l CILSe, a.ll till)Ill. \]go,'//,a- 
string includes a final position verb. The ('.xl)eri- 
Ill(Jilt h;ts showll tfia£ the (:onversiou ix carried Olll; 
~t the accura.(:y ra.te of 79.6% per word, wher(~ 
the ('onversion r~te is 93.1% per word, in the first 
~This dictionary was m~Me by A1 Soft; Co. 
(:a.se. \[n the stone way, the a(:curacy rate is 93.8% 
1)('r word, where the conversion r~Lte is 14.5% per 
word, in tile se(:ond (:a.se. And then, we. ~flso (:on- 
(lu(:ted the sCLme experiment by using the method 
1)ase(l on the. word oc(:un'ence frequency to (:om- 
1)~m~ our iue(;hod with an or(lin~ry inethod. It has 
showu that the accuracy rate is 72.2% per word 
in the tirst case, ~md 77.8% per woM in the sec- 
on(l (:~Lse. We (:a,n lind tile ac(:ur;~cy r;Lte I)y our 
met;ho(l is 7.4% higher it, the first case ~md 16.9% 
higher ill the secolM case ('oml)~u'ed with the or- 
(lilmry m(~thod, it is clarified th~Lt our method is 
lnore etlective than the ordimLry method based on 
the word O('Clll'relH:(*. fre(l/tellcy. 
5 An approximation of the far 
co-occurrence relation 
W(, (:~m a,1)l)roxilua£e tile far co-occ'arrc,nce re- 
|at'ion, which is (:o-o(:curren(:e relation among 
words ill ~L seqllellce of Sellt(~ll(zes, frolil 'li, C(l,'F co- 
occu'rrc'nce data sets. The fivr co-occurrence data 
sets m'c descril)ed ~s follows: 
EN, .... = { (,,,,, t, ), (',,,~, t~),..., (',<, t,.)} 
~,5 .... = ~(~,,., ',,,), ('~,',,~),..., (<, 'a,~)} 
where ni(i = 1,2,..., 1,,) is a noun, t i is the 1)ri- 
ority wdue of hi, vi(i = 1, 2,...,Iv) is ~t verb and 
ui is the priority wdue of vi. 
The l)ro(:(~(lure for 1)roducing the fiw co- 
oceurre'n,c(', data .sets is: 
Stepl Clem: the fa,'r co-oceu'r'rcnce data .sets. 
E~# .... = e 
Step2 After e~tch fixing of noun N ,mmng homo- 
nyms in tim process of ka'ua-to-kanji conver- 
sioll, rClll:W the fivr co-occu'rrencc data 8¢;t,s 
2Nj .... ~md Ev~,,. by folh)wing these steps: 
1. (',ha,lJge all priority values of ti(i = 
1, ~,..., t,,) i,~ ~h,, ~t ~,,,,. ~,, f(td (for 
exmnl)le, f(tl) = 0.95ti). This process 
is intended to de(:rease priority with the 
1);tss~tge of time. 
TM)le 1: Exl)erinmnt results oi1 pl'o(:essing homo- 
llylllS ill a. silnl)h~ SlHltellce, 
Collversion r~tl, e 
1)er Selit(~llC(~ 
Ac(:ur~t(:y ra.te 
1)el: S(~lltellce 
Conversion r~d;e 
l)er word 
A(:cura.(:y rzd;e 
l)er word 
For sell(;elIces For sentellces 
without a verb with a verb 
1053/1129 1661~9 
(93.3%) (14.7%) 
663/1053 1361~66 
(63.0%) (81.9%) 
215502315 500/3444 
(93.1%) (14.~%) 
1716/2155 469/~O 
(79.6%) (93.8%) 
1137 
2. Change all priority values of ui(i = 
1, 2~,-.-,1,) in the set Evj,,, to f(ui) as 
well. 
3. Let N be the noun determined in 
the process of kana-to-kanji conversion. 
Find M1 
k,)(i = 1, 2,..., q) 
which co-occur with the noun N followed 
by any particle, in the near co-occurrence 
data set EN ..... . Add new elements 
(v, g(k,))(i = 1, 2,..., q) 
to the set Evj,,,. If an element with the 
same verb vi already exists in Evs~,,., add 
the value g(k,) to the priority vMue of 
that element instead of the new element. 
Here, g(k,) is a function for converting 
fl'equency of occurrence to priority value. 
For example, 
g(ki) = 1 - (1/k,) 
4. Let vi be the verb described in the pre- 
vious step. Find all 
(nj, lj)(j = 1, 2,..., q) 
which co-occur with the verb v, and any 
particle in the near co-occurrence data 
set P~v...~. Add new elements 
(nj, h(ki, lj))(j = 1,2,..., q) 
to the set ENj .... If an element with the 
same noun nj already exists in Egs~,~, , 
add the value h(ki, lj) to the priority 
value of that element instead of the new 
element. Here, h(ki, lj) is a flmction 
for converting frequency of occurrence to 
priority value. For example, 
h(k, b) = g(k,)(1 - (1/l j)) 
6 Processing homonyms in a 
sequence of sentences 
Using the Jar co-occurrence data sets defined 
in the previous section, the most feasible word 
among homonyms can be selected in the scope of 
a sequence of sentences according to the following 
two CaSeS. 
Casel An input word written in t~ana-characters 
is a 1101111. 
Case2 An input word written in kaua-characters 
is a verb. 
6.1 Procedure for easel 
Stepl Find set Sn: 
S, = {(N,,Tt),(N2, T2),...} 
where Ni(i : 1, 2,...) is a homonynfic l~a'nfi- 
variant for the input word written in kana- 
characters and T, is the priority vMue for 
homonym N,, which can be retrieved from 
the Jar co-occurrence data set ENid,.. 
Step2 The noun Ni which has the greatest T, 
priority value in Sn is tile most feasible noun 
for the input word written in kana-chara~ters. 
6.2 Procedure for case2 
Step1 Find set S,: 
sv = 
Here, Vj(j = 1,2,...)is a homo,tymic kanji- 
variant for the input word written in kana- 
characters and Uj is the priority value for 
homonym Vj , which can be retrieved from 
the far co-occurrence data set Evj .... 
Step2 The verb Iv) which has the greatest Uj pri- 
ority wflue in S, is the most feasible verb for 
the input word written in kana-characters. 
7 Conclusion 
We have proposed two new methods for processing 
Japanese homonyms based on the co-occurrence 
relation between a noun and a verb in a sentence 
which can l)e obtained easily from corpora. Using 
these inethods, we can evMuate the co-occurrence 
relation of words in a simple sentence by using the 
near co-occurrence data sets obtained from cor- 
pora. We can Mso evaluate the co-occurrence rela- 
tion of words in different sentences by using the far 
co-occurrence data sets constructed from the near 
co-occurrence data sets in the course of process- 
iug input sentences. The far co-occurrence data 
sets are based on the proposition that it is more 
practical to maintain a relatively smM1 amount 
of data on the semantic relations between words, 
being changed dynamically in the course of pro- 
cessing, than to maintain a huge universal "the- 
saurus" data base, which does not appear to have 
been built successfldly. 
An experiment of l~ana-to-kanji conversion by 
the first method for 1,129 input simple sentences 
has shown that the conversion is carried out in 
93.1% per word and the accuracy rate is 79.6% 
per word. It is clarified that the first method is 
more effective than the ordinary method base~l on 
the word occurrence frequency. 
In the next stage of our study, we intend to 
ewfluate the second method based on the.far co- 
occurrence data sets by conducting experiments. 
References 
Kobayashi, T., et al. 1992. Realization of 
Kana-to-Kanji Conversion Using Neural Net- 
works. Toshiba Review, Vol.47, No.ll, pages 868- 
870, Japan. 
Yamamoto, K., et al. 1992. Kana-to-Kanji 
Conversion Using Co-occurrence Groups. Proc. 
of 4\]~th Confi:reuce of IPSJ, 4p-ll, pages 189-190, 
Japan. 
EDR. 1994. Co-occurrence Dictionary Vet.2, 
TR-043, Japan. 
1138 
