AN AUTOMATIC PROCESSING OF THE NATURAL LANGUAGE 
IN THE WORD COUNT SYSTEM 
HIROSHI NAKANO, SHIN'ICHI TSUCHIYA, AKIO TSURUOKA 
THE NATIONAL LANGUAGE RESEARCH INSTITUTE 
3-9-14, NISHIGAOKA, KITAKU, TOKYO, JAPAN 
Summary 
We succeeded in making a program having 
the following four functions: 
i. segmenting the Japanese sentence 
2. transliterating from Chinese 
characters (called Kan~i in Japa- 
nese) to the Japanese syllabary 
(kana) or to Roman letters 
3. classifying the parts-of-speech in 
the Japanese vocabulary 
4. making a concordance 
We are using this program for the pre- 
editing of surveys of Japanese vocabu- 
lary. 
In Japanese writing we use many kinds of 
writing systems, i.e. KanOi, kana, the 
alphabet, numerals, and so on. We have 
thought of this as a demerit in language 
data processing. But we can change this 
from a demerit to a merit. That is, we 
can make good use of these many writing 
systems in our program. 
Our program has only a small table con- 
taining 300 units. And it is very fast. 
In our experiments we have obtained 
approximately 90% correct answers. 
Introduction 
Obtaining clean date is very important 
in language data processing. There are 
two problems here. One is how to input 
the Japanese text and the other is how 
to find errors in the data and correct 
them. The human being is suited to com- 
plicated work but not to simple work. 
The machine, on the contrary, is suited 
to simple work but not to complicated 
work. In the word count system using 
computers, the machine has simple work 
(sorting, computation, making a list), 
and the humans have complicated work 
(segmentation, transliteration from 
Kan~i to kana, classification of parts 
of speech, finding errors in the data, 
discrimination of homonyms and homo- 
graphs, ets.). 
However, in this system there is one 
major problem -- humans often make mis- 
takes. And, regrettably, we cannot pre- 
dict where they will make them. Thus we 
decided to make an automatic processing 
system. This system has to be compact, 
fast, and over 90% accurate. 
In Japanese writing we generally use 
many kinds of writing systems. 
For example, 
In this example sentence we find used 
the alphabet (C, O, L, I, N, G), numer- 
als (8, 0), kana (hirasana --the Japa- 
nese cursive syllabary -- ~, O,~, ~,~, 
~, and katakana -- the Japanese 
straight-lined syllabary --~ , ~, ~ , -, 
,~,- , j~ ), Kanji ( ~.~,~,~i, ~,~i~ ), 
and signs (.). And as you can see, there 
are no spaces left between words. This 
makes Japanese data processing difficult. 
Our program makes good use of these dif- 
ferent elements in the writing system. 
At present the automatic processing pro- 
gram makes more mistakes than humans do. 
But we can predict where it will make 
them and easily correct errors in the 
data. 
Objective 
Our objective is a system having the 
following functions: 
i. segmentation 
2. tranliteration from Kanji to kana 
3. classification of parts of speech 
4. adding lexical information by use 
of a dictionary 
5. making a concordance 
6. making a word list 
Numbers i, 2, and 3 are especially im- 
portant for our program. Our report will 
mainly deal with these three functions. 
The input data is generally a text writ- 
ten in Japanese. The output is a con- 
cordance sorted in the Japanese alpha- 
betical order, giving information of the 
parts of speech, and marked with a the- 
saurus number. 
-338- 
System 
Figure i is a flow chart of our program. 
Input is by magnetic tape, paper tape, 
or card. The input code is the NLRI 
(National Language Research Institute) 
code or some other code. Of course we 
have a code conversion program from other 
codes to the NLRI code. 
The second block of Figure 1 shows what 
we call the automatic processing of nat- 
ural language. In the supervisor square 
we check and select the results of the 
three automatic processing programs. 
Some of these programs have many kinds 
of processing of natural language 
For example, the automatic segmentation 
program involves the classification of 
parts of speech, automatic syntactic 
analysis, automatic transliteration from 
Kan~i to kana, and so on. (An example 
will be found in the next section.) 
In the adding lexical information block 
of Figure i, we make use of the diction- 
ary obtained by research into some 5 
million words at the NLRI. This diction- 
ary includes word frequencies, parts of 
speech, classes by word origin, and 
a thesaurus number. 
By using the concordance we can find and 
correct errors in the data. As our pro- 
gram is unfortunately not always com- 
plete, this concordance is very useful. 
In the output block of Figure i we can 
choose a variety of output devices -- an 
alphabet line printer, a kana line 
printer, a high-speed Kan~i printe~, or 
a Kan~i display. 
Method 
i. Automatic transliteration from Kanji 
to Roman letters 
The Chinese characters have many differ- 
ent readings in Japanese. For example, 
/ sei/ /syo/ /um-/ /iki/ nama/ /ai/ 
/ tachi/ /tatsu/ /tate/ /dachi/ 
/ritsu/ /rittoru/ 
-- / ichi/ /itsu/ /kazu/ hajime//hito/ 
We have to arrange the Japanese words in 
the Japanese alphabetical order. 
The program puts the reading way to each 
word for the word list. 
The method of selecting the reading is 
to choose it in accordance with the 
surroundings of the Kanji in the text. 
The possible readings for each Kanji are 
listed in a small table. The records in 
this table are of 3types- Groupsl,2, and 
3 represented by numbers i;2,3~ and 4,5, 
6 respectively in Figure 2. 
The Kanji in Group i have one reading 
each. The program replaces the KanOi 
with this reading. In Figure 2, No. I 
falls into this category. We have about 
700 K anji in Group i (~,~ ,~,~ ,~, 
ets.). 
The Kanji in Group 2 have tow or more 
readings each. In Figure 2, Nos. 2 and 3 
fall into this category. 
The format for these entries is group 
number, the Kanji, the operation code (a 
numeral or Capital letter), and the 
reading (up to 8 small letters). 
The appropriate reading is chosen for 
the situation of the Kanij in accordance 
with Table i. 
situaton operation letter 
front behind A I g 2 C 3 D 4 E 5 F 6 G 7 H 8 
unti unti 0 i 0 I 0 1 0 i 0 1 0 1 0 i 0 i 
unti Kanjl i 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 
Kanji unti i 0 1 0 0 1 0 1 1 0 1 0 0 i 0 1 
Kanji Kanji i 0 1 0 1 0 i 0 0 1 0 i 0 1 0 1 
O: replace KanJi to reading in the table 
Table i. Operation of situation 
I INPDT 
I 
INFORMATION CONCORDANCE 
Figure i. A flow chart 
--339-- 
(l) 1 ?#I<OI~U ® 
(2)2~2 1K8 A~UTA ® 
(3)2 ~ 1 KFI AKAMA @) 
(4)3Jll 1 8SENN 2 HKR~IA B" tl~I 1Pq t~j~ \]. ® , 
(5)3~i~i I~E~I 2A~OYO ;KM~2I"\]Lr2@ 
(6)3:~<1 lgUI~I 2AMI~U ~<Pl~2H%,2@ 
Figure 2. Table of Kanji reading 
(Input) (Output) 
(l) 't~ ~f~ ~ ~ ~. KOffUKRMONLITRNLI. 
(2) ill ~f'~ <'. KR\[,IRDE~OYOGU. 
Figure 3. result of experimentation 
Figure 3 gives a sample of the results 
of our experiments. The Kanji/~/in no. 
1 here is a group 2 KanJi. Its situation 
in the context /~<~/is that in front 
of it is the Kan~i/~/ and behind it is 
the non-Kanji/~/ . When the context is 
Kanji + non-Kanji, the program selects 
reading i/ka/. The situation o~/in 
context/~ ~%O/is non-Kanji + non-Kanji 
so the reading A/#uta/ is selected. AS a 
result/~%P/ is transliterated to 
/ko#ukawo#uta#u/. 
Group 2 contains 1500 Chinese characters. 
The Kanji in Group 3 have a special 
reading in a special context in addition 
to their regular meanings. In Figure 2, 
Nos. 4, 5, and 6 are in this group. In 
Figure 3,/)bl/in No. 2 can be processed 
without a special reading, but in no. 3 
the special reading is needed. To obtain 
this reading, the special context after 
the the sign * is applied. The format, 
as in Figure 2, no. 4, is group number 
(3), Kanji ())I) , reading number (i, 2), 
operation code (8, H), reading, sign (*~ 
code for front or behind(M, N) , Kanji 
~ , ~F ), and applied reading number(l, 
i). 
Groupe number 
Kanji 
Reading number 
Operation's letter 
Reading way 
Sighn 
Sighn of front or behind 1 
Caracter 1 
Applied reading number 1 
(e.g.) 
1 letter 3 
1 1 2 
1 8 H 
8 sn~all letter 
S ENN KAWA 
1 letter * 
M N 
1 1 
In this case reading number 1 is applied 
because~/is found in front of/),I/. 
The merits of this method are that the 
table is small and the process fast. If 
we had a table listing vocabulary rather 
than Kanji, it would be much larger , 
requiring at least 70,000 entries. 
One demerit is that the process does not 
completely cover all cases. The phenom- 
enon of rendaku or renjo, in particular, 
requires special contexts. There are no 
rules for this. Examples of rendaku and 
renjo are follows: 
(in English) ~ /hon/+/hako/-->/honbako/ 
bookcase ~ /ko/+/!omo/-->/kodomo/ 
child ~ /ten/+/ou/-->/tennou/ 
emperor 
I~ ~4g /in/+/en/--~ /in~en/ karma 
fir ~ /sake/+/ya/--+/sakaya/ wineshop 
2. Automatic segmentation 
We do not use spaces between words in 
Japanese, but we do use many different 
elements in our writing system. There 
are Kanji, kana (hiragana and katakana), 
the alphabet, numerals, and signs. 
Figure 4 shows the ratio of these ele- 
ments in Japanese newspapers. If we look 
at a Japanese text as a string of dif- 
ferent kinds of characters, we can 
replace the characters of a Japanese 
sentence with the abbreviations of Table 
2. 
AM. i0 t: /¢'~ I~ ~ 
446 55 2 3 3 2 1 2 
In Japanese composition we are taught 
the proper use of the different char- 
acters in this way: 
Kanji - to express concepts; more 
concretely, for nouns, the 
stems of verbs, etc. 
hiragana - for particles, auxiliary 
verbs~Ithe endings of verbs 
and adjectives, writing 
phonetically, etc. 
katakana - for borrowed words, foreign 
personal and place names, 
onomatopoeia, etc. 
alphabet for abbreviations 
numerals - for figures 
Therefore, if the different characters 
are used properly they suggest the type 
of word. Katakana 
\Roman char \ ~umeral 
\ \ I Sighn 
Kanji Hiragana~~ 
43.4 28.0 
Running characte s:i,489,175 ~.6 
Figure 4. Ratio of characters on 
newspaper 
- 340 
We checked the character combinations. 
The ratio of segmental point to the 
character combinations is as follows. 
behind 1 2 3 4 5 
front 
i. 5.7 61.7 
2. 92.1 40.8 
3. 25.4 89.5 1.0 --- 
4. 2.8 i00.0 i00.0 13.2 
5. 2.7 i00.0 --- i00.0 
6. 98.2 84.7 62.1 33.3 
i: Kanji, 2: Hiragana 
3: Katakana 4: Alphabet 
5: Numeral 6: Sighn 
Object: 15,677 characters 
Table 2. A ratio of segmental point 
45.2 75.0 i00.0 73.8 
95.7 i00.0 i00.0 95.1 
33.3 
0.0 90.0 
0.0 75.0 
23.7 --- (%) 
We can segment at character combinations 
with a high ratio in Table 2 but not at 
those with a low ratio. 
For our program we converted Table 2 to 
the form found in Table 3. We can seg- 
ment a sentence at the places where nu- 
meral 1 is found in the table. 
behind 1 2 3 4 5 6 
front 
1 Kanji 0 1 0 1 1 1 
2 Hiragana 1 0 1 1 1 I 
3 Katakana 0 1 0 0 0 0 
4 Alphabet 0 1 1 0 0 1 
5 Numeral 0 1 0 1 0 1 
6 Sighn 1 1 1 0 0 0 
Table 3. Table for segmentation by 
character combination 
1t~ ~ 1R 
4~.9b~ o ~.C 1EglP 
1~ 1P+ 
iT i@9 
I® iR 
1~. 1 P :1:t: 
Figure 5. Table for segmentation and 
Classification of parts of 
speech 
Hiragana-Hiragana type is use of the 
second most frequent combinations in 
Japanese. According to Table 2, We are 
unable to segment for this combination. 
'Therefore we make the following rule. 
The hiragana/~/is used only as a parti- 
cle and we always segment at it. The 
other hirasana characters are segmented 
according to the character string table 
found in Figure 5. The format, as in the 
second line in Figure 5, is the number 
of characters in the string (4), the 
character string (up to i0 characters) 
(C ~ L ~), the length of the words ( 2 , 
I , i ), the parts of speech (C, E, P), 
and the conjugation (9). 
This table contains only 300 records. 
These are the particles, auxiliary verb~ 
adverbs, and character strings which 
cannot be segmented by Tabie 3 (ex. CJb~ 
in Figure 5). 
This table is applied as follows. The 
program first searches the character 
strings of the table in the input sen- 
tences. If a character string (~gb~) 
fits part of an input sentence ( E~b~ 
l:I~ ), then the program segments it into 
parts by the lengths of words in the 
table and adds the information about the 
parts of speech and conjugation. As a 
result we obtain the words (~/ b / ~ /). 
Figure 6 shows the results of automatic 
segmentation and automatic translitera- 
tion from Kanji to Roman letters. The 
operation of Table 3 has resulted in no 
segmentation for the strings (/COLING80 
/) , (/~/) , (/~rff-~y~--$--J~/), and (/~{!~ 
/) as well as the segmentation at the 
sign (/./) . The operation of the table 
in Figure 5 has resulted in the segmen- 
tation for the hirasana (/ ~/), (/ ~/), 
(/ V /), (/~ /), and (/~ /). 
3. Automatic classification of parts of 
speech 
In order to analyze the vocabulary we 
have to classify it by parts of speech. 
The program dose this by three methods. 
The first method is by using the table 
found in Figure 5. 
The second method is by the form of the 
word, applying the rules below. The ra- 
tio of correct answers obtained is given 
in parentheses after each rule. 
i. If the last character of the word 
is in Kanji, katakana,or the al- 
phabet, then the word is a noun. 
(94.4%) 
2. If the last character is/~/, then 
it is a verb in the renyo form 
(conjugation) or an adjective in 
the syushi or rental form. (86.2%) 
3. If the last character is/~ /, then 
it is a verb in the syushi or 
rental form or an adjective in the 
reny_o_ form. (83.4%) 
-341 
C 0 L I I',\] G 8 0 ~"~ .m, o) ,~ ri t 1,, .~ - ,T. - J l, ¢' ~flf~ ~ .:K\[ t 
C 0 L I N G 8 0 GA TO#UKIJEII:ILI NO TDSISENNTNO HO0 RU DE KAI:IISAHISA 
NASOBI Nl #AKI TA KOTOMORA GA KANEOO TE ~IKU . 
~.::a>. F. ~.-,..~'~" I~ I~ ,% 2,~.~-~ t f-~ ~ . 
ZIJONN. F. KENEDE*I HA I~IDA~I NA DANITOI~LIRIJO~U DAOl\] TA . 
~C:,, ~.~- ~ i 0 0 g :b", I 0 0 F\]~" < E~L,. 
J(>~ ~ lOOg ~" , iOOH~ <EZL', . 
PANNKD ~0 1 0 0 G KA , I 0 0 ~IEIINBUNN KUDASAHI . 
RE TA . 
Figure 6. Result of Segmentation and Transliteration from Kanji 
to Roman character 
4. If the last character is/Y/, then 
it is verb, syushi form. (95.8%) 
5. If the last character is/K/, then 
it is verb, katei form, or demon- 
strative pronoun, or auxiliary 
verb~ I (92.9%) 
6. If the last character is/b/, then 
it is verb, meirei form, or noun. 
(63.3%) 
7. If the last tow characters ar~/, 
then it is adjective, mizen form, 
or verb, renyo form. (74.2%) 
8. If the last character is /~ /, then 
it is verb, renyo form. (79.6%) 
9. If the last tow characters are 
Kanji-hirasana,then it is a verb. 
(94.4%) 
If the vowel of the last hirasana 
is /a/, then its conjugation is 
mizen or renyo form, and 
if it is /i/, then it is mizen or 
renyo 
if it is /u/, then it is syushi or 
rental 
if it is /e/, then it is katei or 
meirei 
if it is /o/, then it is meirei 
i0. If the last character is a numeral, 
then it is a figure and if it is a 
sign, then it is a sign. 
The third method is by word combinations. 
That is, in Japanese grammer word combi- 
nation -- especially of nouns or verbs 
and particles or auxiliary verb~ ~- is 
not free. The formula given in Figure 7 
is made from this rule. 
Its format is as follows: 
i. the word 
2. its part of speech 
3. auxiliary verbs~r particles which 
can be used in front of this word 
4. parts of speech and conjugations 
which can be used in front of this 
word 
5. if 3 and 4 do not agree then 5 ap- 
plies obligatorily. 
Figure 8 is the result of automatic 
classification of parts of speech. The 
explanation of the codes used in it is 
as follows: 
i (noun). E (verb), M (adjective) 
P (auxiliary ver~ I, R (particle) 
C (adverb), A (conjanction), B(inter- 
jection) , Y (sighn), X (figure) 
(i) (2) f (3) 
# 1 / 1 ® @ (I) 
Figure 7. table for Classification of parts of speech 
--342 - 
(1) ~'-~, t.) ~d ~ ".) "(I 'I 5 
(2) #g ~ ~b T. L',~5 . 
(3) ~ACLIRI H~TE 411RU . 
I I I 
(5) ~ 3 + 
(6) 1 R ER EY 
(7) 9 i- 
Figure 8. Result of Classification 
of parts of speech 
Q (auxiliary verb~ior particle) 
8 ('mizen' form), 9 ('renyo' form) 
# ('mizen' or 'renyo' form) 
+ ('syushi' or 'rental' form) 
char. 
~D 
b 
char. ' s 
freq. 
38404 
23633 
22124 
18962 
16383 
16062 
15958 
15522 
14710 
13515 
word's freq. 
aux.v. & part. other 
32588(84.9%) 2( 0%) 
2(0.0%) 1305(5.5%) 
64(0.3%) 13138(59.4%) 
17037(89.8%) 3(0.0%) 
10173(62.1%) 0( 0%) 
13324(83.0%) 0( 0%) 
10569(66.2%) l(0.0%) 
17(0.i%) o( 0%) 
14702(99.9%) o( 0%) 
8351(61.8%) 00( 0%) 
Figure 9. Result of supervisor 
6. automatic classification by method 
3, resulting in/~ ')/ being changed 
from a verb to a noun (using the 
formula for/i/found in Figure 7 ). 
The steps in Figure 8 are 
i. input data 
2. the result of segmentation 
3. the result of transliteration from 
Kanji to Roman letters 
4. the automatic classification of 
the parts of speech by methods i 
and 2 (by table and by word form) 
5. the conjugations 
(l) !~@@~ ~ ~1~'~ ~ ~ ~ ~ ~ ~' b ~. 
4. Supervisor 
The supervisor program checks the re- 
sults of the three automatic processing 
programs and selects the correct results 
or processes feedback. It also utilizes 
information obtained through each pro- 
gram. That is, 
I. The results of the character check 
~ttt A, t'b m . 
TAKUSANN 110 KI blO TA BA lIE RARE MASE Nil DESI TA . 
1RiRPRO P PP PPY 
+ O # #+ 3+ 
TAKUSANII 110 KI NO TABANERA RE 
1R1R EP PP PPY 
O# #+ 3+ 
(2) i~8 < ~L~(.J~\],~"'~ • 
i~8< ~ ~G .~' ~ • 
~BMOSIRBKU TE ~ASOBI SUGI TA . 
fb ~ . 
MASE NN DESI TA • 
EMR E EPY 
+3 # #+ ~ 8 < \[ \] d ~ ~ II 
~OMOSIROKU TE I~ASOBISUGI TA . 
EMR EPY 
+.3 3+ 
Figure I0. Result of supervisor 
--343-- 
and conversion from kana to Roman 
letters are used for each program. 
2. The information obtained in auto- 
matic transliteration is used in 
segmentation. 
Namely, if the special context is 
applied, then the program does not 
segment at that point because the 
character string is a word. 
3. The information obtained at the 
conversion from kana to Roman 
letters is used in segmentation. 
Namely, if the consonant of the 
Romanized Japanese is (*), (J), or 
(Q)-- these are used as special 
small characters in kana -- then 
the program dose not segment at 
that point. 
4. The information obtained in seg- 
mentation is used in classifica- 
tion. 
Namely, the program obtains infor- 
mation concerning parts of speech 
and conjugation through using the 
table in Figure 5 in segmentation. 
Checking the results of the processing 
involves the following: 
i. Checking particle and auxiliary 
verb strings obtained by the pro- 
gram at classification. If these 
strings are impossible in Japanese, 
then the segmentation was mistaken. 
The program corrects these. 
2. There are not many words composed 
of one character in Japanese ex- 
cept for particles and auxiliary 
verbs. Figure 9 gives the frequen- 
cy of some characters and the fre- 
quency of words consisting of that 
character alone. 
Words of high frequency that are 
not particles or auxiliary verbs 
are produced by errors in segmen- 
tation. The program then corrects 
these errors, combining them into 
longer words. 
3. If a verb in the renyo form is 
followed by another verb, then it 
is a compound word and the program 
corrects the error to produce a 
longer word. 
Figure i0 shows the results of the 
supervisor program. In test sen- 
tence i, the program at first seg- 
mented / ~ /L~/ ~ / ~ / as auxil- 
iary verbs through the use of the 
table in Figure 5. But the super- 
visor program checks and corrects 
this string and the classification 
program adds th~ information of 
verb to/~t~'~/, as can be seen in 
Figure i0. 
In test sentence 2, the program at 
first segmented it /#ASOBI/SUGI/TA 
/, but the supervisor program 
checked this and corrected this 
string to the compound word, 
/#ASOBISUGI/,plus /TA/. 
We can process Japanese sentences using 
these methods and obtain words and vari- 
ous information about these words. With 
this program we can obtain a rate of 
correct answers of approximately 90 
percent.Y3 
We should be able to improve this pro- 
gram at the level of the supervisor and 
the tables. However, we don't think that 
it will be possible to obtain i00 
percent correct answers because this 
system uses Japanese writing and the Jap- 
anese writing system is not i00 percent 
standardized. In addition, if we wish to 
produce a complete program, it is neces- 
sary to process on the basis of syntax 
and meaning. At persent, this is not the 
object of our efforts. 
5. Adding lexical information 
The National Language Research Institute 
has been investigating the vocabulary of 
modern Japanese since 1952, and has been 
using the computer in this research 
since 1966. As a result, some five mil- 
lion words are available as machine 
readable data. This data contains vari" 
ous information such as word frequency, 
part of speech, class by word origin, 
and thesaurus number. The thesaurus, 
Bunrui go ih~o in Japanese, was produced 
by Doctor Oki Hayashi. It contains about 
38,000 words in the natural language of 
Japanese. 
6. Making the concordance 
We will not explain this program here 
since we have written a separate report 
about it (number 6 in the list of refer- 
ences below). Please refer to this re- 
port for further details. 
Figure ii is the result of this process. 
Acknowledgements 
Professor Akio Tanaka developed this 
plan, made a prototype for automatic 
transliteration from Kan~i to kana, and 
permitted us to use this program. 
Mr. Kiyoshi Egawa made a prototype 
for an automatic segmentation program 
and permitted us to use it. They also 
contributed to this study through our 
344 
discussions with them. Mr. Oki Haya- 
shi furnished us with the opportunity to 
study this and provided his support for 
our efforts. 
WORD WORD ROMANIZED PARTS THESAURUS 
NUMBER JAPANESE SPEECH NUMBER 
01421 :I:E 1 1. 202 
~,fc 01224 =I=E g 4. 921 
~_5 00224 =I=ERU E+ 
~,£5 01769 =I=ERU E+ 
~,~t~ 01949 = I KANAKE E8 
t-tiE. 01719 =IKI E= 
~@ 01761 =IKI E= 
~ 02080 =IKI E= 
£k~ 02495 =IKI E9 
~ 01146 :IKI 1 
~ 00469 :IKI:O=I=I 1 
~ 02070 :IKIRU E+ 2. 581 
~$ 02827 :IKIRU E+ 2. 581 
~ka,5 02524 :IKIRU E+ 2. 581 
~5 01970 =IKIRU E+ 2. 581 
~-~,5 02128 :IKIRU E+ 2. 581 
~< 01278 =IKU E+ 
~,4 00438 =IKU M9 
~,< 00520 :IKU MS 
~&'5 01621 :IKO=U 1 2. 382 
~,~9 01667 =IKO=U 1 2. ~2 
J~,~ 00025 :IGO 1 
~,,G 00840 =ISI 1 
:~:~ 00258 :I8IKI I 
~ilI! 00551 =ISIKI 1 
~}8 00950 =ISIKISA E8 
£t ~f F'~\] ~ 00285 =ISIKINA=I 1 

Notes: 
*i Auxiliary verb : This term means 
the bound form which conjugate. 
It is put Jodoshi in Japanese. 
*2 / ~ ~g ~6~/ is rightly segmented for 
/@la ~/ and /6 ~/. This case is an 
error of program. 
*3 A ratio of correct answers is fol- 
lows. 
Sample : 2500 words from a high 
school textbook 
Segmentation : 91.3% 
Transliteration from Kanji to Kana : 
95.7% 
Clasification of parts of speech: 
97.0% 
KEYWORD IN CONTEXT 
--.~',~b, ~9~: ~ ~CI~{,@~±~P 167> l~J~{~ 
1. 14o8 9tgi~i~8k.b&8"~b~, ~<, ~-~b, ~te~J~±~.l~o)~ 
l<llO)XAd~tll~Ijl)iiLl@ 11ii;iiii I~zilI(Eb, -flJ./~& o ~ < 
Figure ii. Concordance of a high school textbook 
345 

References 

Hiroshi Nakano. 1978. An Automatic 
Processing Sysem of Natural Lan- 
guage. 
STUDIES IN COMPUTATIONAL LINGUIS- 
TICS, Vol. i0, pp. 17-40 

Akio Tanaka. 1969. A Program 
System of Transliteration, from 
Kanji to Kana, and from Kanji to 
Romaji. STUDIES IN COMPUTATINAL 
LINGUISTICS, Vol, 2, pp. 107-138 

Kiyoshi Egawa. 1968. An Inquiry 
into the "Automatic Segmentation" 
of Japanese Text. MATHEMATICAL 
LINGUISTICS, Vol. 43 / 44 pp. 46-52 

Kiyoshi Egawa. 1969. A System of 
Automatic Segmentation for Japanese 
Text. MATHEMATICAL LINGUISTICS, 
Vol. 51 pp. 17-22 

Hiroshi Nakano. 1971 Automatical 
Classification of Parts of Speech. 
STUDIES IN COMPUTATIONAL LINGUIS- 
TICS, Vol, 3 pp. 98-i15 

Hiroshi Nakano. 1976. A Program Li- 
brary for Making the Verbal Con- 
cordance by Computer. STUDIES IN 
COMPUTATIONAL LINGUISTICS, Vol. 8 
pp. 18-62 

The National Language Research Institute 
1970. STUDIES ON THE VOCABULARY OF 
MODERN NEWSPAPERS. The N.L.R.Inst. 
REPORT 37, 
