File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1050_metho.xml

Size: 21,847 bytes

Last Modified: 2025-10-06 14:11:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="C80-1050">
  <Title>AN AUTOMATIC PROCESSING OF THE NATURAL LANGUAGE IN THE WORD COUNT SYSTEM</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
AN AUTOMATIC PROCESSING OF THE NATURAL LANGUAGE
IN THE WORD COUNT SYSTEM
HIROSHI NAKANO, SHIN'ICHI TSUCHIYA, AKIO TSURUOKA
THE NATIONAL LANGUAGE RESEARCH INSTITUTE
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Summary
</SectionTitle>
    <Paragraph position="0"> We succeeded in making a program having the following four functions: i. segmenting the Japanese sentence  2. transliterating from Chinese characters (called Kan~i in Japanese) to the Japanese syllabary (kana) or to Roman letters 3. classifying the parts-of-speech in the Japanese vocabulary 4. making a concordance  We are using this program for the pre-editing of surveys of Japanese vocabulary. null In Japanese writing we use many kinds of writing systems, i.e. KanOi, kana, the alphabet, numerals, and so on. We have thought of this as a demerit in language data processing. But we can change this from a demerit to a merit. That is, we can make good use of these many writing systems in our program.</Paragraph>
    <Paragraph position="1"> Our program has only a small table containing 300 units. And it is very fast. In our experiments we have obtained approximately 90% correct answers.</Paragraph>
    <Paragraph position="2"> Introduction Obtaining clean date is very important in language data processing. There are two problems here. One is how to input the Japanese text and the other is how to find errors in the data and correct them. The human being is suited to complicated work but not to simple work. The machine, on the contrary, is suited to simple work but not to complicated work. In the word count system using computers, the machine has simple work (sorting, computation, making a list), and the humans have complicated work (segmentation, transliteration from Kan~i to kana, classification of parts of speech, finding errors in the data, discrimination of homonyms and homographs, ets.).</Paragraph>
    <Paragraph position="3"> However, in this system there is one major problem -- humans often make mistakes. And, regrettably, we cannot predict where they will make them. Thus we decided to make an automatic processing system. This system has to be compact, fast, and over 90% accurate.</Paragraph>
    <Paragraph position="4"> In Japanese writing we generally use many kinds of writing systems.</Paragraph>
    <Paragraph position="5"> For example, In this example sentence we find used the alphabet (C, O, L, I, N, G), numerals (8, 0), kana (hirasana --the Japanese cursive syllabary -- ~, O,~, ~,~, ~, and katakana -- the Japanese straight-lined syllabary --~ , ~, ~ , -, ,~,- , j~ ), Kanji ( ~.~,~,~i, ~,~i~ ), and signs (.). And as you can see, there are no spaces left between words. This makes Japanese data processing difficult. Our program makes good use of these different elements in the writing system. At present the automatic processing program makes more mistakes than humans do. But we can predict where it will make them and easily correct errors in the data.</Paragraph>
    <Paragraph position="6"> Objective Our objective is a system having the following functions:  i. segmentation 2. tranliteration from Kanji to kana 3. classification of parts of speech 4. adding lexical information by use of a dictionary 5. making a concordance 6. making a word list  Numbers i, 2, and 3 are especially important for our program. Our report will mainly deal with these three functions. The input data is generally a text written in Japanese. The output is a concordance sorted in the Japanese alphabetical order, giving information of the parts of speech, and marked with a thesaurus number.</Paragraph>
    <Paragraph position="7"> -338-</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
System
</SectionTitle>
      <Paragraph position="0"> Figure i is a flow chart of our program.</Paragraph>
      <Paragraph position="1"> Input is by magnetic tape, paper tape, or card. The input code is the NLRI</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
(National Language Research Institute)
</SectionTitle>
      <Paragraph position="0"> code or some other code. Of course we have a code conversion program from other codes to the NLRI code.</Paragraph>
      <Paragraph position="1"> The second block of Figure 1 shows what we call the automatic processing of natural language. In the supervisor square we check and select the results of the three automatic processing programs.</Paragraph>
      <Paragraph position="2"> Some of these programs have many kinds of processing of natural language For example, the automatic segmentation program involves the classification of parts of speech, automatic syntactic analysis, automatic transliteration from Kan~i to kana, and so on. (An example will be found in the next section.) In the adding lexical information block of Figure i, we make use of the dictionary obtained by research into some 5 million words at the NLRI. This dictionary includes word frequencies, parts of speech, classes by word origin, and a thesaurus number.</Paragraph>
      <Paragraph position="3"> By using the concordance we can find and correct errors in the data. As our program is unfortunately not always complete, this concordance is very useful. In the output block of Figure i we can choose a variety of output devices -- an alphabet line printer, a kana line printer, a high-speed Kan~i printe~, or a Kan~i display.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Method
</SectionTitle>
      <Paragraph position="0"> i. Automatic transliteration from Kanji to Roman letters The Chinese characters have many different readings in Japanese. For example, / sei/ /syo/ /um-/ /iki/ nama/ /ai/</Paragraph>
      <Paragraph position="2"> We have to arrange the Japanese words in the Japanese alphabetical order.</Paragraph>
      <Paragraph position="3"> The program puts the reading way to each word for the word list.</Paragraph>
      <Paragraph position="4"> The method of selecting the reading is to choose it in accordance with the surroundings of the Kanji in the text.</Paragraph>
      <Paragraph position="5"> The possible readings for each Kanji are listed in a small table. The records in this table are of 3types- Groupsl,2, and 3 represented by numbers i;2,3~ and 4,5, 6 respectively in Figure 2.</Paragraph>
      <Paragraph position="6"> The Kanji in Group i have one reading each. The program replaces the KanOi with this reading. In Figure 2, No. I falls into this category. We have about 700 K anji in Group i (~,~ ,~,~ ,~, ets.).</Paragraph>
      <Paragraph position="7"> The Kanji in Group 2 have tow or more readings each. In Figure 2, Nos. 2 and 3 fall into this category.</Paragraph>
      <Paragraph position="8"> The format for these entries is group number, the Kanji, the operation code (a numeral or Capital letter), and the reading (up to 8 small letters).</Paragraph>
      <Paragraph position="9"> The appropriate reading is chosen for the situation of the Kanij in accordance with Table i.</Paragraph>
      <Paragraph position="10"> situaton operation letter front behind A I g 2 C 3 D 4 E 5 F 6 G 7 H 8 unti unti 0 i 0 I 0 1 0 i 0 1 0 1 0 i 0 i</Paragraph>
      <Paragraph position="12"> O: replace KanJi to reading in the table</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="13515" type="metho">
    <SectionTitle>
I INPDT
I
INFORMATION CONCORDANCE
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> (Input) (Output) (l) 't~ ~f~ ~ ~ ~. KOffUKRMONLITRNLI. (2) ill ~f'~ &lt;'. KR\[,IRDE~OYOGU.  Figure 3 gives a sample of the results of our experiments. The Kanji/~/in no. 1 here is a group 2 KanJi. Its situation in the context /~&lt;~/is that in front of it is the Kan~i/~/ and behind it is the non-Kanji/~/ . When the context is Kanji + non-Kanji, the program selects reading i/ka/. The situation o~/in context/~ ~%O/is non-Kanji + non-Kanji so the reading A/#uta/ is selected. AS a result/~%P/ is transliterated to /ko#ukawo#uta#u/.</Paragraph>
    <Paragraph position="3"> Group 2 contains 1500 Chinese characters. The Kanji in Group 3 have a special reading in a special context in addition to their regular meanings. In Figure 2, Nos. 4, 5, and 6 are in this group. In Figure 3,/)bl/in No. 2 can be processed without a special reading, but in no. 3 the special reading is needed. To obtain this reading, the special context after the the sign * is applied. The format, as in Figure 2, no. 4, is group number (3), Kanji ())I) , reading number (i, 2), operation code (8, H), reading, sign (*~ code for front or behind(M, N) , Kanji  In this case reading number 1 is applied because~/is found in front of/),I/.</Paragraph>
    <Paragraph position="4"> The merits of this method are that the table is small and the process fast. If we had a table listing vocabulary rather than Kanji, it would be much larger , requiring at least 70,000 entries.</Paragraph>
    <Paragraph position="5"> One demerit is that the process does not completely cover all cases. The phenomenon of rendaku or renjo, in particular, requires special contexts. There are no rules for this. Examples of rendaku and renjo are follows:  We do not use spaces between words in Japanese, but we do use many different elements in our writing system. There are Kanji, kana (hiragana and katakana), the alphabet, numerals, and signs.</Paragraph>
    <Paragraph position="6"> Figure 4 shows the ratio of these elements in Japanese newspapers. If we look at a Japanese text as a string of different kinds of characters, we can replace the characters of a Japanese sentence with the abbreviations of Table 2.</Paragraph>
    <Paragraph position="7"> AM. i0 t: /C/'~ I~ ~ 446 55 2 3 3 2 1 2 In Japanese composition we are taught the proper use of the different characters in this way: Kanji - to express concepts; more concretely, for nouns, the stems of verbs, etc.</Paragraph>
    <Paragraph position="8"> hiragana - for particles, auxiliary verbs~Ithe endings of verbs and adjectives, writing phonetically, etc.</Paragraph>
    <Paragraph position="9"> katakana - for borrowed words, foreign personal and place names, onomatopoeia, etc.</Paragraph>
    <Paragraph position="10"> alphabet for abbreviations numerals - for figures Therefore, if the different characters are used properly they suggest the type</Paragraph>
    <Paragraph position="12"> We checked the character combinations.</Paragraph>
    <Paragraph position="13"> The ratio of segmental point to the character combinations is as follows.</Paragraph>
    <Paragraph position="14">  We can segment at character combinations with a high ratio in Table 2 but not at those with a low ratio.</Paragraph>
    <Paragraph position="15"> For our program we converted Table 2 to the form found in Table 3. We can segment a sentence at the places where numeral 1 is found in the table.</Paragraph>
    <Paragraph position="17"> Classification of parts of speech Hiragana-Hiragana type is use of the second most frequent combinations in Japanese. According to Table 2, We are unable to segment for this combination. 'Therefore we make the following rule. The hiragana/~/is used only as a particle and we always segment at it. The other hirasana characters are segmented according to the character string table found in Figure 5. The format, as in the second line in Figure 5, is the number of characters in the string (4), the character string (up to i0 characters) (C ~ L ~), the length of the words ( 2 , I , i ), the parts of speech (C, E, P), and the conjugation (9).</Paragraph>
    <Paragraph position="18"> This table contains only 300 records.</Paragraph>
    <Paragraph position="19"> These are the particles, auxiliary verb~ adverbs, and character strings which cannot be segmented by Tabie 3 (ex. CJb~ in Figure 5).</Paragraph>
    <Paragraph position="20"> This table is applied as follows. The program first searches the character strings of the table in the input sentences. If a character string (~gb~) fits part of an input sentence ( E~b~ l:I~ ), then the program segments it into parts by the lengths of words in the table and adds the information about the parts of speech and conjugation. As a result we obtain the words (~/ b / ~ /). Figure 6 shows the results of automatic segmentation and automatic transliteration from Kanji to Roman letters. The operation of Table 3 has resulted in no segmentation for the strings (/COLING80 /) , (/~/) , (/~rff-~y~--$--J~/), and (/~{!~ /) as well as the segmentation at the sign (/./) . The operation of the table in Figure 5 has resulted in the segmentation for the hirasana (/ ~/), (/ ~/),  (/ V /), (/~ /), and (/~ /).</Paragraph>
    <Paragraph position="21"> 3. Automatic classification of parts of  speech In order to analyze the vocabulary we have to classify it by parts of speech. The program dose this by three methods. The first method is by using the table found in Figure 5.</Paragraph>
    <Paragraph position="22"> The second method is by the form of the word, applying the rules below. The ratio of correct answers obtained is given in parentheses after each rule.</Paragraph>
    <Paragraph position="23">  i. If the last character of the word is in Kanji, katakana,or the alphabet, then the word is a noun.</Paragraph>
    <Paragraph position="24"> (94.4%) 2. If the last character is/~/, then it is a verb in the renyo form (conjugation) or an adjective in the syushi or rental form. (86.2%) 3. If the last character is/~ /, then it is a verb in the syushi or rental form or an adjective in the reny_o_ form. (83.4%)  C 0 L I I',\] G 8 0 ~&amp;quot;~ .m, o) ,~ ri t 1,, .~ - ,T. - J l, C/' ~flf~ ~ .:K\[ t C 0 L I N G 8 0 GA TO#UKIJEII:ILI NO TDSISENNTNO HO0 RU DE KAI:IISAHISA NASOBI Nl #AKI TA KOTOMORA GA KANEOO TE ~IKU . ~.::a&gt;. F. ~.-,..~'~&amp;quot; I~ I~ ,% 2,~.~-~ t f-~ ~ . ZIJONN. F. KENEDE*I HA I~IDA~I NA DANITOI~LIRIJO~U DAOl\] TA . ~C:,, ~.~- ~ i 0 0 g :b&amp;quot;, I 0 0 F\]~&amp;quot; &lt; E~L,. J(&gt;~ ~ lOOg ~&amp;quot; , iOOH~ &lt;EZL', .</Paragraph>
    <Paragraph position="25"> PANNKD ~0 1 0 0 G KA , I 0 0 ~IEIINBUNN KUDASAHI . RE TA .</Paragraph>
    <Paragraph position="26">  to Roman character 4. If the last character is/Y/, then it is verb, syushi form. (95.8%) 5. If the last character is/K/, then it is verb, katei form, or demonstrative pronoun, or auxiliary verb~ I (92.9%) 6. If the last character is/b/, then it is verb, meirei form, or noun.</Paragraph>
    <Paragraph position="27"> (63.3%) 7. If the last tow characters ar~/, then it is adjective, mizen form, or verb, renyo form. (74.2%) 8. If the last character is /~ /, then it is verb, renyo form. (79.6%) 9. If the last tow characters are</Paragraph>
    <Paragraph position="29"> If the vowel of the last hirasana is /a/, then its conjugation is mizen or renyo form, and if it is /i/, then it is mizen or renyo if it is /u/, then it is syushi or rental if it is /e/, then it is katei or meirei if it is /o/, then it is meirei i0. If the last character is a numeral, then it is a figure and if it is a sign, then it is a sign.</Paragraph>
    <Paragraph position="30"> The third method is by word combinations. That is, in Japanese grammer word combination -- especially of nouns or verbs and particles or auxiliary verb~ ~- is not free. The formula given in Figure 7 is made from this rule.</Paragraph>
    <Paragraph position="31"> Its format is as follows:  i. the word 2. its part of speech 3. auxiliary verbs~r particles which can be used in front of this word 4. parts of speech and conjugations which can be used in front of this word 5. if 3 and 4 do not agree then 5 ap null classification of parts of speech. The explanation of the codes used in it is as follows:</Paragraph>
    <Paragraph position="33"> word's freq.</Paragraph>
    <Paragraph position="34"> aux.v. &amp; part. other</Paragraph>
    <Paragraph position="36"> 3, resulting in/~ ')/ being changed from a verb to a noun (using the formula for/i/found in Figure 7 ).</Paragraph>
    <Paragraph position="37"> The steps in Figure 8 are  i. input data 2. the result of segmentation 3. the result of transliteration from Kanji to Roman letters 4. the automatic classification of the parts of speech by methods i and 2 (by table and by word form) 5. the conjugations (l) !~@@~ ~ ~1~'~ ~ ~ ~ ~ ~ ~' b ~. 4. Supervisor  The supervisor program checks the results of the three automatic processing programs and selects the correct results or processes feedback. It also utilizes information obtained through each program. That is, I. The results of the character check  and conversion from kana to Roman letters are used for each program. 2. The information obtained in automatic transliteration is used in segmentation.</Paragraph>
    <Paragraph position="38"> Namely, if the special context is applied, then the program does not segment at that point because the character string is a word.</Paragraph>
    <Paragraph position="39"> 3. The information obtained at the conversion from kana to Roman letters is used in segmentation. Namely, if the consonant of the Romanized Japanese is (*), (J), or (Q)-- these are used as special small characters in kana -- then the program dose not segment at that point.</Paragraph>
    <Paragraph position="40"> 4. The information obtained in segmentation is used in classification. null Namely, the program obtains information concerning parts of speech and conjugation through using the table in Figure 5 in segmentation. Checking the results of the processing involves the following: i. Checking particle and auxiliary verb strings obtained by the program at classification. If these strings are impossible in Japanese, then the segmentation was mistaken. The program corrects these.</Paragraph>
    <Paragraph position="41"> 2. There are not many words composed of one character in Japanese except for particles and auxiliary verbs. Figure 9 gives the frequency of some characters and the frequency of words consisting of that character alone.</Paragraph>
    <Paragraph position="42"> Words of high frequency that are not particles or auxiliary verbs are produced by errors in segmentation. The program then corrects these errors, combining them into longer words.</Paragraph>
    <Paragraph position="43"> 3. If a verb in the renyo form is followed by another verb, then it is a compound word and the program corrects the error to produce a longer word.</Paragraph>
    <Paragraph position="44"> Figure i0 shows the results of the supervisor program. In test sentence i, the program at first segmented / ~ /L~/ ~ / ~ / as auxiliary verbs through the use of the table in Figure 5. But the supervisor program checks and corrects this string and the classification program adds th~ information of verb to/~t~'~/, as can be seen in Figure i0.</Paragraph>
    <Paragraph position="45"> In test sentence 2, the program at first segmented it /#ASOBI/SUGI/TA /, but the supervisor program checked this and corrected this string to the compound word, /#ASOBISUGI/,plus /TA/.</Paragraph>
    <Paragraph position="46"> We can process Japanese sentences using these methods and obtain words and various information about these words. With this program we can obtain a rate of correct answers of approximately 90 percent.Y3 We should be able to improve this program at the level of the supervisor and the tables. However, we don't think that it will be possible to obtain i00 percent correct answers because this system uses Japanese writing and the Japanese writing system is not i00 percent standardized. In addition, if we wish to produce a complete program, it is necessary to process on the basis of syntax and meaning. At persent, this is not the object of our efforts.</Paragraph>
    <Paragraph position="47"> 5. Adding lexical information The National Language Research Institute has been investigating the vocabulary of modern Japanese since 1952, and has been using the computer in this research since 1966. As a result, some five million words are available as machine readable data. This data contains vari&amp;quot; ous information such as word frequency, part of speech, class by word origin, and thesaurus number. The thesaurus, Bunrui go ih~o in Japanese, was produced by Doctor Oki Hayashi. It contains about 38,000 words in the natural language of Japanese.</Paragraph>
  </Section>
  <Section position="4" start_page="13515" end_page="13515" type="metho">
    <SectionTitle>
6. Making the concordance
</SectionTitle>
    <Paragraph position="0"> We will not explain this program here since we have written a separate report about it (number 6 in the list of references below). Please refer to this report for further details.</Paragraph>
    <Paragraph position="1"> Figure ii is the result of this process.</Paragraph>
  </Section>
  <Section position="5" start_page="13515" end_page="13515" type="metho">
    <SectionTitle>
Acknowledgements
</SectionTitle>
    <Paragraph position="0"> Professor Akio Tanaka developed this plan, made a prototype for automatic transliteration from Kan~i to kana, and permitted us to use this program.</Paragraph>
    <Paragraph position="1"> Mr. Kiyoshi Egawa made a prototype for an automatic segmentation program and permitted us to use it. They also contributed to this study through our  discussions with them. Mr. Oki Hayashi furnished us with the opportunity to study this and provided his support for our efforts.</Paragraph>
  </Section>
  <Section position="6" start_page="13515" end_page="13515" type="metho">
    <SectionTitle>
WORD WORD ROMANIZED PARTS THESAURUS
NUMBER JAPANESE SPEECH NUMBER
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"/>
  </Section>
class="xml-element"></Paper>
Download Original XML