File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1024_intro.xml
Size: 16,720 bytes
Last Modified: 2025-10-06 14:06:49
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1024"> <Title>Detection of Japanese Homophone Errors by a Decision List Including a Written Word as a Default Evidence</Title> <Section position="3" start_page="0" end_page="183" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper, we propose a method of detecting Japanese homophone errors in Japanese texts.</Paragraph> <Paragraph position="1"> Our method is based on a decision list proposed by Yarowsky (Yarowsky, 1994; Yarowsky, 1995). We improve the original decision list by using written words in the default evidence. The improved decision list can raise the F-measure of error detection. null Most Japanese texts are written using Japanese word processors. To input a word composed of kanji characters, we first input the phonetic hira-gana sequence for the word, and then convert it to the desired kanji sequence. However, multiple converted kanji sequences are generally produced, and we must then choose the correct kanji sequence. Therefore, Japanese texts suffer from homophone errors caused by incorrect choices. Carelessness of choice alone is not the cause of homophone errors; Ignorance of the difference among homophone words is serious. For example, many Japanese are not aware of the difference between '.~.,'~,' and '~,~,', or between '~.~.' and ,~,t.</Paragraph> <Paragraph position="2"> In this paper, we define the term homophone set as a set of words consisting of kanji characters that have the same phone 2. Then, we define the term homophone word as a word in a homophone set. For example, the set { ~/~-~ (probability), ~7 (establishment)} is a homophone set because words in the set axe composed of kanji characters that have the same phone 'ka-ku-ri-tu'.</Paragraph> <Paragraph position="3"> Thus, q/~' and '~f_' are homophone words. In this paper, we name the problem of choosing the correct word from the homophone set the homophone problem. In order to detect homophone errors, we make a list of homophone sets in advance, find a homophone word in the text, and then solve the homophone problem for the homophone word.</Paragraph> <Paragraph position="4"> Many methods of solving the homophone problem have been proposed (Tochinai et al., 1986; Ibuki et al., 1997; Oku and Matsuoka, 1997; Oku, 1994; Wakita and Kaneko, 1996). However, they are restricted to the homophone problem, that is, they are heuristic methods. On the other hand, the homophone problem is equivalent to the word sense disambiguation problem if the phone of the homophone word is regarded as the word, and the homophone word as the sense. Therefore, we can solve the homophone problem by using various 1 '~'.-~.,~. and '~.~..m~,' have a same phone 'i-sift'. The meaning of '~,' is a general will, and the meaning of '~:~'.~.,,... is a strong positive will. '~.~.' and '~' have a same phone 'cho-kkan'. The meaning of 'l-ff__,~. i is an intuition through a feeling, and the meaning of '~' is an intuition through a latent knowledge.</Paragraph> <Paragraph position="5"> ZWe ignore the difference of accents, stresses and parts of speech. That is, the homophone set is the set of words having the same expression in hiragana characters.</Paragraph> <Paragraph position="6"> Proceedings of EACL '99 statistical methods proposed for the word sense disambiguation problem(Fujii, 1998). Take the case of context-sensitive spelling error detection 3, which is equivalent to the homophone problem.</Paragraph> <Paragraph position="7"> For that problem, some statistical methods have been applied and succeeded(Golding, 1995; Golding and Schabes, 1996). Hence, statistical methods axe certainly valid for the homophone problem. In particular, the decision list is valid for the homophone problem(Shinnou, 1998). The decision list arranges evidences to identify the word sense in the order of strength of identifying the sense. The word sense is judged by the evidence, with the highest identifying strength, in the context. null Although the homophone problem is equivalent to the word sense disambiguation problem, the former has a distinct difference from the latter.</Paragraph> <Paragraph position="8"> In the homophone problem, almost all of the answers axe given correctly, because almost all of the expressions written in the given text are correct.</Paragraph> <Paragraph position="9"> It is difficult to decide which is the meaning of 'crane', 'crane of animal' or 'crane of tool'. However, it is almost right that the correct expression of '~' in a text is not '~-~' but '~1~'. In the homophone problem, the choice of the written word results in high precision. We should use this information. However, the method to always choose the written word is useless for error detection because it doesn't detect errors at all. The method used for the homophone problem should be evaluated from the precision and the recall of the error detection. In this paper, we evaluate it by the F-measure to combine the precision and the recall, and use the written word to raise the F-measure of the original decision list.</Paragraph> <Paragraph position="10"> We use the written word as an evidence of the decision list. The problem is how much strength to give to that evidence. If the strength is high, the precision rises but the recall drops. On the other hand, if the strength is low, the decision list is not improved. In this paper, we calculate the strength that gives the maximum F-measure in a training corpus. As a result, our decision list can raise the F-measure of error detection.</Paragraph> <Paragraph position="11"> 2 Homophone disambiguation by a decision list In this section, we describe how to construct the decision list and to apply it to the homophone problem.</Paragraph> <Paragraph position="12"> SFor example, confusion between 'peace' and 'piece', or between 'quiet' and 'quite' is the context-sensitive spelling error.</Paragraph> <Section position="1" start_page="180" end_page="181" type="sub_section"> <SectionTitle> 2.1 Construction of the decision list </SectionTitle> <Paragraph position="0"> The decision list is constructed by the following steps.</Paragraph> <Paragraph position="1"> step 1 Prepare homophone sets.</Paragraph> <Paragraph position="2"> In this paper, we use the 12 homophone sets shown in Table 1, which consist of homophone words that tend to be mis-chosen.</Paragraph> <Paragraph position="4"> step 2 Set context information, i.e. evidences, to identify the homophone word.</Paragraph> <Paragraph position="5"> We use the following three kinds of evidence.</Paragraph> <Paragraph position="6"> * word (w) in front of H: Expressed as w* word (w) behind H: Expressed as w+ * fi~tu words 4 surrounding H: We pick up the nearest three fir/tu words in front of and behind H respectively. We express them as w+-3.</Paragraph> <Paragraph position="7"> step 3 Derive the frequency frq(wi,ej) of the collocation between the homophone word wl in the homophone set {Wl,W~,-.-,wn} and the evidence e j, by using a training corpus. For example, let us consider the homophone set { ~_~1~ (running (of a ship, etc.)), ~_~7 (running (ofa train, etc.))} and the following two Japanese sentences.</Paragraph> <Paragraph position="8"> Sentence 1 r~g~)~J~;o~ ~ - b J~'~7~_ (A west wind of 3 m/s did not prevent the plane from flying.) and adjectives are examples.</Paragraph> <Paragraph position="10"> (Running hours in the early morning and during the night were shortened.) From sentence 1, we can extract the following evidences for the word '~': and from sentence 2, we can extract the following evidences for the word '~': &quot;~#r~? +&quot;, &quot;C/) -&quot;, &quot;~+~ +-3&quot;, &quot;~@ +3&quot;, &quot;@r~ +Y', &quot;~ +3&quot;, &quot;~ +3&quot;. step 4 Define the strength est(wi, ej) of estimating that the homophone word wl is correct given the evidence e j:</Paragraph> <Paragraph position="12"> where P(wi\]ej) is approximately calculated by: frq(wi, ej ) + a P(wl \[ej) = )-~k frq(wk, ej) + a&quot; a in the above expression is included to avoid the unsatisfactory case of frq(wl, ej) = O. In this paper, we set a : 0.15. We also use the special evidence default, frq(wl, default) is defined as the frequency of wl.</Paragraph> <Paragraph position="13"> step5 Pick the highest strength est(wh,ej) among 5As in this paper, the addition of a small value is an easy and effective way to avoid the unsatisfactory case, as shown in (Yarowsky, 1994).</Paragraph> <Paragraph position="14"> {est(wl, ), ea(w , e#), * * *, e e#)), and set the word wk as the answer for the evidence ej. In this case, the identifying strength is est(wk, ej).</Paragraph> <Paragraph position="15"> For example, by steps 4 and 5 we can construct the list shown in Table 2.</Paragraph> <Paragraph position="16"> step 6 Fix the answer wkj for each ej and sort identifying strengths est(wkj, ej) in order of dimension, but remove the evidence whose identifying strength is less than the identifying strength est(wkj,default) for the evidence default from the list. This is the decision list.</Paragraph> <Paragraph position="17"> After step 6, we obtain the decision list for the homophone set { ~_~, ~.~ } as shown in Table 3.</Paragraph> <Paragraph position="19"/> </Section> <Section position="2" start_page="181" end_page="182" type="sub_section"> <SectionTitle> 2.2 Solving by a decision llst </SectionTitle> <Paragraph position="0"> In order to solve the homophone problem by the decision list, we first find the homophone word w in the given text, and then extract evidences E for the word w from the text:</Paragraph> <Paragraph position="2"> Next, picking up the evidence from the decision list for the homophone set for the homophone word w in order of rank, we check whether the evidence is in the set E. If the evidence ej is in the set E, the answer wkj for ej is judged to be the correct expression for the homophone word w. If wkj is equal to w, w is judged to be correct, and if it is not equal, then it is shown that w may be the error for wkj.</Paragraph> <Paragraph position="3"> 3 Use of the written word In this section, we describe the use of the written word in the homophone problem and how to incorporate it into the decision list.</Paragraph> </Section> <Section position="3" start_page="182" end_page="182" type="sub_section"> <SectionTitle> 3.1 Evaluation of error detection systems </SectionTitle> <Paragraph position="0"> As described in the Introduction, the written word cannot be used in the word sense disambiguation problem, but it is useful for solving homophone problems. The method used for the homophone problem is trivial if the method is evaluated by the precision of distinction using the following formula: null number of correct discriminations number of all discriminations That is, if the expression is '~\]~' (or '~.~'), then we should clearly choose the word '~t~' (or the word '~') from the homophone set { ~_~t~, ~_~T }. This distinction method probably has better precision than any other methods for the word sense disambiguation problem. However, this method is useless because it does not detect errors at all.</Paragraph> <Paragraph position="1"> The method for the homophone problem should be evaluated from the standpoint of not error discrimination but error detection. In this paper, we use the F-measure (Eq.1) to combine the precision P and the recall R defined as follows: number of real errors in detected errors P= R= number number of detected errors of real errors in detected errors number of errors in the tezt</Paragraph> <Paragraph position="3"/> </Section> <Section position="4" start_page="182" end_page="182" type="sub_section"> <SectionTitle> 3.2 Use of the identifying strength of the </SectionTitle> <Paragraph position="0"> written word The distinction method to choose the written word is useless, but it has a very high precision of error discrimination. Thus, it is valid to use this method where it is difficult to use context to solve the homophone problem.</Paragraph> <Paragraph position="1"> The question is when to stop using the decision from context and use the written word. In this paper, we regard the written word as a kind of evidence on context, and give it an identifying strength. Consequently we can use the written word in the decision list.</Paragraph> </Section> <Section position="5" start_page="182" end_page="183" type="sub_section"> <SectionTitle> 3.3 Calculation of the identifying </SectionTitle> <Paragraph position="0"> strength of the written word First, let z be the identifying strength of the written word. We name the set of evidences with higher identifying strength than z the set a, and the set of evidences with lower identifying strength than z the set f~, Let T be the number of homophone problems for a homophone set. We solve them by the original decision list DLO. Let G (or H) be the ratio of the number of homophone problems by judged by a (or f~ ) to T. Let g (or h) be the precision of a (or f~), and p be the occurrence probability of the homophone error.</Paragraph> <Paragraph position="1"> The number of problems correctly solved by a is as follows:</Paragraph> <Paragraph position="3"> and the number of problems incorrectly solved by a is as follows: GTp. (3) The number of problems detected as errors in Eq.2 and Eq.3 are GT(1 - p)(1 - g) and GTpg respectively. Thus, the number of problems detected as errors by a is as follows:</Paragraph> <Paragraph position="5"> In the same way, the number of problems detected as errors by/~ is as follows:</Paragraph> <Paragraph position="7"> Consequently the total number of problems detected as errors is as follows:</Paragraph> <Paragraph position="9"> The number of correct detections in Eq.6 is Tp(Gg + Hh). Therefore the precision P0 is as follows:</Paragraph> <Paragraph position="11"> Because the number of real errors in T is Tp, the recall R0 is Gg+Hh. By using P0 and R0, we can get the F-measure F0 of DL0 by Eq. 1.</Paragraph> <Paragraph position="12"> Next, we construct the decision list incorporating the written word into DL0. We name this decision list DL1. In DL1, we use the written word to solve problems which we cannot judge by c\[. That is, DL1 is the decision list to attach the written word as the default evidence to a (see Fig.l).</Paragraph> <Paragraph position="13"> Next, we calculate the precision and the recall of DL1. Because a of DL1 is the same as that of DL0, the number of problems detected as errors by a is given by Eq.4. In the case of DL1, problems judged by ~ of DL0 are judged by the written word. Therefore, we detect no error from these problems.</Paragraph> <Paragraph position="14"> As a result, the number of problems detected as errors by DL1 is given by Eq.4, and the number of real errors in these detections is TGpg. Therefore, the precision P1 of DL1 is as follows:</Paragraph> <Paragraph position="16"> Because the number of whole errors is Tp, the recall R1 of DL1 is Gg. By using P1 and t/1, we can get the F-measure F1 of DL1 by Eq.1.</Paragraph> <Paragraph position="17"> Finally, we try to define the identifying strength z. z is the value that yields the maximum F~ under the condition F1 > F0. However, theoretical calculation alone cannot give z, because p is unknown, and functions of G,H,g, and h are also unknown.</Paragraph> <Paragraph position="18"> In this paper, we set p = 0.05, and get values of G, H, g, and h by using the training corpus which is the resource used to construct the original decision list DL0. Take the case of the homophone set {'~', '~.~T'}. For this homophone set, we try to get values of G, H, g, and h. The training corpus has 2,890 sentences which include the word '~.~\]~' or the word '~.~'. These 2,890 sentences are homophone problems for that homophone set. The identifying strength of DL0 for this homophone set covers from 0.046 to 9.453 as shown in Table 3.</Paragraph> <Paragraph position="19"> Next we give z a value. For example, we set z = 2.5. In this case, the number of problems judged by a is 1,631, and the number of correct judgments in them is 1,593. Thus, G = 1631/2890 = 0.564 and g = 1593/1631 = 0.977. In the same way, under this assumption z -- 2.5, the number of problems judged by j3 is 1,259, and the number of correct judgments in them is 854. Thus, H = 1259/2890 = 0.436 and</Paragraph> <Paragraph position="21"> when z varies from 0.0 to 10.0 in units of 0.1. By choosing the maximum value of F1 in Fig.4, we can get the desired z. In this homophone set, we obtain z = 3.0.</Paragraph> </Section> </Section> class="xml-element"></Paper>