File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1050_metho.xml
Size: 17,735 bytes
Last Modified: 2025-10-06 14:08:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1050"> <Title>Improving Japanese Zero Pronoun Resolution by Global Word Sense Disambiguation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Resources </SectionTitle> <Paragraph position="0"> We consider that verb and noun senses correspond to case frames and semantic features defined in the NTT thesaurus, respectively. This section describes the NTT thesaurus and the case frames briefly.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 NTT thesaurus NTT Communication Science Laboratories con- </SectionTitle> <Paragraph position="0"> structed a semantic feature tree, whose 3,000 nodes are semantic features, and a nominal dictionary containing about 300,000 nouns, each of which is given one or more appropriate semantic features. Figure 1 shows the upper levels of the semantic feature tree.</Paragraph> <Paragraph position="1"> The similarity between two words is defined by formula (1) in Appendix A.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Automatically constructed case frames </SectionTitle> <Paragraph position="0"> We employ the automatically constructed case frames (Kawahara and Kurohashi, 2002) as the basic resource for zero pronoun resolution and word sense disambiguation. This section outlines the method of constructing the case frames.</Paragraph> <Paragraph position="1"> The biggest problem in automatic case frame construction is verb sense ambiguity. Verbs which have different meanings should have different case frames, but it is hard to disambiguate verb senses precisely. To deal with this problem, predicate-argument examples which are collected from a large corpus are distinguished by coupling a verb and its closest case component. That is, examples are not distinguished by verbs (e.g.</Paragraph> <Paragraph position="2"> &quot;tsumu&quot; (load/accumulate)), but by couples (e.g. &quot;nimotsu-wo tsumu&quot; (load baggage) and &quot;keiken-wo tsumu&quot; (accumulate experience)). This process makes separate case frames which have almost the same meaning or usage.</Paragraph> <Paragraph position="3"> For example, &quot;nimotsu-wo tsumu&quot; (load baggage) and &quot;busshi-wo tsumu&quot; (load supply) are similar, but have separate case frames. To cope with this problem, the case frames are clustered. Example words are collected for each case marker, such as &quot;ga&quot;, &quot;wo&quot;, &quot;ni&quot; and &quot;kara&quot;. They are case-marking postpositions in Japanese, and usually mean nominative, accusative, dative and ablative, respectively. We call such a case marker 'case slot' and example words in a case slot 'case examples'.</Paragraph> <Paragraph position="4"> Case examples in a case slot are similar, but have some incorrect semantic features because of word sense ambiguity. For instance, &quot;nimotsu&quot; (baggage), &quot;busshi&quot; (supply) and &quot;nisemono&quot; (imitation) are gathered in a case slot, and all of them are below the semantic feature <goods>. On the other hand, &quot;nisemono&quot; belongs to <lie>. <lie> is incorrect for this case slot, and possibly causes errors in case analysis.</Paragraph> <Paragraph position="5"> We delete a semantic feature that is not similar to the other semantic features of its case slot.</Paragraph> <Paragraph position="6"> To sum up, the procedure for the automatic case frame construction is as follows.</Paragraph> <Paragraph position="7"> 1. A large raw corpus is parsed by the Japanese parser, KNP (Kurohashi and Nagao, 1994b), and reliable predicate-argument examples are extracted from the parse results.</Paragraph> <Paragraph position="8"> 2. The extracted examples are bundled accordingtotheverbanditsclosestcasecom- null ponent, making initial case frames.</Paragraph> <Paragraph position="9"> 3. The initial case frames are clustered using a similarity measure function. This similarity is calculated by formula (5) in Appendix B.</Paragraph> <Paragraph position="10"> 4. For each case slot of clustered case frames, an inappropriate semantic feature that is not similar to the other semantic features is discarded.</Paragraph> <Paragraph position="11"> We constructed two sets of case frames: for newspaper and cooking domain.</Paragraph> <Paragraph position="12"> The newspaper case frames are constructed from about 21,000,000 sentences of newspaper articles in 20 years (9 years of Mainichi newspaper and 11 years of Nihonkeizai newspaper). They consist of 23,000 verbs, and the average number of case frames for a verb is 14.5.</Paragraph> <Paragraph position="13"> The cooking case frames are constructed from about 5,000,000 sentences of cooking domain that are collected from WWW. They consist of 5,600 verbs, and the average number of case frames for a verb is 6.8.</Paragraph> <Paragraph position="14"> In Figure 1, some examples of the resulting case frames are shown. In this table, 'CS' means a case slot. <agent> in the table is a generalized case example, which is given to the case slot where half of the case examples belong to <agent>. <agent> is also given to &quot;ga&quot; case slot that has no case examples, because &quot;ga&quot; case components are often omitted, but &quot;ga&quot;</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Resolution System </SectionTitle> <Paragraph position="0"> We have proposed a Japanese zero pronoun resolution system using the case frames, antecedent preference orders, and a machine learning technique (Kawahara and Kurohashi, 2004).</Paragraph> <Paragraph position="1"> Its procedure is as follows.</Paragraph> <Paragraph position="2"> 1. Parse an input sentence using the Japanese parser, KNP.</Paragraph> <Paragraph position="3"> 2. Process each verb in the sentence from left to right by the following steps.</Paragraph> <Paragraph position="4"> CS case examples+ ga <agent>, group, party, ***youritsu (1) wo <agent>, candidate, applicant(support) ni <agent>, district, election, *** ga <agent>youritsu (2) wo <agent>, member, minister, ***(support) ni <agent>, candidate, successor ... ... ...</Paragraph> <Paragraph position="5"> orosu (1) ga <agent> (grate) wo radish ga <agent>orosu (2) wo money(withdraw) kara bank, post ... ... ...</Paragraph> <Paragraph position="6"> itadaku (1) ga <agent> (have) wo soup ga <agent>itadaku (2) wo advice, instruction, address(be given) kara <agent>, president, circle, *** ... ... ...</Paragraph> <Paragraph position="7"> +case examples are expressed only in English for space limitation.</Paragraph> <Paragraph position="8"> 2.1. Narrow case frames down to corresponding ones to the verb and its closest case component.</Paragraph> <Paragraph position="9"> 2.2. Perform the following processes for each case frame of the target verb. i. Match each input case component with an appropriate case slot of the case frame. Regard case slots that have no correspondence as zero pronouns.</Paragraph> <Paragraph position="10"> ii. Estimate an antecedent of each zero pronoun.</Paragraph> <Paragraph position="11"> 2.3. Select a case frame which has the highest total score, and output the analysis result for the case frame.</Paragraph> <Paragraph position="12"> The rest of this section describes the above steps (2.1), (2.2.i) and (2.2.ii) in detail.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Narrowing down case frames </SectionTitle> <Paragraph position="0"> The closest case component plays an important role to determine the usage of a verb. In particular, when the closest case is &quot;wo&quot; or &quot;ni&quot;, this trend is clear-cut. In addition, an expression whose nominative belongs to <agent> (e.g.</Paragraph> <Paragraph position="1"> &quot;<agent> has accomplished&quot;), does not have enough clue to decide its usage, namely a case frame. By considering these aspects, we impose the following conditions on narrowing down case frames.</Paragraph> <Paragraph position="2"> not belong to the semantic marker <agent>.</Paragraph> <Paragraph position="3"> * A case frame with the closest case exists, and the similarity between the closest case component and examples in the closest case exceeds a threshold.</Paragraph> <Paragraph position="4"> We choose the case frames whose similarity is the highest. If the above conditions are not satisfied, case frames are not narrowed down, and the subsequent processes are performed for each case frame of the target verb. The similarity used here is defined as the best similarity between the closest case component and examples in the case slot. The similarity between two examples is defined as formula (1) in Appendix A.</Paragraph> <Paragraph position="5"> Let us consider &quot;youritsu&quot; (support) in the second sentence of Figure 2. &quot;youritsu&quot; has the case frames shown in Table 1. The input expression &quot;kouho-wo youritsu&quot; (support a candidate) satisfies the above two conditions, and the case frame &quot;youritsu (1)&quot; meets the last condition. Accordingly, this case frame is selected.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Matching input case components </SectionTitle> <Paragraph position="0"> with case slots in the case frame We match case components of the target verb with case slots in the case frame (Kurohashi and Nagao, 1994a). When a case component has a case marker, it must be assigned to the case slot with the same case marker. When a case component is a topic marked phrase or a clausal modifiee, which does not have a case marker, it can be assigned to one of the case slots in the following table.</Paragraph> <Paragraph position="1"> topic marked phrases : ga, wo, ga2 clausal modifiees : ga, wo, non-gapping The conditions above may produce multiple matching patterns. In this case, one which has the best score is selected. The score of a matching pattern is defined as the sum of similarities of case assignments. This similarity is calculated as the same way described in Section 3.1. The result of case analysis tells if the zero pronouns exist. That is, vacant case slots in the case frame, which have no correspondence with the input case components, mean zero pronouns. In this paper, we concentrate on three case slots: &quot;ga&quot;, &quot;wo&quot;, and &quot;ni&quot;. In the case of &quot;youritsu&quot; (support) in Figure 2 and the selected case frame &quot;youritsu (1)&quot;, &quot;wo&quot; case slot has a corresponding case component, but &quot;ga&quot; and &quot;ni&quot; case slots are vacant. Accordingly, two zero pronouns are identified in &quot;ga&quot; and &quot;ni&quot; case of &quot;youritsu&quot;.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Antecedent estimation </SectionTitle> <Paragraph position="0"> The antecedents of the detected zero pronouns are estimated. Possible antecedents are examined according to the antecedent preference order (Kawahara and Kurohashi, 2004). If a possible antecedent is classified as positive by a binary classifier and its similarity to examples in its case slot exceeds a threshold, it is determined as the antecedent.</Paragraph> <Paragraph position="1"> For example, &quot;youritsu&quot; (support) in Figure 2 has zero pronouns in &quot;ga&quot; and &quot;ni&quot; cases. The ordered possible antecedents for &quot;ga&quot; are L7:&quot;Minsyutou&quot;, L14:&quot;Jimintou&quot;(ph ga), L14:&quot;Ishihara chiji&quot;(ph wo), ***. The first candidate &quot;Minsyutou (similarity:0.73)&quot;, which is labeled as positive by the classifier, and whose similarity to the case frame examples exceeds a threshold (0.60), is determined as the antecedent. null</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Global Word Sense </SectionTitle> <Paragraph position="0"> Disambiguation We integrate a global method of word sense disambiguation into the zero pronoun resolution system described in the previous section. The word sense disambiguation is applied to verbs and nouns based on the case frames. Furthermore, the word sense disambiguation results are cached and applied globally by the subsequent analyses based on the one sense per discourse heuristic. In the rest of this section, we describe verb and noun sense disambiguation respectively using examples of cooking domain.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Verb sense disambiguation </SectionTitle> <Paragraph position="0"> The case frames are specific enough to the diversity of verb senses, because their meanings are distinguished by the couple of the verb and its closest case component (Kawahara and Kurohashi, 2002). We regard the process of verb sense disambiguation as the case frame selection (Step (2.3) described in Section 3). In addition, theverbsensedisambiguationresultsarecached and applied globally in the same text. In other words, the selected case frames are cached for each verb, and only the case frames that are similar to the cache are used for the same verb following in the same text. The similarity measure for two case frames is stated in Appendix B, and the threshold is set to 0.60 empirically.</Paragraph> <Paragraph position="1"> Here is an example article that consists of three sentences.</Paragraph> <Paragraph position="2"> For &quot;oroshite&quot; (grate) in the first sentence, the case frame &quot;orosu (1)&quot; (in Table 1), which means &quot;grate radish&quot;, is selected, because the closest case component &quot;kabura&quot; (turnip) exists, and is very similar to the &quot;wo&quot; case example &quot;daikon&quot; (radish). This selected case frame is cached for the verb &quot;orosu&quot;. For &quot;oroshi&quot; (grate) in the third sentence, case frames are not narrowed down for lack of the closest component. The previous system performs the antecedent estimation process for all the case frames of &quot;orosu&quot;, and incorrectly estimates the antecedent of &quot;wo&quot; zero pronoun as &quot;oroshi-gane&quot; (grater)++. On the other hand, our proposed method deals with only the similar case frames to the cached &quot;orosu (1)&quot;. That is, the case frame &quot;orosu (2)&quot;, which means &quot;withdraw money from bank or post&quot;, is not similar to &quot;orosu (1)&quot;, and is not used. Accordingly, the system certainly estimates the antecedent of &quot;wo&quot; zero pronoun as &quot;kabura&quot; (turnip).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Noun sense disambiguation </SectionTitle> <Paragraph position="0"> We define the process of noun sense disambiguation as selecting an appropriate semantic feature from the ones given to a noun in the NTT thesaurus. This process is performed based on the matching of the input case components and the case frame decided by the step (2.3) described in Section 3. For each input case component, its semantic features are matched against those of case examples of its corresponding case slot, and the best matched one is selected. In addition, this disambiguation result is applied globallyliketheverbsensedisambiguation. The determined semantic feature is cached for each noun, and is given to the same noun following in the same text, instead of reconsidering all of its semantic features. Here is an example article.</Paragraph> <Paragraph position="1"> tori-mashita.</Paragraph> <Paragraph position="2"> prepare (We prepared real stock.) &quot;itadaki&quot; (have) in the first sentence has the closest case component &quot;osumashi&quot; (clear soup), and the case frame &quot;itadaku (1)&quot; (in Table 1) is selected, because its &quot;wo&quot; case example &quot;soup&quot; is very similar to &quot;osumashi&quot;. In the NTT thesaurus, &quot;osumashi&quot; (clear soup) has three semantic features: <soup>, <look> and <eccentric>. <eccentric> is located below <agent> in the thesaurus, and the previous system incorrectly estimates antecedents of &quot;ga&quot; zero pronouns of the following verbs as &quot;osumashi&quot; (because almost all the ++In Japanese, &quot;gane&quot; of &quot;oroshi-gane&quot; (grater) exactly matches with &quot;kane&quot; (money), the &quot;wo&quot; case example of &quot;orosu (2)&quot;.</Paragraph> <Paragraph position="3"> case frames have <agent> in their &quot;ga&quot; case slots). In our approach, each of the semantic features are matched against the case example &quot;soup&quot;, and only the best matched semantic feature <soup> is given to &quot;osumashi&quot;.</Paragraph> </Section> </Section> class="xml-element"></Paper>