File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1024_intro.xml
Size: 10,258 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1024"> <Title>Japanese Zero Pronoun Resolution based on Ranking Rules and Machine Learning</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"> In this paper, we combine heuristic ranking rules and machine learning. First, we describe how we extract possible antecedents (candidates). Second, we describe the rule-based ranking system and the machine learning system. Finally, we describe how to combine these two methods.</Paragraph> <Paragraph position="1"> We consider only anaphors for noun phrases following Seki and other studies. We assume that zeros are already detected. We also assume zeros are located at the starting point of a bunsetsu that contains a yougen (a verb, an adjective, or an auxiliary verb).</Paragraph> <Paragraph position="2"> From now on, we use 'verb' instead of 'yougen' for readability. A zero's bunsetsu is a bunsetsu that contains the zero. We further assume that each zero's grammatical case is already determined by a zero detector and represented by corresponding particles.</Paragraph> <Paragraph position="3"> If a zero is the subject of a verb, its case is represented by the particle ga. If it is an object, it is represented by wo. If it is an object2, it is represented by ni. We consider only these three cases. A zero's particle means such a particle.</Paragraph> <Paragraph position="4"> Since complex sentences are hard to analyze, each sentence is automatically split at conjunctive post-positions (setsuzoku joshi) (Okumura and Tamura, 1996; Ehara and Kim, 1996). In order to distinguish the original complex sentence and the simpler sentences after the split, we call the former just a 'sentence' and the latter 'post-split sentences'. When a conjunctive postposition appears in a relative clause, we do not split the sentence at that position. In the examples below, we split the first sentence at 'and' but do not split the second sentence at 'and'.</Paragraph> <Paragraph position="5"> She bought the book and sold it to him.</Paragraph> <Paragraph position="6"> She bought the book that he wrote and sold.</Paragraph> <Paragraph position="7"> A zero's sentence is the (original) sentence that contains the zero. From now on, a36 stands for a zero and a37 stands for a candidate of a36 's antecedent. a36 's particle is denoted ZP, and CP stands for a37 's next word that is a37 's particle or a punctuation symbol.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Enumeration of possible antecedents </SectionTitle> <Paragraph position="0"> Candidates (possible antecedents) are enumerated on the fly by using the following method. 1. We extract a content word sequence</Paragraph> <Paragraph position="2"> lowed by a case marker (kaku-joshi, e.g., ga, wo), a topic marker (wa or mo), or a period.</Paragraph> <Paragraph position="3"> 2. If a37 's a13a39a15 is a verb, an adjective, an auxilary verb, an adverb, or a relative pronoun (ChaSen's meishi-hijiritsu, e.g., koto (what he did) and toki (when she married)), a37 is excluded. (If a13 a15 is a closing quotation mark, a13 a15a22a40 a1 is checked instead.) 3. If a37 's a13 a15 is a pronoun or an adverbial noun (a noun that can also be used as an adverb, i.e., ChaSen's meishi-fukushi-kanou), a37 is excluded. 4. If a37 is dou-shi (the person), it is replaced by the latest person name. If a37 is dou-sha (the company), it is replaced by the latest organization name. If a37 is dou+suffix, it is replaced by the latest candidate that has the same suffix. For this task, we use a named entity recognizer (Isozaki and Kazawa, 2002).</Paragraph> <Paragraph position="4"> The first step extracts a content word sequence from a bunsetsu. The second step excludes verb phrases, adjective phrases, and clauses. As a result, we obtain only noun phrases. The third step excludes adverbial expressions like kotoshi (this year). The forth step resolves anaphors like definite noun phrases in English. We should also resolve pronouns, but we did not because useful pronouns are rare in newspaper articles.</Paragraph> <Paragraph position="5"> In addition, we register a resolved zero as a new candidate. If a36 's antecedent is determined to be a37a42a41 , a new candidate a37a14a43 's particle is ZP and a37a34a41 's location is a36 's location. In the training phase of the machine learning approach, we consider a correct answer as a37a14a41 . Then, we can remove far candidates from the list.</Paragraph> <Paragraph position="6"> In this way, our zero resolver creates a 'general purpose' candidate list. However, some of the candidates are inappropriate for certain zeros. A verb usually does not have the same entity in two or more cases (Murata and Nagao, 1997). Therefore, our resolver excludes candidates that are filled in other cases of the verb. When a verb has two or more zeros, we resolve ga first, and its best candidate is excluded from the candidates of wo or ni.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Ranking rules </SectionTitle> <Paragraph position="0"> Various heuristics have been reported in past literature. Here, we use the following heuristics.</Paragraph> <Paragraph position="1"> 1. Forward center ranking (Walker et al., 1994): (topic a44 empathy a44 subject a44 object2 a44 object a44 others).</Paragraph> <Paragraph position="2"> 2. Property-sharing (Kameyama, 1986): If a zero is the subject of a verb, its antecedent is perhaps a subject in the antecedent's sentence. If a zero is an object, its antecedent is perhaps an object. 3. Semantic constraints (Yamura-Takei et al., 2002; Yoshino, 2001): If a zero is the sub-ject of 'eat,' its antecedent is probably a per-son or an animal, and so on. We use Nihongo Goi Taikei (Ikehara et al., 1997), which has 14,730 English-to-Japanese translation patterns for 6,103 verbs, to check the acceptability of a candidate. Goi Taikei also has 300,000 words in about 3,000 semantic categories. (See Ap- null pendix A for details.) 4. Demotion of candidates in a relative clause (rentai shuushoku setsu): Usually, Japanese zeros do not refer to noun phrases in relative clauses (Ehara and Kim, 1996). (See Appendix B for details.) Since sentences in newspaper articles are often complex and relative clauses are sometimes nested, we refine this rule in the following way.</Paragraph> <Paragraph position="3"> a45 A candidate's relative clause is the inmost relative clause that contains the candidate.</Paragraph> <Paragraph position="4"> a45 A relative clause finishes at the noun modified by the clause.</Paragraph> <Paragraph position="5"> a45 If a36 appears before the finishing noun of a37 's relative clause, the clause is still unfinished at a36 . Otherwise, the clause is already finished.</Paragraph> <Paragraph position="6"> a45 A quoted clause (with or without quotation marks &quot; &quot;) indicated by a quotation marker 'to' ('that' in 'He said that she is . . . ') is also regarded as a relative clause.</Paragraph> <Paragraph position="7"> a45 We demote a37 after a37 's relative clause finishes.</Paragraph> <Paragraph position="8"> It is not clear how to combine the above heuristics consistently. Here, we sort the candidates in a lexicographical order based on the above features of candidates. For instance, we can use a lexicographically increasing order defined by</Paragraph> <Paragraph position="10"> the semantic constraint. Otherwise, Vi is 0.</Paragraph> <Paragraph position="11"> a45 Re (for relative) is 1 if the candidate is in a relative clause that has already finished before a36 .</Paragraph> <Paragraph position="12"> Otherwise, Re is 0.</Paragraph> <Paragraph position="13"> a45 Ag (for agreement) is 0 if CP=ZP holds. (Since most of wa and mo are subjects, they are regarded as ga here.) Otherwise, Ag is 1.</Paragraph> <Paragraph position="14"> a45 Di (for distance) is a non-negative integer that represents the number of post-split sentences between a37 and a36 . If a candidate's Di is larger than maxDi, it is removed from the candidate list.</Paragraph> <Paragraph position="15"> a45 Sa (for salience) is 0 if CP is wa. Sa is 1 if CP is ga. Sa is 2 if CP is ni. Sa is 3 if CP is wo. Otherwise, Sa is 4. We did not implement empathy because it makes the program more complex, and empathy verbs are rare in newspaper articles.</Paragraph> <Paragraph position="16"> For instance, a21a8a46 a3 a46 a3 a46 a3 a28 a3 a23 a26a48a47 a21a8a23 a3 a46 a3 a46 a3 a46 a3 a46 a26 holds. The first ranked (lexicographically smallest) candidate is regarded as the best candidate. We employ lexicographical ordering because it seems the simplest way to rank candidates. We put Vi in the first place because Vi was often regarded as a constraint in the past literature. We put Ag before Sa because Kameyama's method was better than Walker's in Okumura and Tamura (1996). Therefore, a21 Via3a6a5a49a5a50a3 Aga3a6a5a49a5a50a3 Saa3a6a5a50a5a49a26 is expected to be a good ordering. The above ordering is an instance of this.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Machine Learning </SectionTitle> <Paragraph position="0"> Although we can consider various other features for zero pronoun resolution, it is difficult to combine these features consistently. Therefore, we use machine learning. Support Vector Machines (SVMs) have shown good performance in various tasks in Natural Language Processing (Kudo and Matsumoto, 2001; Isozaki and Kazawa, 2002; Hirao et al., 2002).</Paragraph> <Paragraph position="1"> Yoshino (2001) and Iida et al.(2003b) also applied SVM to Japanese zero pronoun resolution, but the usefulness of each feature was not clear. Here, we add features for complex sentences and analyze useful features by examining the weights of features. We use the following features of a37 as well as CP.</Paragraph> <Paragraph position="2"> CSem a37 's semantic categories. (See Appendix A.) CPPOS CP's part-of-speech (POS) tags (rough and detailed).</Paragraph> </Section> </Section> class="xml-element"></Paper>