File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1074_evalu.xml
Size: 9,168 bytes
Last Modified: 2025-10-06 13:59:02
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1074"> <Title>Optimizing Algorithms for Pronoun Resolution</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Algorithms and Evaluation </SectionTitle> <Paragraph position="0"> In this section, we consider the individual approaches in more detail, in particular we will look at their choice of factors and their strategy to model factor interaction. According to interaction potential, we distinguish three classes of approaches: Serialization, Weighting, and Machine Learning.</Paragraph> <Paragraph position="1"> We re-implemented some of the algorithms described in the literature and evaluated them on syntactically ideal and realistic German3 input. Evaluation results are listed in Table 2.</Paragraph> <Paragraph position="2"> With the ideal treebank input, we also assumed ideal input for the factors dependent on previous 3A reviewer points out that most of the algorithms were proposed for English, where they most likely perform better. However, the algorithms also incorporate a theory of saliency, which should be language-independent.</Paragraph> <Paragraph position="3"> anaphora resolution results. With realistic parsed input, we fed the results of the actual system back into the computation of such factors.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Serialization Approaches </SectionTitle> <Paragraph position="0"> Algorithmical approaches first apply filters unconditionally; possible exceptions are deemed nonexistant or negligible. With regard to interaction of preferences, many algorithms (Hobbs, 1978; Strube, 1998; Tetreault, 2001) subscribe to a scheme, which, though completely rigid, performs surprisingly well: The chosen preferences are applied one after the other in a certain pre-defined order. Application of a preference consists in selecting those of the antecedents still available that are ranked highest in the preference order.</Paragraph> <Paragraph position="1"> Hobbs (1978)'s algorithm essentially is a concatenation of the preferences Sentence Recency (without cataphora), Common Path, Depth of Embedding, and left-to-right Surface Order. It also implements the binding constraints by disallowing sibling to the anaphor in a clause or NP as antecedents. Like Lappin and Leass (1994), we replaced this implementation by our own mechanism to check binding constraints, which raised the success rate.</Paragraph> <Paragraph position="2"> The Left-Right Centering algorithm of Tetreault (1999) is similar to Hobbs's algorithm, and is composed of the preferences Sentence Recency (without cataphora), Depth of Embedding, and left-to-right Surface Order. Since it is a centering approach, it only inspects the current and last sentence.</Paragraph> <Paragraph position="3"> Strube (1998)'s S-list algorithm is also restricted to the current and last sentence. Predicative complements and NPs in direct speech are excluded as antecedents. The primary ordering criterion is Information Status, followed by Sentence Recency (without cataphora) and left-to-right Surface Order.</Paragraph> <Paragraph position="4"> Since serialization provides a quite rigid frame, we conducted an experiment to find the best performing combination of pronoun resolution factors on the treebank and the best combination on the parsed input. For this purpose, we checked all permutations of preferences and subtracted preferences from the best-performing combinations until performance degraded (greedy descent). Greedy descent outperformed hill-climbing. The completely annotated 6.7% of the corpus were used as development set, the rest as test set.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Weighting Approaches </SectionTitle> <Paragraph position="0"> Compared with the serialization approaches, the algorithm of Lappin and Leass (1994) is more sophisticated: It uses a system of hand-selected weights to control interaction among preferences, so that in principle the order of preference application can switch under different input data. In the actual realization, however, the weights of factors lie so much apart that in the majority of cases interaction boils down to serialization. The weighting scheme includes Sentence Recency, Grammatical Roles, Role Parallelism, on the basis of the equivalence class approach described in section 3.2. Final choice of antecedents is relegated to right-to-left Surface Order. Interestingly, the Lappin&Leass algorithm out-performs even the best serialization algorithm on parsed input.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Machine Learning Approaches </SectionTitle> <Paragraph position="0"> Machine Learning approaches (Ge et al., 1998; Soon et al., 2001; Ng and Cardie, 2002) do not distinguish between filters and preferences. They submit all factors as features to the learner. For every combination of feature values the learner has the freedom to choose different factors and to assign different strength to them.</Paragraph> <Paragraph position="1"> Thus the main problem is not choice and interaction of factors, but rather the formulation of anaphora resolution as a classification problem.</Paragraph> <Paragraph position="2"> Two proposals emerge from the literature. (1) Given an anaphor and an antecedent, decide if the antecedent is the correct one (Ge et al., 1998; Soon et al., 2001; Ng and Cardie, 2002). (2) Given an anaphor and two antecedents, decide which antecedent is more likely to be the correct one (Yang et al., 2003). In case (1), the lopsidedness of the distribution is problematic: There are much more negative than positive training examples. Machine Learning tools have to surpass a very high baseline: The strategy of never proposing an antecedent typically already yields an f-score of over 90%. In case (2), many more correct decisions have to be made before a correct antecedent is found. Thus it is important in this scenario, that the set of antecedents is subjected to a strict filtering process in advance so that the system only has to choose among the best candidates and errors are less dangerous.</Paragraph> <Paragraph position="3"> Ge et al. (1998)'s probabilistic approach combines three factors (aside from the agreement filter): the result of the Hobbs algorithm, Mention Count dependent on the position of the sentence in the article, and the probability of the antecedent occurring in the local context of the pronoun. In our re-implementation, we neglected the last factor (see section 3.1). Evaluation was performed using 10fold cross validation.</Paragraph> <Paragraph position="4"> Other Machine Learning approaches (Soon et al., 2001; Ng and Cardie, 2002; Yang et al., 2003) make use of decision tree learning4; we used C4.5 (Quinlan, 1993). To construct the training set, Soon et al. (2001) take the nearest correct antecedent in the previous context as a positive example, while all possible antecedents between this antecedent and the pronoun serve as negative examples. For testing, potential antecedents are presented to the classifier in Right-to-Left order; the first one classified positive is chosen. Apart from agreement, only two of Soon et al. (2001)'s features apply to pronominal anaphora: Sentence Recency, and NP Form (with personal pronouns only). We used every 10th sentence in Negra for testing, all other sentences for training. On parsed input, a very simple decision tree is generated: For every personal and possessive pronoun, the nearest agreeing pronoun is chosen as antecedent; demonstrative pronouns never get an antecedent. This tree performs better than the more complicated tree generated from treebank input, where also non-pronouns in previous sentences can serve as antecedents to a personal pronoun.</Paragraph> <Paragraph position="5"> Soon et al. (2001)'s algorithm performs below its potential. We modified it somewhat to get better results. For one, we used every possible antecedent 4On our data, Maximum Entropy (Kehler et al., 2004) had problems with the high baseline, i.e. proposed no antecedents. in the training set, which improved performance on the treebank set (by 1.8%) but degraded performance on the parsed data (by 2%). Furthermore, we used additional features, viz. the grammatical role of antecedent and pronoun, the NP form of the antecedent, and its information status. The latter two features were combined to a single feature with very many values, so that they were always chosen first in the decision tree. We also used fractional numbers to express intrasentential word distance in addition to Soon et al. (2001)'s sentential distance. Role Parallelism (Ng and Cardie, 2002) degraded performance (by 0.3% F-value). Introducing agreement as a feature had no effect, since the learner always determined that mismatches in agreement preclude coreference. Mention Count, Depth of Embedding, and Common Path did not affect performance either.</Paragraph> </Section> </Section> class="xml-element"></Paper>