File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2003_metho.xml
Size: 28,714 bytes
Last Modified: 2025-10-06 14:07:01
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2003"> <Title>A Probabilistic Genre-Independent Model of Pronominalization</Title> <Section position="3" start_page="0" end_page="19" type="metho"> <SectionTitle> 2 Factors in Pronoun Generation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="18" type="sub_section"> <SectionTitle> 2.1 Previous Work </SectionTitle> <Paragraph position="0"> Lately, a number of researchers have done corpus-based work on NP generation and pronoun resolution, and a number of studies have found differences in the frequency of both personal and demonstrative pronouns across genres. However, none of these studies compares the influence of different factors on pronoun generation across genres.</Paragraph> <Paragraph position="1"> Recently, Poesio et al. (1999) have described a corpus-based approach to statistical NP generation.</Paragraph> <Paragraph position="2"> While they ask the same question as previous researchers (e.g. Dale (1992)), their methods differ from traditional work on NP generation. Poesio et al. (1999) use two kinds of factors: (1) factors related to the NP under consideration such as agreement information, semantic factors, and discourse factors, and (2) factors related to the antecedent, such as animacy, clause type, thematic role, proximity, etc. Poesio et al. (1999) report that they were not able to annotate many of these factors reliably. On the basis of these annotations, they constructed decision trees for predicting surface forms of referring expressions based on these factors - with good results: all 28 personal pronouns in their corpus were generated correctly. Unfortunately, they do not evaluate the contribution of each of these factors, so we do not know which ones are important.</Paragraph> <Paragraph position="3"> Work on corpus-based approaches to anaphora resolution is more numerous. Ge et al. (1998) describe a supervised probabilistic pronoun resolution algorithm which is based on complete syntactic information. The factors they use include distance from last mention, syntactic function and context, agreement information, animacy of the referent, a simplified notion of selectional restrictions, Sortal Class (cf. Tab. 2) Syntactic function of antecedent.</Paragraph> <Paragraph position="4"> &quot;F&quot; for first mention, &quot;N&quot; for deadend Form of antecedent (pers. pron., poss.</Paragraph> <Paragraph position="5"> pron., def. NP, indef. NP, proper name) Distance to last mention in units Dist reduced to 4 values (deadend,</Paragraph> <Paragraph position="7"> Number of competing discourse entities and the length of the coreference chain. Cardie & Wagstaff (1999) describe an unsupervised algorithm for noun phrase coreference resolution. Their factors are taken from Ge et al. (1998), with two exceptions. First, they replace complete syntactic information with information about NP bracketing. Second, they use the sortal class of the referent which they determine on the basis of WordNet (Fellbaum, 1998).</Paragraph> <Paragraph position="8"> There has been no comparison between corpus-based approaches for anaphora resolution and more traditional algorithms based on focusing (Sidner, 1983) or centering (Grosz et al., 1995) except for Azzam et al. (1998). However, their comparison is flawed by evaluating a syntax-based focus algorithm on the basis of insufficient syntactic information. For pronoun generation, the original centering model (Grosz et al., 1995) provides a rule which is supposed to decide whether a referring expression has to be realized as a pronoun. However, this rule applies only to the referring expression which is the backward-looking center (Cb) of the current utterance. With respect to all other referring expression in this utterance centering is underspecified.</Paragraph> <Paragraph position="9"> Yeh & Mellish (1997) propose a set of hand-crafted rules for the generation of anaphora (zero and personal pronouns, full NPs) in Chinese. However, the factors which appear to be important in their evaluation are similar to factors described by authors mentioned above: distance, syntactic constraints on zero pronouns, discourse structure, salience and animacy of discourse entities.</Paragraph> </Section> <Section position="2" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 2.2 Our Factors </SectionTitle> <Paragraph position="0"> The factors we investigate in this paper only rely on annotations of NPs and their co-specification relations. We did not add any discourse structural annotation, because (1) the texts are extracts from larger texts which are not available to us, and (2) we have not yet found a labelling scheme for discourse structure that has an inter-coder reliability comparable to the MUC coreference annotation scheme.</Paragraph> <Paragraph position="1"> Based on our review of the literature and relevant work in linguistics (for sortal class, mainly Fraurud (1996) and Fellbaum (1998)), we have chosen the nine factors listed in Table 1. Methodologically, we distinguish two kinds of factors: NP-level factors are independent from co-specification relations. They depend on the semantics of the discourse entity or on discourse information supplied for the NP generation algorithm by the NLG system. Typical examples are NP agreement by gender, number, person and case, the syntactic function of the NP (subject, object, PP adjunct, other), the sortal class of the discourse entity to which an NP refers, discourse structure, or topicality of the discourse entities. In this paper, we focus on the first three factors, agreement (Agree), syntactic function (Syn), and sortal class (Class).</Paragraph> <Paragraph position="2"> Since we are using syntactically annotated data in the Penn Treebank-II format, the syntactic function of an NP was derived from these annotations.</Paragraph> <Paragraph position="3"> Agreement for gender, number, and person was labelled by hand. Since English has almost no nominal case morphemes, case was not annotated.</Paragraph> <Paragraph position="4"> Sortal classes provide information about the discourse entity that a referring expression evokes or accesses. The classes, summarized in Table 2, were derived from EuroWordNet BaseTypes (Vossen, 1998) and are defined extensionally on the basis of WordNet synsets. Their selection was motivated by two main considerations: all classes should occur in all genres, and the number of classes should be as small as possible in order to avoid problems with sparse data. Four classes, State, Event, Action, and Property, cover different types of situations, two cover spatiotemporal characteristics of situations (Loc/Time). The four remaining classes cover the two dimensions &quot;concrete vs. abstract (Concept)&quot; and &quot;human (Pers) vs. non-human (PhysObj) vs. institutionalised groups of humans (Group)&quot;.</Paragraph> <Paragraph position="5"> Since we are only interested in the decision whether to employ pronouns rather than full NPs and less in the form of the NP itself, and since our methodology is based on corpus annotation, we did not take into account more formal semantic categories such as kinds vs. individuals.</Paragraph> <Paragraph position="6"> Co-specification-level factors depend on information about sequences of referring expressions one or more human beings institutionalized group of human beings physical object abstract concept geographical location date, time span sth. which takes place in space and time sth. which is done state of affairs, feeling ....</Paragraph> <Paragraph position="7"> characteristic or attribute of sth.</Paragraph> <Paragraph position="8"> characterizations of relevant synsets which co-specify with each other. Such a sequence consists of all referring expressions that evoke or access the same discourse entity. In this paper, we use the following factors from the literature: distance to last mention (Dist and Dist4), ambiguity (Ambig), parallelism (Par), form of the antecedent (FormAnte), and syntactic function of the antecedent (SynAnte). We also distinguish between discourse entities that are only evoked once, deadend entities, and entities that are accessed repeatedly.</Paragraph> <Paragraph position="9"> Parallelism is defined on the basis of syntactic function: a referring expression and its antecedent are parallel if they have the same syntactic function. For calculating distance and ambiguity, we segmented the texts into major clause units (MCUs). Each MCU consists of a major clause C plus any subordinate clauses and any coordinated major clauses whose subject is the same as that of C and where that subject has been elided.</Paragraph> <Paragraph position="10"> Dist provides the number of MCUs between the current and the last previous mention of a discourse entity. When an entity is evoked for the first time, Dist is set to &quot;D&quot;. Dist4 is derived from Dist by assigning the fixed distance 2 to all referring expressions whose antecedent is more than 1 MCU away. Ambiguity is defined as the number of all discourse entities with the same agreement features that occur in the previous unit or in the same unit before the current referring expression.</Paragraph> </Section> </Section> <Section position="4" start_page="19" end_page="19" type="metho"> <SectionTitle> 3 Data </SectionTitle> <Paragraph position="0"> Our data consisted of twelve (plus two) texts from the Brown corpus and the corresponding part-of-speech and syntactic annotations from the Penn Treebank (LDC, 1995). The texts were selected because they contained relatively little or no direct speech; segments of direct speech pose problems for both pronoun resolution and generation because of the change in point of view. Morpho-syntactic information such as markables, part-of-speech labels, grammatical role labels, and form of referring expression were automatically extracted from the existing Treebank annotations.</Paragraph> <Paragraph position="1"> The texts come from four different genres: Popular Lore (CF), Belles Lettres (CG), Fiction/General (CK), and Fiction/Mystery (CL). The choice of genres was dictated by the availability of detailed Treebank-II parses. Table 3 shows that the distribution of referring expressions differs considerably between genres.</Paragraph> <Paragraph position="2"> The texts from the two non-narrative types, CF and CG, contain far more discourse entities and far less pronouns than the narrative genres CK and CL. The high number of pronouns in CK and CL is partly due to the fact that in one text from each genre, we have a first person singular narrator. CK patterns with CF and CG in the average number of MCUs; the sentences in the sample from mystery fiction are shorter and arguably less complex. CL also has disproportionally few deadend referents. The high percentage of deadend referents in CK is due to the fact that two of the texts deal with relationship between two people. These four discourse referents account for the 4 longest coreference chains in CK (85, 96, 109, and 127 mentions). Two annotators (the authors, both trained linguists), hand-labeled the texts with co-specification information based on the specifications for the Message Understanding Coreference task (Hirschman & Chinchor (1997); for theoretical reasons, we did not mark reflexive pronouns and appositives as cospecifying). The MCUs were labelled by the second author. All referring expressions were annotated with agreement and sortal class information.</Paragraph> <Paragraph position="3"> Labels were placed using the GUI-based annotation tool REFEREE (DeCristofaro et al., 1999).</Paragraph> <Paragraph position="4"> The annotators developed the Sortal Class annotation guidelines on the basis of two training texts. Then, both labellers annotated two texts from each genre independently (eight in total). These eight texts were used to determine the reliability of the sortal class coding scheme. Since sortal class annotation is intrinsically hard, the annotators looked up the senses of the head noun of each referring NP that was not a pronoun or a proper name in Word-Net. Each sense was mapped directly to one or more of the ten classes given in Table 2. The annotators then chose the adequate sense.</Paragraph> <Paragraph position="5"> The reliability of the annotations were measured words ref. expr. entities sequ.. MCUs % pron. %deadend med. len. number of sequences of co-specifying referring expressions. % deadend: percentage of discourse entities mentioned only once. % pronouns: percentage of all referring expressions realized as pronouns, in brackets: perc. of first person singular pronouns, perc. of second person singular pronouns, perc. of third person singular masculine and feminine pronouns, reed. len.: median length of sequences of co-specifying referring expressions with Cohen's n (Cohen, 1960; Carletta, 1996). Cohen (1960) shows that a n between 0.68 and 0.80 allows tentative conclusions, while e; > 0.80 indicates reliable annotations. For genres CF (n = 0.83), CK (n = 0.84) and CL (n = 0.83), the sortal class annotations were indeed reliable, but not for genre CG (n = 0.63). Nevertheless, overall, the sortal class annotations were reliable (n ---- 0.8). Problems are mainly due to the abstract classes Concept, Action, Event, State, and Property. Abstract head nouns sometimes have several senses that fit the context almost equally well, but that lead to different sortal classes. Another problem is metaphorical usage.</Paragraph> <Paragraph position="6"> This explains the bad results for CG, which features many abstract discourse entities.</Paragraph> </Section> <Section position="5" start_page="19" end_page="23" type="metho"> <SectionTitle> 4 Towards a Probabilistic Genre-Independent Model </SectionTitle> <Paragraph position="0"> In this section, we investigate to what extent the factors proposed in section 2.2 influence the decision to prominalize. For the purpose of the statistical analysis, pronominalization is modelled by a feature Pro.</Paragraph> <Paragraph position="1"> For a given referring expression, that feature has the value &quot;P&quot; if the referring expression is a personal or a possessive pronoun, else &quot;N&quot;. We model this variable with a binomial distribution. I</Paragraph> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.1 How do the Factors Affect </SectionTitle> <Paragraph position="0"> First, we examine for all nine factors if there is a statistical association between these factors and Pro.</Paragraph> <Paragraph position="1"> Standard non-parametric tests show a strong association between all nine factors and Pro. 2 This holds ~For all statistical calculations and for the logistic regression analyses reported below, we used R (Ihaka & Gentleman, 1996).</Paragraph> <Paragraph position="2"> 2We used the KruskaI-Wallis test for the ordinal Ambig variable and the X2-test for the other, nominal, variables. Since first mentions and deadends are coded by the character &quot;D&quot; in both for all referring expressions and for those that occur in sequences of co-specifying referring expressions. All of the tests were significant at the p < 0.001-level, with the exception of Par: for expressions that are part of co-specification sequences the effect of that factor is not significant.</Paragraph> <Paragraph position="3"> In the next analysis step, we determine which of the feature values are associated disproportionally often with pronouns, and which values tend to be associated with full NPs. More specifically, we test for each feature-value pair if the pronominalization probability is significantly higher or lower than that computed over (a) the complete data set, (b) all referring expressions in sequences of co-specifying referring expressions, (c) all third person referring expressions in sequences. Almost all feature values show highly significant effects for (a) and (b), but some of these effects vanish in condition (c). Below, we report on associations which are significant at p < 0.001 under all three conditions.</Paragraph> <Paragraph position="4"> Unsurprisingly, there is a strong effect of agreement values: NPs referring to the first and second person are always pronominalized, and third person masculine or feminine NPs, which can refer to persons, are pronominalized more frequently than third person neuter and third person plural. Pronouns are strongly preferred if the distance to the antecedent is 0 or 1 MCUs. Referring expressions are more likely to be pronominalized in subject position than as a PP adjunct, and referring expressions with adjuncts as antecedents are also pronominalized less often than those with antecedents in subject or object position. There is a clear preference for pronouns as possessive determiners, and referring expressions that co-specify with an antecedent possessive pronoun are highly likely to be pronominalised. We both Dist and Dist4, both are treated as a categorical variable by R. For more on these tests, see (Agresti, 1990).</Paragraph> <Paragraph position="5"> also notice strong genre-independent effects of parallelism. Although at first glance, Ambig appears to have a significant effect as well, (median ambiguity for nouns is 3, median ambiguity for pronouns 0), closer inspection reveals that this is mainly due to first and second person and third person masculine and feminine pronouns.</Paragraph> <Paragraph position="6"> The sortal classes show a number of interesting patterns (cf. Table 4). Not only do the classes differ in the percentage of deadend entities, there are also marked differences in pronominalizability. There appear to be three groups of sortal classes: Person/Group, with the lowest rate of dead-end entities and the highest percentage of pronouns - not only due to the first and second per-son personal pronouns-, Location/PhysObj, with roughly two thirds of all entities not in sequences and a significantly lower pronominalization rate, and Concept/Action/Event/Property/State/Concept, with over 80% deadend entities. Within this group, Action, Event, and Concept are pronominalized more frequently than State and Property. Time is the least frequently pronominalized class. An important reason for the difference between Loc and Time might be that Times are almost always referred back to by temporal adverbs, while locations, especially towns and countries, can also be accessed via third person neuter personal pronouns.</Paragraph> <Paragraph position="7"> Interactions between the factors and genre were examined by an analysis of deviance run on a fitted logistic regression model; significance was calculated using the F-test. All factors except for Par show strong (p < 0.001) interactions with Genre.</Paragraph> <Paragraph position="8"> In other words, the influence of all factors but parallelism on pronominalization is mediated by Genre.</Paragraph> <Paragraph position="9"> There are two main reasons for this effect: first, some genres contain far more first and second per-son personal pronouns, which adds to the weight of Agree, and second, texts which are about persons and the actions of persons, such as the texts in CK and CL, tend to use more pronouns than texts which are mainly argumentative or expository.</Paragraph> </Section> <Section position="2" start_page="19" end_page="23" type="sub_section"> <SectionTitle> 4.2 Which Factors are Important? </SectionTitle> <Paragraph position="0"> To separate the important from the unimportant factors, many researchers use decision and regression trees, mostly the binary CART variant (Breiman et al., 1984). We use a different kind of model here, logistic regression, which is especially well suited for categorical data analysis (cf. eg. Agresti (1990) or Kessler et al. (1997)). In this model, the value of the binary target variable is predicted by a linear combination of the predictor variables. Variable weights indicate the importance of a variable for classification: the higher the absolute value of the weight, the more important it is.</Paragraph> <Paragraph position="1"> Logistic regression models are not only evaluated by their performance on training and test data. We could easily construct a perfect model of any training data set with n variables, where n is the size of the data set. But we need models that are small, yet predict the target values well. A suitable criterion is the Akaike Information Criterion (AIC, Akaike (1974)), which punishes both models that do not fit the data well and models that have too many parameters. The quality of a factor is judged by the amount of variation in the target variable that it explains. Note that increased prediction accuracy does not necessarily mean an increase in the amount of variation explained. As the model itself is a continuous approximation of the categorical distinctions to be modelled, it may occur that the numerical variation in the predictions decreases, but that this decrease is lost when re-translating numerical predictions into categorical ones.</Paragraph> <Paragraph position="2"> The factors for our model were selected based on the following procedure: We start with a model that always predicts the most frequent class. We then determine which factor provides the greatest reduction in the AIC, add that factor to the model and retrain.</Paragraph> <Paragraph position="3"> This step is repeated until all factors have been used or adding another factor does not yield any significant improvements anymore. 3 This procedure invariably yields the sequence Dist4, Agree, Class, FormAnte, Syn, SynAnte, Ambig, Par, both when training models on the complete data set and when training on a single genre. Inspection of the AIC values suggests that parallelism is the least important factor, and does not improve the AIC significantly. Therefore, we will discard it from the outset. All other factors are maintained in the initial full model. This model is purely additive; it does not include interactions between factors. This approach allows us to filter out factors which only mediate the influence of other factors, but do not exert any significant influence of their own. Note that this probabilistic model only provides a numerical description of how its factors affect pronominalization in our corpus. As such, it is not equivalent to a theoretical model, but rather provides data for fur3We excluded Dist from this stepwise procedure, since the relevant information is covered already by Dist4, which furthermore has much fewer values.</Paragraph> <Paragraph position="4"> genre-specific corpora (CF, CG, CK, CL) and the complete data set (all). % correct: correctly predicted pronominalization decition, AIC: Akaike Information Criterion, % variation: percentage of original variation in the data (as measured by deviance) accounted for by the model ther theoretical interpretation.</Paragraph> <Paragraph position="5"> Results of a first evaluation of the full model are summarized in Table 5. The model can explain more than two thirds of the variation in the complete data set and can predict pronominalization quite well on the data it was fitted on. The matter becomes more interesting when we examine the genre-specific results. Although overall prediction performance remains stable, the model is obviously suited better to some genres than to others. The best results are obtained on CF, the worst on CL (mystery fiction). In the CL texts, MCUs are short, a third of all referring expressions are pronouns, there is no first person singular narrator, and most paragraphs which mention persons are about the interaction between two persons.</Paragraph> <Paragraph position="6"> The Relative Importance of Factors. All values of Dist4 have very strong weights in all models; this is clearly the most important factor. The same goes for Agree, where the first and second per-son are strong signs of pronominalization, and, to a lesser degree, masculine and feminine third person singular. The most important distinction provided by Class appears to be that between Persons, non-Persons, and Times. This holds as well when the model is only trained on third person referring expressions. For singular referring expressions, Personhood information is reflected in gender, but not for plural referring expressions. Another important influence is the form of the antecedent. The syntactic function of the referring expression and of its antecedent are less important, as is ambiguity.</Paragraph> <Paragraph position="7"> In order to examine the importance of the factors in more detail, we refitted the models on the complete data set while omitting one or more of the three central features Dist4, Agree, and Class. The results are summarized in Table 6. The most interesting finding is that even if we exclude all three factors, prediction accuracy only drops by 3.2%.</Paragraph> <Paragraph position="8"> This means that the remaining 4 factors also contain most of the relevant information, but that this information is coded more &quot;efficiently&quot;, so to speak, in the first three. Speaking of these factors, questions concerning the effect of sortal class remains. Remarkably enough, when sortal class is omitted, accuracy increases by 0.7%. The increase in A1C can be explained by a decrease in the amount of explained variation. A third result is that information about the form of the antecedent can substitute for distance information, if that information is missing. Both variables code the crucial distinctions between expressions that evoke entities and those that access evoked entities. Furthermore, a pronominal antecedent tends to occur at a distance of less than 2 MCUs. The contribution of syntactic function remains stable and significant, albeit comparatively unimportant.</Paragraph> <Paragraph position="9"> Predictive Power: To evaluate the predictive power of the models computed so far, we determine the percentage of correctly predicted pronouns and NPs. The performance of the trained models was compared to two very simple algorithms: Algorithm A: Always choose the most frequent option (i.e. noun).</Paragraph> <Paragraph position="10"> Algorithm B: If the antecedent is in the same MCU, or if it is in the previous MCU and there is no ambiguity, choose a pronoun; else choose a noun.</Paragraph> <Paragraph position="11"> Table 7 summarises the results of the comparison. To determine the overall predictive power of 54.4 21.1 n.a. 4.7 2.8 0.5 1.1 54.4 n.a. 14.3 6.2 2.7 0.6 1.1 n.a. 35.8 6.1 32 3 0.8 0.I n.a. 35.8 n.a. 33.7 3.4 0.8 0.1 n.a. n.a. 31.4 35.4 3.1 0.8 0.2 54.4 n.a. n.a. 13.11 3.5 0.5 3.6 n.a. n.a. n.a. 52.62 4 0.7 1.7 pothesize that the decrease in performance is mainly due to the model itself, not to the training data. The results presented in both Table 5 and 7 show that although the model we have found is not quite as genre-independent as we would want it to be, it provides a reasonable fit to all the genres we examined. data in % correct prediction if referring expression is to be pronominalised or not. Setup for genres: model is trained on three genres, tested on the remaining one the model, we used 10-fold cross-validation. Algorithm A always fares worst, while algorithm B, which is based mainly on distance, the strongest factor in the model, performs quite well. Its overall performance is 3.2% below that of the full model, and 3.6% below that of the full model without sortal class information. It even outperforms the models on CG, which has the lowest percentage of Persons (12.9% vs. 35% for CF and 43.4% and 43.5% for CL and CK). For all other genres, the statistical models outperform the simple heuristics. Excluding sortal class information can boost prediction performance on unseen data by as much as 0.4% for the complete corpus. The apparent contradiction between this finding and the results reported in the previous section can be explained if we consider that not only were some sortal classes comparatively rare in the data (Property, Event), but that our sortal class definition may still be too fine-grained.</Paragraph> <Paragraph position="12"> We evaluated the genre-independence of the model by training on three genres and testing on the fourth. The results show that the model fares quite well for genre CF, which is also the genre where the overall fit was best (see Table 5). We therefore hy-</Paragraph> </Section> </Section> class="xml-element"></Paper>