File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2012_metho.xml
Size: 20,298 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2012"> <Title>High-precision Identification of Discourse New and Unique Noun Phrases</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 NP Classification </SectionTitle> <Paragraph position="0"> In our study we follow mainly E. Prince's classification of NPs (Prince, 1981). Prince distinguishes between the discourse and the hearer givenness.The resulting taxonomy is summarised below: a9 brand new NPs introduce entities which are both discourse and hearer new (&quot;a bus&quot;), sub-class of them, brand new anchored NPs contain explicit link to some given discourse entity (&quot;a guy I work with&quot;), a9 unused NPs introduce discourse new, but hearer old entities (&quot;Noam Chomsky&quot;), a9 evoked NPs introduce entities already present in the discourse model and thus discourse and hearer old: textually evoked NPs refer to entities which have already been mentioned in the previous discourse (&quot;he&quot; in &quot;A guy I worked with says he knows your sister&quot;), whereas situationally evoked are known for situational reasons (&quot;you&quot; in &quot;Would you have change of a quarter?&quot;), a9 inferrables are not discourse or hearer old, however, the speaker assumes the hearer can infer them via logical reasoning from evoked entities or other inferrables (&quot;the driver&quot; in &quot;I got on a bus yesterday and the driver was drunk&quot;), containing inferrables make this inference link explicit (&quot;one of these eggs&quot;). For our present study we do not need such an elaborate classification. Moreover, various experiments of Vieira and Poesio show that even humans have difficulties distinguishing, for example, between inferrables and new NPs, or trying to find an anchor for an inferrable. So, we developed a simple taxonomy following the main Prince's distinction between the discourse and the hearer givenness.</Paragraph> <Paragraph position="1"> First, we distinguish between discourse new and discourse old entities. An entity is considered discourse old (a4 a1a4a3a48a5a40a7a10a9a40a11 a13a4a5a40a15 a16a17a15a19a18 ) if it refers to an object or a person mentioned in the previous discourse. For example, in &quot;The Navy is considering a new ship that [..] The Navy would like to spend about $ 200 million a year on the arsenal ship..&quot; the first occurrence of &quot;The Navy&quot; and &quot;a new ship&quot; are classified as a55 a1a4a3a6a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 , whereas the second occurrence of &quot;The Navy&quot; and &quot;the arsenal ship&quot; are classified as a4 a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 . It must be noted that many researchers, in particular, Bean and Riloff, would consider the second &quot;the Navy&quot; nonanaphoric, because it fully specifies its referent and does not require information on the first NP to be interpreted successfully. However, we think that a link between two instances of &quot;the Navy&quot; can be very helpful, for example, in the Information Extraction task. Therefore we treat those NPs as discourse old.</Paragraph> <Paragraph position="2"> Our a4 a1a4a3a48a5a40a7a10a9a12a11a14a13a4a5a40a15 a16a17a15 a18 class corresponds to Prince's textually evoked NPs.</Paragraph> <Paragraph position="3"> Second, we distinguish between uniquely and non-uniquely referring expressions. Uniquely referring expressions (a55 a11a14a16a20a3a22a21a12a11 a15 ) fully specify their referents and can be successfully interpreted without any local supportive context. Main part of the a55 a11a14a16a20a3a22a21a12a11 a15 class constitute entities, known to the hearer (reader) already at the moment when she starts processing the text, for example &quot;The Mount Everest&quot;. In addition, an NP (unknown to the reader in the very beginning) is considered unique if it fully specifies its referent due to its own content only and thus can be added as it is (maybe, for a very short time) to the reader's World knowledge base after the processing of the text, for example, &quot;John Smith, chief executive of John Smith Gmbh&quot; or &quot;the fact that John Smith is a chief executive of John Smith Gmbh&quot;. In Prince's terms our a55 a11a14a16a20a3a22a21a12a11a23a15 class corresponds to the unused and, partially, new. In our Navy example (cf.</Paragraph> <Paragraph position="4"> above) both occurrences of &quot;The Navy&quot; are considered a55 a11a14a16a20a3a22a21a12a11a23a15 , whereas &quot;a new ship&quot; and &quot;the arsenal ship&quot; are classified as a4 a11 a16a20a3a22a21a12a11a23a15 .</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Data </SectionTitle> <Paragraph position="0"> In our research we use 20 texts from the MUC-7 corpus (Hirschman and Chinchor, 1997). The texts were parsed by E. Charniak's parser (Charniak, 2000). Parsing errors were not corrected manually. After this preprocessing step we have 20 lists of noun phrases.</Paragraph> <Paragraph position="1"> There are discrepancies between our lists and the MUC-7 annotations. First, we consider only noun phrases, whereas MUC-7 takes into account more types of entities (for example, &quot;his&quot; in &quot;his position&quot; should be annotated according to the MUC-7 scheme, but is not included in our lists). Second, the MUC-7 annotation identifies only markables, participating in some coreference chain. Our lists are produced automatically and thus include all the NPs.</Paragraph> <Paragraph position="2"> We annotated automatically our NPs as a0a2a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 using the following simple rule: an NP is considered a4 a1a4a3a48a5a40a7a10a9a12a11a14a13a4a5a40a15 a16a17a15 a18 if and only if a9 it is marked in the original MUC-7 corpus, and a9 it has an antecedent in the MUC-7 corpus (even if this antecedent does not correspond to any NP in our corpus).</Paragraph> <Paragraph position="3"> In addition, we annotated our NPs manually as a0 a11 a16a20a3a22a21a12a11a23a15 . The following expressions were considered a55 a11a14a16a20a3a22a21a12a11a23a15 : a9 fully specifying the referent without any local or global context (the chairman of Microsoft Corporation, 1998, or Washington). We do not take homonymy into account, so, for example, Washington is annotated as a55 a11a14a16a20a3a22a21a12a11a23a15 although it can refer to many different entities: various persons, cities, counties, towns, islands, a state, the government and many others.</Paragraph> <Paragraph position="4"> a9 time expressions that can be interpreted uniquely once some starting time point (global context) is specified. The MUC-7 corpus consists of New York Times News Service articles.</Paragraph> <Paragraph position="5"> Obviously, they were designed to be read on some particular day. Thus, for a reader of such a text, the expressions on Thursday or tomorrow fully specify their referents. Moreover, the information on the starting time point can be easily extracted from the header of the text.</Paragraph> <Paragraph position="6"> a9 expressions, denoting political or administrative objects (for example, &quot;the Army&quot;). Although such expressions do not fully specify their referents without an appropriate global context (many countries have armies), in an U.S. newspaper they can be interpreted uniquely.</Paragraph> <Paragraph position="7"> Overall, we have 3710 noun phrases. 2628 of them were annotated as a55 a1a4a3a48a5a40a7a10a9a12a11a14a13a4a5a40a15 a16a17a15 a18 and 1082 -- as a4 a1a4a3a6a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 . 2651 NPs were classified as a4 a11 a16a20a3a22a21a12a11a23a15 and 1059 -- as a55 a11a14a16a20a3a22a21a12a11 a15 . We provide these data to a machine learning system (Ripper).</Paragraph> <Paragraph position="8"> Another source of data for our experiments is the World Wide Web. To model &quot;definite probability&quot; for a given NP, we construct various phrases, for example, &quot;the NP&quot;, and send them to the AltaVista search engine. Obtained counts (number of pages worldwide written in English and containing the phrases) are used to calculate values for several &quot;definite probability&quot; features (see Section 4.1 below). We do not use morphological variants in this study.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Identifying Discourse New and Unique </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Expressions </SectionTitle> <Paragraph position="0"> In our experiments we want to learn both classifications a0a2a1a54a3a48a5a40a7a34a9a12a11 a13a30a5a12a15 a16a17a15a19a18 and a0 a11 a16a20a3a22a21a12a11a23a15 automatically. However, not every learning algorithm would be appropriate due to the specific requirements we have.</Paragraph> <Paragraph position="1"> First, we need an algorithm that does not always require all the features to be specified. For example, we might want to calculate &quot;definite probability&quot; for a definite NP, but not for a pronoun. We also don't want to decide a priori, which features are important and which ones are not in any particular case. This requirement rules out such approaches as Memory-based Learning, Naive Bayes, and many others. On the contrary, algorithms, providing treeor rule-based classifications (for example, C4.5 and Ripper) would fulfil our first requirement ideally.</Paragraph> <Paragraph position="2"> Second, we want to control precision-recall tradeoff, at least for the a0a2a1a54a3a48a5a40a7a34a9a12a11 a13a30a5a12a15 a16a17a15a19a18 task. For these reasons we have finally chosen the Ripper learner (Cohen, 1995).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Features </SectionTitle> <Paragraph position="0"> Our feature set consists currently of 32 features.</Paragraph> <Paragraph position="1"> They can be divided into three groups: 1. Syntactic Features. We encode part of speech of the head word and type of the determiner. Several features contain information on the characters, constituting the NP's string (digits, capital and low case letters, special symbols). We use several heuristics for restrictive postmodification. Two types of appositions are identified: with and without commas (&quot;Rupert Murdoch, News Corp.'s chairman and chief executive officer,&quot; and &quot;News Corp.'s chairman and chief executive officer Rupert Murdoch&quot;). In the MUC-7 corpus, appositions of the latter type are usually annotated as a whole. Charniak's parser, however, analyses these constructions as two NPs (['News Corp.'s chairman and chief executive officer] [Rupert Murdoch]). Therefore those cases require special treatment. 2. Context Features. For every NP we calculate the distance (in NPs and in sentences) to the previous NP with the same head if such an NP exists. Obtaining values for these features does not require exhaustive search when heads are stored in an appropriate data structure, for example, in a trie.</Paragraph> <Paragraph position="2"> 3. &quot;Definite probability&quot; features. Suppose a0 is a noun phrase, a1 is the same noun phrase without a determiner, and a2 is its head. We obtain Internet counts for &quot;Det Y&quot; and &quot;Det H&quot;, where a3 a15a5a4 stays for &quot;the&quot;, &quot;a(n)&quot;, or the empty string. Then the following ratios are used as features:</Paragraph> <Paragraph position="4"> We expect our NPs to behave w.r.t. the &quot;definite probability&quot; as follows: pronouns and long proper names are seldom used with any article:</Paragraph> <Paragraph position="6"> &quot;he&quot; was found on the Web 44681672 times, &quot;the he&quot; -- 134978 times (0.3%), and &quot;a he&quot; -- 154204 times (0.3%). Uniques (including short proper names) and plural non-uniques are used with the definite article much more often than with the indefinite one: &quot;government&quot; was found 23197407 times, &quot;the government&quot; -- 5539661 times (23.9%), and &quot;a government&quot; -- 1109574 times (4.8%). Singular non-unique expressions are used only slightly (if at all) more often with the definite article: &quot;retailer&quot; was found 1759272 times, &quot;the retailer&quot; -- 204551 times (11.6%), and &quot;a retailer&quot; -- 309392 times (17.6%).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Discourse New entities </SectionTitle> <Paragraph position="0"> We use Ripper to learn the a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15 a18 classification from the feature representations described above. The experiment is designed in the following way: one text is reserved for testing (we do not want to split our texts and always process them as a whole). The remaining 19 texts are first used to optimise Ripper parameters -- class ordering, possibility of negative tests, hypothesis simplification, and minimal number of training examples to be covered by a rule. We perform 5-fold cross-validation on these 19 texts in order to find the settings with the best precision for the a55 a1a4a3a6a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 class. These settings are then used to train Ripper on all the 19 files and test on the reserved one. The whole procedure is repeated for all the 20 test files and the average precision and recall are calculated. The parameter &quot;Loss Ratio&quot; (ratio of the cost of a false negative to the cost of a false positive) is adjusted separately -- we decreased it as much as possible (to 0.3) to have a classification with a good precision and a reasonable recall.</Paragraph> <Paragraph position="1"> The automatically induced classifier includes, for example, the following rules: R2: (applicable to such NPs as &quot;you&quot;) IF an NP is a pronoun, CLASSIFY it as discourse old.</Paragraph> <Paragraph position="2"> R14: (applicable to such NPs as &quot;Mexico&quot; or &quot;the Shuttle&quot;) IF an NP has no premodifiers, is more often used with &quot;the&quot; than with &quot;a(n)&quot; (the ratio is between 2 and 10), and a same head NP is found within the 18-NPs window, CLASSIFY it as discourse old.</Paragraph> <Paragraph position="3"> The performance is shown in table 1.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Uniquely Referring Expressions </SectionTitle> <Paragraph position="0"> Although the &quot;definite probability&quot; features could not help us much to classify NPs as a0a2a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 , we expect them to be useful for identifying unique expressions.</Paragraph> <Paragraph position="1"> We conducted a similar experiment trying to learn a a0 a11 a16a20a3a22a21a12a11a23a15 classifier. The only difference was in the optimisation strategy: as we did not know a priori, what was more important, we looked for settings with the best precision for non-uniques, recall for non-uniques, and overall accuracy (number of correctly classified items of both classes) separately.</Paragraph> <Paragraph position="2"> The results are summarised in table 2.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Combining two approaches </SectionTitle> <Paragraph position="0"> Unique and non-unique NPs demonstrate different behaviour w.r.t. the coreference: discourse entities are seldom introduced by vague descriptions and then referred to by fully specifying NPs. Therefore tion for unique and non-unique NPs separately, all the features are used we can expect a unique NP to be discourse new, if obvious checks for coreference fail. The &quot;obvious checks&quot; include in our case looking for same head expressions and appositive constructions, both of them requiring only constant time.</Paragraph> <Paragraph position="1"> On the other hand, unique expressions always have the same or similar form: &quot;The Navy&quot; can be either discourse new or discourse old. Non-unique NPs, on the contrary, look differently when introducing entities (for example, &quot;a company&quot; or &quot;the company that . . . &quot;) and referring to the previous ones (&quot;it&quot; or &quot;the company&quot; without postmodifiers). Therefore our syntactic features should be much more helpful when classifying non-uniques as a0a2a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 . To investigate this difference we conducted another experiment. We split our data into two parts -- a55 a11 a16a20a3a22a21a12a11a23a15a40a5 and a4 a11 a16a20a3a22a21a12a11a23a15a40a5 . Then we learn the a0a2a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 classification for both parts separately as described in section 4.2. Finally the rules are combined, producing a classifier for all the NPs.</Paragraph> <Paragraph position="2"> The results are summarised in table 3.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Discussion </SectionTitle> <Paragraph position="0"> As far as the a0a2a1a4a3a48a5a40a7a10a9a12a11a14a13a4a5a40a15 a16a17a15 a18 task is concerned, our system performed slightly, if at all, better with the definite probability features than without them: the improvement in precision (our main criterion) is compensated by the loss in recall. However, when only definite NPs are taken into account, the improvement becomes significant. It's not surprising, as these features bring much more information for definites than for other NPs.</Paragraph> <Paragraph position="1"> For the a0 a11a14a16a20a3a48a21a40a11 a15 classification our definite probability features were more important, leading to significantly better results compared to the case when only syntactic and context features were used. Although the improvement is only about 0.5%, it must be taken into account that overall figures are high: 1% improvement on 90% and on 70% accuracy is not the same. We conducted the t-test to check the significance of these improvements, using weighted means and weighted standard deviations, as all the texts have different sizes. Table 2 shows in bold performance measures (precision, recall, or F-score) that improve significantly (a0a2a1 a51a4a3a51a4a50 ) when we use the definite probability features.</Paragraph> <Paragraph position="2"> As our third experiment shows, non-unique entities can be classified very reliably into a0a2a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 classes. Uniques, however, have shown quite poor performance, although we expected them to be resolved successfully by heuristics for appositions and same heads. Such a low performance is mainly due to the fact that many objects can be referred to by very similar, but not the same unique NPs: &quot;Lockheed Martin Corp.&quot;, &quot;Lockheed Martin&quot;, and &quot;Lockheed&quot;, for example, introduce the same object. We hope to improve the accuracy by developing more sophisticated matching rules for unique descriptions.</Paragraph> <Paragraph position="3"> Although uniques currently perform poorly, the overall classification still benefits from the sequential processing (identify a0 a11a14a16a20a3a48a21a40a11 a15a8a5 first, then learn a0a2a1a54a3a48a5a8a7a10a9a12a11a14a13a4a5a12a15 a16a17a15a19a18 classifiers for uniques and non-uniques separately, and then combine them). And we hope to get a better overall accuracy once our matching rules are improved.</Paragraph> </Section> </Section> class="xml-element"></Paper>