File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0302_metho.xml
Size: 11,196 bytes
Last Modified: 2025-10-06 14:10:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0302"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Toward Opinion Summarization: Linking the Sources</Title> <Section position="5" start_page="9" end_page="10" type="metho"> <SectionTitle> 3 Data set </SectionTitle> <Paragraph position="0"> We begin our discussion by describing the data set that we use for development and evaluation.</Paragraph> <Paragraph position="1"> As noted previously, we desire methods that work with automatically identified opinions and sources. However, for the purpose of developing and evaluating our approaches we rely on a corpus ofmanuallyannotatedopinionsandsources. More precisely, we rely on the MPQA corpus (Wilson and Wiebe, 2003)3, which contains 535 manually annotated documents. Full details about the corpus and the process of corpus creation can be found in Wilson and Wiebe (2003); full details of the opinion annotation scheme can be found in Wiebe et al. (2005). For the purposes of the discussion in this paper, the following three points suffice.</Paragraph> <Paragraph position="2"> First, the corpus is suitable for the domains and genres that we target - all documents have occurred in the world press over an 11-month period, between June 2001 and May 2002. Therefore, the corpus is suitable for the political and government domains as well as a substantial part of the commercial domain. However, a fair portion of the commercial domain is concerned with opinion extraction from product reviews. Work described in this paper does not target the genre of reviews, which appears to differ significantly from newspaper articles.</Paragraph> <Paragraph position="3"> Second, all documents are manually annotated with phrase-level opinion information. The annotation scheme of Wiebe et al. (2005) includes phrase level opinions, their sources, as well as other attributes, which are not utilized by our approach. Additionally, the annotations contain information that allows coreference among source mentions to be recovered.</Paragraph> <Paragraph position="4"> Finally, the MPQA corpus contains no coreference information for general NPs (which are not sources). This might present a problem for traditional coreference resolution approaches, as discussed throughout the paper.</Paragraph> </Section> <Section position="6" start_page="10" end_page="11" type="metho"> <SectionTitle> 4 Source Coreference Resolution </SectionTitle> <Paragraph position="0"> In this Section we define the problem of source coreference resolution, describe its challenges, and provide an overview of our general approach.</Paragraph> <Paragraph position="1"> We define source coreference resolution as the problem of determining which mentions of opinion sources refer to the same real-world entity.</Paragraph> <Paragraph position="2"> Source coreference resolution differs from traditional supervised NP coreference resolution in two important aspects. First, sources of opinions do not exactly correspond to the automatic extractors' notion of noun phrases (NPs). Second, due mainly to the time-consuming nature of coreference annotation, NP coreference information is incomplete in our data set: NP mentions that are not sources of opinion are not annotated with coreference information (even when they are part of a chain that contains source NPs)4. In this paper we address the former problem via a heuristic method for mapping sources to NPs and give statistics for the accuracy of the mapping process.</Paragraph> <Paragraph position="3"> We then apply state-of-the-art coreference resolution methods to the NPs to which sources were 4This problem is illustrated in the example of Figure 1 The underlined Stanishev is coreferent with all of the Stanishev references marked as sources, but, because it is used in an objective sentence rather than as the source of an opinion, thereferencewouldbeomittedfromtheStanishevsource coreference chain. Unfortunately, this proper noun might be critical in establishing coreference of the final source reference he with the other mentions of the source Stanishev. phrases.</Paragraph> <Paragraph position="4"> mapped (source noun phrases). The latter problem of developing methods that can work with incomplete supervisory information is addressed in a subsequent effort (Stoyanov and Cardie, 2006).</Paragraph> <Paragraph position="5"> Our general approach to source coreference resolution consists of the following steps: 1. Preprocessing: We preprocess the corpus by running NLP components such as a tokenizer, sentence split null ter, POS tagger, parser, and a base NP finder. Subsequently, we augment the set of the base NPs found by the base NP finder with the help of a named entity finder. The preprocessing is done following the NP coreference work by Ng and Cardie (2002). From the preprocessing step, we obtain an augmented set of NPs in the text.</Paragraph> <Paragraph position="6"> 2. Source to noun phrase mapping: The problem of mapping (manually or automatically annotated) sources to NPs is not trivial. We map sources to NPs using a set of heuristics.</Paragraph> <Paragraph position="7"> 3. Coreference resolution: Finally, we restrict our atten null tion to the source NPs identified in step 2. We extract a feature vector for every pair of source NPs from the preprocessed corpus and perform NP coreference resolution. null The next two sections give the details of Steps 2 and 3, respectively. We follow with the results of an evaluation of our approach in Section 7.</Paragraph> <Paragraph position="8"> 5 Mapping sources to noun phrases Thissectiondescribesourmethodforheuristically mapping sources to NPs. In the context of source coreference resolution we consider a noun phrase to correspond to (or match) a source if the source and the NP cover the exact same span of text. Unfortunately, the annotated sources did not always match exactly a single automatically extracted NP. We discovered the following problems: 1. Inexact span match. We discovered that often (in 3777 out of the 11322 source mentions) there is no noun phrase whose span matches exactly the source although there are noun phrases that overlap the source. In most cases this is due to the way spans of sources are marked in the data. For instance, in some cases determiners are not included in the source span (e.g. &quot;Venezuelan people&quot; vs. &quot;the Venezuelan people&quot;). In other cases, differences are due to mistakes by the NP extractor (e.g. &quot;Muslims rulers&quot; was not recognized, while &quot;Muslims&quot; and &quot;rulers&quot; were recognized). Yet in other cases, manually marked sources do not match the definition of a noun phrase. This case is described in more detail next.</Paragraph> <Paragraph position="9"> 2. Multiple NP match. For 3461 of the 11322 source mentions more than one NP overlaps the source. In roughly a quarter of these cases the multiple match is due to the presence of nested NPs (introduced by the NP augmentation process introduced in Section 3). In other cases the multiple match is caused by source annotations that spanned multiple NPs or included more than only NPs inside its span. There are three general classes of such sources. First, some of the marked sourcesareappositivessuchas&quot;thecountry'snewpresident, Eduardo Duhalde&quot;. Second, some sources containanNPfollowedbyanattachedprepositionalphrase null suchas&quot;LatinAmericanleadersatasummitmeetingin Costa Rica&quot;. Third, some sources are conjunctions of NPs such as &quot;Britain, Canada and Australia&quot;. Treatment of the latter is still a controversial problem in the context of coreference resolution as it is unclear whether conjunctions represent entities that are distinct fromtheconjuncts. Forthepurposeofourcurrentwork we do not attempt to address conjunctions.</Paragraph> <Paragraph position="10"> 3. No matching NP. Finally, for 50 of the 11322 sources there are no overlapping NPs. Half of those (25 to be exact) included marking of the word &quot;who&quot; such as in the sentence &quot;Carmona named new ministers, including two military officers who rebelled against Chavez&quot;. From the other 25, 19 included markings of non-NPs including question words, qualifiers, and adjectives such as &quot;many&quot;, &quot;which&quot;, and &quot;domestically&quot;. The remaining six are rare NPs such as &quot;lash&quot; and &quot;taskforce&quot; that are mistakenly not recognized by the NP extractor.</Paragraph> <Paragraph position="11"> Counts for the different types of matches of sources to NPs are shown in Table 1. We determine the match in the problematic cases using a set of heuristics: 1. If a source matches any NP exactly in span, match that source to the NP; do this even if multiple NPs overlap the source - we are dealing with nested NP's.</Paragraph> <Paragraph position="12"> 2. If no NP matches matches exactly in span then: * If a single NP overlaps the source, then map the sourcetothatNP.Mostlikelywearedealingwith differently marked spans.</Paragraph> <Paragraph position="13"> * If multiple NPs overlap the source, determine whether the set of overlapping NPs include any non-nested NPs. If all overlapping NPs are nested with each other, select the NP that is closer in span to the source - we are still dealing with differently marked spans, but now we also have nested NPs. If there is more than one set of nested NPs, then most likely the source spans more than a single NP. In this case we select the outermost of the last set of nested NPs before any preposition in the span. We prefer: the outermost NP because longer NPs contain more information; thelastNPbecauseitislikelytobethehead NP of a phrase (also handles the case of explanation followed by a proper noun); NP's before preposition, because a preposition signals an explanatory prepositional phrase.</Paragraph> <Paragraph position="14"> 3. If no NP overlaps the source, select the last NP before the source. In half of the cases we are dealing with the word who, which typically refers to the last preceding NP.</Paragraph> </Section> <Section position="7" start_page="11" end_page="11" type="metho"> <SectionTitle> 6 Source coreference resolution as </SectionTitle> <Paragraph position="0"> coreference resolution Once we isolate the source NPs, we apply coreference resolution using the standard combination of classification and single-link clustering (e.g. Soon et al. (2001) and Ng and Cardie (2002)).</Paragraph> <Paragraph position="1"> We compute a vector of 57 features for every pair of source noun phrases from the preprocessed corpus. We use the training set of pairwise instances to train a classifier to predict whether a source NP pair should be classified as positive (the NPs refer to the same entity) or negative (different entities). During testing, we use the trained classifier to predict whether a source NP pair is positive and single-link clustering to group together sources that belong to the same entity.</Paragraph> </Section> class="xml-element"></Paper>