File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1018_metho.xml
Size: 20,759 bytes
Last Modified: 2025-10-06 14:09:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1018"> <Title>Modeling Local Coherence: An Entity-based Approach</Title> <Section position="4" start_page="141" end_page="142" type="metho"> <SectionTitle> 3 The Coherence Model </SectionTitle> <Paragraph position="0"> In this section we introduce our entity-based representation of discourse. We describe how it can be computed and how entity transition patterns can be extracted. The latter constitute a rich feature space on which probabilistic inference is performed.</Paragraph> <Paragraph position="1"> Text Representation Each text is represented by an entity grid, a two-dimensional array that captures the distribution of discourse entities across text sentences. We follow Miltsakaki and Kukich (2000) in assuming that our unit of analysis is the traditional sentence (i.e., a main clause with accompanying subordinate and adjunct clauses). The rows of the grid correspond to sentences, while the columns correspond to discourse entities. By discourse entity we mean a class of coreferent noun phrases. For each occurrence of a discourse entity in the text, the corresponding grid cell contains information about its grammatical role in the given sentence. Each grid column thus corresponds to a string from a set of categories reflecting the entity's presence or absence in a sequence of sentences. Our set consists of four symbols: S (subject), O (object), X (neither subject nor object) and - (gap which signals the entity's absence from a given sentence).</Paragraph> <Paragraph position="2"> Table 1 illustrates a fragment of an entity grid constructed for the text in Table 2. Since the text contains six sentences, the grid columns are of length six. Consider for instance the grid column for the entity trial, [O - - - - X]. It records that trial is present in sentences 1 and 6 (as O and X respectively) but is absent from the rest of the sentences. Grid Computation The ability to identify and cluster coreferent discourse entities is an important prerequisite for computing entity grids. The same entity may appear in different linguistic forms, e.g., Microsoft Corp., Microsoft, and the company, but should still be mapped to a single entry in the grid. Table 1 exemplifies the entity grid for the text in Table 2 when coreference resolution is taken into account. To automatically compute entity classes, are represented by their head nouns.</Paragraph> </Section> <Section position="5" start_page="142" end_page="143" type="metho"> <SectionTitle> 1 [The Justice Department]S is conducting an [anti-trust </SectionTitle> <Paragraph position="0"> trial]O against [Microsoft Corp.]X with [evidence]X that [the company]S is increasingly attempting to crush [competitors]O.</Paragraph> <Paragraph position="1"> 2 [Microsoft]O is accused of trying to forcefully buy into [markets]X where [its own products]S are not competitive enough to unseat [established brands]O.</Paragraph> <Paragraph position="2"> 3 [The case]S revolves around [evidence]O of [Microsoft]S aggressively pressuring [Netscape]O into merging [browser software]O.</Paragraph> <Paragraph position="3"> 4 [Microsoft]S claims [its tactics]S are commonplace and good economically.</Paragraph> <Paragraph position="4"> 5 [The government]S may file [a civil suit]O ruling that [conspiracy]S to curb [competition]O through [collusion]X is [a violation of the Sherman Act]O. 6 [Microsoft]S continues to show [increased earnings]O despite [the trial]X.</Paragraph> <Paragraph position="5"> we employ a state-of-the-art noun phrase coreference resolution system (Ng and Cardie, 2002) trained on the MUC (6-7) data sets. The system decides whether two NPs are coreferent by exploiting a wealth of features that fall broadly into four categories: lexical, grammatical, semantic and positional. null Once we have identified entity classes, the next step is to fill out grid entries with relevant syntactic information. We employ a robust statistical parser (Collins, 1997) to determine the constituent structure for each sentence, from which subjects (s), objects (o), and relations other than subject or object (x) are identified. Passive verbs are recognized using a small set of patterns, and the underlying deep grammatical role for arguments involved in the passive construction is entered in the grid (see the grid cell o for Microsoft, Sentence 2, Table 2).</Paragraph> <Paragraph position="6"> When a noun is attested more than once with a different grammatical role in the same sentence, we default to the role with the highest grammatical ranking: subjects are ranked higher than objects, which in turn are ranked higher than the rest. For example, the entity Microsoft is mentioned twice in Sentence 1 with the grammatical roles x (for Microsoft Corp.) and s (for the company), but is represented only by s in the grid (see Tables 1 and 2).</Paragraph> <Paragraph position="7"> Coherence Assessment We introduce a method for coherence assessment that is based on grid representation. A fundamental assumption underlying our approach is that the distribution of entities in coherent texts exhibits certain regularities reflected in grid topology. Some of these regularities are formalized in Centering Theory as constraints on transitions of local focus in adjacent sentences. Grids of coherent texts are likely to have some dense columns (i.e., columns with just a few gaps such as Microsoft in Table 1) and many sparse columns which will consist mostly of gaps (see markets, earnings in Table 1). One would further expect that entities corresponding to dense columns are more often subjects or objects. These characteristics will be less pronounced in low-coherence texts.</Paragraph> <Paragraph position="8"> Inspired by Centering Theory, our analysis revolves around patterns of local entity transitions. A local entity transition is a sequence {S, O, X, -}n that represents entity occurrences and their syntactic roles in n adjacent sentences. Local transitions can be easily obtained from a grid as continuous subsequences of each column. Each transition will have a certain probability in a given grid. For instance, the probability of the transition [S -] in the grid from Table 1 is 0.08 (computed as a ratio of its frequency (i.e., six) divided by the total number of transitions of length two (i.e., 75)). Each text can thus be viewed as a distribution defined over transition types. We believe that considering all entity transitions may uncover new patterns relevant for coherence assessment. null We further refine our analysis by taking into account the salience of discourse entities. Centering and other discourse theories conjecture that the way an entity is introduced and mentioned depends on its global role in a given discourse. Therefore, we discriminate between transitions of salient entities and the rest, collecting statistics for each group separately. We identify salient entities based on their</Paragraph> <Paragraph position="10"> resentation using all transitions of length two given syntactic categories: S, O, X, and -.</Paragraph> <Paragraph position="11"> frequency,1 following the widely accepted view that the occurrence frequency of an entity correlates with its discourse prominence (Morris and Hirst, 1991; Grosz et al., 1995).</Paragraph> <Paragraph position="12"> Ranking We view coherence assessment as a ranking learning problem. The ranker takes as input a set of alternative renderings of the same document and ranks them based on their degree of local coherence. Examples of such renderings include a set of different sentence orderings of the same text and a set of summaries produced by different systems for the same document. Ranking is more suitable than classification for our purposes since in text generation, a system needs a scoring function to compare among alternative renderings. Furthermore, it is clear that coherence assessment is not a categorical decision but a graded one: there is often no single coherent rendering of a given text but many different possibilities that can be partially ordered.</Paragraph> <Paragraph position="13"> As explained previously, coherence constraints are modeled in the grid representation implicitly by entity transition sequences. To employ a machine learning algorithm to our problem, we encode transition sequences explicitly using a standard feature vector notation. Each grid rendering j of a document di is represented by a feature vector Ph(xi j) = (p1(xi j), p2(xi j),..., pm(xi j)), where m is the number of all predefined entity transitions, and pt(xi j) the probability of transition t in grid xi j. Note that considerable latitude is available when specifying the transition types to be included in a feature vector. These can be all transitions of a given length (e.g., two or three) or the most frequent transitions within a document collection. An example of a feature space with transitions of length two is illustrated in Table 3.</Paragraph> <Paragraph position="14"> The training set consists of ordered pairs of renderings (xi j,xik), where xi j and xik are renderings development set. See Section 5 for further discussion. of the same document di, and xi j exhibits a higher degree of coherence than xik. Without loss of generality, we assume j > k. The goal of the training procedure is to find a parameter vector vectorw that yields a &quot;ranking score&quot; function vectorw * Ph(xi j), which minimizes the number of violations of pairwise rankings provided in the training set. Thus, the ideal vectorw would satisfy the condition vectorw*(Ph(xi j)[?]Ph(xik)) > 0 [?]j,i,k such that j > k. The problem is typically treated as a Support Vector Machine constraint optimization problem, and can be solved using the search technique described in Joachims (2002a). This approach has been shown to be highly effective in various tasks ranging from collaborative filtering (Joachims, 2002a) to parsing (Toutanova et al., 2004).</Paragraph> <Paragraph position="15"> In our ranking experiments, we use Joachims' (2002a) SVMlight package for training and testing with all parameters set to their default values.</Paragraph> </Section> <Section position="6" start_page="143" end_page="145" type="metho"> <SectionTitle> 4 Evaluation Set-Up </SectionTitle> <Paragraph position="0"> In this section we describe two evaluation tasks that assess the merits of the coherence modeling framework introduced above. We also give details regarding our data collection, and parameter estimation.</Paragraph> <Paragraph position="1"> Finally, we introduce the baseline method used for comparison with our approach.</Paragraph> <Section position="1" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 4.1 Text Ordering </SectionTitle> <Paragraph position="0"> Text structuring algorithms (Lapata, 2003; Barzilay and Lee, 2004; Karamanis et al., 2004) are commonly evaluated by their performance at information-ordering. The task concerns determining a sequence in which to present a pre-selected set of information-bearing items; this is an essential step in concept-to-text generation, multi-document summarization, and other text-synthesis problems. Since local coherence is a key property of any well-formed text, our model can be used to rank alternative sentence orderings. We do not assume that local coherence is sufficient to uniquely determine the best ordering -- other constraints clearly play a role here.</Paragraph> <Paragraph position="1"> However, we expect that the accuracy of a coherence model is reflected in its performance in the ordering task.</Paragraph> <Paragraph position="2"> Data To acquire a large collection for training and testing, we create synthetic data, wherein the candidate set consists of a source document and permutations of its sentences. This framework for data acquisition is widely used in evaluation of ordering algorithms as it enables large scale automatic evalu- null ation. The underlying assumption is that the original sentence order in the source document must be coherent, and so we should prefer models that rank it higher than other permutations. Since we do not know the relative quality of different permutations, our corpus includes only pairwise rankings that comprise the original document and one of its permutations. Given k original documents, each with n randomly generated permutations, we obtain k *n (trivially) annotated pairwise rankings for training and testing.</Paragraph> <Paragraph position="3"> Using the technique described above, we collected data in two different genres: newspaper articles and accident reports written by government officials. The first collection consists of Associated Press articles from the North American News Corpus on the topic of natural disasters. The second includes narratives from the National Transportation Safety Board's database2. Both sets have documents of comparable length - the average number of sentences is 10.4 and 11.5, respectively. For each set, we used 100 source articles with 20 randomly generated permutations for training. The same number of pair-wise rankings (i.e., 2000) was used for testing. We held out 10 documents (i.e., 200 pairwise rankings) from the training data for development purposes.</Paragraph> </Section> <Section position="2" start_page="144" end_page="145" type="sub_section"> <SectionTitle> 4.2 Summary Evaluation </SectionTitle> <Paragraph position="0"> We further test the ability of our method to assess coherence by comparing model induced rankings against rankings elicited by human judges. Admittedly, the information ordering task only partially approximates degrees of coherence violation using different sentence permutations of a source document. A stricter evaluation exercise concerns the assessment of texts with naturally occurring coherence violations as perceived by human readers. A representative example of such texts are automatically generated summaries which often contain sentences taken out of context and thus display problems with respect to local coherence (e.g., dangling anaphors, thematically unrelated sentences). A model that exhibits high agreement with human judges not only accurately captures the coherence properties of the summaries in question, but ultimately holds promise for the automatic evaluation of machine-generated texts. Existing automatic evaluation measures such as BLEU (Papineni et al., 2002) and ROUGE (Lin 2The collections are available from http://www.csail.</Paragraph> <Paragraph position="1"> mit.edu/regina/coherence/.</Paragraph> <Paragraph position="2"> and Hovy, 2003), are not designed for the coherence assessment task, since they focus on content similarity between system output and reference texts.</Paragraph> <Paragraph position="3"> Data Our evaluation was based on materials from the Document Understanding Conference (DUC, 2003), which include multi-document summaries produced by human writers and by automatic summarization systems. In order to learn a ranking, we require a set of summaries, each of which have been rated in terms of coherence. We therefore elicited judgments from human subjects.3 We randomly selected 16 input document clusters and five systems that had produced summaries for these sets, along with summaries composed by several humans.</Paragraph> <Paragraph position="4"> To ensure that we do not tune a model to a particular system, we used the output summaries of distinct systems for training and testing. Our set of training materials contained 4 * 16 summaries (average length 4.8), yielding parenleftbig42parenrightbig*16 = 96 pairwise rankings. In a similar fashion, we obtained 32 pairwise rankings for the test set. Six documents from the training data were used as a development set.</Paragraph> <Paragraph position="5"> Coherence ratings were obtained during an elicitation study by 177 unpaid volunteers, all native speakers of English. The study was conducted remotely over the Internet. Participants first saw a set of instructions that explained the task, and defined the notion of coherence using multiple examples.</Paragraph> <Paragraph position="6"> The summaries were randomized in lists following a Latin square design ensuring that no two summaries in a given list were generated from the same document cluster. Participants were asked to use a seven point scale to rate how coherent the summaries were without having seen the source texts. The ratings (approximately 23 per summary) given by our subjects were averaged to provide a rating between 1 and 7 for each summary.</Paragraph> <Paragraph position="7"> The reliability of the collected judgments is crucial for our analysis; we therefore performed several tests to validate the quality of the annotations. First, we measured how well humans agree in their coherence assessment. We employed leave-one-out resampling4 (Weiss and Kulikowski, 1991), by correlating the data obtained from each participant with the mean coherence ratings obtained from all other participants. The inter-subject agree- null measuring agreement since it is appropriate for nominal scales, whereas our summaries are rated on an ordinal scale.</Paragraph> <Paragraph position="8"> ment was r = .768. Second, we examined the effect of different types of summaries (human- vs.</Paragraph> <Paragraph position="9"> machine-generated.) An ANOVA revealed a reliable effect of summary type: F(1;15) = 20.38, p < 0.01 indicating that human summaries are perceived as significantly more coherent than system-generated ones. Finally, the judgments of our participants exhibit a significant correlation with DUC evaluations (r = .41, p < 0.01).</Paragraph> </Section> <Section position="3" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 4.3 Parameter Estimation </SectionTitle> <Paragraph position="0"> Our model has two free parameters: the frequency threshold used to identify salient entities and the length of the transition sequence. These parameters were tuned separately for each data set on the corresponding held-out development set. For our ordering and summarization experiments, optimal salience-based models were obtained for entities with frequency [?] 2. The optimal transition length was [?] 3 for ordering and [?] 2 for summarization.</Paragraph> </Section> <Section position="4" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 4.4 Baseline </SectionTitle> <Paragraph position="0"> We compare our algorithm against the coherence model proposed by Foltz et al. (1998) which measures coherence as a function of semantic relatedness between adjacent sentences. Semantic relatedness is computed automatically using Latent Semantic Analysis (LSA, Landauer and Dumais 1997) from raw text without employing syntactic or other annotations. This model is a good point of comparison for several reasons: (a) it is fully automatic, (b) it is a not a straw-man baseline; it correlates reliably with human judgments and has been used to analyze discourse structure, and (c) it models an aspect of coherence which is orthogonal to ours (their model is lexicalized).</Paragraph> <Paragraph position="1"> Following Foltz et al. (1998) we constructed vector-based representations for individual words from a lemmatized version of the North American News Text Corpus5 (350 million words) using a term-document matrix. We used singular value decomposition to reduce the semantic space to 100 dimensions obtaining thus a space similar to LSA. We represented the meaning of a sentence as a vector by taking the mean of the vectors of its words. The similarity between two sentences was determined by measuring the cosine of their means. An overall text coherence measure was obtained by averaging the cosines for all pairs of adjacent sentences.</Paragraph> <Paragraph position="2"> 5Our selection of this corpus was motivated by its similarity to the DUC corpus which primarily consists of news stories. In sum, each text was represented by a single feature, its sentence-to-sentence semantic similarity. During training, the ranker learns an appropriate threshold value for this feature.</Paragraph> </Section> <Section position="5" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 4.5 Evaluation Metric </SectionTitle> <Paragraph position="0"> Model performance was assessed in the same way for information ordering and summary evaluation.</Paragraph> <Paragraph position="1"> Given a set of pairwise rankings, we measure accuracy as the ratio of correct predictions made by the model over the size of the test set. In this setup, random prediction results in an accuracy of 50%.</Paragraph> </Section> </Section> class="xml-element"></Paper>