File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2019_metho.xml
Size: 19,947 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2019"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Constraint-based Sentence Compression An Integer Programming Approach</Title> <Section position="5" start_page="145" end_page="148" type="metho"> <SectionTitle> 3 Problem Formulation </SectionTitle> <Paragraph position="0"> Our work models sentence compression explicitly as an optimisation problem. There are 2n possible compressions for each sentence and while many of these will be unreasonable (Knight and Marcu 2002), it is unlikely that only one compression will be satisfactory. Ideally, we require a function that captures the operations (or rules) that can be performed on a sentence to create a compression while at the same time factoring how desirable each operation makes the resulting compression. We can then perform a search over all possible compressions and select the best one, as determined by how desirable it is.</Paragraph> <Paragraph position="1"> Our formulation consists of two basic components: a language model (scoring function) and a small number of constraints ensuring that the resulting compressions are structurally and semantically valid. Our task is to nd a globally optimal compression in the presence of these constraints.</Paragraph> <Paragraph position="2"> We solve this inference problem using Integer Programming without resorting to heuristics or approximations during the decoding process. Integer programming has been recently applied to several classi cation tasks, including relation extraction (Roth and Yih 2004), semantic role labelling (Punyakanok et al. 2004), and the generation of route directions (Marciniak and Strube 2005).</Paragraph> <Paragraph position="3"> Before describing our model in detail, we introduce some of the concepts and terms used in Linear Programming and Integer Programming (see Winston and Venkataramanan 2003 for an introduction). Linear Programming (LP) is a tool for solving optimisation problems in which the aim is to maximise (or minimise) a given function with respect to a set of constraints. The function to be maximised (or minimised) is referred to as the objective function. Both the objective function and constraints must be linear. A number of decision variables are under our control which exert in uence on the objective function. Speci cally, they have to be optimised in order to maximise (or minimise) the objective function. Finally, a set of constraints restrict the values that the decision variables can take. Integer Programming is an extension of linear programming where all decision variables must take integer values.</Paragraph> <Section position="1" start_page="145" end_page="146" type="sub_section"> <SectionTitle> 3.1 Language Model </SectionTitle> <Paragraph position="0"> Assume we have a sentence W = w1,w2,...,wn for which we wish to generate a compression.</Paragraph> <Paragraph position="1"> We introduce a decision variable for each word in the original sentence and constrain it to be binary; a value of 0 represents a word being dropped, whereas a value of 1 includes the word in the compression. Let:</Paragraph> <Paragraph position="3"> If we were using a unigram language model, our objective function would maximise the overall sum of the decision variables (i.e., words) multiplied by their unigram probabilities (all probabilities throughout this paper are log-transformed):</Paragraph> <Paragraph position="5"> Thus if a word is selected, its corresponding yi is given a value of 1, and its probability P(wi) according to the language model will be counted in our total score, z.</Paragraph> <Paragraph position="6"> A unigram language model will probably generate many ungrammatical compressions. We therefore use a more context-aware model in our objective function, namely a trigram model. Formulating a trigram model in terms of an integer program becomes a more involved task since we now must make decisions based on word sequences rather than isolated words. We rst create some extra de-</Paragraph> <Paragraph position="8"> is in the compression [?]j [?] [i + 1...n[?]1] 0 otherwise [?]k [?] [ j + 1...n] Our objective function is given in Equation (1). This is the sum of all possible trigrams that can occur in all compressions of the original sentence where w0 represents the 'start' token and wi is the ith word in sentence W. Equation (2) constrains the decision variables to be binary.</Paragraph> <Paragraph position="10"> The objective function in (1) allows any combination of trigrams to be selected. This means that invalid trigram sequences (e.g., two or more tri-grams containing the symbol 'end') could appear in the output compression. We avoid this situation by introducing sequential constraints (on the decision variables yi,xi jk, pi, and qi j) that restrict the set of allowable trigram combinations.</Paragraph> <Paragraph position="11"> Constraint 1 Exactly one word can begin a sentence. n</Paragraph> <Paragraph position="13"> Constraint 2 If a word is included in the sentence it must either start the sentence or be preceded by two other words or one other word and the 'start' token w0.</Paragraph> <Paragraph position="15"> Constraint 3 If a word is included in the sentence it must either be preceded by one word and followed by another or it must be preceded by one word and end the sentence.</Paragraph> <Paragraph position="17"> Constraint 4 If a word is in the sentence it must be followed by two words or followed by one word and then the end of the sentence or it must be preceded by one word and end the sentence.</Paragraph> <Paragraph position="19"> Constraint 5 Exactly one word pair can end the sentence.</Paragraph> <Paragraph position="21"> Example compressions using the trigram model just described are given in Table 1. The model in O: He became a power player in Greek Politics in 1974, when he founded the socialist Pasok Party.</Paragraph> <Paragraph position="22"> LM: He became a player in the Pasok.</Paragraph> <Paragraph position="23"> Mod: He became a player in the Pasok Party.</Paragraph> <Paragraph position="24"> Sen: He became a player in politics.</Paragraph> <Paragraph position="25"> Sig: He became a player in politics when he founded the Pasok Party.</Paragraph> <Paragraph position="26"> O: Finally, AppleShare Printer Server, formerly a separate package, is now bundled with AppleShare File Server.</Paragraph> <Paragraph position="27"> LM: Finally, AppleShare, a separate, AppleShare. Mod: Finally, AppleShare Server, is bundled.</Paragraph> <Paragraph position="28"> Sen: Finally, AppleShare Server, is bundled with Server.</Paragraph> <Paragraph position="29"> Sig: AppleShare Printer Server package is now bundled with AppleShare File Server.</Paragraph> <Paragraph position="30"> tence, LM: compression with the trigram model, Mod: compression with LM and modi er constraints, Sen: compression with LM, Mod and sentential constraints, Sig: compression with LM, Mod, Sen, and signi cance score) its current state does a reasonable job of modelling local word dependencies, but is unable to capture syntactic dependencies that could potentially allow more meaningful compressions. For example, it does not know that Pasok Party is the object of founded or that Appleshare modi es Printer Server.</Paragraph> </Section> <Section position="2" start_page="146" end_page="147" type="sub_section"> <SectionTitle> 3.2 Linguistic Constraints </SectionTitle> <Paragraph position="0"> In this section we propose a set of global constraints that extend the basic language model presented in Equations (1) (7). Our aim is to bring some syntactic knowledge into the compression model and to preserve the meaning of the original sentence as much as possible. Our constraints are linguistically and semantically motivated in a similar fashion to the grammar checking component of Jing (2000). Importantly, we do not require any additional knowledge sources (such as a lexicon) beyond the parse and grammatical relations of the original sentence. This is provided in our experiments by the Robust Accurate Statistical Parsing (RASP) toolkit (Briscoe and Carroll 2002). However, there is nothing inherent in our formulation that restricts us to RASP; any other parser with similar output could serve our purposes.</Paragraph> <Paragraph position="1"> Modifier Constraints Modi er constraints ensure that relationships between head words and their modi ers remain grammatical in the compression: null</Paragraph> <Paragraph position="3"> Equation (8) guarantees that if we include a non-clausal modi er (ncmod) in the compression then the head of the modi er must also be included; this is repeated for determiners (detmod) in (9).</Paragraph> <Paragraph position="4"> We also want to ensure that the meaning of the original sentence is preserved in the compression, particularly in the face of negation. Equation (10) implements this by forcing not in the compression when the head is included. A similar constraint is added for possessive modi ers (e.g., his, our), as shown in Equation (11). Genitives (e.g., John's gift) are treated separately, mainly because they are encoded as different relations in the parser (see Equation (12)).</Paragraph> <Paragraph position="6"> [?]i, j : wi [?] possessive ncmods [?]w j = possessive Compression examples with the addition of the modi er constraints are shown in Table 1. Although the compressions are grammatical (see the inclusion of Party due to the modi er Pasok and Server due to AppleShare), they are not entirely meaning preserving.</Paragraph> <Paragraph position="7"> Sentential Constraints We also de ne a few intuitive constraints that take the overall sentence structure into account. The rst constraint (Equation (13)) ensures that if a verb is present in the compression then so are its arguments, and if any of the arguments are included in the compression then the verb must also be included. We thus force the program to make the same decision on the verb, its subject, and object.</Paragraph> <Paragraph position="9"> [?]i, j : w j [?] subject/object of verb wi Our second constraint forces the compression to contain at least one verb provided the original sentence contains one as well:</Paragraph> <Paragraph position="11"> Other sentential constraints include Equations (15) and (16) which apply to prepositional phrases, wh-phrases and complements. These constraints force the introducing term (i.e., the preposition, complement or wh-word) to be included in the compression if any word from within the syntactic constituent is also included. The reverse is also true, i.e., if the introducing term is included at least one other word from the syntactic constituent should also be included.</Paragraph> <Paragraph position="13"> We also wish to handle coordination. If two head words are conjoined in the original sentence, then if they are included in the compression the coordinating conjunction must also be included:</Paragraph> <Paragraph position="15"> [?]i, j,k : w j [?]wk conjoined by wi Table 1 illustrates the compression output when sentential constraints are added to the model. We see that politics is forced into the compression due to the presence of in; furthermore, since bundled is in the compression, its object with Server is included too.</Paragraph> <Paragraph position="16"> we impose some hard constraints on the compression output. First, Equation (20) disallows anything within brackets in the original sentence from being included in the compression. This is a somewhat super cial attempt at excluding parenthetical and potentially unimportant material from the compression. Second, Equation (21) forces personal pronouns to be included in the compression. The constraint is important for generating coherent document as opposed to sentence compressions.</Paragraph> <Paragraph position="18"> It is also possible to in uence the length of the compressed sentence. For example, Equation (22) forces the compression to contain at least b tokens.</Paragraph> <Paragraph position="19"> Alternatively, we could force the compression to be exactly b tokens (by substituting [?] with = in (22)) or to be less than b tokens (by replacing [?]</Paragraph> <Paragraph position="21"/> </Section> <Section position="3" start_page="147" end_page="148" type="sub_section"> <SectionTitle> 3.3 Significance Score </SectionTitle> <Paragraph position="0"> While the constraint-based language model produces more grammatical output than a regular lan- null guage model, the sentences are typically not great compressions. The language model has no notion of which content words to include in the compression and thus prefers words it has seen before. But words or constituents will be of different relative importance in different documents or even sentences. null Inspired by Hori and Furui (2004), we add to our objective function (see Equation (1)) a significance score designed to highlight important content words. Speci cally, we modify Hori and Furui's signi cance score to give more weight to content words that appear in the deepest level of embedding in the syntactic tree. The latter usually contains the gist of the original sentence:</Paragraph> <Paragraph position="2"> The signi cance score above is computed using a large corpus where wi is a topic word (i.e., a noun or verb), fi and Fi are the frequency of wi in the document and corpus respectively, and Fa is the sum of all topic words in the corpus. l is the number of clause constituents above wi, and N is the deepest level of embedding. The modi ed objective function is given below:</Paragraph> <Paragraph position="4"> A weighting factor could be also added to the objective function, to counterbalance the importance of the language model and the signi cance score.</Paragraph> </Section> </Section> <Section position="6" start_page="148" end_page="149" type="metho"> <SectionTitle> 4 Evaluation Set-up </SectionTitle> <Paragraph position="0"> We evaluated the approach presented in the previous sections against Knight and Marcu's (2002) decision-tree model. This model is a good basis for comparison as it operates on parse trees and therefore is aware of syntactic structure (as our models are) but requires a large parallel corpus for training whereas our models do not; and it yields comparable performance to the noisy-channel model.2 The decision-tree model was compared against two variants of our IP model. Both variants employed the constraints described in Section 3.2 but differed in that one variant included the signi cance 2Turner and Charniak (2005) argue that the noisy-channel model is not an appropriate compression model since it uses a source model trained on uncompressed sentences and as a result tends to consider compressed sentences less likely than uncompressed ones.</Paragraph> <Paragraph position="1"> score in its objective function (see (24)), whereas the other one did not (see (1)). In both cases the sequential constraints from Section 3.1 were applied to ensure that the language model was wellformed. We give details below on the corpora we used and explain how the different model parameters were estimated. We also discuss how evaluation was carried out using human judgements.</Paragraph> <Paragraph position="2"> Corpora We evaluate our systems on two different corpora. The rst is the compression corpus of Knight and Marcu (2002) derived automatically from document-abstract pairs of the Ziff-Davis corpus. This corpus has been used in most previous compression work. We also created a compression corpus from the HUB-4 1996 English Broadcast News corpus (provided by the LDC).</Paragraph> <Paragraph position="3"> We asked annotators to produce compressions for 50 broadcast news stories (1,370 sentences).3 The Ziff-Davis corpus is partitioned into training (1,035 sentences) and test set (32 sentences). We held out 50 sentences from the training for development purposes. We also split the Broadcast News corpus into a training and test set (1,237/133 sentences). Forty sentences were randomly selected for evaluation purposes, 20 from the test portion of the Ziff-Davis corpus and 20 from the Broadcast News corpus test set.</Paragraph> <Paragraph position="4"> Parameter Estimation The decision-tree model was trained, using the same feature set as Knight and Marcu (2002) on the Ziff-Davis corpus and used to obtain compressions for both test corpora.4 For our IP models, we used a language model trained on 25 million tokens from the North American News corpus using the CMU-Cambridge Language Modeling Toolkit (Clarkson and Rosenfeld 1997) with a vocabulary size of 50,000 tokens and Good-Turing discounting.</Paragraph> <Paragraph position="5"> The signi cance score used in our second model was calculated using 25 million tokens from the Broadcast News Corpus (for the spoken data) and 25 million tokens from the American News Text Corpus (for the written data). Finally, the model that includes the signi cance score was optimised against a loss function similar to McDonald (2006) to bring the language model and the score into harmony. We used Powell's method (Press et al. 1992) and 50 sentences (randomly selected from the training set).</Paragraph> <Paragraph position="6"> meaningful compressions when trained on the Broadcast News corpus (in most cases it recreated the original sentence). Thus we used the decision model trained on Ziff-Davis to generate Broadcast News compressions.</Paragraph> <Paragraph position="7"> We also set a minimum compression length (using the constraint in Equation (22)) in both our models to avoid overly short compressions. The length was set at 40% of the original sentence length or ve tokens, whichever was larger. Sentences under ve tokens were not compressed.</Paragraph> <Paragraph position="8"> In our modeling framework, we generate and solve an IP for every sentence we wish to compress. We employed lp solve for this purpose, an ef cient Mixed Integer Programming solver.5 Sentences typically take less than a few seconds to compress on a 2 GHz Pentium IV machine.</Paragraph> <Paragraph position="9"> Human Evaluation As mentioned earlier, the output of our models is evaluated on 40 examples. Although the size of our test set is comparable to previous studies (which are typically assessed on 32 sentences from the Ziff-Davis corpus), the sample is too small to conduct significance testing. To counteract this, human judgements are often collected on compression output; however the evaluations are limited to small subject pools (often four judges; Knight and Marcu 2002; Turner and Charniak 2005; McDonald 2006) which makes dif cult to apply inferential statistics on the data. We overcome this problem by conducting our evaluation using a larger sample of subjects.</Paragraph> <Paragraph position="10"> Speci cally, we elicited human judgements from 56 unpaid volunteers, all self reported native English speakers. The elicitation study was conducted over the Internet. Participants were presented with a set of instructions that explained the sentence compression task with examples. They were asked to judge 160 compressions in total. These included the output of the three automatic systems on the 40 test sentences paired with their gold standard compressions. Participants were asked to read the original sentence and then reveal its compression by pressing a button.</Paragraph> <Paragraph position="11"> They were told that all compressions were generated automatically. A Latin square design ensured that subjects did not see two different compressions of the same sentence. The order of the sentences was randomised. Participants rated each compression on a ve point scale based on the information retained and its grammaticality. Examples of our experimental items are given in Table 2.</Paragraph> </Section> class="xml-element"></Paper>