File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1038_metho.xml
Size: 21,679 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1038"> <Title>Discriminative Sentence Compression with Soft Syntactic Evidence</Title> <Section position="4" start_page="298" end_page="301" type="metho"> <SectionTitle> 3 Discriminative Sentence Compression </SectionTitle> <Paragraph position="0"> For the rest of the paper we use x = x1 ...xn to indicate an uncompressed sentence and y = y1 ...ym a compressed version of x, i.e., each yj indicates the position in x of the jth word in the compression. We always pad the sentence with dummy start and end words, x1 = -START- and xn = -END-, which are always included in the compressed version (i.e. y1 = x1 and ym = xn).</Paragraph> <Paragraph position="1"> Inthis section wedescribed adiscriminative on-line learning approach to sentence compression, the core of which is a decoding algorithm that searches the entire space of compressions. Let the score of a compression y for a sentence x as s(x,y) In particular, we are going to factor this score using a first-order Markov assumption on the words in the compressed sentence</Paragraph> <Paragraph position="3"> Finally, we define the score function to be the dot product between a high dimensional feature representation and a corresponding weight vector</Paragraph> <Paragraph position="5"> Note that this factorization will allow us to define features over two adjacent words in the compression as well as the words in-between that were dropped from the original sentence to create the compression. We will show in Section 3.2 how this factorization also allows us to include features on dropped phrases and subtrees from both a dependency and a phrase-structure parse of the original sentence. Note that these features are meant to capture the same information in both the source and channel models of Knight and Marcu (2000).</Paragraph> <Paragraph position="6"> However, here they are merely treated as evidence for the discriminative learner, which will set the weight of each feature relative to the other (possibly overlapping) features to optimize the models accuracy on the observed data.</Paragraph> <Section position="1" start_page="298" end_page="299" type="sub_section"> <SectionTitle> 3.1 Decoding </SectionTitle> <Paragraph position="0"> We define a dynamic programming table C[i] which represents the highest score for any compression that ends at word xi for sentence x. We define a recurrence as follows</Paragraph> <Paragraph position="2"> It is easy to show that C[n] represents the score of the best compression for sentence x(whose length is n) under the first-order score factorization we made. We can show this by induction. If we assume that C[j] is the highest scoring compression that ends at word xj, for all j < i, then C[i] must also be the highest scoring compression ending at word xi since it represents the max combination over all high scoring shorter compressions plus the score of extending the compression to the current word. Thus, since xn is by definition in every compressed version of x (see above), then it must be the case that C[n] stores the score of the best compression. This table can be filled in O(n2).</Paragraph> <Paragraph position="3"> This algorithm is really an extension of Viterbi to the case when scores factor over dynamic sub-strings of the text (Sarawagi and Cohen, 2004; McDonald et al., 2005a). As such, we can use back-pointers to reconstruct the highest scoring compression as well as k-best decoding algorithms. null This decoding algorithm is dynamic with respect to compression rate. That is, the algorithm will return the highest scoring compression regardless of length. This may seem problematic since longer compressions might contribute more to the score (since they contain more bigrams) and thus be preferred. However, in Section 3.2 we define a rich feature set, including features on words dropped from the compression that will help disfavor compressions that drop very few words since this is rarely seen in the training data. In fact, it turns out that our learned compressions have a compression rate very similar to the gold standard.</Paragraph> <Paragraph position="4"> Thatsaid, therearesomeinstances whenastatic compression rate is preferred. A user may specifically want a 25% compression rate for all sentences. This is not a problem for our decoding algorithm. We simply augment the dynamic programming table and calculate C[i][r], which is the score of the best compression of length r that ends at word xi. This table can be filled in as follows</Paragraph> <Paragraph position="6"> Thus, if werequire aspecific compression rate, we simple determine the number of words r that satisfy this rate and calculate C[n][r]. The new complexity is O(n2r).</Paragraph> </Section> <Section position="2" start_page="299" end_page="301" type="sub_section"> <SectionTitle> 3.2 Features </SectionTitle> <Paragraph position="0"> So far we have defined the score of a compression as well as a decoding algorithm that searches the entire space of compressions to find the one with highest score. This all relies on a score factorization over adjacent words in the compression,</Paragraph> <Paragraph position="2"> In Section 3.3 we describe an online large-margin method for learning w. Here we present the feature representation f(x,I(yj[?]1),I(yj)) for a pair of adjacent words in the compression. These features were tuned on a development data set.</Paragraph> <Paragraph position="3"> The first set of features are over adjacent words yj[?]1 and yj in the compression. These include the part-of-speech (POS) bigrams for the pair, the POS of each word individually, and the POS context (bigram and trigram) of the most recent word being added to the compression, yj. These features are meant to indicate likely words to include in the compression as well as some level of grammaticality, e.g., the adjacent POS features &quot;JJ&VB&quot; would get a low weight since we rarely see an adjective followed by a verb. We also add a feature indicating if yj[?]1 and yj were actually adjacent in the original sentence or not and we conjointhis feature withtheabovePOSfeatures. Note that we have not included any lexical features. We found during experiments onthedevelopment data that lexical information was too sparse and led to overfitting, so we rarely include such features. Instead we rely on the accuracy of POS tags to provide enough evidence.</Paragraph> <Paragraph position="4"> Next we added features over every dropped wordintheoriginal sentence between yj[?]1 and yj, if there were any. These include the POS of each dropped word, the POS of the dropped words conjoined with the POSof yj[?]1 and yj. If the dropped word is a verb, we add a feature indicating the actualverb(thisisforcommonverbslike&quot;is&quot;, which are typically in compressions). Finally we add the POScontext (bigram andtrigram) ofeach dropped word. These features represent common characteristics of words that can or should be dropped from the original sentence in the compressed version (e.g. adjectives and adverbs). We also add a feature indicating whether the dropped word is a negation (e.g., not, never, etc.).</Paragraph> <Paragraph position="5"> We also have a set of features to represent brackets in the text, which are common in the data set. The first measures if all the dropped words between yj[?]1 and yj have a mismatched or inconsistent bracketing. The second measures if the left and right-most dropped words are themselves both brackets. These features come in handy for examples like, The Associated Press ( AP ) reported the story, where the compressed version is The Associated Press reported the story. Information within brackets is often redundant.</Paragraph> <Paragraph position="6"> The previous set of features are meant to encodecommonPOScontexts thatarecommonly retained or dropped from the original sentence during compression. However, they do so without a larger picture of the function of each word in the sentence. For instance, dropping verbs is not that uncommon - a relative clause for instance may be dropped during compression. However, dropping the main verb in the sentence is uncommon, since that verb and its arguments typically encode most of the information being conveyed.</Paragraph> <Paragraph position="7"> An obvious solution to this problem is to include features over a deep syntactic analysis of the sentence. To do this we parse every sentence twice, once with a dependency parser (McDonald et al., 2005b) and once with a phrase-structure parser (Charniak, 2000). These parsers have been trained out-of-domain on the Penn WSJ Treebank and as a result contain noise. However, we are merely going to use them as an additional source of features. We call this soft syntactic evidence since the deep trees are not used as a strict gold-standard inourmodelbutjustasmoreevidence for tree from the Charniak (2000) parser. In this example we want to add features from the trees for the case when Ralph and after become adjacent in the compression, i.e., we are dropping the phrase on Tuesday. or against particular compressions. The learning algorithm will set the feature weight accordingly depending on each features discriminative power.</Paragraph> <Paragraph position="8"> It is not unique to use soft syntactic features in this way, as it has been done for many problems in language processing. However, we stress this aspect of our model due to the history of compression systems using syntax to provide hard structural constraints on the output.</Paragraph> <Paragraph position="9"> Letsconsider thesentence x =Mary saw Ralph on Tuesday after lunch, with corresponding parses given in Figure 2. In particular, lets consider the feature representation f(x,3,6). That is, the feature representation of making Ralph and after adjacent in the compression and dropping the prepositional phrase on Tuesday. Thefirstsetoffeatures we consider are over dependency trees. For every dropped word we add a feature indicating the POS of the words parent in the tree. For example, if the dropped words parent is root, then it typically means it is the main verb of the sentence and unlikely to be dropped. We also add a conjunction feature of the POS tag of the word being dropped and the POS of its parent as well as a feature indicating for each word being dropped whether it is a leaf node in the tree. We also add the same features for the two adjacent words, but indicating that they are part of the compression.</Paragraph> <Paragraph position="10"> For the phrase-structure features we find every node in the tree that subsumes a piece of dropped textandisnotachild ofasimilarnode. Inthiscase the PP governing on Tuesday. We then add features indicating the context from which this node was dropped. For example we add a feature specifying that a PP was dropped which was the child of a VP. We also add a feature indicating that a PP was dropped which was the left sibling of another PP, etc. Ideally, for each production in the tree we would like to add a feature indicating every node that was dropped, e.g. &quot;VP-VBD NP PP PP = VP-VBD NP PP&quot;. However, we cannot necessarily calculate this feature since the extent of the production might be well beyond the local context of first-order feature factorization. Furthermore, since the training set is so small, these features are likely to be observed very few times.</Paragraph> <Paragraph position="11"> In this section we have described a rich feature set over adjacent words in the compressed sentence, dropped words and phrases from the original sentence, and properties of deep syntactic trees oftheoriginal sentence. Notethat these features in many ways mimic the information already present in the noisy-channel and decision-tree models of Knight and Marcu (2000). Our bigram features encode properties that indicate both good and bad words to be adjacent in the compressed sentence.</Paragraph> <Paragraph position="12"> Thisissimilar inpurpose tothesource model from the noisy-channel system. However, in that system, the source model is trained on uncompressed sentences and thus isnot asrepresentative oflikely bigram features for compressed sentences, which is really what we desire.</Paragraph> <Paragraph position="13"> Our feature set also encodes dropped words and phrases through the properties of the words themselves and through properties of their syntactic relation to the rest of the sentence in a parse tree. These features represent likely phrases to be dropped in the compression and are thus similar in nature to the channel model in the noisy-channel system aswellasthefeatures inthetree-to-tree decision tree system. However, we use these syntactic constraints as soft evidence in our model. That is, they represent just another layer of evidence to be considered during training when setting parameters. Thus, if the parses have too much noise, the learning algorithm can lower the weight of the parse features since they are unlikely to be useful discriminators on the training data. This differs from the models of Knight and Marcu (2000), whichtreat thenoisy parses asgold-standard when calculating probability estimates.</Paragraph> <Paragraph position="14"> An important distinction we should make is the notion of supported versus unsupported features (Sha and Pereira, 2003). Supported features are those that are on for the gold standard compressions in the training. For instance, the bigram feature &quot;NN&VB&quot; will be supported since there is most likely a compression that contains a adjacent noun and verb. However, the feature &quot;JJ&VB&quot; will not be supported since an adjacent adjective and verb most likely will not be observed in any valid compression. Our model includes all features, including those that are unsupported. The advantage of this is that the model can learn negative weights for features that are indicative of bad compressions. Thisisnotdifficult todosince most features are POS based and the feature set size even with all these features is only 78,923.</Paragraph> </Section> <Section position="3" start_page="301" end_page="301" type="sub_section"> <SectionTitle> 3.3 Learning </SectionTitle> <Paragraph position="0"> Having defined a feature encoding and decoding algorithm, the last step is to learn the feature weights w. We do this using the Margin Infused Relaxed Algorithm (MIRA), which is a discriminative large-margin online learning technique shown in Figure 3 (Crammer and Singer, 2003). Oneach iteration, MIRA considers a single instance from the training set (xt,yt) and updates the weights so that the score of the correct compression, yt, is greater than the score of all other compressions by a margin proportional to their loss. Many weight vectors will satisfy these constraints so we pick the one with minimum change from the previous setting. We define the loss to be the number of words falsely retained or dropped in the incorrect compression relative to the correct one. Forinstance, ifthecorrect compression ofthe sentence in Figure 2 is Mary saw Ralph, then the compression Mary saw after lunch would have a loss of 3 since it incorrectly left out one word and included two others.</Paragraph> <Paragraph position="1"> Ofcourse, forasentence thereareexponentially many possible compressions, which means that this optimization will have exponentially many constraints. We follow the method of McDonald et al. (2005b) and create constraints only on the k compressions that currently have the highest score, bestk(x;w). This can easily be calculated by extending the decoding algorithm with standard Viterbi k-best techniques. On the development data, we found that k = 10 provided the Training data: T = {(xt,yt)}Tt=1 1. w0 = 0; v = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. min major impact overall. Furthermore we found that after only 3-5 training epochs performance on the development data was maximized.</Paragraph> <Paragraph position="2"> The final weight vector is the average of all weight vectors throughout training. Averaging has been shown to reduce overfitting (Collins, 2002) as well as reliance on the order of the examples during training. We found it to be particularly important for this data set.</Paragraph> </Section> </Section> <Section position="5" start_page="301" end_page="302" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We use the same experimental methodology as Knight and Marcu (2000). We provide every compression to four judges and ask them to evaluate each one for grammaticality and importance on a scale from 1 to 5. For each of the 32 sentences in our test set we ask the judges to evaluate three systems: human annotated, the decision tree model of Knight and Marcu (2000) and our system. The judges were told all three compressions were automatically generated and the order in which they werepresented wasrandomly chosen foreach sentence. We compared our system to the decision tree model of Knight and Marcu instead of the noisy-channel model since both performed nearly as well in their evaluation, and the compression rate of the decision tree model is nearer to our system (around 57-58%). The noisy-channel model typically returned longer compressions.</Paragraph> <Paragraph position="1"> ResultsareshowninTable1. Wepresent theaverage score over all judges as well as the standard deviation. The evaluation for the decision tree system of Knight and Marcu is strikingly similar to theoriginal evaluation intheir work. Thisprovides strong evidence that the evaluation criteria in both cases were very similar.</Paragraph> <Paragraph position="2"> Table 1 shows that all models had similar com- null pressions rates, with humans preferring to compress a little more aggressively. Not surprisingly, the human compressions are practically all grammatical. A quick scan of the evaluations shows that the few ungrammatical human compressions were for sentences that were not really grammatical in the first place. Of greater interest is that the compressions of our system are typically more grammatical than the decision tree model of Knight and Marcu.</Paragraph> <Paragraph position="3"> When looking at importance, we see that our system actually does the best - even better than humans. The most likely reason for this is that our model returns longer sentences and is thus less likely to prune away important information. For example, consider the sentence The chemical etching process used for glare protection is effective and will help if your office has the fluorescent-light overkill that's typical in offices The human compression was Glare protection is effective, whereas our model compressed the sentence to The chemical etching process used for glare protection is effective.</Paragraph> <Paragraph position="4"> A primary reason that our model does better than the decision tree model of Knight and Marcu is that on a handful of sentences, the decision tree compressions were a single word or noun-phrase.</Paragraph> <Paragraph position="5"> For such sentences the evaluators typically rated the compression a 1 for both grammaticality and importance. In contrast, our model never failed in such drastic ways and always output something reasonable. This is quantified in the standard deviation of the two systems.</Paragraph> <Paragraph position="6"> Though these results are promising, more large scale experiments are required to really ascertain the significance of the performance increase. Ideally we could sample multiple training/testing splits and use all sentences in the data set to evaluate the systems. However, since these systems require human evaluation we did not have the time or the resources to conduct these experiments.</Paragraph> <Section position="1" start_page="302" end_page="302" type="sub_section"> <SectionTitle> 4.1 Some Examples </SectionTitle> <Paragraph position="0"> Here we aim to give the reader a flavor of some common outputs from the different models. Three examples are given in Table 4.1. The first shows two properties. First of all, the decision tree model completely breaks and just returns a single noun-phrase. Our system performs well, however it leaves out the complementizer of the relative clause. This actually occurred in a few examples and appears to be the most common problem of our model. A post-processing rule should eliminate this.</Paragraph> <Paragraph position="1"> The second example displays a case in which our system and the human system are grammatical, but the removal of a prepositional phrase hurts the resulting meaning of the sentence. In fact, without the knowledge that the sentence is referring to broadband, the compressions are meaningless. This appears to be a harder problem determining which prepositional phrases can be dropped and which cannot.</Paragraph> <Paragraph position="2"> The final, and more interesting, example presents two very different compressions by the human and our automatic system. Here, the human kept the relative clause relating what languages the source code is available in, but dropped the main verb phrase of the sentence. Our model preferred to retain the main verb phrase and drop the relative clause. This is most likely due to the fact that dropping the main verb phrase of a sentence is much less likely in the training data than dropping a relative clause. Two out of four evaluators preferred the compression returned by our system and the other two rated them equal.</Paragraph> </Section> </Section> class="xml-element"></Paper>