File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2019_intro.xml
Size: 5,821 bytes
Last Modified: 2025-10-06 14:03:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2019"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Constraint-based Sentence Compression An Integer Programming Approach</Title> <Section position="4" start_page="144" end_page="145" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> Jing (2000) was perhaps the rst to tackle the sentence compression problem. Her approach uses multiple knowledge sources to determine which phrases in a sentence to remove. Central to her system is a grammar checking module that speci es which sentential constituents are grammatically obligatory and should therefore be present in the compression. This is achieved using simple rules and a large-scale lexicon. Other knowledge sources include WordNet and corpus evidence gathered from a parallel corpus of original-compressed sentence pairs. A phrase is removed only if it is not grammatically obligatory, not the focus of the local context and has a reasonable deletion probability (estimated from the parallel corpus).</Paragraph> <Paragraph position="1"> In contrast to Jing (2000), the bulk of the research on sentence compression relies exclusively on corpus data for modelling the compression process without recourse to extensive knowledge sources (e.g., WordNet). Approaches based on the noisy-channel model (Knight and Marcu 2002; Turner and Charniak 2005) consist of a source model P(s) (whose role is to guarantee that the generated compression is grammatical), a channel model P(l|s) (capturing the probability that the long sentence l is an expansion of the compressed sentence s), and a decoder (which searches for the compression s that maximises P(s)P(l|s)).</Paragraph> <Paragraph position="2"> The channel model is typically estimated using a parallel corpus, although Turner and Charniak (2005) also present semi-supervised and unsupervised variants of the channel model that estimate P(l|s) without parallel data.</Paragraph> <Paragraph position="3"> Discriminative formulations of the compression task include decision-tree learning (Knight and Marcu 2002), maximum entropy (Riezler et al. 2003), support vector machines (Nguyen et al. 2004), and large-margin learning (McDonald 2006). We describe here the decision-tree model in more detail since we will use it as a basis for comparison when evaluating our own models (see Section 4). According to this model, compression is performed through a tree rewriting process inspired by the shift-reduce parsing paradigm. A sequence of shift-reduce-drop actions are performed on a long parse tree, l, to create a smaller tree, s. The compression process begins with an input list generated from the leaves of the original sentence's parse tree and an empty stack. 'Shift' operations move leaves from the input list to the stack while 'drop' operations delete from the input list.</Paragraph> <Paragraph position="4"> Reduce operations are used to build trees from the leaves on the stack. A decision-tree is trained on a set of automatically generated learning cases from a parallel corpus. Each learning case has a target action associated with it and is decomposed into a set of indicative features. The decision-tree learns which action to perform given this set of features.</Paragraph> <Paragraph position="5"> The nal model is applied in a deterministic fashion in which the features for the current state are extracted and the decision-tree is queried. This is repeated until the input list is empty and the nal compression is recovered by traversing the leaves of resulting tree on the stack.</Paragraph> <Paragraph position="6"> While most compression models operate over constituents, Hori and Furui (2004) propose a model which generates compressions through word deletion. The model does not utilise parallel data or syntactic information in any form. Given a prespeci ed compression rate, it searches for the compression with the highest score according to a function measuring the importance of each word and the linguistic likelihood of the resulting compressions (language model probability). The score is maximised through a dynamic programming algorithm. null Although sentence compression has not been explicitly formulated as an optimisation problem, previous approaches have treated it in these terms.</Paragraph> <Paragraph position="7"> The decoding process in the noisy-channel model searches for the best compression given the source and channel models. However, the compression found is usually sub-optimal as heuristics are used to reduce the search space or is only locally optimal due to the search method employed. The decoding process used in Turner and Charniak's (2005) model rst searches for the best combination of rules to apply. As they traverse their list of compression rules they remove sentences outside the 100 best compressions (according to their channel model). This list is eventually truncated to 25 compressions.</Paragraph> <Paragraph position="8"> In other models (Hori and Furui 2004; McDonald 2006) the compression score is maximised using dynamic programming. The latter guarantees we will nd the global optimum provided the principle of optimality holds. This principle states that given the current state, the optimal decision for each of the remaining stages does not depend on previously reached stages or previously made decisions (Winston and Venkataramanan 2003).</Paragraph> <Paragraph position="9"> However, we know this to be false in the case of sentence compression. For example, if we have included modi ers to the left of a head noun in the compression then it makes sense that we must include the head also. With a dynamic programming approach we cannot easily guarantee such constraints hold.</Paragraph> </Section> class="xml-element"></Paper>