XML Viewer - c04-1107

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1107_metho.xml
Size: 16,986 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1107">
  <Title>Probabilistic Sentence Reduction Using Support Vector Machines</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Support Vector Machine
</SectionTitle>
    <Paragraph position="0"> Support vector machine (SVM)(Vapnik 95) is a technique of machine learning based on statistical learning theory. Suppose that we are given l training examples (xi;yi), (1 * i * l), where xi is a feature vector in n dimensional feature space, yi is the class label f-1, +1 g of xi. SVM flnds a hyperplane w:x + b = 0 which correctly separates the training examples and has a maximum margin which is the distance between two hyperplanes w:x+b , 1 and w:x+b *!1: The optimal hyperplane with maximum margin can be obtained by solving the following quadratic programming.</Paragraph>
    <Paragraph position="2"> where C0 is the constant and &gt;&gt;i is a slack variable for the non-separable case. In SVM, the optimal hyperplane is formulated as follows:</Paragraph>
    <Paragraph position="4"> (2) where fii is the Lagrange multiple, and K(x0;x00) is a kernel function, the SVM calculates similarity between two arguments x0 and x00. For instance, the Polynomial kernel function is formulated as follow:</Paragraph>
    <Paragraph position="6"> SVMs estimate the label of an unknown example x whether the sign of f(x) is positive or not.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Deterministic Sentence Reduction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Using SVMs
3.1 Problem Description
</SectionTitle>
      <Paragraph position="0"> In the corpus-based decision tree approach, a given input sentence is parsed into a syntax tree and the syntax tree is then transformed into a small tree to obtain a reduced sentence.</Paragraph>
      <Paragraph position="1"> Let t and s be syntax trees of the original sentence and a reduced sentence, respectively. The process of transforming syntax tree t to small tree s is called \rewriting process&amp;quot; (Knight and Marcu 02), (Nguyen and Horiguchi 03). To transform the syntax tree t to the syntax tree s; some terms and flve rewriting actions are deflned. null An Input list consists of a sequence of words subsumed by the tree t where each word in the Input list is labelled with the name of all syntactic constituents in t. Let CSTACK be a stack that consists of sub trees in order to rewrite a small tree. Let RSTACK be a stack that consists of sub trees which are removed from the Input list in the rewriting process.</Paragraph>
      <Paragraph position="2"> + SHIFT action transfers the flrst word from the Input list into CSTACK. It is written mathematically and given the label SHIFT.</Paragraph>
      <Paragraph position="3"> + REDUCE(lk;X) action pops the lk syntactic trees located at the top of CSTACK and combines them in a new tree, where lk is an integer and X is a grammar symbol.</Paragraph>
      <Paragraph position="4"> + DROP X action moves subsequences of words that correspond to syntactic constituents from the Input list to RSTACK.</Paragraph>
      <Paragraph position="5"> + ASSIGN TYPE X action changes the label of trees at the top of the CSTACK. These POS tags might be difierent from the POS tags in the original sentence.</Paragraph>
      <Paragraph position="6"> + RESTORE X action takes the X element in RSTACK and moves it into the Input list, where X is a subtree.</Paragraph>
      <Paragraph position="7"> For convenience, let conflguration be a status of Input list, CSTACK and RSTACK. Let current context be the important information in a conflguration. The important information are deflned as a vector of features using heuristic methods as in (Knight and Marcu 02), (Nguyen and Horiguchi 03).</Paragraph>
      <Paragraph position="8"> The main idea behind deterministic sentence reduction is that it uses a rule in the current context of the initial conflguration to select a distinct action in order to rewrite an input sentence into a reduced sentence. After that, the current context is changed to a new context and the rewriting process is repeated for selecting an action that corresponds to the new context.</Paragraph>
      <Paragraph position="9"> The rewriting process is flnished when it meets a termination condition. Here, one rule corresponds to the function that maps the current context to a rewriting action. These rules are learned automatically from the corpus of long sentences and their reduced sentences (Knight and Marcu 02), (Nguyen and Horiguchi 03).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Example
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows an example of applying a sequence of actions to rewrite the input sentence (a;b;c;d;e), when each character is a word. It illustrates the structure of the Input list, two stacks, andthetermofarewritingprocessbased on the actions mentioned above. For example, in the flrst row, DROP H deletes the sub-tree with its root node H in the Input list and stores it in the RSTACK. The reduced tree s can be obtained after applying a sequence of actions as follows: DROP H; SHIFT; ASSIGN TYPE K; DROP B; SHIFT; ASSIGN TYPE H; REDUCE 2 F; RESTORE H; SHIFT; ASSIGN TYPE D; REDUCE 2G. In this example, the reduced sentence</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Learning Reduction Rules Using
SVMs
</SectionTitle>
      <Paragraph position="0"> As mentioned above, the action for each conflguration can be decided by using a learning rule, which maps a context to an action. To obtain such rules, the conflguration is represented by a vector of features with a high dimension. After that, we estimate the training examples by using several support vector machines to deal with the multiple classiflcation problem in sentence reduction.</Paragraph>
      <Paragraph position="1">  One important task in applying SVMs to text summarization is to deflne features. Here, we describe features used in our sentence reduction models.</Paragraph>
      <Paragraph position="2"> The features are extracted based on the current context. As it can be seen in Figure 2, a context includes the status of the Input list and the status of CSTACK and RSTACK. We deflne a set of features for a current context as described bellow.</Paragraph>
      <Paragraph position="3"> Operation feature The set of features as described in (Nguyen and Horiguchi 03) are used in our sentence reduction models.</Paragraph>
      <Paragraph position="4"> Original tree features These features denote the syntactic constituents  that start with the flrst unit in the Input list. For example, in Figure 2 the syntactic constituents are labels of the current element in the Input list from \VP&amp;quot; to the verb \convince&amp;quot;. Semantic features The following features are used in our model as semantic information.</Paragraph>
      <Paragraph position="5"> + Semantic information about current words within the Input list; these semantic types are obtained by using the named entities such as Location, Person, Organization and Time within the input sentence. To deflne these name entities, we use the method described in (Borthwick 99).</Paragraph>
      <Paragraph position="6"> + Semantic information about whether or not the word in the Input list is a head word.</Paragraph>
      <Paragraph position="7"> + Word relations, such as whether or not a word has a relationship with other words in the sub-categorization table. These relations and the sub-categorization table are obtained using the Commlex database (Macleod 95).</Paragraph>
      <Paragraph position="8"> Using the semantic information, we are able to avoid deleting important segments within the given input sentence. For instance, the main verb, the subject and the object are essential and for the noun phrase, the head noun is essential, but an adjective modifler of the head noun is not. For example, let us consider that the verb \convince&amp;quot; was extracted from the Com-lex database as follows.</Paragraph>
      <Paragraph position="9">  This entry indicates that the verb \convince&amp;quot; can be followed by a noun phrase and a prepositional phrase starting with the preposition \of&amp;quot;. It can be also followed by a noun phrase and a to-inflnite phrase. This information shows that we cannot delete an \of&amp;quot; prepositional phrase or a to-inflnitive that is the part of the verb phrase.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3.2 Two-stage SVM Learning using
Pairwise Coupling
</SectionTitle>
      <Paragraph position="0"> Using these features we can extract training data for SVMs. Here, a sample in our training data consists of pairs of a feature vector and an action. The algorithm to extract training data from the training corpus is modifled using the algorithm described in our pervious work (Nguyen and Horiguchi 03).</Paragraph>
      <Paragraph position="1"> Since the original support vector machine (SVM) is a binary classiflcation method, while the sentence reduction problem is formulated as multiple classiflcation, we have to flnd a method to adapt support vector machines to this problem. For multi-class SVMs, one can use strategies such as one-vs all, pairwise comparison or DAG graph (Hsu 02). In this paper, we use the pairwise strategy, which constructs a rule for discriminating pairs of classes and then selects the class with the most winning among two class decisions.</Paragraph>
      <Paragraph position="2"> To boost the training time and the sentence reduction performance, we propose a two-stage SVM described below.</Paragraph>
      <Paragraph position="3"> Suppose that the examples in training data are divided into flve groups m1;m2;:::;m5 according to their actions. Let Svmc be multi-class SVMs and let Svmc-i be multi-class SVMs for a group mi: We use one Svmc classifler to identify the group to which a given context e should be belong. Assume that e belongs to the group mi: The classifler Svmc-i is then used to recognize a speciflc action for the context e: The flve classiflers Svmc-1, Svmc-2,..., Svmc-5 are trained by using those examples which have actions belonging to SHIFT, REDUCE, DROP,</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ASSIGN TYPE and RESTORE.
</SectionTitle>
    <Paragraph position="0"> Table 1 shows the distribution of examples in flve data groups.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Disadvantage of Deterministic
Sentence Reductions
</SectionTitle>
      <Paragraph position="0"> The idea of the deterministic algorithm is to use the rule for each current context to select the next action, and so on. The process terminates when a stop condition is met. If the early steps of this algorithm fail to select the best ac- null tions, then the possibility of obtaining a wrong reduced output becomes high.</Paragraph>
      <Paragraph position="1">  Onewaytosolvethisproblemistoselectmultiple actions that correspond to the context at each step in the rewriting process. However, the question that emerges here is how to determine which criteria to use in selecting multiple actions for a context. If this problem can be solved, then multiple best reduced outputs can be obtained for each input sentence and the best one will be selected by using the whole text document. null In the next section propose a model for selecting multiple actions for a context in sentence reduction as a probabilistic sentence reduction and present a variant of probabilistic sentence reduction.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Probabilistic Sentence Reduction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Using SVM
4.1 The Probabilistic SVM Models
</SectionTitle>
      <Paragraph position="0"> Let A be a set of k actions A = fa1;a2:::ai;:::;akg and C be a set of n contexts C = fc1;c2:::ci;:::;cng: A probabilistic model fi for sentence reduction will select an action a 2 A for the context c with probability pfi(ajc). The pfi(ajc) can be used to score action a among possible actions A depending the context c that is available at the time of decision. There are several methods for estimating such scores; we have called these \probabilistic sentence reduction methods&amp;quot;. The conditional probability pfi(ajc) is estimated using a variant of probabilistic support vector machine, which is described in the following sections.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Pairwise Coupling
</SectionTitle>
      <Paragraph position="0"> For convenience, we denote uij = p(a = aija = ai_aj;c): Given a context c and an action a; we assume that the estimated pairwise class probabilities rij of uij are available. Here rij can be estimated by some binary classiflers. For instance, we could estimate rij by using the SVM binary posterior probabilities as described in (Plat 2000). Then, the goal is to estimate fpigki=1 ; where pi = p(a = aijc); i = 1;2;:::;k: For this propose, a simple estimate of these probabilities can be derived using the following voting method:</Paragraph>
      <Paragraph position="2"> where I is an indicator function and k(k!1) is the number of pairwise classes. However, this model is too simple; we can obtain a better one with the following method.</Paragraph>
      <Paragraph position="3"> Assume that uij are pairwise probabilities of the model subject to the condition that uij = pi=(pi+pj):In (Hastie 98), the authors proposed to minimize the Kullback-Leibler (KL) distance between the rij and uij</Paragraph>
      <Paragraph position="5"> where rij and uij are the probabilities of a pair-wise ai and aj in the estimated model and in our model, respectively, and nij is the number of training data in which their classes are ai or aj: To flnd the minimizer of equation (6), they</Paragraph>
      <Paragraph position="7"> where i = 1;2;:::k and pi &gt; 0: Such a point can be obtained by using an algorithm described elsewhere in (Hastie 98). We applied it to obtain a probabilistic SVM model for sentence reduction using a simple method as follows. Assume that our class labels belong to l groups: M = fm1;m2:::mi;:::;mlg; where l is a number of groups and mi is a group e.g., SHIFT, REDUCE ,..., ASSIGN TYPE. Then the probability p(ajc) of an action a for a given context c can be estimated as follows.</Paragraph>
      <Paragraph position="9"> where mi is a group and a 2 mi: Here, p(mijc) and p(ajc;mi) are estimated by the method in (Hastie 98).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Probabilistic sentence reduction
</SectionTitle>
      <Paragraph position="0"> algorithm After obtaining a probabilistic model p; we then use this model to deflne function score, by which the search procedure ranks the derivation of incomplete and complete reduced sentences. Let d(s) = fa1;a2;:::adg be the derivation of a small tree s; where each action ai belongs to a set of possible actions. The score of s is the product of the conditional probabilities of the individual actions in its derivation.</Paragraph>
      <Paragraph position="2"> where ci is the context in which ai was decided.</Paragraph>
      <Paragraph position="3"> The search heuristic tries to flnd the best reduced tree s/ as follows:</Paragraph>
      <Paragraph position="5"> where tree(t) are all the complete reduced trees from the tree t of the given long sentence. Assume that for each conflguration the actions fa1;a2;:::ang are sorted in decreasing order according to p(aijci); in which ci is the context of that conflguration. Algorithm 1 shows a probabilistic sentence reduction using the top K-BFS search algorithm. This algorithm uses a breadth-flrst search which does not expand the entire frontier, but instead expands at most the top K scoring incomplete conflgurations in the frontier; it is terminated when it flnds M completed reduced sentences (CL is a list of reduced trees), or when all hypotheses have been exhausted. A conflguration is completed if and only if the Input list is empty and there is one tree in the CSTACK. Note that the function get-context(hi;j) obtains the current context of the jth conflguration in hi; where hi is a heap at step i: The function Insert(s,h) ensures that the heap h is sorted according to the score of each element in h: Essentially, in implementation we can use a dictionary of contexts and actions observed from the training data in order to reduce the number of actions to explore for a current context.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML