File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0621_metho.xml

Size: 19,649 bytes

Last Modified: 2025-10-06 14:15:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0621">
  <Title>A Learning Approach to Shallow Parsing*</Title>
  <Section position="4" start_page="168" end_page="169" type="metho">
    <SectionTitle>
2 SNoW
</SectionTitle>
    <Paragraph position="0"> The SNoW (Sparse Network of Winnows 1) learning architecture is a sparse network of linear units over: a common pre-defined or incrementally learned feature space. Nodes in the input layer of t:he network represent simple relations over the input sentence and are being used as the input features. Each linear unit is called a target node and represents relations which are of interest ove~r the input sentence; in the cur1To winnow: to separate chaff from grain.</Paragraph>
    <Paragraph position="1"> rent application, target nodes may represent a potential prediction with respect to a word in the input sentence, e.g., inside a phrase, outside a phrase, at the beginning off a phrase, etc. An input sentence, along with a designated word of interest in it, is mapped into a set of features which are active in it; this representation is presented to the input layer of SNoW and propagates to the target nodes. Target nodes are linked via weighted edges to (some of the) input features. Let ,At = (Q,... ,ira} be the set of features that are active in an example and are linked to the target node t. Then the linear t t is the unit is active iff ~ieAt wi &gt; 9t, where w i weight on the edge connecting the ith feature to the target node t, and 9t is the threshold for the target node t.</Paragraph>
    <Paragraph position="2"> Each SNoW unit may include a collection of subnetworks, one for each of the target relations. A given example is treated autonomously by each target subnetwork; an example labeled t may be treated as a positive example by the subnetwork for t and as a negative example by the rest of the target nodes.</Paragraph>
    <Paragraph position="3"> The learning policy is on-line and mistakedriven; several update rules can be used within SNOW. The most successful update rule, and the only one used in this work is a variant of Littlestone's (1988) Winnow update rule, a multiplicative update rule tailored to the situation in which the set of input features is not known a priori, as in the infinite attribute model (Blum, 1992). This mechanism is implemented via the sparse architecture of SNOW. That is, (1) input features are allocated in a data driven way - an input node for the feature i is allocated only if the feature i was active in any input sentence and (2) a link (i.e., a non-zero weight) exists between a target node t and a feature i if and only if i was active in an example labeled t.</Paragraph>
    <Paragraph position="4"> The Winnow update rule has, in addition to the threshold 9t at the target t, two update parameters: a promotion parameter a &gt; 1 and a demotion parameter 0 &lt; j3 &lt; 1. These are being used to update the current representation of the target t (the set of weights w~) only when a mistake in prediction is made. Let .At -- (Q,... ,ira} be the set of active features that are linked to the target node t. If the algot &lt; St) and rithm predicts 0 (that is, ~ieAt wi the received label is 1, the active weights in  the current example are promoted in a multit If the plicative fashion: Vi E ,At, wl t +-- o~ * w i. t 0t) and the algorithm predicts 1 (~ie.~t wi &gt; received label is 0, the active weights in the cur- t t rent example are demoted Vi E .At, wi ~ 8 &amp;quot; wi. All other weights are unchanged.</Paragraph>
    <Paragraph position="5"> The key feature of the Winnow update rule is that the number of examples required to learn a linear function grows linearly with the number of relevant features and only logarithmically with the total number of features. This prop-erty seems crucial in domains in which the number of potential features is vast, but a relatively small number of them is relevant. Winnow is known to learn efficiently any linear threshold function and to be robust in the presence of various kinds of noise and in cases where no linear-threshold function can make perfect classifications, while still maintaining its abovementioned dependence on the number of total and relevant attributes (Littlestone, 1991; Kivinen and Warmuth, 1995).</Paragraph>
    <Paragraph position="6"> Once target subnetworks have been learned and the network is being evaluated, a decision support mechanism is employed, which selects the dominant active target node in the SNoW unit via a winner-take-all mechanism to produce a final prediction. The decision support mechanism may also be cached and processed along with the output of other SNoW units to produce a coherent output.</Paragraph>
  </Section>
  <Section position="5" start_page="169" end_page="173" type="metho">
    <SectionTitle>
3 Modeling Shallow Parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
3.1 Task Definition
</SectionTitle>
      <Paragraph position="0"> This section describes how we model the shallow parsing tasks studied here as learning problems.</Paragraph>
      <Paragraph position="1"> The goal is to detect NPs and SV phrases. Of the several slightly different definitions of a base NP in the literature we use for the purposes of this work the definition presented in (Ramshaw and Marcus, 1995) and used also by (Argamon et al., 1998)and others. That is, a base NP is a non-recursive NP that includes determiners but excludes post-modifying prepositional phrases or clauses. For example: *..presented \[last year \] in \[Illinois\] in front of ...</Paragraph>
      <Paragraph position="2"> SV phrases, following the definition suggested in (Argamon et al., 1998), are word phrases starting with the subject of the sentence and ending with the first verb, excluding modal verbs 2. For example, the SV phrases are bracketed in the following: *..presented \[ a theory that claims \] that \[the algorithm runs \] and performs...</Paragraph>
      <Paragraph position="3"> Both tasks can be viewed as sequence recognition problems. This can be modeled as a collection of prediction problems that interact in a specific way. For example, one may predict the first and last word in a target sequence.</Paragraph>
      <Paragraph position="4"> Moreover, it seems plausible that information produced by one predictor (e.g., predicting the beginning of the sequence) may contribute to others (e.g., predicting the end of the sequence). Therefore, our computational paradigm suggests using SNoW predictors that learn separately to perform each of the basic predictions, and chaining the resulting predictors at evaluation time. Chaining here means that the predictions produced by one of the predictors may be used as (a part of the) input to others 3.</Paragraph>
      <Paragraph position="5"> Two instantiations of this paradigm - each of which models the problems using a different set of predictors - are described below.</Paragraph>
    </Section>
    <Section position="2" start_page="169" end_page="171" type="sub_section">
      <SectionTitle>
3.2 Inside/Outside Predictors
</SectionTitle>
      <Paragraph position="0"> The predictors in this case are used to decide, for each word, whether it belongs to the interior of a phrase or not; this information is then used to group the words into phrases. Since annotating words only with Inside/Outside information is ambiguous in cases of two consecutive phrases, an additional predictor is used. Specifically, each word in the sentence may be annotated using one of the following labels: O - the current word is outside the pattern. I the current word is inside the pattern. B - the current word marks the beginning of a pattern that immediately follows another pattern 4.</Paragraph>
      <Paragraph position="1"> 2Notice that according to this definition the identified verb may not correspond to the subject, but this phrase still contains meaningful information; in any case, the learning method presented is independent of the specific  presented here consists of part-of-speech tagged data. In the demo of the system (available from http ://12r. cs. uiuc. edu/'cogcomp/eoh/index, html), an additional layer of chaining is used. Raw sentences are supplied as input and are processed using a SNoW based POS tagger (Roth and Ze!enko, 1998) first.</Paragraph>
      <Paragraph position="2"> 4There are other ways to define the B annotation, e.g., as always marking the beginning of a phrase. The  For example, the sentence I went to California last May would be marked for base NPs as: I went to California last May</Paragraph>
      <Paragraph position="4"> indicating that the NPs are I, California and last May. This approach has been studied in (Ramshaw and Marcus, 1995).</Paragraph>
      <Paragraph position="5">  SNoW is used in order to learn the OIB annotations both for NPs and SV phrases. In each case, two:predictors are learned, which differ in the type of information they receive in their input. A first predictor takes as input a modeling used, however, turns out best experimentally. sentence along with the corresponding part-of-speech (POS) tags. The features extracted from this input represent the local context of each word in terms of POS tags (with the possible addition of lexical information), as described in Sec 3.4. The SNoW predictor in this case consists of three targets - O, I and B. Figure 1 depicts the feature extraction module which extracts the local features and generates an example for each word in the sentence. Each example is labeled with one of 0, I or B.</Paragraph>
      <Paragraph position="6"> The second predictor takes as input a sentence along with the corresponding POS tags as well as the Inside/Outside information.</Paragraph>
      <Paragraph position="7"> The hope is that representing the local context of a word using the Inside/Outside information for its neighboring words, in addition to the POS and lexical information, will enhance the  performance of the predictor. While this information is available during training, since the data is annotated with the OIB information, it is not available in the input sentence at evaluation time. Therefore, at evaluation time, given a sentence (represented as a sequence of POS tags), we first need to evaluate the first predictor on it, generate an Inside/Outside representation of the sentence, and then use this to generate new features that feed into the second predictor.</Paragraph>
    </Section>
    <Section position="3" start_page="171" end_page="172" type="sub_section">
      <SectionTitle>
3.3 Open/Close Predictors
</SectionTitle>
      <Paragraph position="0"> The predictors in this case are used to decide, for each word, whether it is the first in a phrase, the last in a phrase, both of these, or none of these. In this way, the phrase boundaries are determined; this is annotated by placing an open bracket (\[) before the first word and a close bracket (\]) after the last word of each phrase. Our earlier example would be marked for base NPs as: \[I\] wont to \[California\] \[last May\]. This approach has been studied in (Church, 1988; Argamon et al., 1998).</Paragraph>
      <Paragraph position="1">  The architecture used for the Open/Close predictors is shown in Figure 2. Two SNoW predictors are used, one to predict if the word currently in consideration is the first in the phrase (an open bracket), and the other to predict if it is the last (a close bracket). Each of the two predictors is a SNoW network with two competing target nodes: one predicts if the current position is an open (close) bracket and the other predicts if it is not. In this case, the actual activation value (sum of weights of the active features for a given target) of the SNoW predictors is used to compute a confidence in the prediction. Let ty be the activation value for the yes-bracket target and t N for the no-bracket target. Normally, the network would predict the target corresponding to the higher activation value. In this case, we prefer to cache the system preferences for each of the open (close) brackets predictors so that several bracket pairings can be considered when all the information is available. The confidence, '7, of a candidate is defined by '7 = tr/(tr + t,v). Normally, SNoW will predict that there is a bracket if &amp;quot;7 /&gt; 0.5,  candidates. Bracket candidates that would be chosen by the combinator are marked with a * but this system employs an threshold ~-. We will consider any bracket that has '7 &gt;t r as a candidate. The lower ~- is, the more candidates will be considered.</Paragraph>
      <Paragraph position="2"> The input to: the open bracket predictor is a sentence and the POS tags associated with each word in the sentence. For each position in the sentence,: the open bracket predictor decides if it is a candidate for an open bracket. For each open bracket candidate, features that correspond to this information are generated; the close bracket predictor can (potentially) receive this information in addition to the sentence and the POS information, and use it in its decision on whether a given position in the sentence is to be a candiddte for a close bracket predictor (to be paired with the open bracket candidate).</Paragraph>
      <Paragraph position="3">  Finding the final phrases by pairing the open and close bracket candidates is crucial to the performance of the system; even given good prediction performance choosing an inadequate pairing would severely lower the overall performance. We use 'a graph based method that uses the confidence Of the SNoW predictors to generate the consistent pairings, at only a linear time complexity.</Paragraph>
      <Paragraph position="4"> We call p = (o, c) a pair, where o is an open bracket and c is any close bracket that was predicted with respect to o. The position of a bracket at the: ith word is defined to be i if it is an open bracket and i + 1 if it is a close bracket. Clearly, a pair (o, c) is possible only when pos(o) &lt;: po8(c). The confidence of a bracket t is thei weight '7(t). The value of a pair p = (o,c) is defined to be v(p) = 7(o) * 7(c).</Paragraph>
      <Paragraph position="5"> The pair Pl occurs before the pair P2 if po8(cl) ~. pos(o2). Pl and P2 are compatible if either Pl occurs before p2 or P2 occurs before Pl. A pairing is a set of pair s P = {pl,p2,...pn} such that Pl is compatible with pj for all i and j where i ~ j. The value of the pairing is the sum of all of the values of the pairs within the pairing.</Paragraph>
      <Paragraph position="6"> Our combinator finds the pairing with the maximum value. Note that while there may be exponentially many pairings, by modeling the problem of finding the maximum wlued pairing as a shortest path problem on a directed acyclic graph, we provide a linear time solution. Figure 3 gives an example of pairing bracket candidates of the sentence S = 818283848586, where the confidence of each candidate is written in the subscript.</Paragraph>
    </Section>
    <Section position="4" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
3.4 Features
</SectionTitle>
      <Paragraph position="0"> The features used in our system are relational features over the sentence and the POS information, which can be defined by a pair of numbers, k and w. Specifically, features are either word conjunctions or POS tags conjunctions. All conjunctions of size up to k and within a symmetric window that includes the w words before and after the designated word are generated.</Paragraph>
      <Paragraph position="1"> An example is shown in Figure 4 where (w, k) = (3, 4) for POS tags, and (w, k) = (1, 2) for words. In this example the word &amp;quot;how&amp;quot; is the designated word with POS tag &amp;quot;WRB&amp;quot;. &amp;quot;0&amp;quot; marks the position of the current word (tag) if it is not part of the feature, and &amp;quot;(how)&amp;quot; or &amp;quot;(WI:tB)&amp;quot; marks the position of the current word (tag) if it is part of the current feature.</Paragraph>
      <Paragraph position="2"> The distance of a conjunction from the current word (tag) can be induced by the placement of the special character %&amp;quot; in the feature. We do not consider mixed features between words and POS tags as in (l:tamshaw and Marcus, 1995), that is, a single feature consists of either words or tags.</Paragraph>
      <Paragraph position="3"> Additionally, in the Inside/Outside model, the second predictor incorporates as features the OIB status of the w words before and after the designated word, and the conjunctions of size 2 of the words surrounding it.</Paragraph>
      <Paragraph position="5"/>
    </Section>
  </Section>
  <Section position="6" start_page="173" end_page="174" type="metho">
    <SectionTitle>
4 Methodology
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> In order to be able to compare our results with the results obtained by other researchers, we worked with the same data sets already used by (Ramshaw and Marcus, 1995; Argamon et al., 1998) for NP and SV detection. These data sets were based on the Wall Street Journal corpus in the Penn Treebank (Marcus et al., 1993). For NP, the training and test corpus was prepared from sections 15 to 18 and section 20, respectively; the SV corpus was prepared from sections 1 to 9 for training and section 0 for testing. Instead of using the NP bracketing information present in the tagged Treebank data, Ramshaw and Marcus modified the data so as to include bracketing information related only to the non-recursive, base NPs present in each sentence while the subject verb phrases were taken as is. The data sets include POS tag information generated by Ramshaw and Marcus using Brill's transformational part-of-speech tagger (Brill, 1995).</Paragraph>
      <Paragraph position="1"> The sizes of the training and test data are summarized in Table 1 and Table 2.</Paragraph>
    </Section>
    <Section position="2" start_page="173" end_page="174" type="sub_section">
      <SectionTitle>
4.2 Parameters
</SectionTitle>
      <Paragraph position="0"> The Open/Close system has two adjustable parameters, r\[ and v\], the threshold for the open and close bracket predictors, respectively. For all experiments, the system is first trained on 90% of the training data and then tested on the remaining 10%. The r\] and r\[ that provide the  best performance are used on the real test file. After the best parameters are found, the system is trained on the whole training data set. Results are reported in terms of recall, precision, and Fa. F# is always used as the single value to compare the performance.</Paragraph>
      <Paragraph position="1"> For all the experiments, we use 1 as the initial weight, 5 as the .threshold, 1.5 as a, and 0.7 as ~3 to train SNOW, and it is always trained for 2 cycles.</Paragraph>
    </Section>
    <Section position="3" start_page="174" end_page="174" type="sub_section">
      <SectionTitle>
4.3 Evaluation Technique
</SectionTitle>
      <Paragraph position="0"> To evaluate the results, we use the following metrics: Number of correct proposed patterns Recall = Number of correct patterns</Paragraph>
      <Paragraph position="2"> Number of words labeled correctly Accuracy = Total number of words We use ~ = 1. Note that, for the Open/Close system, we must measure the accuracy for the open predictor and the close predictor separately since each word can be labeled as &amp;quot;Open&amp;quot; or &amp;quot;Not Open&amp;quot; and, at the same time, &amp;quot;Close&amp;quot; or &amp;quot;Not Close&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML