XML Viewer - p06-2098

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2098_metho.xml
Size: 18,540 bytes
Last Modified: 2025-10-06 14:10:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2098">
  <Title>Exact Decoding for Jointly Labeling and Chunking Sequences</Title>
  <Section position="4" start_page="763" end_page="764" type="metho">
    <SectionTitle>
2 Model for Joint Labeling and Chunking
</SectionTitle>
    <Paragraph position="0"> Consider the task of finding noun chunks. The noun chunk extends from the beginning of a noun phrase to the head noun, excluding postmodifiers (which are difficult to attach correctly). Table 1 shows a sentence labeled with POS tags and segmented into noun chunks. B marks the first word of a noun chunk, I the other words in a noun chunk, and O the words that are not in a noun chunk. Note that we collapsed the 45 different POS labels into 5 labels, following (McCallum et al, 2003). All different types of adjectives are labeled as JADJ.</Paragraph>
    <Paragraph position="1"> Each word carries two tags. Given the first layer, our aim is to present a model that can predict the second and third layers of tags at the same time.</Paragraph>
    <Paragraph position="2"> Assume we have n training samples, {(xi,yi)}ni=1, where xi is a sequence of input tokens and yi is a label-chunk structure for xi. In this example, the first column contains the tokens xi and the second and third columns together represent the label-chunk structures yi. We will present an efficient exact decoding for this structure.</Paragraph>
    <Paragraph position="3"> The label-chunk structure, shown in Table 2, is a representation of the two layers of tags. The tuples in Table 2 are called parts. If the token at index r carries a POS tag P and a chunk tag C, the first layer includes part &lt;C,P,r&gt; . This part is called a node.</Paragraph>
    <Paragraph position="4"> If the tokens at index r [?] 1 and r are in the same chunk, and C is the label of that chunk, the first layer also includes part &lt;C,P0,P,r[?]1,r&gt; (where P0 and P are the POS tags of the tokens at r [?] 1 and r Token First Layer (POS) Second Layer (NP) U.K. &lt;I, JADJ,0&gt;</Paragraph>
    <Paragraph position="6"> respectively). This part is called a transition. If a chunk tagged C extends from the token at q to the token at r inclusive, the second layer includes part &lt;C,q,r&gt; . This part is a chunk node. And if the token at q[?]1 is the last token in a chunk tagged C0, while the token at q is the first token of a chunk tagged C, the second layer includes part &lt;C0,C,q[?]1,q&gt; . This part is a chunk transition.</Paragraph>
    <Paragraph position="7"> In this paper we use the common method of factoring the score of the label-chunk structure as the sum of the scores of all the parts. Each part in a label-chunk structure can be lexicalized, and gives rise to several features. For each feature, we have a corresponding weight. If we sum up the weights for these features, we have the score for the part, and if we sum up the scores of the parts, we have the score for the label-chunk structure.</Paragraph>
    <Paragraph position="8"> Suppose we would like to score a pair (xi,yi) in the training set, and it happens to be the one shown in Table 2. To begin, let's say we would like to find the features for the part &lt;I,NOUN,7&gt; of POS node type (1st Layer). This is the NOUN tag on the seventh token &amp;quot;level&amp;quot; in Table 2. By default, the POS node type generates the following binary feature.</Paragraph>
    <Paragraph position="9"> * Is there a token labeled with &amp;quot;NOUN&amp;quot; in a chunk labeled with &amp;quot;I&amp;quot;?  Now, to have more features, we can lexicalize POS node type. Suppose we use xr to lexicalize POS node &lt;C,P,r&gt; , then we have the following binary feature, as it is &lt;I,NOUN,7&gt; and xi7 = &amp;quot;level&amp;quot;. * Is there a token &amp;quot;level&amp;quot; labeled with &amp;quot;NOUN&amp;quot; in a chunk labeled with &amp;quot;I&amp;quot;? We can also use xr[?]1 and xr to lexicalize the parts of POS node type.</Paragraph>
    <Paragraph position="10"> * Is there a token &amp;quot;level&amp;quot; labeled with &amp;quot;NOUN&amp;quot; in a chunk labeled with &amp;quot;I&amp;quot; that's preceded by &amp;quot;highest&amp;quot;? This way, we have a complete specification of the feature set given the part type, lexicalization for each part type and the training set. Let us define f a boolean feature vector function such that each dimension of f(xi,yi) contains 1 if the pair (xi,yi) has the feature, 0 otherwise. Now define a real-valued weight vector w with the same dimension as f. To represent the score of the pair (xi,yi), we write s(xi,yi) = w[?]f(xi,yi) We could also have w[?]f(xi,{p}) where p just a single part, in which case we just write s(p).</Paragraph>
    <Paragraph position="11"> Assuming an appropriate feature representation as well as a weight vector w, we would like to find the highest scoring label-chunk structure y = argmaxy'(w[?]f(x,y')) given an input sentence x.</Paragraph>
    <Paragraph position="12"> In the upcoming section, we present a decoding algorithm for the label-chunk structures, and later we give a method for learning the weight vector used in the decoding.</Paragraph>
  </Section>
  <Section position="5" start_page="764" end_page="764" type="metho">
    <SectionTitle>
3 Decoding
</SectionTitle>
    <Paragraph position="0"> The decoding algorithm is shown in Figure 1. The idea is to use two tables for dynamic programming: label table and chunk table.</Paragraph>
    <Paragraph position="1"> Suppose we are examining the current position r, and would like to consider extending the chunk [q,r[?]1] to [q,r]. We need to know the chunk tag C for [q,r[?]1] and the last POS tag P0 at index r[?]1. The array entry label table[q][r [?] 1] keeps track of this information.</Paragraph>
    <Paragraph position="2"> Then we examine how the current chunk is connected with the previous chunk. The array entry chunk table[q][C0] keeps track of the score of the best label-chunk structure from 0 up to the index q that has the ending chunk tag C0. Now checking the chunk transition from C0 to C at the index q is simple, and we can record the score of this chunk to chunk table[r][C], so that the next chunk starting at r can use this information.</Paragraph>
    <Paragraph position="3"> In short, we are executing two Viterbi algorithms on the first and second layer at the same time. One extends [q,r [?] 1] to [q,r], considering the node indexed by r (first layer). The other extends [0,q] to [0,r], considering the node indexed by [q,r] (second layer). The dynamic programming table for the first layer is kept in the label table (r [?] 1 and P0 are used in the Viterbi algorithm for this layer) and that for the second layer in the chunk table (q and C0 used). The algorithm returns the best score of the label-chunk structure.</Paragraph>
    <Paragraph position="4"> To recover the structure, we simply need to maintain back pointers to the items that gave rise to the each item in the dynamic programming table. This is just like maintaining back pointers in the Viterbi algorithm for sequences, or the CKY algorithm for parsing.</Paragraph>
    <Paragraph position="5"> The pseudo-code shows that the run-time complexity of the decoding algorithm is O(n2) unlike that of CFG parsing, O(n3). Thus the algorithm performs better on long sentences. On the other hand, the constant is c2p2 where c is the number of chunk tags and p is the number of POS tags.</Paragraph>
  </Section>
  <Section position="6" start_page="764" end_page="767" type="metho">
    <SectionTitle>
4 Learning
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="764" end_page="766" type="sub_section">
      <SectionTitle>
4.1 Voted Perceptron
</SectionTitle>
      <Paragraph position="0"> In the CKY and Viterbi decoders, we use the forward-backward or inside-outside algorithm to find the marginal probabilities. Since we don't yet have the inference algorithm to find the marginal probabilities of the parts of a label-chunk structure, we use an online learning algorithm to train the model. Despite this restriction, the voted perceptron is known for its performance (Sha and Pereira, 2003).</Paragraph>
      <Paragraph position="1"> The voted perceptron we use is the adaptation of (Freund and Schapire, 1999) to the structured setting. Algorithm 4.1 shows the pseudo code for the training, and the function update(wk,xi,yi,y') returns wk [?]f(xi,y') + f(xi,yi) .</Paragraph>
      <Paragraph position="2"> Given a training set {(xiyi)}ni=1 and the epoch number T, Algorithm 4.1 will return a list of  Algorithm 3.1: DECODE(the scoring function s(p))</Paragraph>
      <Paragraph position="4"> #Add the score of the chunk node at [q,r-1]. (2nd Layer, NP) score := score + s(&lt;C,q,r [?] 1&gt; ); if (index start &lt; q) #Add the score of the chunk transition from q-1 to q. (2nd Layer, NP) score := score + s(&lt;C0,C,q [?] 1,q&gt; ) + chunk table[q][C0]; if (score &gt;= chunk table[r][C]) chunk table[r][C] := score; end for end for end for end for end for end for</Paragraph>
      <Paragraph position="6"> Note: Since the scoring function s(p) is defined as w[?]f(xi,{p}), the input sequence xi and the weight vector w are also the inputs to the algorithm.</Paragraph>
      <Paragraph position="7">  weighted perceptrons {(w1,c1),..(wk,ck)}. The final model V uses the weight vector</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="2" start_page="766" end_page="767" type="sub_section">
      <SectionTitle>
4.2 Max Margin
</SectionTitle>
      <Paragraph position="0"> A max margin method minimizes the regularized empirical risk function with the hard (penalized)</Paragraph>
      <Paragraph position="2"> li finds the loss for y with respect to yi, and it is assumed that the function is decomposable just as y is decomposable to the parts. This equation is equivalent to</Paragraph>
      <Paragraph position="4"> After taking the Lagrange dual formation, we have</Paragraph>
      <Paragraph position="6"> This quadratic program can be optimized by bicoordinate descent, known as Sequential Minimum Optimization. Given an example i and two label-chunk structures y' and y'',</Paragraph>
      <Paragraph position="8"> Using the equation (1), any increase in a can be translated to w. For a naive SMO, this update is executed for each training sample i, for all pairs of possible parses y' and y'' for xi. See (Taskar and Klein, 2005; Zhang, 2001; Jaakkola et al, 2000).</Paragraph>
      <Paragraph position="9"> Here is where we differ from (Taskar et al, 2004).</Paragraph>
      <Paragraph position="10"> We choose y'' to be the correct parse yi, and y' to be the best runner-up. After setting the initial weights using yi, we also set ai(yi) = 1 and ai(y') = 0. Although these alphas are not correct, as optimization nears the end, the margin is wider; ai(yi) and ai(y') gets closer to 1 and 0 respectively. Given this approximation, we can compute d. Then, the function update(wk,xi,yi,y') will return wk[?]df(xi,y')+df(xi,yi) and we have reduced the SMO to the perceptron weight update.</Paragraph>
      <Paragraph position="11">  We can think of maximizing the margin in terms of extending the Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer, 2003; Crammer et al, 2003) to learning with structured outputs. (Mc-Donald et al, 2005) presents this approach for dependency parsing.</Paragraph>
      <Paragraph position="12"> In particuler, Single-best MIRA (McDonald et al, 2005) uses only the single margin constraint for the runner up y' with the highest score. The resulting online update would be wk+1 with the following  condition: minbardblwk+1 [?] wkbardbl such that s(xi,yi) [?] s(xi,y') [?] li(y') where y' = argmaxys(xi,y).</Paragraph>
      <Paragraph position="13"> Incidentally, the equation (2) for d above when ai(yi) = 1 and ai(y') = 0 solves this minimization problem as well, and the weight update is the same as the SMO case.</Paragraph>
      <Paragraph position="14">  Instead of minimizing the regularized empirical risk function with the hard (penalized) margin, conditional random fields try to minimize the same with the negative log loss: minw 12bardblwbardbl2 [?]summationdisplay</Paragraph>
      <Paragraph position="16"> Usually, CRFs use marginal probabilities of parts to do the optimization. Since we have not yet come up with the algorithm to compute marginals for a label-chunk structure, the training methods for CRFs is not applicable to our purpose. However, on sequence labeling tasks CRFs have shown very good performance (Lafferty et al, 2001; Sha and Pereira, 2003), and we will use them for the baseline comparison. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="767" end_page="768" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="767" end_page="767" type="sub_section">
      <SectionTitle>
5.1 Task: Base Noun Phrase Chunking
</SectionTitle>
      <Paragraph position="0"> The data for the training and evaluation comes from the CoNLL 2000 shared task (Tjong Kim Sang and Buchholz, 2000), which is a portion of the Wall Street Journal.</Paragraph>
      <Paragraph position="1"> We consider each sentence to be a training instance xi, with single words as tokens.</Paragraph>
      <Paragraph position="2"> The shared task data have a standard training set of 8936 sentences and a test set of 2012 sentences. For the training, we used the first 447 sentences from the standard training set, and our evaluation was done on the standard test set of the 2012 sentences. Let us define the set D to be the first 447 samples from the standard training set .</Paragraph>
      <Paragraph position="3"> There are 45 different POS labels, and the three NP labels: begin-phrase, inside-phrase, and other. (Ramshaw and Marcus, 1995) To reduce the inference time, following (McCallum et al, 2003), we collapsed the 45 different POS labels contained in the original data. The rules for collapsing the POS labels are listed in the Table 3.</Paragraph>
    </Section>
    <Section position="2" start_page="767" end_page="767" type="sub_section">
      <SectionTitle>
Original Collapsed
</SectionTitle>
      <Paragraph position="0"> all different types of nouns NOUN all different types of verbs VERB all different types of adjectives JADJ all different types of adverbs RBP the remaining POS labels OTHER  and after collapsing the labels. We present two experiments: one comparing our label-chunk model with a cascaded linear-chain model and a simple linear-chain model, and one comparing different learning algorithms. The cascaded linear-chain model uses one linear-chain model to predict POS tags, and another linear-chain model to predict NP labels, using the POS tags predicted by the first model as a feature.</Paragraph>
      <Paragraph position="1"> More specifically, we trained a POS-tagger using the training set D. We then used the learned model and replaced the POS labels of the test set with the labels predicted by the learned model. The linear-chain NP chunker was again trained on D and evaluated on this new test set with POS supplied by the earlier processing. Note that the new test set has exactly the same word tokens and noun chunks as the original test set.</Paragraph>
    </Section>
    <Section position="3" start_page="767" end_page="768" type="sub_section">
      <SectionTitle>
5.2 Systems
5.2.1 POS Tagger and NP Chunker
</SectionTitle>
      <Paragraph position="0"> There are three versions of POS taggers and NP chunkers: CRF, VP, MMVP. For CRF, L-BFGS, a quasi-Newton optimization method was used for the training, and the implementation we used is CRF++ (Kudo, 2005). VP uses voted perceptron, and MMVP uses max margin update for the voted perceptron. For the voted perceptron, we used aver- null if xq matches then tq is  aging of the weights suggested by (Collins, 2002). The features are exactly the same for all three systems. null  For each CRF, VP, MMVP, the output of a POS tagger was used as a feature for the NP chunker. The feeds always consist of a POS tagger and NP chunker of the same kind, thus we have CRF+CRF, VP+VP, and MMVP+MMVP.</Paragraph>
      <Paragraph position="1">  Since CRF requires the computation of marginals for each part, we were not able to use the learning method. VP and MMVP were used to train the label-chunk structures with the features explained in the following section.</Paragraph>
    </Section>
    <Section position="4" start_page="768" end_page="768" type="sub_section">
      <SectionTitle>
5.3 Features
</SectionTitle>
      <Paragraph position="0"> First, as a preprocessing step, for each word token xq, feature tq was created with the rule in Table 5, and included in the input files. This feature is included in x along with the word tokens. The feature tells us whether the token is capitalized, and whether digits occur in the token. No outside resources such as a list of names or a gazetteer were used.</Paragraph>
      <Paragraph position="1"> Table 6 shows the lexicalized features for the joint labeling and chunking. For the first iteration of training, the weights for the lexicalized features were not  updated. The intention is to have more weights on the unlexicalized features, so that when lexical feature is not found, unlexicalized features could provide useful information and avoid overfitting, much as back-off probabilities do.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="768" end_page="768" type="metho">
    <SectionTitle>
6 Result
</SectionTitle>
    <Paragraph position="0"> We evaluated the performance of the systems using three measures: POS accuracy, NP accuracy, and F1 measure on NP. These figures show how errors accumulate as the systems are chained together. For the statistical significance testing, we have used pairsamples t test, and for the joint labeling and chunking task, everything was found to be statistically significant except for CRF + CRF vs VP Joint.</Paragraph>
    <Paragraph position="1"> One can see that the systems with joint labeling and chunking models perform much better than the cascaded models. Surprisingly, the perceptron update motivated by the max margin principle performed significantly worse than the simple perceptron update for linear-chain models but performed better on joint labeling and chunking.</Paragraph>
    <Paragraph position="2"> Although joint labeling and chunking model takes longer time per sample because of the time complexity of decoding, the number of iteration needed to achieve the best result is very low compared to other systems. The CPU time required to run 10 iterations of MMVP is 112 minutes.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML