File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3233_metho.xml
Size: 14,938 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3233"> <Title>NP Bracketing by Maximum Entropy Tagging and SVM Reranking</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Hypothesis Reranking </SectionTitle> <Paragraph position="0"> In the previous section, we described a tagging model for NP Bracketing that can produce n-best lists. In this section, we describe a machine learning method for reranking these lists in an attempt to choose a hypothesis which is superior to the first-best output of the decoder. Reranking of n-best lists has recently become popular in several natural language problems, including parsing (Collins, 2003), machine translation (Och and Ney, 2002) and web search (Joachims, 2002). Each of these researchers takes a different approach to reranking. Collins (2003) uses both Markov Random Fields and boosting, Och and Ney (2002) use a maximum entropy ranking scheme, and Joachims (2002) uses a support vector approach. As SVMs tend to exhibit less problems with over-fitting than other competing approaches in noisy scenarios, we also adopt the support vector approach.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Support Vector Reranking </SectionTitle> <Paragraph position="0"> A support vector classifier is a binary classifier with a linear decision boundary. The selected decision boundary is a hyperplane that is chosen in such a way that the distance between it and the nearest data points is maximized. Slack variables are commonly introduced when the problem is not linearly separable, leading to soft margins.</Paragraph> <Paragraph position="1"> For reranking, we assume that instead of having binary classes for the yis, we have real values which specify the relative ordering (higher values come first). For this task, we get the following optimization problem (Joachims, 2002):</Paragraph> <Paragraph position="3"> Where the i;js are drawn from comparable data points and yi yj and C is a regularization parameter that specifies how great the cost of mis-ordering is. As noticed by Joachims, the condition in Equation 7 can be reduced to the standard SVM model by subtracting w xj from both sides.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Reranking Feature Functions </SectionTitle> <Paragraph position="0"> Since our problem is closely related to that of Collins' (2003), we use many of the same feature functions he does, though we do introduce many of our own (those which are copied from Collins are marked with an asterisk). We view the hypothesized bracketing as a tree in a context free grammar and include features based on each rule used to generate the given tree. For concreteness, we will use the CFG rule NP ! DT JJ NP (where the NP is selected as the head) as an example.</Paragraph> <Paragraph position="1"> Rules*: the full CFG rule; in this case, the active rule would be NP ! DT JJ NP.</Paragraph> <Paragraph position="2"> Markov 2 Rules: CFG rules where 2-level Markovization has been applied. That is, we look at the rule for generating the first two tags, then the next two (given the previous one), then the next two (given the previous one), and so on. A start of branch tag ([S]) and end of branch tag ([/S]) are added to the beginning and end of the children lists. In this case, the rules that fire are: NP! ! [S] DT, NP![S] ! DT JJ, NP!DT ! JJ NP and NP!JJ ! NP [/S]. The notation is X!Y ! A B, where X is the true parent, Y was the previous child in the Markovization, and A B are the two children.</Paragraph> <Paragraph position="3"> Lex-Rules*: full CFG rules, where terminal POS tags are replaced with lexical items.</Paragraph> <Paragraph position="4"> Markov 2 Lex-Rules: Markov 2-style rules, terminal POS tags are replaced with lexical items.</Paragraph> <Paragraph position="5"> Bigrams*: pairs of adjacent tags in the CFG rule; in our example, the active pairs are ([S],DT), (DT,JJ), (JJ,NP) and (NP,[/S]).</Paragraph> <Paragraph position="6"> Lex-Bigrams*: same as BIGRAMS, but with lexical heads instead of POS tags.</Paragraph> <Paragraph position="7"> Head Pairs*: pairs of internal node tags with the head type; in the example, (DT, NP), (JJ, NP) and (NP, NP).</Paragraph> <Paragraph position="8"> Sizes: the child count, conditioned on the internal tag; eg., NP ! 3.</Paragraph> <Paragraph position="9"> Word Count: pair of the SIZES and total number of words under this constituent.</Paragraph> <Paragraph position="10"> Boundary Heads: pairs of the first and last head in the constituent.</Paragraph> <Paragraph position="11"> POS-Counts: a scheme of features that count the number of children whose part of speech tag matches a given predicate. There are six of these: (1) children whose tag begins with N, (2) children whose tag begins with N but is not NP, (3) children which are DTs, (4) children whose tag begin with V, (5) children which are commas, (6) children whose tag is CC. In this case, we get a count of 1 for rules (2) and (3), and 2 for rule (1).</Paragraph> <Paragraph position="12"> Lex-Tag/Head Pairs: same as HEAD PAIRS, but where lexical items are used instead of POS tags. Special Tag Pairs: count of the lexical heads to the left and right of leaves tagged with each of POS, CC, IN and TO.</Paragraph> <Paragraph position="13"> Tag-Counts: another schema of features that replicates some of the features used in the maximum entropy tagger. This schema includes all the original maximum entropy tags, as well as a feature for each maximum entropy tag at position i, paired with (a) the part of speech tag at position i, i 1 and i + 1, (b) the word at position i, i 1 and i + 1, (c) the part of speech + word pair at those positions, (d) the maximum entropy tag at that position.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 SVM Training </SectionTitle> <Paragraph position="0"> We develop three reranking systems, differentiated by the amount of training data used. The first, RR1, is trained on the validation part of the training set (20% of sections 15-18). The second, RR2.</Paragraph> <Paragraph position="1"> is trained on the entire training set through cross-validation (all of sections 15-18). The final, RR3 is trained on the entire Penn Treebank corpus, except section 20.</Paragraph> <Paragraph position="2"> Training the reranking system only on the validation data (RR1) results in only a marginal gain of overall f-score, due primarily to the fact that most of the features use lexical information to prefer one bracketing over another. The validation data from sections 15-18 gives rise to 2;012 training instances and 362;415 features. In order to train the reranking system on all of the training data (RR2), we built five decoders, each with a different 20% of the training data held out. Each decoder is then used to tag the held-out 20% (this is done so that the tagger does not do &quot;too well&quot; on its training data). This leads to 8;935 sentences for training, with a total of 1:1 million features. Training on all the WSJ data except section 20 (RR3) gives rise to 39;953 training instances and a total of just over 2:1 million features.</Paragraph> <Paragraph position="3"> These examples give 1;462;568 rank constraints.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> We compare our system against those reported in the literature. In all, the evaluation is over 2;012 sentences of test data. In Table 1, we display the results of state-of-the-art systems, and the system described in this paper (both with and without reranking). The upper part of the table displays results from systems which are trained only on sections 15-18 of the WSJ. The lower part displays results based on systems trained on more data.</Paragraph> <Paragraph position="1"> lower half are not directly comparable, since they were either trained or tested on different data.</Paragraph> <Paragraph position="2"> In the table, TKS99 and TKS02 are the systems of Tjong Kim Sang (1999; 2002). KD00 is the system of (Krymolowski and Dagan, 2000). All the COL03 systems are results obtained using the restriction of the output of Collins (2003) parser.</Paragraph> <Paragraph position="3"> In particular, the two comparable numbers coming from Collins' parser are COL03NP and COL03Full.</Paragraph> <Paragraph position="4"> The difference between these two systems is that the NP system is trained on parse trees, with all non-NP nodes removed. The FULL system is trained on full parse trees, and then the output is reduced to just include NPs. COL03All is trained on sections 2-21 of WSJ and tested on section 23, and is thus an upper bound, since these numbers are testing on training data.3 Our RR3 system had the reranking component (but not the tagging component) trained on all of the WSJ except for section 20.</Paragraph> <Paragraph position="5"> The CHUNK row in the results table is the performance of an optimally performing NP chunker.</Paragraph> <Paragraph position="6"> That is, this is the performance attainable given a chunker that identifies base NPs perfectly (at 100% precision). However, since this hypothetical system only chunks base NPs, it misses all non-base NPs and thus achieves a recall of only 73:0, yielding an overall F-score below our system's performance. Note also that no chunker will perform this well. Current systems attain approximately 94% precision and recall on the chunking task (Sha and Pereira, 2002; Kudo and Matsumoto, 2001), so the 3Collins independently reports a recall of 91:2 and precision of 90:3 for NPs (Collins, 2003); however, these numbers are based on training on all the data and testing on section 0. Moreover, it is possible that his evaluation of NP bracketing is not identical to our own. The results in row COL03F ull are therefore perhaps more relevant.</Paragraph> <Paragraph position="7"> actual performance for a real system would be substantially lower.</Paragraph> <Paragraph position="8"> The four criteria these systems are evaluated on are bracketing recall (BR), bracketing precision (BP), bracketing f-score (BF) and average crossing brackets (CB). Some systems do not report their crossing bracket rate. All of these metrics are calculated only on NP* and WHNP* brackets.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Comparison of Performance </SectionTitle> <Paragraph position="0"> The results depicted in Table 1 show that, when comparing our system directly to Collins' parser, his system tends to achieve significantly higher levels of recall, while maintaining a slight advantage in terms of precision. This table, however, does not tell the full story. As is typically observed in these sort of applications, it is not the case that Collins' parser is &quot;winning&quot; by a little on all the data, but rather that Collins' parser wins on some of the data and our bracketer wins on some of the data. In this section, we analyze the differences.</Paragraph> <Paragraph position="1"> Overall, there are 2;012 sentences in the test data. In 558 cases, both the bracketing system and Collins' parser achieve perfect precision. In 505 cases, both achieve perfect recall. For the remainder of the discussion in this section, when discussing precision, we will only consider the cases in which not both achieved perfect scores, and similarly for recall.</Paragraph> <Paragraph position="2"> In Figure 4, we depict (excluding the mutually perfect sentences) the percentage of sentences on which each system is better than the other by a distance of at least . Along the X-axes, the value of ranges from 0 to 20. At a given value of , the segmentation along the Y-axes depict (a) along the top (in yellow where available), the proportion of sentences for which the bracketer's precision (for the left hand image) was at least of that of Collins'; (b) in the middle (in red), the proportion of sentences for which Collins' was at least better; and (c) along the bottom (in blue), the proportion of sentences where the two systems performed within of each other.</Paragraph> <Paragraph position="3"> As should be expected, as increases, the &quot;Equal&quot; region also increases. However, it is worth noticing that even at an of 20 precision points, there are still roughly 11% of the sentences for which one system's performance is noticeably different from the other's (and furthermore, that these are about even). As can be immediately seen from the right-hand graph, Collins' parser consistently outperforms the bracketer in terms of recall. How- null ever, in contrast to the Precision graph, for the first 10 or so values of , these proportions remain roughly the same (in fact, for a short period, Collins' actually looses ground). This suggests that there are a relatively large proportion of sentences for which our system is performing abominably (with > 10 recall points difference) in comparison to Collins'.</Paragraph> <Paragraph position="4"> However, once a critical mass of > 10 is reached, the relative differences become less strong.</Paragraph> <Paragraph position="5"> Since neither system is winning in all cases, in an effort to better understand the conditions in which one system will outperform the other, we inspect the sentences for which there was a difference in performance of at least 10 (for precision and recall separately). To perform this investigation, we look at the distribution of tags in the true, full parse trees for those sentences. These percentages, for the 7 most common tags, are summarized in Table 2 (for example, the relative frequency of the NP tag in sentences where the RR2 system achieved higher precision was 21.4, while for the sentences for which COL03 achieved higher precision was 19.8).</Paragraph> <Paragraph position="6"> The first thing worth noticing in this table is that in general, when one system achieves higher precision, the other system achieves higher recall, which is not surprising. However, in the last row, corresponding to proper nouns, the RR2 system outperforms the COL03 (this is the &quot;Full&quot; implementation) in both precision and recall, suggesting that our system is better able to capture the phrasing of proper nouns. We attribute this to the fact that our model is specialized to identify noun phrases, of which proper nouns comprise a large part. Similarly, the largest gains in recall for COL03 over RR2 are in sentences with many PPs. This coincides with our intuition about the syntactic parser being better able to capture long, embedded noun phrases.</Paragraph> </Section> class="xml-element"></Paper>