File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1012_evalu.xml

Size: 5,248 bytes

Last Modified: 2025-10-06 13:59:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1012">
  <Title>Using LTAG Based Features in Parse Reranking</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> As described above, we use the SVM based voting algorithm (Shen and Joshi, 2003) in our reranking experiments. We use preference kernels and pair-wise parse trees in our reranking models.</Paragraph>
    <Paragraph position="1"> We use the same data set as described in (Collins, 2000). Section 2-21 of the Penn WSJ Treebank are used as training data, and section 23 is used for nal test. The training data contains around 40,000 sentences, each of which has 27 distinct parses on average. Of the 40,000 training sentences, the rst 36,000 are used to train SVMs. The remaining 4,000 sentences are used as development data.</Paragraph>
    <Paragraph position="2"> Due to the computational complexity of SVM, we have to divide training data into slices to speed up training. Each slice contain two pairs of parses from every sentence. Speci cally, slice i contains positive samples ((~pk;pki); +1) and negative samples ((pki; ~pk); 1), where ~pk is the best parse for sentence k, pki is the parse with the ith highest log-likelihood in all the parses for sentence k and it is not the best parse (Shen and Joshi, 2003). There are about 60000 samples in each slice in average.</Paragraph>
    <Paragraph position="3"> For the tree kernel SVMs of Model 1, we take 3 slices as a chunk, and train an SVM for each chunk. Due to the limitation of computing resource, we have only trained on 3 chunks. The results of tree kernel SVMs are combined with simple combination. Then the outcome is combined with the result of the linear kernel SVMs trained on features extracted from the derived trees which are reported in (Shen and Joshi, 2003). For each parse, the number of the brackets in it and the log-likelihood given by Collins' parser Model 2 are also used in the computation of the score of a parse. For each parse p, its score Sco(p) is de ned as follows:</Paragraph>
    <Paragraph position="5"> where MT (p) is the output of the tree kernel SVMs, ML(p) is the output of linear kernel SVMs, l(p) is the log-likelihood of parse p, and b(p) is the number of brackets in parse p. We noticed that the SVM systems prefers to give higher scores to the parses with less brackets. As a result, the system has a high precision but a low recall. Therefore, we take the number of brackets, b(p), as a feature to make the recall and precision balanced. The three weight parameters are tuned on the development data.</Paragraph>
    <Paragraph position="6"> The results are shown in Table 1. With Model 1, we achieve LR/LP of 89:7%=90:0% on sentences  bank. LR/LP = labeled recall/precision. CBs = average number of crossing brackets per sentence. 0 CBs, 2 CBs are the percentage of sentences with 0 or 2 crossing brackets respectively. CO99 = (Collins, 1999) Model 2. CO00 = (Collins, 2000).</Paragraph>
    <Paragraph position="7"> CD02 = (Collins and Duffy, 2002). SJ03 = linear kernel of (Shen and Joshi, 2003). M1=Model 1.</Paragraph>
    <Paragraph position="8"> M2=Model 2.</Paragraph>
    <Paragraph position="9">  with 100 words. Our results show a 17% relative difference in f-score improvement over the use of a linear kernel without LTAG based features (Shen and Joshi, 2003). In addition, we also get non-trivial improvement on the number of crossing brackets. These results verify the bene t of using LTAG based features and con rm the hypothesis that LTAG based features provide a novel set of abstract features that complement the hand selected features from (Collins, 2000). Our results on Model 1 show a 1% error reduction on the previous best reranking result using the dataset reported in (Collins, 2000). Also, Model 1 provides a 10% reduction in error over (Collins and Duffy, 2002) where the features from tree kernel were over arbitrary sub-trees.</Paragraph>
    <Paragraph position="10"> For Model 2, we rst train 22 SVMs on 22 distinct slices. Then we combine the results of individual SVMs with simple combination. However, the overall performance does not improve. But we notice that the use of LTAG based features gives rise to  SVMs in Model 2: with and without LTAG based features. X-axis stands for the ID of the slices on which the SVMs are trained.Y-axis stands for the fscore. null improvement on most of the single SVMs, as shown in Fig. 8.</Paragraph>
    <Paragraph position="11"> We think there are several reasons to account for why our Model 2 doesn't work as well for the full task when compared with Model 1. Firstly, the training slice is not large enough. Local optimization on each slice does not result in global optimization (as seen in Fig. 8). Secondly, the LTAG based features that we have used in the linear kernel in Model 2 are not as useful as the tree kernel in Model 1.4 The last reason is that we do not set the importance of LTAG based features. One shortcoming of kernel methods is that the coef cient of each feature must be set before the training (Herbrich, 2002). In our case, we do not tune the coef cients for the LTAG based features in Model 2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML