XML Viewer - w06-2902

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2902_evalu.xml
Size: 10,974 bytes
Last Modified: 2025-10-06 13:59:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2902">
  <Title>Porting Statistical Parsers with Data-Defined Kernels</Title>
  <Section position="7" start_page="9" end_page="11" type="evalu">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We used the Penn Treebank WSJ corpus and the Brown corpus to evaluate our approach. We used the standard division of the WSJ corpus into training, validation, and testing sets. In the Brown corpus we ran separate experiments for sections F (informative prose: popular lore), K (imaginative prose: general fiction), N (imaginative prose: adventure and western fiction), and P (imaginative prose: romance and love story). These sections were selected because they are sufficiently large, and because they appeared to be maximally different from each other and from WSJ text. In each Brown corpus section, we selected every third sentence for testing. From the remaining sentences, we used 1 sentence out of 20 for the validation set, and the remainder for training. The resulting datasets sizes are presented in table 1.</Paragraph>
    <Paragraph position="1"> For the large margin classifier, we used the SVM-Struct (Tsochantaridis et al., 2004) implementation of SVM, which rescales the margin with F1 measure of bracketed constituents (see (Tsochantaridis et al., 2004) for details). Linear slack penalty was</Paragraph>
    <Section position="1" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
employed.2
5.1 Experiments on Transferring across
Domains
</SectionTitle>
      <Paragraph position="0"> To evaluate the pure porting scenario (transferring), described in section 3.1, we trained the SSN parsing model on the WSJ corpus. For each tag, there is an unknown-word vocabulary item which is used for all those words not sufficiently frequent with that tag to be included individually in the vocabulary. In the 2Training of the SVM takes about 3 hours on a standard desktop PC. Running the SVM is very fast, once the probabilistic model has finished computing the probabilities needed to select the candidate parses.</Paragraph>
      <Paragraph position="1">  dataset.</Paragraph>
      <Paragraph position="2"> vocabulary of the parser, we included the unknown-word items and the words which occurred in the training set at least 20 times. This led to the vocabulary of 4,215 tag-word pairs.</Paragraph>
      <Paragraph position="3"> We derived the kernel from the trained model for each target section (F, K, N, P) using reparameterization discussed in section 3.1: we included in the vocabulary all the words which occurred at least twice in the training set of the corresponding section. This approach led to a smaller vocabulary than that of the initial parser but specifically tied to the target domain (3,613, 2,789, 2,820 and 2,553 tag-word pairs for sections F, K, N and P respectively). There is no sense in including the words from the WSJ which do not appear in the Brown section training set because the classifier won't be able to learn the corresponding components of its decision vector. The results for the original probabilistic model (SSN-WSJ) and for the kernel method (TOP-Transfer) on the testing set of each section are presented in table 2.3 To evaluate the relative contribution of our porting technique versus the use of the TOP kernel alone, we also used this TOP kernel to train an SVM on the WSJ corpus. We trained the SVM on data from the development set and section 0, so that the size of this dataset (3,267 sentences) was about the same as for each Brown section.4 This gave us a &amp;quot;TOP-WSJ&amp;quot;  a fair test of the contribution of the TOP kernel alone. It would also not be computationally tractable to train an SVM on the full WSJ dataset without using different training techniques, which would then compromise the comparison.</Paragraph>
      <Paragraph position="4">  model, which we tested on each of the four Brown sections. In each case, the TOP-WSJ model did worse than the original SSN-WSJ model, as shown in table 2. This makes it clear that we are getting no improvement from simply using a TOP kernel alone or simply using more data, and all our improvement is from the proposed porting method.</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
5.2 Experiments on Focusing on a Subdomain
</SectionTitle>
      <Paragraph position="0"> To perform the experiments on the approach suggested in section 3.2 (focusing), we trained the SSN parser on the WSJ training set joined with the training set of the corresponding section. We included in the vocabulary only words which appeared in the joint training set at least 20 times. Resulting vocabularies comprised 4,386, 4,365, 4,367 and 4,348 for sections F, K, N and P, respectively.5 Experiments were done in the same way as for the parser transferring approach, but reparameterization was not performed. Standard measures of accuracy for the original probabilistic model (SSN-WSJ+Br) and the kernel method (TOP-Focus) are also shown in table 2.</Paragraph>
      <Paragraph position="1"> For the sake of comparison, we also trained the SSN parser on only training data from one of the Brown corpus sections (section P), producing a &amp;quot;SSN-Brown&amp;quot; model. This model achieved an F1 measure of only 81.0% for the P section testing set, which is worse than all the other models and is 3% lower than our best results on this testing set (TOP-Focus). This result underlines the need to port parsers from domains in which there are large annotated datasets.</Paragraph>
    </Section>
    <Section position="3" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
5.3 Experiments Comparing Vocabulary to
Structure
</SectionTitle>
      <Paragraph position="0"> We conducted the same set of experiments with the kernel with vocabulary features (TOP-Voc-Transfer and TOP-Voc-Focus) and with the kernel with the structural features (TOP-Str-Transfer and TOP-Str-Focus). Average results for classifiers with these kernels, as well as for the original kernel and the baseline, are presented in table 3.</Paragraph>
      <Paragraph position="1"> 5We would expect some improvement if we used a smaller threshold on the target domain, but preliminary results suggest that this improvement would be small.</Paragraph>
    </Section>
    <Section position="4" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
5.4 Discussion of Results
</SectionTitle>
      <Paragraph position="0"> For the experiments which directly test the usefulness of our proposed porting technique (SSN-WSJ versus TOP-Transfer), our technique demonstrated improvement for each of the Brown sections (table 2), and this improvement was significant for three out of four of the sections (K, N, and P).6 This demonstrates that data-defined kernels are an effective way to port parsers to a new domain.</Paragraph>
      <Paragraph position="1"> For the experiments which combine training a new probability model with our porting technique (SSN-WSJ+Br versus TOP-Focus), our technique still demonstrated improvement over training alone.</Paragraph>
      <Paragraph position="2"> There was improvement for each of the Brown sections, and this improvement was significant for two 6We measured significance in F1 measure at the 5% level with the randomized significance test of (Yeh, 2000). We think that the reason the improvement on section F was only significant at the 10% level was that the baseline model (SSN-WSJ) was particularly lucky, as indicated by the fact that it did even better than the model trained on the combination of datasets (SSN-WSJ+Br).</Paragraph>
      <Paragraph position="3">  F, K, N and P of the Brown corpus.</Paragraph>
      <Paragraph position="4"> out of four of the sections (F and K). This demonstrates that, even when the probability model is well suited to the target domain, there is still room for improvement from using data-defined kernels to optimize the parser specifically to the target domain without losing information about the source domain.</Paragraph>
      <Paragraph position="5"> One potential criticism of these conclusions is that the improvement could be the result of reranking with the TOP kernel, and have nothing to do with porting. The lack of an improvement in the TOP-WSJ results discussed in section 5.1 clearly shows that this cannot be the explanation. The opposite criticism is that the improvement could be the result of optimizing to the target domain alone. The poor performance of the SSN-Brown model discussed in section 5.2 makes it clear that this also cannot be the explanation. Therefore reranking with data defined kernels must be both effective at preserving information about the source domain and effective at specializing to the target domain.</Paragraph>
      <Paragraph position="6"> The experiments which test the hypothesis that differences in vocabulary distributions are more important than difference in syntactic structure distributions confirm this belief. Results for the classifier which uses the kernel with only vocabulary features are better than those for structural features in each of the four sections with both the Transfer and Focus scenarios. In addition, comparing the results of TOP-Transfer with TOP-Voc-Transfer and TOP-Focus with TOP-Voc-Focus, we can see that adding structural features in TOP-Focus and TOP-Transfer leads to virtually no improvement. This suggest that differences in vocabulary distributions are the only issue we need to address, although this result could possibly also be an indication that our method did not sufficiently exploit structural differences.</Paragraph>
      <Paragraph position="7"> In this paper we concentrate on the situation where a parser is needed for a restricted target domain, for which only a small amount of data is available. We believe that this is the task which is of greatest practical interest. For this reason we do not run experiments on the task considered in (Gildea, 2001) and (Roark and Bacchiani, 2003), where they are porting from the restricted domain of the WSJ corpus to the more varied domain of the Brown corpus as a whole. However, to help emphasize the success of our proposed porting method, it is relevant to show that even our baseline models are performing better than this previous work on parser portability. We trained and tested the SSN parser in their &amp;quot;de-focusing&amp;quot; scenario using the same datasets as (Roark and Bacchiani, 2003). When trained only on the WSJ data (analogously to the SSN-WSJ baseline for TOP-Transfer) it achieves results of 82.9%/83.4% LR/LP and 83.2% F1, and when trained on data from both domains (analogously to the SSN-WSJ+Br baselines for TOP-Focus) it achieves results of 86.3%/87.6% LR/LP and 87.0% F1. These results represent a 2.2% and 1.3% increase in F1 over the best previous results, respectively (see the discussion of (Roark and Bacchiani, 2003) below).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML