File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1043_evalu.xml
Size: 6,652 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1043"> <Title>Reranking and Self-Training for Parser Adaptation</Title> <Section position="7" start_page="340" end_page="342" type="evalu"> <SectionTitle> 5 Analysis </SectionTitle> <Paragraph position="0"> We perform several types of analysis to measure some of the differences and similarities between the BROWN-trained and WSJ-trained reranking parsers. While the two parsers agree on a large number of parse brackets (Section 5.2), there are categorical differences between them (as seen in on the SWITCHBOARD development corpus. In this case, WSJ+NANC is a model created from WSJ and 1,750k sentences from NANC.</Paragraph> <Paragraph position="1"> duced by baseline WSJ parser, a combined WSJ and NANC parser, and a baseline BROWN parser.</Paragraph> <Paragraph position="2"> Section 5.3).</Paragraph> <Section position="1" start_page="340" end_page="340" type="sub_section"> <SectionTitle> 5.1 Oracle Scores </SectionTitle> <Paragraph position="0"> Table 6 shows the f-scores of an &quot;oracle reranker&quot; -- i.e. one which would always choose the parse with the highest f-score in the n-best list. While the WSJ parser has relatively low f-scores, adding NANC data results in a parser with comparable oracle scores as the parser trained from BROWN training. Thus, the WSJ+NANC model has better oracle rates than the WSJ model (McClosky et al., 2006) for both the WSJ and BROWN domains.</Paragraph> </Section> <Section position="2" start_page="340" end_page="340" type="sub_section"> <SectionTitle> 5.2 Parser Agreement </SectionTitle> <Paragraph position="0"> In this section, we compare the output of the WSJ+NANC-trained and BROWN-trained reranking parsers. We use evalb to calculate how similar the two sets of output are on a bracket level. Table 7 shows various statistics. The two parsers achieved an 88.0% f-score between them. Additionally, the two parsers agreed on all brackets almost half the time. The part of speech tagging agreement is fairly high as well. Considering they were created from different corpora, this seems like a high level of agreement.</Paragraph> </Section> <Section position="3" start_page="340" end_page="342" type="sub_section"> <SectionTitle> 5.3 Statistical Analysis </SectionTitle> <Paragraph position="0"> We conducted randomization tests for the significance of the difference in corpus f-score, based on the randomization version of the paired sample t-test described by Cohen (1995). The null hypothesis is that the two parsers being compared are in fact behaving identically, so permuting or swapping the parse trees produced by the parsers for of NANC sentences added under four test conditions. &quot;BROWN tuned&quot; indicates that BROWN training data was used to tune the parameters (since the normal held-out section was being used for testing). For &quot;WSJ tuned,&quot; we tuned the parameters from section 24 of WSJ. Tuning on BROWN helps the parser, but not for the reranking parser.</Paragraph> <Paragraph position="1"> ment. The reranking parser used the WSJ-trained reranker model. The BROWN parsing model is naturally better than the WSJ model for this task, but combining the two training corpora results in a better model (as in Gildea (2001)). Adding small amounts of NANC further improves the models. test. The WSJ+NANC parser with the WSJ reranker comes close to the BROWN-trained reranking parser. The BROWN reranker provides only a small improvement over its WSJ counterpart, which is not statistically significant.</Paragraph> <Paragraph position="2"> parser with the WSJ reranker and the BROWN parser with the BROWN reranker. Complete match is how often the two reranking parsers returned the exact same parse.</Paragraph> <Paragraph position="3"> the same test sentence should not affect the corpus f-scores. By estimating the proportion of permutations that result in an absolute difference in corpus f-scores at least as great as that observed in the actual output, we obtain a distribution-free estimate of significance that is robust against parser and evaluator failures. The results of this test are shown in Table 8. The table shows that the BROWN reranker is not significantly different from the WSJ reranker.</Paragraph> <Paragraph position="4"> In order to better understand the difference between the reranking parser trained on Brown and the WSJ+NANC/WSJ reranking parser (a reranking parser with the first-stage trained on WSJ+NANC and the second-stage trained on WSJ) on Brown data, we constructed a logistic regression model of the difference between the two parsers' f-scores on the development data using the R statistical package5. Of the 2,078 sentences in the development data, 29 sentences were discarded because evalb failed to evaluate at least one of the parses.6 A Wilcoxon signed rank test on the remaining 2,049 paired sentence level f-scores was significant at p = 0.0003. Of these 2,049 sentences, there were 983 parse pairs with the same sentence-level f-score. Of the 1,066 sentences for which the parsers produced parses with different f-scores, there were 580 sentences for which the BROWN/BROWN parser produced a parse with a higher sentence-level f-score and 486 sentences for which the WSJ+NANC/WSJ parser produce a parse with a higher f-score. We constructed a generalized linear model with a binomial link with BROWN/BROWN f-score > WSJ+NANC/WSJ f-score as the predicted variable, and sentence length, the number of prepositions (IN), the number of conjunctions (CC) and Brown f-score > WSJ+NANC/WSJ f-score identified by model selection. The feature IN is the number prepositions in the sentence, while ID identifies the Brown subcorpus that the sentence comes from. Stars indicate significance level.</Paragraph> <Paragraph position="5"> subcorpus ID as explanatory variables. Model selection (using the &quot;step&quot; procedure) discarded all but the IN and Brown ID explanatory variables. The final estimated model is shown in Table 9. It shows that the WSJ+NANC/WSJ parser becomes more likely to have a higher f-score than the BROWN/BROWN parser as the number of prepositions in the sentence increases, and that the BROWN/BROWN parser is more likely to have a higher f-score on Brown sections K, N, P, G and L (these are the general fiction, adventure and western fiction, romance and love story, letters and memories, and mystery sections of the Brown corpus, respectively). The three sections of BROWN not in this list are F, M, and R (popular lore, science fiction, and humor).</Paragraph> </Section> </Section> class="xml-element"></Paper>