File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1043_metho.xml
Size: 12,133 bytes
Last Modified: 2025-10-06 14:10:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1043"> <Title>Reranking and Self-Training for Parser Adaptation</Title> <Section position="5" start_page="338" end_page="338" type="metho"> <SectionTitle> 3 Corpora </SectionTitle> <Paragraph position="0"> We primarily use three corpora in this paper. Self-training requires labeled and unlabeled data. We assume that these sets of data must be in similar domains (e.g. news articles) though the effectiveness of self-training across domains is currently an open question. Thus, we have labeled (WSJ) and unlabeled (NANC) out-of-domain data and labeled in-domain data (BROWN). Unfortunately, lacking a corresponding corpus to NANC for BROWN, we cannot perform the opposite scenario and adapt BROWN to WSJ.</Paragraph> <Section position="1" start_page="338" end_page="338" type="sub_section"> <SectionTitle> 3.1 Brown </SectionTitle> <Paragraph position="0"> The BROWN corpus (Francis and KuVcera, 1979) consists of many different genres of text, intended to approximate a &quot;balanced&quot; corpus. While the full corpus consists of fiction and nonfiction domains, the sections that have been annotated in Treebank II bracketing are primarily those containing fiction. Examples of the sections annotated include science fiction, humor, romance, mystery, adventure, and &quot;popular lore.&quot; We use the same divisions as Bacchiani et al. (2006), who base their divisions on Gildea (2001). Each division of the corpus consists of sentences from all available genres. The training division consists of approximately 80% of the data, while held-out development and testing divisions each make up 10% of the data. The treebanked sections contain approximately 25,000 sentences (458,000 words).</Paragraph> </Section> <Section position="2" start_page="338" end_page="338" type="sub_section"> <SectionTitle> 3.2 Wall Street Journal </SectionTitle> <Paragraph position="0"> Our out-of-domain data is the Wall Street Journal (WSJ) portion of the Penn Treebank (Marcus et al., 1993) which consists of about 40,000 sentences (one million words) annotated with syntactic information. We use the standard divisions: Sections 2 through 21 are used for training, section 24 for held-out development, and section 23 for final testing.</Paragraph> </Section> <Section position="3" start_page="338" end_page="338" type="sub_section"> <SectionTitle> 3.3 North American News Corpus </SectionTitle> <Paragraph position="0"> In addition to labeled news data, we make use of a large quantity of unlabeled news data. The unlabeled data is the North American News Corpus, NANC (Graff, 1995), which is approximately 24 million unlabeled sentences from various news sources. NANC contains no syntactic information and sentence boundaries are induced by a simple discriminative model. We also perform some basic cleanups on NANC to ease parsing. NANC contains news articles from various news sources including the Wall Street Journal, though for this paper, we only use articles from the LA Times portion.</Paragraph> <Paragraph position="1"> To use the data from NANC, we use self-training (McClosky et al., 2006). First, we take a WSJ trained reranking parser (i.e. both the parser and reranker are built from WSJ training data) and parse the sentences from NANC with the 50-best (Charniak and Johnson, 2005) parser. Next, the 50-best parses are reordered by the reranker. Finally, the 1-best parses after reranking are combined with the WSJ training set to retrain the first-stage parser.1 McClosky et al. (2006) find that the self-trained models help considerably when parsing WSJ.</Paragraph> </Section> </Section> <Section position="6" start_page="338" end_page="340" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We use the Charniak and Johnson (2005) reranking parser in our experiments. Unless mentioned otherwise, we use the WSJ-trained reranker (as opposed to a BROWN-trained reranker). To evaluate, we report bracketing f-scores.2 Parser f-scores reported are for sentences up to 100 words long, while reranking parser f-scores are over all sentences. For simplicity and ease of comparison, most of our evaluations are performed on the development section of BROWN.</Paragraph> <Section position="1" start_page="338" end_page="339" type="sub_section"> <SectionTitle> 4.1 Adapting self-training </SectionTitle> <Paragraph position="0"> Our first experiment examines the performance of the self-trained parsers. While the parsers are created entirely from labeled WSJ data and unlabeled NANC data, they perform extremely well on BROWN development (Table 2). The trends are the same as in (McClosky et al., 2006): Adding NANC data improves parsing performance on BROWN development considerably, improving the f-score from 83.9% to 86.4%. As more NANC data is added, the f-score appears to approach an asymptote. The NANC data appears to help reduce data sparsity and fill in some of the gaps in the WSJ model. Additionally, the reranker provides further benefit and adds an absolute 1-2% to the fscore. The improvements appear to be orthogonal, as our best performance is reached when we use the reranker and add 2,500k self-trained sentences for the parser with and without the WSJ reranker are shown when evaluating on BROWN development. For this experiment, we use the WSJ-trained reranker.</Paragraph> <Paragraph position="1"> The results are even more surprising when we compare against a parser3 trained on the labeled training section of the BROWN corpus, with parameters tuned against its held-out section. Despite seeing no in-domain data, the WSJ based parser is able to match the BROWN based parser.</Paragraph> <Paragraph position="2"> For the remainder of this paper, we will refer to the model trained on WSJ+2,500k sentences of NANC as our &quot;best WSJ+NANC&quot; model. We also note that this &quot;best&quot; parser is different from the &quot;best&quot; parser for parsing WSJ, which was trained on WSJ with a relative weight4 of 5 and 1,750k sentences from NANC. For parsing BROWN, the difference between these two parsers is not large, though.</Paragraph> <Paragraph position="3"> Increasing the relative weight of WSJ sentences versus NANC sentences when testing on BROWN development does not appear to have a significant effect. While (McClosky et al., 2006) showed that this technique was effective when testing on WSJ, the true distribution was closer to WSJ so it made sense to emphasize it.</Paragraph> </Section> <Section position="2" start_page="339" end_page="340" type="sub_section"> <SectionTitle> 4.2 Incorporating In-Domain Data </SectionTitle> <Paragraph position="0"> Up to this point, we have only considered the situation where we have no in-domain data. We now 3In this case, only the parser is trained on BROWN. In section 4.3, we compare against a fully BROWN-trained reranking parser as well.</Paragraph> <Paragraph position="1"> 4A relative weight of n is equivalent to using n copies of the corpus, i.e. an event that occurred x times in the corpus would occur xxn times in the weighted corpus. Thus, larger corpora will tend to dominate smaller corpora of the same relative weight in terms of event counts.</Paragraph> <Paragraph position="2"> explore different ways of making use of labeled and unlabeled in-domain data.</Paragraph> <Paragraph position="3"> Bacchiani et al. (2006) applies self-training to parser adaptation to utilize unlabeled in-domain data. The authors find that it helps quite a bit when adapting from BROWN to WSJ. They use a parser trained from the BROWN train set to parse WSJ and add the parsed WSJ sentences to their training set.</Paragraph> <Paragraph position="4"> We perform a similar experiment, using our WSJ-trained reranking parser to parse BROWN train and testing on BROWN development. We achieved a boost from 84.8% to 85.6% when we added the parsed BROWN sentences to our training. Adding in 1,000k sentences from NANC as well, we saw a further increase to 86.3%. However, the technique does not seem as effective in our case. While the self-trained BROWN data helps the parser, it adversely affects the performance of the reranking parser. When self-trained BROWN data is added to WSJ training, the reranking parser's performance drops from 86.6% to 86.1%. We see a similar degradation as NANC data is added to the training set as well. We are not yet able to explain this unusual behavior.</Paragraph> <Paragraph position="5"> We now turn to the scenario where we have some labeled in-domain data. The most obvious way to incorporate labeled in-domain data is to combine it with the labeled out-of-domain data.</Paragraph> <Paragraph position="6"> We have already seen the results (Gildea, 2001) and (Bacchiani et al., 2006) achieve in Table 1.</Paragraph> <Paragraph position="7"> We explore various combinations of BROWN, WSJ, and NANC corpora. Because we are mainly interested in exploring techniques with self-trained models rather than optimizing performance, we only consider weighting each corpus with a relative weight of one for this paper. The models generated are tuned on section 24 from WSJ. The results are summarized in Table 3.</Paragraph> <Paragraph position="8"> While both WSJ and BROWN models benefit from a small amount of NANC data, adding more than 250k NANC sentences to the BROWN or combined models causes their performance to drop. This is not surprising, though, since adding &quot;too much&quot; NANC overwhelms the more accurate BROWN or WSJ counts. By weighting the counts from each corpus appropriately, this problem can be avoided.</Paragraph> <Paragraph position="9"> Another way to incorporate labeled data is to tune the parser back-off parameters on it. Bacchiani et al. (2006) report that tuning on held-out BROWN data gives a large improvement over tun- null ing on WSJ data. The improvement is mostly (but not entirely) in precision. We do not see the same improvement (Figure 1) but this is likely due to differences in the parsers. However, we do see a similar improvement for parsing accuracy once NANC data has been added. The reranking parser generally sees an improvement, but it does not appear to be significant.</Paragraph> </Section> <Section position="3" start_page="340" end_page="340" type="sub_section"> <SectionTitle> 4.3 Reranker Portability </SectionTitle> <Paragraph position="0"> We have shown that the WSJ-trained reranker is actually quite portable to the BROWN fiction domain. This is surprising given the large number of features (over a million in the case of the WSJ reranker) tuned to adjust for errors made in the 50-best lists by the first-stage parser. It would seem the corrections memorized by the reranker are not as domain-specific as we might expect.</Paragraph> <Paragraph position="1"> As further evidence, we present the results of applying the WSJ model to the Switchboard corpus -- a domain much less similar to WSJ than BROWN. In Table 4, we see that while the parser's performance is low, self-training and reranking provide orthogonal benefits. The improvements represent a 12% error reduction with no additional in-domain data. Naturally, in-domain data and speech-specific handling (e.g. disfluency modeling) would probably help dramatically as well.</Paragraph> <Paragraph position="2"> Finally, to compare against a model fully trained on BROWN data, we created a BROWN reranker. We parsed the BROWN training set with 20-fold cross-validation, selected features that occurred 5 times or more in the training set, and fed the 50-best lists from the parser to a numerical optimizer to estimate feature weights. The resulting reranker model had approximately 700,000 features, which is about half as many as the WSJ trained reranker. This may be due to the smaller size of the BROWN training set or because the feature schemas for the reranker were developed on WSJ data. As seen in Table 5, the BROWN reranker is not a significant improvement over the WSJ reranker for parsing BROWN data.</Paragraph> </Section> </Section> class="xml-element"></Paper>