File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1031_metho.xml
Size: 21,415 bytes
Last Modified: 2025-10-06 14:08:14
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1031"> <Title>Example Selection for Bootstrapping Statistical Parsers</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Selecting Training Examples </SectionTitle> <Paragraph position="0"> In each iteration, selection is performed in two steps.</Paragraph> <Paragraph position="1"> First, each parser uses some scoring function, f, to assess the parses it generated for the sentences in the cache.2 Second, the central control uses some selection method, S, to choose a subset of these labeled sentences (based on the scores assigned by f) to add to the parsers' training data. The focus of this paper is on the selection phase, but to more fully investigate the effect of different selection methods we also consider two possible scoring functions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Scoring functions </SectionTitle> <Paragraph position="0"> The scoring function attempts to quantify the correctness of the parses produced by each parser. An ideal scoring function would give the true accuracy rates (e.g., Fscore, the combined labeled precision and recall rates).</Paragraph> <Paragraph position="1"> In practice, accuracy is approximated by some notion of confidence. For example, one easy-to-compute scoring function measures the conditional probability of the (most likely) parse. If a high probability is assigned, the parser is said to be confident in the label it produced.</Paragraph> <Paragraph position="2"> In our experimental studies, we considered the selection methods' interaction with two scoring functions: an oracle scoring function fF-score that returns the F-score of the parse as measured against a gold standard, and a 2In our experiments, both parsers use the same scoring function. null practical scoring function fprob that returns the conditional probability of the parse.3</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Selection methods </SectionTitle> <Paragraph position="0"> Based on the scores assigned by the scoring function, the selection method chooses a subset of the parser labeled sentences that best satisfy some selection criteria.</Paragraph> <Paragraph position="1"> One such criterion is the accuracy of the labeled examples, which may be estimated by the teacher parser's confidence in its labels. However, the examples that the teacher correctly labeled may not be those that the student needs. We hypothesize that the training utility of the examples for the student parser is another important criterion.</Paragraph> <Paragraph position="2"> Training utility measures the improvement a parser would make if that sentence were correctly labeled and added to the training set. Like accuracy, the utility of an unlabeled sentence is difficult to quantify; therefore, we approximate it with values that can be computed from features of the sentence. For example, sentences containing many unknown words may have high training utility; so might sentences that a parser has trouble parsing. Under the co-training framework, we estimate the training utility of a sentence for the student by comparing the score the student assigned to its parse (according to its scoring function) against the score the teacher assigned to its own parse.</Paragraph> <Paragraph position="3"> To investigate how the selection criteria of utility and accuracy affect the co-training process, we considered a number of selection methods that satisfy the requirements of accuracy and training utility to varying degrees. The different selection methods are shown below. For each method, a sentence (as labeled by the teacher parser) is selected if: * above-n (Sabove-n): the score of the teacher's parse (using its scoring function) [?] n.</Paragraph> <Paragraph position="4"> * difference (Sdiff-n): the score of the teacher's parse is greater than the score of the student's parse by some threshold n.</Paragraph> <Paragraph position="5"> * intersection (Sint-n): the score of the teacher's parse is in the set of the teacher's n percent highest-scoring labeled sentences, and the score of the student's parse for the same sentence is in the set of the student's n percent lowest-scoring labeled sentences. null Each selection method has a control parameter, n, that determines the number of labeled sentences to add at each co-training iteration. It also serves as an indirect control 3A nice property of using conditional probability, Pr(parse|sentence), as the scoring function is that it normalizes for sentence length.</Paragraph> <Paragraph position="6"> of the number of errors added to the training set. For example, the Sabove-n method would allow more sentences to be selected if n was set to a low value (with respect to the scoring function); however, this is likely to reduce the accuracy rate of the training set.</Paragraph> <Paragraph position="7"> The above-n method attempts to maximize the accuracy of the data (assuming that parses with higher scores are more accurate). The difference method attempts to maximize training utility: as long as the teacher's labeling is more accurate than that of the student, it is chosen, even if its absolute accuracy rate is low. The intersection method attempts to maximize both: the selected sentences are accurately labeled by the teacher and incorrectly labeled by the student.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="80" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Experiments were performed to compare the effect of the selection methods on co-training and corrected cotraining. We consider a selection method, S1, superior to another, S2, if, when a large unlabeled pool of sentences has been exhausted, the examples selected by S1 (as labeled by the machine, and possibly corrected by the human) improve the parser more than those selected by S2. All experiments shared the same general setup, as described below.</Paragraph> <Section position="1" start_page="0" end_page="80" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> For two parsers to co-train, they should generate comparable output but use independent statistical models.</Paragraph> <Paragraph position="1"> In our experiments, we used a lexicalized context free grammar parser developed by Collins (1999), and a lexicalized Tree Adjoining Grammar parser developed by Sarkar (2002). Both parsers were initialized with some seed data. Since the goal is to minimize human annotated data, the size of the seed data should be small. In this paper we used a seed set size of 1,000 sentences, taken from section 2 of the Wall Street Journal (WSJ) Penn Treebank. The total pool of unlabeled sentences was the remainder of sections 2-21 (stripped of their annotations), consisting of about 38,000 sentences. The cache size is set at 500 sentences. We have explored using different settings for the seed set size (Steedman et al., 2003).</Paragraph> <Paragraph position="2"> The parsers were evaluated on unseen test sentences (section 23 of the WSJ corpus). Section 0 was used as a development set for determining parameters. The evaluation metric is the Parseval F-score over labeled constituents: F-score = 2xLRxLPLR+LP , where LP and LR are labeled precision and recall rate, respectively. Both parsers were evaluated, but for brevity, all results reported here are for the Collins parser, which received higher Parseval scores.</Paragraph> <Paragraph position="3"> quality of the training data. (a) The average accuracy rates are about 85%. (b) The average accuracy rates (except for those selected by Sdiff-10%) are about 95%.</Paragraph> </Section> <Section position="2" start_page="80" end_page="80" type="sub_section"> <SectionTitle> 4.2 Experiment 1: Selection Methods and Co-Training </SectionTitle> <Paragraph position="0"> We first examine the effect of the three selection methods on co-training without correction (i.e., the chosen machine-labeled training examples may contain errors).</Paragraph> <Paragraph position="1"> Because the selection decisions are based on the scores that the parsers assign to their outputs, the reliability of the scoring function has a significant impact on the performance of the selection methods. We evaluate the effectiveness of the selection methods using two scoring functions. In Section 4.2.1, each parser assesses its output with an oracle scoring function that returns the Parseval F-score of the output (as compared to the human annotated gold-standard). This is an idealized condition that gives us direct control over the error rate of the labeled training data. By keeping the error rates constant, our goal is to determine which selection method is more successful in finding sentences with high training utility.</Paragraph> <Paragraph position="2"> In Section 4.2.2 we replace the oracle scoring function with fprob, which returns the conditional probability of the best parse as the score. We compare how the selection methods' performances degrade under the realistic condition of basing selection decisions on unreliable parser output assessment scores.</Paragraph> <Paragraph position="3"> 4.2.1 Using the oracle scoring function, fF-score The goal of this experiment is to evaluate the selection methods using a reliable scoring function. We therefore use an oracle scoring function, fF-score, which guarantees a perfect assessment of the parser's output. This, however, may be too powerful. In practice, we expect even a reliable scoring function to sometimes assign high scores to inaccurate parses. We account for this effect by adjusting the selection method's control parameter to affect two factors: the accuracy rate of the newly labeled training data, and the number of labeled sentences added at each training iteration. A relaxed parameter setting adds more parses to the training data, but also reduces the accuracy of the training data.</Paragraph> <Paragraph position="4"> Figure 2 compares the effect of the three selection methods on co-training for the relaxed (left graph) and the strict (right graph) parameter settings. Each curve in the two graphs charts the improvement in the parser's accuracy in parsing the test sentences (y-axis) as it is trained on more data chosen by its selection method (x-axis).</Paragraph> <Paragraph position="5"> The curves have different endpoints because the selection methods chose a different number of sentences from the same 38K unlabeled pool. For reference, we also plotted the improvement of a fully-supervised parser (i.e., trained on human-annotated data, with no selection).</Paragraph> <Paragraph position="6"> For the more relaxed setting, the parameters are chosen so that the newly labeled training data have an average accuracy rate of about 85%: * Sabove-70% requires the labels to have an F-score [?] 70%. It adds about 330 labeled sentences (out of the 500 sentence cache) with an average accuracy rate of 85% to the training data per iteration.</Paragraph> <Paragraph position="7"> * Sdiff-10% requires the score difference between the teacher's labeling and the student's labeling to be at least 10%. It adds about 50 labeled sentences with an average accuracy rate of 80%.</Paragraph> <Paragraph position="8"> * Sint-60% requires the teacher's parse to be in the top 60% of its output and the student's parse for the same sentence to be in its bottom 60%. It adds about 150 labeled sentences with an average accuracy rate of 85%.</Paragraph> <Paragraph position="9"> Although none rivals the parser trained on human annotated data, the selection method that improves the parser the most is Sdiff-10%. One interpretation is that the training utility of the examples chosen by Sdiff-10% outweighs the cost of errors introduced into the training data. Another interpretation is that the other two selection methods let in too many sentences containing errors. In the right graph, we compare the same Sdiff-10% with the other two selection methods using stricter control, such that the average accuracy rate for these methods is now about 95%: * Sabove-90% now requires the parses to be at least 90% correct. It adds about 150 labeled sentences per iteration.</Paragraph> <Paragraph position="10"> * Sint-30% now requires the teacher's parse to be in the top 30% of its output and the student's parse for the same sentence in its bottom 30%. It adds about 15 labeled sentences.</Paragraph> <Paragraph position="11"> The stricter control on Sabove-90% improved the parser's performance, but not enough to overtake Sdiff-10% after all the sentences in the unlabeled pool had been considered, even though the training data of Sdiff-10% contained many more errors. Sint-30% has a faster initial improvement4, closely tracking the progress of the fully-supervised parser. However, the stringent requirement exhausted the unlabeled data pool before training the parser to convergence. Sint-30% might continue to help the parser to improve if it had access to more unlabeled data, which is easier to acquire than annotated data5.</Paragraph> <Paragraph position="12"> Comparing the three selection methods under both strict and relaxed control settings, the results suggest that training utility is an important criterion in selecting training examples, even at the cost of reduced accuracy. 4.2.2 Using the fprob scoring function To determine the effect of unreliable scores on the selection methods, we replace the oracle scoring function, fF-score, with fprob, which approximates the accuracy of a parse with its conditional probability. Although this is a poor estimate of accuracy (especially when computed from a partially trained parser), it is very easy to compute. The unreliable scores also reduce the correlation between the selection control parameters and the level of errors in the training data. In this experiment, we set the parameters for all three selection methods so that approximately Sabove-70% was about 85%, and the rate for Sdiff-30% and Sint-30% was about 75%.</Paragraph> <Paragraph position="13"> As expected, the parser performances of all three selection methods using fprob (shown in Figure 3) are lower than using fF-score (see Figure 2). However, Sdiff-30% and Sint-30% helped the co-training parsers to improve with a 5% error reduction (1% absolute difference) over the parser trained only on the initial seed data. In contrast, despite an initial improvement, using Sabove-70% did not help to improve the parser. In their experiments on NP identifiers, Pierce and Cardie (2001) observed a similar effect. They hypothesize that co-training does not scale well for natural language learning tasks that require a huge amount of training data because too many errors are accrued over time. Our experimental results suggest that the use of training utility in the selection process can make co-training parsers more tolerant to these accumulated errors.</Paragraph> </Section> <Section position="3" start_page="80" end_page="80" type="sub_section"> <SectionTitle> 4.3 Experiment 2: Selection Methods and Corrected Co-training </SectionTitle> <Paragraph position="0"> To address the problem of the training data accumulating too many errors over time, Pierce and Cardie proposed a semi-supervised variant of co-training called corrected co-training, which allows a human annotator to review and correct the output of the parsers before adding it to the training data. The main selection criterion in their co-training system is accuracy (approximated by confidence). They argue that selecting examples with nearly correct labels would require few manual interventions from the annotator.</Paragraph> <Paragraph position="1"> We hypothesize that it may be beneficial to consider the training utility criterion in this framework as well.</Paragraph> <Paragraph position="2"> We perform experiments to determine whether selecting fewer (and possibly less accurately labeled) exam- null from the annotator. In our experiments, we simulated the interactive sample selection process by revealing the gold standard. As before, we compare the three selection methods using both fF-score and fprob as scoring functions.6 null 4.3.1 Using the oracle scoring function, fF-score Figure 4 shows the effect of the three selection methods (using the strict parameter setting) on corrected cotraining. As a point of reference, we plot the improvement rate for a fully supervised parser (same as the one in Figure 2). In addition to charting the parser's performance in terms of the number of labeled training sentences (left graph), we also chart the parser's performance in terms of the the number of constituents the machine mislabeled (right graph). The pair of graphs indicates the amount of human effort required: the left graph shows the number of sentences the human has to check, and the right graph shows the number of constituents the human has to correct.</Paragraph> <Paragraph position="3"> Comparing Sabove-90% and Sdiff-10%, we see that Sdiff-10% trains a better parser than Sabove-90% when all the unlabeled sentences have been considered. It also improves the parser using a smaller set of training examples. Thus, for the same parsing performance, it requires the human to check fewer sentences than Sabove-90% and the reference case of no selection (Figure 4(a)). On the other hand, because the labeled sentences selected by ous set of experiments, using the strict setting (i.e., Figure 2(b)) for fF-score.</Paragraph> <Paragraph position="4"> than Sabove-90% for the same level of parsing performance; though both require fewer corrections than the reference case of no selection (Figure 4(b)). Because the amount of effort spent by the annotator depends on the number of sentences checked as well as the amount of corrections made, whether Sdiff-10% or Sabove-90% is more effort reducing may be a matter of the annotator's preference.</Paragraph> <Paragraph position="5"> The selection method that improves the parser at the fastest rate is Sint-30%. For the same parser performance level, it selects the fewest number of sentences for a human to check and requires the human to make the least number of corrections. However, as we have seen in the earlier experiment, very few sentences in the unlabeled pool satisfy its stringent criteria, so it ran out of data before the parser was trained to convergence. At this point we cannot determine whether Sint-30% might continue to improve the parser if we used a larger set of unlabeled data.</Paragraph> <Paragraph position="6"> 4.3.2 Using the fprob scoring function We also consider the effect of unreliable scores in the corrected co-training framework. A comparison between the selection methods using fprob is reported in Figure 5. The left graph charts parser performance in terms of the number of sentences the human must check; the right charts parser performance in terms of the number of constituents the human must correct. As expected, the unreliable scoring function degrades the effectiveness of the selection methods; however, compared to its unsupervised counterpart (Figure 3), the degradation is not as severe. In fact, Sdiff-30% and Sint-30% still require fewer training data than the reference parser. Moreover, consistent with the other experiments, the selection methods that attempt to maximize training utility achieve better parsing performance than Sabove-70%. Finally, in terms of reducing human effort, the three selection methods require the human to correct comparable amount of parser errors for the same level of parsing performance, but for Sdiff-30% and Sint-30%, fewer sentences need to be checked.</Paragraph> <Paragraph position="7"> Corrected co-training can be seen as a form of active learning, whose goal is to identify the smallest set of unlabeled data with high training utility for the human to label. Active learning can be applied to a single learner (Lewis and Catlett, 1994) and to multiple learners (Freund et al., 1997; Engelson and Dagan, 1996; Ngai and Yarowsky, 2000). In the context of parsing, all previous work (Thompson et al., 1999; Hwa, 2000; Tang et al., 2002) has focussed on single learners. Corrected co-training is the first application of active learning for multiple parsers. We are currently investigating comparisons to the single learner approaches.</Paragraph> <Paragraph position="8"> Our approach is similar to co-testing (Muslea et al., 2002), an active learning technique that uses two classifiers to find contentious examples (i.e., data for which the classifiers' labels disagree) for a human to label. There is a subtle but significant difference, however, in that their goal is to reduce the total number of labeled training examples whereas we also wish to reduce the number of corrections made by the human. Therefore, our selection methods must take into account the quality of the parse produced by the teacher in addition to how different its parse is from the one produced by the student. The intersection method precisely aims at selecting sentences that satisfy both requirements. Exploring different selection methods is part of our on-going research effort.</Paragraph> </Section> </Section> class="xml-element"></Paper>