File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-3001_metho.xml
Size: 49,738 bytes
Last Modified: 2025-10-06 14:08:45
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-3001"> <Title>c(c) 2004 Association for Computational Linguistics Sample Selection for Statistical Parsing</Title> <Section position="3" start_page="254" end_page="255" type="metho"> <SectionTitle> 2. Learning with Sample Selection </SectionTitle> <Paragraph position="0"> Unlike traditional learning systems that receive training examples indiscriminately, a sample selection learning system actively influences its own progress by choosing new examples to incorporate into its training set. There are two types of selection algorithms: committee-based and single learner. A committee-based selection algorithm works with multiple learners, each maintaining a different hypothesis (perhaps pertaining to different aspects of the problem). The candidate examples that lead to the most disagreements among the different learners are considered to have the highest TUV (Cohn, Atlas, and Ladner 1994; Freund et al. 1997). For computationally intensive problems, such as parsing, keeping multiple learners may be impractical.</Paragraph> <Paragraph position="1"> In this work, we focus on sample selection using a single learner that keeps one working hypothesis. Without access to multiple hypotheses, the selection algorithm can nonetheless estimate the TUV of a candidate. We identify the following three classes of predictive criteria: 1. Problem-space: Knowledge about the problem space may provide information about the type of candidates that are particularly plentiful or difficult to learn. This criterion focuses on the general attributes of the learning problem, such as the distribution of the input data and properties of the learning algorithm, but it ignores the current state of the hypothesis.</Paragraph> <Paragraph position="2"> 2. Performance of the hypothesis: Testing the candidates on the current working hypothesis shows the type of input data on which the hypothesis may perform weakly. That is, if the current hypothesis is unable to label a candidate or is uncertain about it, then the candidate might be a good training example (Lewis and Catlett 1994). The underlying assumption is that an uncertain output is likely to be wrong.</Paragraph> <Paragraph position="3"> 3. Parameters of the hypothesis: Estimating the potential impact that the candidates will have on the parameters of the current working hypothesis locates those examples that will change the current hypothesis the most.</Paragraph> <Paragraph position="4"> Hwa Sample Selection for Statistical Parsing U is a set of unlabeled candidates.</Paragraph> <Paragraph position="5"> L is a set of labeled training examples.</Paragraph> <Paragraph position="6"> C is the current hypothesis.</Paragraph> <Paragraph position="7"> Initialize:</Paragraph> <Paragraph position="9"> Until (C is good enough) or (U = [?]) or (cutoff).</Paragraph> <Paragraph position="10"> Figure 1 Pseudo code for the sample selection learning algorithm. Figure 1 outlines the single-learner sample selection training loop in pseudocode. Initially, the training set, L, consists of a small number of labeled examples, based on which the learner proposes its first hypothesis of the target concept, C. Also available to the learner is a large pool of unlabeled training candidates, U. In each training iteration, the selection algorithm, Select(n, U, C, f), ranks the candidates of U according to their expected TUVs and returns the n candidates with the highest values. The algorithm predicts the TUV of each candidate, u [?] U, with an evaluation function, f(u, C). This function may rely on the hypothesis concept C to estimate the utility of a candidate u. The n chosen candidates are then labeled by human experts and added to the existing training set. Running the learning algorithm, Train(L), on the updated training set, the system proposes a new hypothesis regarding the target concept that is the most compatible with the examples seen thus far. The loop continues until one of three stopping conditions is met: The hypothesis is considered to perform well enough, all candidates are labeled, or an absolute cutoff point is reached (e.g., no more resources).</Paragraph> </Section> <Section position="4" start_page="255" end_page="256" type="metho"> <SectionTitle> 3. Sample Selection for Prepositional-Phrase Attachment </SectionTitle> <Paragraph position="0"> One common source of structural ambiguities arises from syntactic constructs in which a prepositional phrase might be equally likely to modify the verb or the noun preceding it. Researchers have proposed many computational models for resolving PP-attachment ambiguities. Some well-known approaches include rule-based models (Brill and Resnik 1994), backed-off models (Collins and Brooks 1995), and a maximum-entropy model (Ratnaparkhi 1998). Following the tradition of using learning PP-attachment as a way to gain insight into the parsing problem, we first apply sample selection to reduce the amount of annotation used in training a PP-attachment model.</Paragraph> <Paragraph position="1"> We use the Collins-Brooks model as the basic learning algorithm and experiment with several evaluation functions based on the types of predictive criteria described earlier.</Paragraph> <Paragraph position="2"> Our experiments show that the best evaluation function can reduce the number of labeled examples by nearly half without loss of accuracy.</Paragraph> <Section position="1" start_page="255" end_page="256" type="sub_section"> <SectionTitle> 3.1 A Summary of the Collins-Brooks Model </SectionTitle> <Paragraph position="0"> The Collins-Brooks model takes prepositional phrases and their attachment classifications as training examples: each is represented as a quintuple of the form (v, n, p, n2, a), where v, n, p, and n2 are the head words of the verb phrase, the object noun phrase, the Computational Linguistics Volume 30, Number 3 subroutine Train(L) foreach ex [?] L do extract (v, n, p, n2, a) from ex foreach tuple [?]{(v, n, p, n2), (v, p, n2), (n, p, n2), (v, n, p), (v, p), (n, p), (p, n2), (p)} do The Collins-Brooks PP-attachment classification algorithm. preposition, and the prepositional noun phrase, respectively, and a specifies the attachment classification. For example, (wrote a book in three days, attach-verb) would be annotated as (wrote, book, in, days, verb). The head words can be automatically extracted using a heuristic table lookup in the manner described by Magerman (1994). For this learning problem, the supervision is the one-bit information of whether p should attach to v or to n. In order to learn the attachment preferences of prepositional phrases, the system builds attachment statistics for each the characteristic tuple of all training examples. A characteristic tuple is some subset of the four head words in the example, with the condition that one of the elements must be the preposition. Each training example forms eight characteristic tuples: (v, n, p, n2), (v, n, p), (v, p, n2), (n, p, n2), (v, p), (n, p), (p, n2), (p). The attachment statistics are a collection of the occurrence frequencies for all the characteristic tuples in the training set and the occurrence frequencies for the characteristic tuples of those examples determined to attach to nouns. For some characteristic tuple t, Count(t) denotes the former and Count</Paragraph> </Section> </Section> <Section position="5" start_page="256" end_page="267" type="metho"> <SectionTitle> NP </SectionTitle> <Paragraph position="0"> (t) denotes the latter. In terms of the sample selection algorithm, the collection of counts represents the learner's current hypothesis (C in Figure 1). Figure 2 provides the pseudocode for the Train routine.</Paragraph> <Paragraph position="1"> Once trained, the system can be used to classify test cases based on the statistics of the most similar training examples and back off as necessary. For instance, to determine the PP-attachment for a test case, the classifier would first consider the ratio of the two frequency counts for the four-word characteristic tuple of the test case. If the tuple Hwa Sample Selection for Statistical Parsing Figure 3 In this example, the classification of the test case preposition is backed off to the two-word-tuple level. In the diagram, each circle represents a characteristic tuple. A filled circle denotes that the tuple has occurred in the training set. The dashed rectangular box indicates the back-off level on which the classification is made.</Paragraph> <Paragraph position="2"> never occurred in the training example, the classifier would then back off to look at the test case's three three-word characteristic tuples. It would continue to back off further, if necessary. In the case that the model has no information on any of the characteristic tuples of the test case, it would, by default, classify the test case as an instance of noun attachment. Figure 3 shows using the back-off scheme on a test case. We describe in the Test pseudocode routine in Figure 2 the model's classification procedure for each back-off level.</Paragraph> <Section position="1" start_page="257" end_page="261" type="sub_section"> <SectionTitle> 3.2 Evaluation Functions </SectionTitle> <Paragraph position="0"> Based on the three classes of predictive criteria discussed in Section 2, we propose several evaluation functions for the Collins-Brooks model.</Paragraph> <Paragraph position="1"> the PP-attachment model and properties of English prepositional phrases. For instance, we know that the most problematic test cases for the PP-attachment model are those for which it has no statistics at all. Therefore, those data that the system has not yet encountered might be good candidates. The first evaluation function we define, f novel (u, C), equates the TUV of a candidate u with its degree of novelty, the number of its characteristic tuples that currently have zero counts:</Paragraph> <Paragraph position="3"> This evaluation function has some blatant defects. It may distort the data distribution so much that the system will not be able to build up a reliable collection of statistics.</Paragraph> <Paragraph position="4"> The function does not take into account the intuition that those data that rarely occur, no matter how novel, probably have overall low training utility. Moreover, the scoring scheme does not make any distinction between the characteristic tuples of a candidate.</Paragraph> <Paragraph position="5"> 1 Note that the current hypothesis C is ignored in evaluation functions of this class because they depend only on the knowledge about the problem space.</Paragraph> <Paragraph position="6"> is selected, a total of 22 tuples can be ignored. The dashed rectangles show the classification level before training, and the solid rectangles show the classification level after the statistics of u have been taken. The obviated tuples are represented by the filled black circles.</Paragraph> <Paragraph position="7"> We know, however, that the PP-attachment classifier is a back-off model that makes its decision based first on statistics of the characteristic tuple with the most words. A more sophisticated sampling of the data domain should consider not only the novelty of the data, but also the frequency of its occurrence, as well as the quality of its characteristic tuples. We define a back-off-model-based evaluation function, f backoff (u, C), that scores a candidate u by counting the number of characteristic tuples that would be obviated in all candidates if u were included in the training set. For example, suppose we have a small pool of five candidates, and we are about to pick the first training would be the best choice. By selecting either as the first training example, we could ignore all but the four-word characteristic tuple each have three words in common with the first two candidates, they would no longer depend on their lower four tuples; and although we would also improve the statistics for one of u 's tuples (on), nothing could be pruned from u were chosen as the first example, u would lose all its utility, because we could not prune any extra characteristic tuples by using u . That is, in the next round of selection, f would be the best second example because it would now have the most tuples to prune (7 tuples). The evaluation function f backoff improves upon f novel in two ways. First, novel candidates that occur frequently are favored over those that rarely come up. As we have seen in the above example, a candidate that is similar to other candidates can eliminate more characteristic tuples all at once. Second, the evaluation strategy follows the working principle of the back-off model and discounts lower-level characteristic tuples that do not affect the classification process, even if they were &quot;novel.&quot; For instance, as the first training example, we would no longer care about the two-word tuples of u such as (wrote, on), even though we have no statistics for them. A potential problem with f backoff is that after all the obvious candidates have been selected, the function is not very good at differentiating between the remaining candidates that have about the same level of novelty and occur infrequently. previous section score candidates based on prior knowledge alone, independent of the current state of the learner's hypothesis and the annotation of the selected training examples. To attune the selection of training examples to the learner's progress, an evaluation function might factor in its current hypothesis in predicting a candidate's TUV.</Paragraph> <Paragraph position="8"> One way to incorporate the current hypothesis into the evaluation function is to score each candidate using the current model, assuming its hypothesis is right. An error-driven evaluation function, f err , equates the TUV of a candidate with the hypothesis' estimate of its likelihood to misclassify that candidate (i.e., one minus the probability of the most-likely class). If the hypothesis predicts that the likelihood of a prepositional phrase to attach to the noun is 80%, and if the hypothesis is accurate, then there is a 20% chance that it has misclassified.</Paragraph> <Paragraph position="9"> A related evaluation function is one that measures the hypothesis's uncertainty across all classes, rather than focusing on only the most likely class. Intuitively, if the hypothesis classifies a candidate as equally likely to attach to the verb as to the noun, it is the most uncertain of its answer. If the hypothesis assigns a candidate to a class with a probability of one, then it is the most certain of its answer. For the binary-class case, the uncertainty-based evaluation function, f unc , can be expressed in the same way as the error-driven function, as a function that is symmetric about 0.5 and monotonically decreases if the hypothesis prefers one class over another:</Paragraph> <Paragraph position="11"> In the general case of choosing between multiple classes, f err and f unc are different from one another. We shall return to this point in Section 4.1.2 when we consider training parsers.</Paragraph> <Paragraph position="12"> The potential drawback of the performance-based evaluation functions is that they assume that the hypothesis is correct. Selecting training examples based on a poor hypothesis is prone to pitfalls. On the one hand, the hypothesis may be overly confident about the certainty of its decisions. For example, the hypothesis may assign noun to a candidate with a probability of one based on parameter estimates computed from a single previous observation in which a similar example was labeled as noun. Despite the unreliable statistics, this candidate would not be selected, since the hypothesis considers this a known case. Conversely, the hypothesis may also direct the selection algorithm to chase after undecidable cases. For example, consider prepositional phrases (PPs) with in as the head. These PPs occur frequently, and about half of them should attach to the object noun. Even though training on more labeled in examples 2 As long as it adheres to these criteria, the specific form of the function is irrelevant, since the selection is not determined by the absolute scores of the candidates, but by their scores relative to each other. Computational Linguistics Volume 30, Number 3 does not improve the model's performance on future in PPs, the selection algorithm will keep on requesting more in training examples because the hypothesis remains uncertain about this preposition.</Paragraph> <Paragraph position="13"> With an unlucky starting hypothesis, these evaluation functions may select uninformative candidates initially.</Paragraph> <Paragraph position="14"> based evaluation function stems from their trust in the model's diagnosis of its own progress. Another way to incorporate the current hypothesis is to determine how good it is and what type of examples will improve it the most. In this section we propose an evaluation function that scores candidates based on their utilities in increasing the confidence about the parameters of the hypothesis (i.e., the collection of statistics over the characteristic tuples of the training examples).</Paragraph> <Paragraph position="15"> Training the parameters of the PP-attachment model is similar to empirically determining the bias of a coin. We measure the coin's bias by repeatedly tossing it and keeping track of the percentage of times it lands on heads. The more trials we perform, the more confident we become about our estimation of the bias. Similarly, in estimating p, the likelihood of a PP's attaching to its object noun, we are more confident about the classification decision based on statistics with higher counts than based on statistics with lower counts. A quantitative measurement of our confidence in a statistic is the confidence interval. This is a region around the measured statistic, bounding the area within which the true statistic may lie. More specifically, the confidence interval for p, a binomial parameter, is defined as</Paragraph> <Paragraph position="17"> parenrightBigg where -p is the expected value of p based on n trials, and t is a threshold value that depends on the number of trials and the level of confidence we desire. For instance, if we want to be 90% confident that the true statistic p lies within the interval, and -p is based on n = 30 trials, then we set t to be 1.697.</Paragraph> <Paragraph position="18"> Applying the confidence interval concept to evaluating candidates for the back-off PP-attachment model, we define a function f conf that scores a candidate by taking the average of the lengths of the confidence interval of each back-off level. That is,</Paragraph> <Paragraph position="20"> The confidence-based evaluation function has several potential problems. One of its flaws is similar to that of f novel . In the early stage, f conf picks the same examples 3 This phenomenon is particularly acute in the early stages of refining the hypothesis because most decisions are based on statistics of the head preposition alone; in the later stages, the hypothesis can usually rely on higher-ordered characteristic tuples that tend to be better classifiers. 4 For n [?] 120, the values of t can be found in standard statistic textbooks; for n [?] 120, t = 1.6576. Because the derivation for the confidence interval equation makes a normality assumption, the equation does not hold for small values of n (cf Larsen and Marx [1986], pp. 277-278). When n is large, the contributions from the terms in</Paragraph> <Paragraph position="22"> are negligible. Dropping these terms, we have the t statistic for large n, -p +- t radicalbig -p(1 [?]-p)/n.</Paragraph> <Paragraph position="23"> novel , because we have no confidence in the statistics of novel examples. Therefore, f conf is also prone to chase after examples that rarely occur to build up the confidence of some unimportant parameters. A second problem is that f conf ignores the output of the model. Thus, if candidate A has a confidence interval around [0.6, 1] and candidate B has a confidence interval around [0.4, 0.7], then f conf will prefer candidate A, even though training on A will not change the hypothesis's performance, since the entire confidence interval is already in the noun zone. 3.2.4 Hybrid Function. The three categories of predictive criteria discussed above are complementary, each focusing on a different aspect of the learner's weakness. Therefore, it may be beneficial to combine these criteria into one evaluation function. For instance, the deficiency of the confidence-based evaluation function described in the previous section can be avoided if the confidence interval covering the region around the uncertainty boundary (candidate B in the example just discussed) is weighed more heavily than one around the end points (candidate A). In this section, we introduce a new function that tries to factor in both the uncertainty of the model performance and the confidence of the model parameters. First, we define a function, called area(-p, n), that computes the area under a Gaussian function N(x,u,s) with a mean of 0.5 and a standard deviation of 0.1 that is bounded by the confidence interval as computed by conf int(-p, n) (see Figure 5).</Paragraph> </Section> <Section position="2" start_page="261" end_page="265" type="sub_section"> <SectionTitle> 3.3 Experimental Comparison </SectionTitle> <Paragraph position="0"> To determine the relative merits of the proposed evaluation functions, we compare the learning curve of training with sample selection according to each function against a baseline of random selection in an empirical study. The corpus for this comparison is a collection of phrases extracted from the Wall Street Journal (WSJ) Treebank. We use Section 00 as the development set and Sections 2-23 as the training and test sets. We perform 10-fold cross-validation to ensure the statistical significance of the results. For each fold, the training candidate pool contains about 21,000 phrases, and the test set contained about 2,000 phrases.</Paragraph> <Paragraph position="1"> As shown in Figure 1, the learner generates an initial hypothesis based on a small set of training examples, L. These examples are randomly selected from the pool of unlabeled candidates and annotated by a human. Random sampling ensures that the initial trained set reflects the distribution of the candidate pool and thus that the initial hypothesis is unbiased. Starting with an unbiased hypothesis is important for those evaluation functions whose scoring metrics are affected by the accuracy of the hypothesis. In these experiments, L initially contains 500 randomly selected examples.</Paragraph> <Paragraph position="2"> In each selection iteration, all the candidates are scored by the evaluation function, and n examples with the highest TUVs are picked out from U to be labeled and added Figure 5 An example: Suppose that the candidate has a likelihood of 0.4 for noun attachment and a confidence interval of width 0.1. Then area computes the area bounded by the confidence interval and the Gaussian curve.</Paragraph> <Paragraph position="3"> to L. Ideally, we would like to have n = 1 for each iteration. In practice, however, it is often more convenient for the human annotator to label data in larger batches rather than one at a time. In these experiments, we use a batch size of n = 500 examples.</Paragraph> <Paragraph position="4"> We make note of one caveat to this kind of n-best batch selection. Under a hypothesis-dependent evaluation function, identical examples will receive identical scores. Because identical (or very similar) examples tend to address the same deficiency in the hypothesis, adding n very similar examples to the training set is unlikely to lead to big improvements in the hypothesis. To diversify the examples in each batch, we simulate single-example selection (whenever possible) by reestimating the scores of the candidates after each selection. Suppose we have just chosen to add candidate x to the batch. Then, before selecting the next candidate, we estimate the potential decrease in scores of candidates similar to x once it belongs to the annotated training set. The estimation is based entirely on the knowledge that x is chosen, but not on the classification of x. Thus, only certain types of evaluation functions are amenable to the reestimation process. For example, if scores have been assigned by f conf , then we know that the confidence intervals of the candidates similar to x must decrease slightly after learning x. On the other hand, if scores have been assigned by f unc , then we cannot perceive any changes in the scores of similar candidates without knowing the true classification of x.</Paragraph> <Paragraph position="5"> 3.3.1 Results and Discussion. This section presents the empirical measurements of the model's performances using training examples selected by different evaluation functions. We compare each proposed function with the baseline of random selection (f rand ). The results are graphically depicted from two perspectives. One (e.g., Figure 6(a)-6(c)) plots the learning curves of the functions, showing the relationship between the number of training examples (x-axis) and the performance of the model on test data (y-axis). We deem one evaluation function to be better than another if its learning curve envelopes the other's. An alternative way to interpret the results is to focus on the reduction in training size offered by one evaluation function over another for some particular performance level. Figure 6(d) is a bar graph comparing all the evaluation , and the baseline; (d) compares all the functions for the number of training examples selected at the final performance level (83.8%).</Paragraph> <Paragraph position="6"> functions at the highest performance level. The graph shows that in order to train a model that attaches PPs with an accuracy rate of 83.8%, sample selection with f at least as fast that for as f novel . However, the differences between these two functions become smaller for higher performance levels. This outcome validates our predictions. Scoring candidates by a combination of their novelty, occurrence frequencies, and the qualities of their characteristic tuples, f backoff selects helpful early (the first 4,000 or so) training examples. Then, just as in f novel , the learning rate remains stagnant for the next 2,000 poorly selected examples. Finally, when the remaining candidates all have similar novelty values and contain mostly characteristic tuples that occur infrequently, the selection becomes random.</Paragraph> <Paragraph position="7"> Figure 6(b) compares the two evaluation functions that score candidates based on the current state of the hypothesis. Although both functions suffer a slow start, they are more effective than f backoff at reducing the training set when learning high-quality models. Initially, because all the unknown statistics are initialized to 0.5, selection based on f unc is essentially random sampling. Only after the hypothesis becomes sufficiently accurate (after training on about 5,000 annotated examples) does it begin to make Computational Linguistics Volume 30, Number 3 informed selections. Following a similar but more exaggerated pattern, the confidence-based function, f conf , also improves slowly at the beginning before finally overtaking the baseline. As we noted earlier, because the hypothesis is not confident about novel candidates, f conf and f novel tend to select the same early examples. Therefore, the early learning rate of f conf is as poor as that of f novel . In the later stage, while f novel continues to flounder, f conf can select better candidates based on a more reliable hypothesis. Finally, the best-performing evaluation function is the hybrid approach. Figure 6(c) shows that the learning curve of f area combines the earlier success of f unc and the later success of f conf to always outperform the other functions. As shown in Figure 6(d), it requires the least number of examples to achieve the highest performance level of 83.8%. Compared to the baseline, f area requires 47% fewer examples to achieve this performance level. From these comparison studies, we conclude that involving the hypothesis in the selection process is a key factor in reducing the size of the training set.</Paragraph> <Paragraph position="8"> 4. Sample Selecting for Statistical Parsing In applying sample selection to training a PP-attachment model, we have observed that all effective evaluation functions make use of the model's current hypothesis in estimating the training utility of the candidates. Although knowledge about the problem space seems to help sharpening the learning curve initially, overall, it is not a good predictor. In this section, we investigate whether these observations hold true for training statistical parsing models as well. Moreover, in order to determine whether the performances of the predictive criteria are consistent across different learning models within the same domain, we have performed the study on two parsing models: one based on a context-free variant of tree-adjoining grammars (Joshi, Levy, and Takahashi 1975), the Probabilistic Lexicalized Tree Insertion Grammar (PLTIG) formalism (Schabes and Waters 1993; Hwa 1998), and Collins's Model 2 parser (1997). Although both models are lexicalized, statistical parsers, their learning algorithms are different. The Collins Parser is a fully supervised, history-based learner that models the parameters of the parser by taking statistics directly from the training data. In contrast, PLTIG's expectation-maximization-based induction algorithm is partially supervised; the model's parameters are estimated indirectly from the training data. As a superset of the PP-attachment task, parsing is a more challenging learning problem. Whereas a trained PP-attachment model is a binary classifier, a parser must identify the correct syntactic analysis out of all possible parses for a sentence. This classification task is more difficult than PP-attachment, since the number of possible parses for a sentence grows exponentially with respect to its length. Consequently, the annotator's task is more complex. Whereas the person labeling the training data for PP-attachment reveals one unit of information (always choosing between noun or verb), the annotation needed for parser training is usually greater than one unit, and the type of labels varies from sentence to sentence. Because the annotation complexity differs from sentence to sentence, the evaluation functions must strike a balance between maximizing potential informational gain and minimizing the expected amount 7 We consider each pair of brackets in the training sentence to be one unit of supervised information, assuming that the number of brackets correlates linearly with the amount of effort spent by the human annotator. This correlation is an approximation, however; in real life, adding one pair of brackets to a longer sentence may require more effort than adding a pair of brackets to a shorter one. To capture bracketing interdependencies at this level, we would need to develop a model of the annotation decision process and incorporate it as an additional factor in the evaluation functions. Hwa Sample Selection for Statistical Parsing of annotation exerted. We propose a set of evaluation functions similar in spirit to those for the PP-attachment learner, but extended to accommodate the parsing domain.</Paragraph> </Section> <Section position="3" start_page="265" end_page="267" type="sub_section"> <SectionTitle> 4.1 Evaluation Functions </SectionTitle> <Paragraph position="0"> frequencies of its characteristic tuples, we define an evaluation function, f lex (w, G) that scores a sentence candidate, w, based on the novelty and frequencies of word pair co-occurrences:</Paragraph> <Paragraph position="2"> where w is the unlabeled sentence candidate, G is the current parsing model (which is ignored by problem-space-based evaluation functions), new(w</Paragraph> <Paragraph position="4"> ) is an indicator function that returns one if we have not yet selected any sentence in which w</Paragraph> <Paragraph position="6"> in the candidate pool. We expect these evaluation functions to be less relevant for the parsing domain than for the PP-attachment domain for two reasons.</Paragraph> <Paragraph position="7"> First, because we do not have the actual parses, the extraction of lexical relationships is based on co-occurrence statistics, not syntactic relationships. Second, because the distribution of words that form lexical relationships is wider and more uniform than that of words that form PP characteristic tuples, most word pairs will be novel and appear only once.</Paragraph> <Paragraph position="8"> Another simple evaluation function based on the problem space is one that estimates the TUV of a candidate from its sentence length:</Paragraph> <Paragraph position="10"> The intuition behind this function is based on the general observation that longer sentences tend to have complex structures and introduce more opportunities for ambiguous parses. Although these evaluation functions may seem simplistic, they have one major advantage: They are easy to compute and require little processing time.</Paragraph> <Paragraph position="11"> Because inducing parsing models demands significantly more time than inducing PP-attachment models, it becomes more important that the evaluation functions for parsing models be as efficient as possible.</Paragraph> <Paragraph position="12"> based evaluation functions: f err , the model's estimate of the likelihood that is has made a classification error, and f unc , the model's estimate of its uncertainty in making the classification. We have shown the two functions to have similar performance for the PP-attachment task. This is not the case for statistical parsing, however, because the number of possible classes (parse trees) differs from sentence to sentence. For example, suppose we wish to compare one candidate for which the current parsing model generated four equally likely parses with another candidate for which the model generated 1 parse with probability of 0.2 and 99 other parses with a probability of 0.01 (such that they sum to 0.98). The error-driven function, f err , would score the latter candidate higher because its most likely parse has a lower probability than that of the most likely parse of the former candidate; the uncertainty-based function, f Computational Linguistics Volume 30, Number 3 for one parse over any other. In this section, we provide a formal definition for both functions.</Paragraph> <Paragraph position="13"> Suppose that a parsing model G generates a candidate sentence w with probability P(w |G), and that the set V contains all possible parses that G generated for w. Then, we denote the probability of G's generating a single parse, v [?]V,asP(v |G) such that</Paragraph> <Paragraph position="15"> Note that P(v |G) reflects the probability of one particular parse tree, v, out of all possible parse trees for all possible sentences that G can generate. To compute the likelihood of a parse's being the correct parse out of the possible parses of w according to G, denoted as P(v |w, G), we need to normalize the tree probability by the sentence probability. So according to G, the likelihood that v max is the correct parse for w is</Paragraph> <Paragraph position="17"> Therefore, the error-driven evaluation function is defined as</Paragraph> <Paragraph position="19"> Unlike the error-driven function, which focuses on the most likely parse, the uncertainty-based function takes the probability distribution of all parses into account.</Paragraph> <Paragraph position="20"> To quantitatively characterize its distribution, we compute the entropy of the distribution. That is,</Paragraph> <Paragraph position="22"> be found in textbooks on information theory (e.g., Cover and Thomas 1991).</Paragraph> <Paragraph position="23"> Determining the parse tree for a sentence from a set of possible parses can be viewed as assigning a value to a random variable. Thus, a direct application of the entropy definition to the probability distribution of the parses for sentence w in G computes its tree entropy, TE(w, G), the expected number of bits needed to encode the distribution of possible parses for w. However, we may not wish to compare two sentences with different numbers of parses by their entropy directly. If the parse probability distributions for both sentences are uniform, the sentence with more parses will have a higher entropy. Because longer sentences typically have more parses, using entropy directly would result in a bias toward selecting long sentences. To normalize for the number of parses, the uncertainty-based evaluation function, f unc , is defined as a measurement of similarity between the actual probability distribution of the parses and a hypothetical uniform distribution for that set of parses. In particular, we divide 9 Note that P(w|v, G)=1 for any v [?]V, where V is the set of all possible parses for w, because v exists only when w is observed.</Paragraph> <Paragraph position="24"> Hwa Sample Selection for Statistical Parsing the tree entropy by the log of the number of parses:</Paragraph> <Paragraph position="26"> We now derive the expression for TE(w, G). Recall from equation (2) that if G produces a set of parses, V, for sentence w, the set of probabilities P(v |w, G) (for all v [?]V) defines the distribution of parsing likelihoods for sentence w:</Paragraph> <Paragraph position="28"> Note that P(v |w, G) can be viewed as a density function p(v) (i.e., the probability of assigning v to a random variable V). Mapping it back into the entropy definition from equation (3), we derive the tree entropy of w as follows:</Paragraph> <Paragraph position="30"> Using the bottom-up, dynamic programming technique (see the appendix for details) of computing inside probabilities (Lari and Young 1990), we can efficiently compute the probability of the sentence, P(w |G). Similarly, the algorithm can be modified to compute the quantity</Paragraph> <Paragraph position="32"> gives good TUV estimates to candidates for training PP-attachment models, it is not clear how a similar technique can be applied to training parsers. Whereas binary classification tasks can be described by binomial distributions, for which the confidence interval is well defined, a parsing model is made up of many multinomial classification decisions. We therefore need a way to characterize the confidence for each decision as well as a way to combine them into an overall confidence. Another difficulty is that the complexity of the induction algorithm deters us from reestimating the TUVs of the remaining candidates after selecting each new candidate. As we</Paragraph> </Section> </Section> <Section position="6" start_page="267" end_page="270" type="metho"> <SectionTitle> 10 When f </SectionTitle> <Paragraph position="0"> unc (w, G)=1, the parser is considered to be the most uncertain about a particular sentence. Instead of dividing tree entropies, one could have computed the Kullback-Leibler distance between the two distributions (in which case a score of zero would indicate the highest level of uncertainty). Because the selection is based on relative scores, as long as the function is monotonic, the exact form of the function should not have much impact on the outcome.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 30, Number 3 discussed in Section 3.3, reestimation is important for batched annotation. Without some means of updating the TUVs after each selection, the learner will not realize that it has already selected a candidate to train some parameter with low confidence until the retraining phase, which occurs only at the end of the batch selection; therefore, it may continue to select very similar candidates to train the same parameter. Even if we assume that the statistics can be updated, reestimating the TUVs is a computationally expensive operation. Essentially, all the remaining candidates that share some parameters with the selected candidate will need to be re-parsed. For these practical reasons, we do not include an evaluation function measuring confidence for the parsing experiment.</Paragraph> <Section position="1" start_page="268" end_page="270" type="sub_section"> <SectionTitle> 4.2 Experiments and Results </SectionTitle> <Paragraph position="0"> We compare the effectiveness of sample selection using the proposed evaluation functions against a baseline of random selection (f rand (w, G)=rand()). Similarly to previous experimental designs, the learner is given a small set of annotated seed data from the WSJ Treebank and a large set of unlabeled data (also from the WSJ Treebank but with the labels removed) from which to select new training examples. All training data are from Sections 2-21 of the treebank. We monitor the learning progress of the parser by testing it on unseen test sentences. We use Section 00 for development and Section 23 for testing. This study is repeated for two different models, the PLTIG parser and Collins's Model 2 parser.</Paragraph> <Paragraph position="1"> an induction algorithm (Hwa 2001a) based on the expectation-maximization (EM) principle that induces parsers for PLTIGs. The algorithm performs heuristic search through an iterative reestimation procedure to find local optima: sets of values for the grammar parameters that maximizes the grammar's likelihood of generating the training data. In principle, the algorithm supports unsupervised learning; however, because the search space has too many local optima, the algorithm tends to converge on a model that is unsuitable for parsing. Here, we consider a partially supervised variant in which we assume that the learner is given the phrasal boundaries of the training sentences but not the label of the constituent units. For example, the sentence Several fund managers expect a rough market this morning before prices stabilize. would be labeled as &quot;((Several fund managers) (expect ((a rough market) (this morning)) (before (prices stabilize))).)&quot; Our algorithm is similar to the approach taken by Pereira and Schabes (1992) for inducing PCFG parsers.</Paragraph> <Paragraph position="2"> Because the EM algorithm itself is an iterative procedure, performing sample selection on top of an EM-based learner is an extremely computational-intensive process. Here, we restrict the experiments for the PLTIG parsers to a smaller-scale study in the following two aspects. First, the lexical anchors of the grammar rules are backed off to part-of-speech tags; this restricts the size of the grammar vocabulary to 48. Second, the unlabeled candidate pool is set to contain 3,600 sentences, which is sufficiently large for inducing a grammar of this size. The initial model is trained on 500 labeled seed sentences. For each selection iteration, an additional 100 sentences are moved from the unlabeled pool to be labeled and added to the training set. After training, the updated parser is then tested on unseen sentences (backed off to their part-of-speech tags) and compared to the gold standard. Because the induced PLTIG produces binary-branching parse trees, which have more layers than the gold standard, we measure parsing accuracy in terms of the crossing-bracket metric. The study is repeated for 10 trials, each using a different portion of the full training set, to ensure statistical significance (using pairwise t-test at 95% confidence).</Paragraph> <Paragraph position="3"> PLTIG parser: (a) A comparison of the evaluation functions' learning curves. (b) A comparison of the evaluation functions for a test performance score of 80%.</Paragraph> <Paragraph position="4"> The results of the experiment are graphically shown in Figure 7. As with the PP-attachment studies, Figure 7(a) compares the learning curves of the proposed evaluation functions to that of the baseline. Note that even though these functions select examples in terms of entire sentences, the amount of annotation is measured in the graphs (x-axis) in terms of the number of brackets rather than sentences. Unlike in the PP-attachment case, the amount of effort from the annotators varies significantly from example to example. A short and simple sentence takes much less time to annotate than a long and complex sentence. We address this effect by approximating the amount of effort as the number of brackets the annotator needs to label. Thus, we deem one evaluation function more effective than another if, for the desired level of performance, the smallest set of sentences selected by the function contains fewer brackets than that of the other function. Figure 7(b) compares the evaluation functions at the final test performance level of 80%.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 30, Number 3 Qualitatively comparing the learning curves in the figure, we see that with the appropriate evaluation function, sample selection does reduce the amount of annotation. Similarly to our findings in the PP-attachment study, the simple problem-space-based evaluation function, f len , offers only little savings; its performance is nearly indistinguishable from that of the baseline, for the most part.</Paragraph> <Paragraph position="6"> The evaluation functions based on hypothesis performances, on the other hand, do reduce the amount of annotation in the training data. Of the two that we proposed for this category, the tree entropy evaluation function, f unc , has a slight edge over the error-driven evaluation function, f err .</Paragraph> <Paragraph position="7"> For a quantitative comparison, let us consider the set of grammars that achieve an average parsing accuracy of 80% on the test sentences. We consider a grammar to be comparable to that of the baseline if its mean test score is at least as high as that of the baseline and if the difference between the means is not statistically significant. The baseline case requires an average of about 38,000 brackets in the training data. In contrast, to induce a grammar that reaches the same 80% parsing accuracy with the examples selected by f unc , the learner requires, on average, 19,000 training brackets.</Paragraph> <Paragraph position="8"> Although the learning rate of f err is slower than that of f unc overall, it seems to have caught up in the end; it needs 21,000 training brackets, slightly more than f unc . While the simplistic sentence length evaluation function, f len , is less helpful, its learning rate still improves slightly faster than that of the baseline. A grammar of comparable quality can be induced from a set of training examples selected by f len containing an average of 28,000 brackets.</Paragraph> <Paragraph position="9"> is Collins's (1997) Model 2 parser, which uses a history-based learning algorithm that takes statistics directly over the treebank. As a fully supervised algorithm, it does not have to iteratively reestimate its parameters and is computationally efficient enough for us to carry out a large-scale experiment. For this set of studies, the unlabeled candidate pool consists of around 39,000 sentences. The initial model is trained on 500 labeled seed sentences, and at each selection iteration, an additional 100 sentences are moved from the unlabeled pool into the training set. The parsing performance on the test sentences is measured in terms of the parser's F-score, the harmonic average of the labeled precision and labeled recall rates over the constituents (Van Rijsbergen 1979).</Paragraph> <Paragraph position="10"> We plot the comparisons between different evaluation functions and the baseline for the history-based parser in Figure 8. The examples selected by the problem-space-based functions do not seem to be helpful. Their learning curves are, for the most part, slightly worse than the baseline. In contrast, the parsers trained on data selected by the error-driven and uncertainty-based functions learn faster than the baseline; and as before, f unc performs slightly better than f err .</Paragraph> <Paragraph position="11"> For the final parsing performance of 88%, the parser requires a baseline training set of 30,500 sentences annotated with about 695,000 constituents. The same performance can be achieved with a training set of 20,500 sentences selected by f err , which contains about 577,000 annotated constituents; or with a training set of 17,500 sentences selected by f unc , which contains about 505,000 annotated constituents, reducing the number of annotated constituents by 27%. Comparing the outcome of this experiment with that of Model 2 parser: (a) A comparison of the learning curves of the evaluation functions. (b) A comparison of all the evaluation functions at the test performance level of 88%. the experiment involving the EM-based learner, we see that the training data reduction rates are less dramatic than before. This may be because both f unc and f err ignore lexical items and chase after sentences containing words that rarely occur. Recent work by Tang, Luo, and Roukos (2002) suggests that a hybrid approach that combines features of the problem space and the uncertainty of the parser may result in better performance for lexicalized parsers.</Paragraph> </Section> </Section> class="xml-element"></Paper>