File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/j04-3001_concl.xml

Size: 6,358 bytes

Last Modified: 2025-10-06 13:53:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-3001">
  <Title>c(c) 2004 Association for Computational Linguistics Sample Selection for Statistical Parsing</Title>
  <Section position="8" start_page="272" end_page="274" type="concl">
    <SectionTitle>
6. Conclusion
</SectionTitle>
    <Paragraph position="0"> In this article, we have argued that sample selection is a powerful learning technique for reducing the amount of human-labeled training data. Our empirical studies suggest that sample selection is helpful not only for binary classification tasks such as PPattachment, but also for applications that generate complex outputs such as syntactic parsing.</Paragraph>
    <Paragraph position="1"> We have proposed several criteria for predicting the training utility of the unlabeled candidates and developed evaluation functions to rank them. We have conducted experiments to compare the functions' ability to select the most helpful training examples. We have found that the uncertainty criterion is a good predictor that consistently finds helpful examples. In our experiments, evaluation functions that factor in the uncertainty criterion consistently outperform the baseline of random selection across different tasks and learning algorithms. For learning a PP-attachment model, the most helpful evaluation function is a hybrid that factors in the prediction performance of the hypothesis and the confidence for the values of the parameters of the hypothesis. For training a parser, we found that uncertainty-based evaluation functions that use tree entropy were the most helpful for both the EM-based learner and the history-based learner.</Paragraph>
    <Paragraph position="2"> The current work points us in several future directions. First, we shall continue to develop alternative formulations of evaluation functions to improve the learning rates of parsers. Under the current framework, we did not experiment with any hypothesis-parameter-based evaluation functions for the parser induction task; how- null Hwa Sample Selection for Statistical Parsing ever, hypothesis-parameter-based functions may be feasible under a multilearner setting, using parallel machines. Second, while in this work we focused on selecting entire sentences as training examples, we believe that further reduction in the amount of annotated training data might be possible if the system could ask the annotators more-specific questions. For example, if the learner is unsure only of a local decision within a sentence (such as a PP-attachment ambiguity), the annotator should not have to label the entire sentence.</Paragraph>
    <Paragraph position="3"> In order to allow for finer-grained interactions between the system and the annotators, we have to address some new challenges. To begin with, we must weigh in other factors in addition to the amount of annotations. For instance, the learner may ask about multiple substrings in one sentence. Even if the total number of labels were fewer, the same sentence would still need to be mentally processed by the annotators multiple times. This situation is particularly problematic when there are very few annotators, as it becomes much more likely that a person will encounter the same sentence many times. Moreover, we must ensure that the questions asked by the learner are well-formed. If the learner were simply to present the annotator with some substring that it could not process, the substring might not form a proper linguistic constituent for the annotator to label. Additionally, we are interested in exploring the interaction between sample selection and other semisupervised approaches such as boosting, reranking, and cotraining. Finally, based on our experience with parsing, we believe that active-learning techniques may be applicable to other tasks that produce complex outputs such as machine translation.</Paragraph>
    <Paragraph position="4"> Appendix: Efficient Computation of Tree Entropy As discussed in Section 4.1.2, for learning tasks such as parsing, the number of possible classifications is so large that it may not be computationally efficient to calculate the degree of uncertainty using the tree entropy definition. In the equation for the tree entropy of w (TE(w, G)) presented in Section 4.1.2, the computation requires summing over all possible parses, but the number of possible parses for a sentence grows exponentially with respect to the sentence length. In this appendix, we show that tree entropy can be efficiently computed using dynamic programming.</Paragraph>
    <Paragraph position="5"> For illustrative purposes, we describe the computation process using a PCFG expressed in Chomsky normal form.</Paragraph>
    <Paragraph position="6">  The basic idea is to compose the tree entropy of the entire sentence from the tree entropy of the subtrees. The process is similar to that for computing the probability of the entire sentence from the probabilities of sub-strings (called Inside Probabilities). We follow the notation convention of Lari and Young (1990).</Paragraph>
    <Paragraph position="7"> The inside probability of a nonterminal X generating the substring w</Paragraph>
    <Paragraph position="9"> denoted as e(X, i, j); it is the sum of the probabilities of all possible subtrees that have X as the root and w</Paragraph>
    <Paragraph position="11"> as the leaf nodes. We define a new function h(X, i, j) to represent the corresponding entropy for the substring: h(X, i, j)=[?]  Computational Linguistics Volume 30, Number 3 Analogously to the computation of inside probabilities, we compute h(X, i, j) recursively. The base case is when the nonterminal X generates a single token substring</Paragraph>
    <Paragraph position="13"> Therefore, the tree entropy is h(X, i, i)=e(X, i, i) lg(e(X, i, i)) For the general case, h(X, i, j), we must find all rules of the form X - YZ, where Y and Z are nonterminals, that have contributed toward X  let x represent the parse step of X - YZ. Then, there are a total of bardblYbardblxbardblZbardbl parses, and the probability of each parse is P(x)P(y)P(z), where y [?]Yand z [?]Z. To compute  = [?]P(x) lg(P(x))e(Y, i, k)e(Z, k + 1, j)+P(x)h(Y, i, k)e(Z, k + 1, j) +P(x)e(Y, i, k)h(Z, k + 1, j) Thus, the tree entropy of the entire sentence can be recursively computed from the entropy values of the substrings.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML