File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1306_evalu.xml

Size: 5,790 bytes

Last Modified: 2025-10-06 13:58:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1306">
  <Title>Sample Selection for Statistical Grammar Induction</Title>
  <Section position="7" start_page="48" end_page="49" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> The results of the two experiments are graphically depicted in Figure 2. We plot learning rates of the induction processes using training sentences selected by the three evaluation functions. The learning rate relates the quality of the induced grammars to the amount of supervised training data available. In order for the induced grammar to parse test sentences with higher accuracy (x-axis), more supervision (y-axis) is needed. The amount of supervision is measured in terms of the number of brackets rather than sentences because it more accurately quantifies the effort spent by the human annotator. Longer sentences tend to require more brackets than short ones, and thus take more time to analyze. We deem one evaluation function more effective than another if the smallest set of sentences it selected can train a grammar that performs at least as well as the grammar trained under the other function and if the selected data contains considerably fewer brackets than that of the other function.</Paragraph>
    <Paragraph position="1"> Figure 2(a) presents the outcomes of the first experiment, in which the evaluation functions select training examples out of a large candidate-pool. We see that overall, sample selection has a positive effect on the learning IThe unsupervised induction algorithm induces grammars that generate binary branching trees so that the number of proposed brackets in a sentence is always one fewer than the length of the sentence. The WSJ corpus, on the other hand, favors a more fiattened tree structure with considerably fewer brackets per sentence. The consistent bracketing metric does not unfairly penalize a proposed parse tree for being binary branching.</Paragraph>
    <Paragraph position="2"> 2We generate different candidate-pools by moving a fixed-size window across WSJ sections 02 through 05, advancing 400 sentences for each trial. Sec~n 23 is always used for testing.</Paragraph>
    <Paragraph position="3">  induced with the baseline (after 26 selection iterations) to the sets of grammars induced under the proposed evaluation functions (ften after 17 iterations, fte after 14 iterations). rate of the induction process. For the base-line case, the induction process uses frand, in which training sentences are randomly selected. The resulting grammars achieves an average parsing accuracy of 80.3% on the test sentences after seeing an average of 33355 brackets in the training data. The learning rate of the tree entropy evaluation function, fte, progresses much faster than the baseline.</Paragraph>
    <Paragraph position="4"> To induce a grammar that reaches the same 80.3% parsing accuracy with the examples selected by fte, the learner requires, on average, 21236 training brackets, reducing the amount of annotation by 36% comparing to the baseline. While the simplistic sentence length evaluation function, f~en, is less helpful, its learning rate still improves slightly faster than the baseline. A grammar of comparable quality can be induced from a set of training examples selected by fzen containing an average of 30288 brackets. This provides a small reduction of 9% from the baseline 3. We consider a set of grammars to be comparable to the base3In terms of the number of sentences, the baseline f~d used 2600 randomly chosen training sentences; .fze,~ selected the 1700 longest sentences as training data; and fte selected 1400 sentences.</Paragraph>
    <Paragraph position="5"> line if its mean test score is at least as high as that of the baseline and if the difference of the means is not statistically significant (using pair-wise t-test at 95% confidence). Table 1 summarizes the statistical significance of comparing the best set of baseline grammars with those of of f~en and ffte.</Paragraph>
    <Paragraph position="6"> Figure 2(b) presents the results of the second experiment, in which the evaluation functions only have access to a small candidate pool. Similar to the previous experiment, grammars induced from training examples selected by fte require significantly less annotations than the baseline. Under the baseline, frand, to train grammars with 78.5% parsing accuracy on test data, an average of 11699 brackets (in 900 sentences) is required. In contrast, fte can induce a comparable grammar with an average of 8559 brackets (in 600 sentences), providing a saving of 27% in the number of training brackets. The simpler evaluation function f~n out:performs the baseline as well; the 600 sentences it selected have an average of 9935 brackets. Table 2 shows the statistical significance of these comParisons.</Paragraph>
    <Paragraph position="7"> A somewhat surprising outcome of the second study is that the grammars induced from  induced with the baseline (after 9 selection iterations) to the sets of grammars induced under the proposed evaluation functions (ften after 6 iterations, fte after 6 and 8 iterations). the three methods did not parse with the same accuracy when all the sentences from the unlabeled pool have been added to the training set. Presenting the training examples in different orders changes the search path of the induction process. Trained on data selected by fte, the induced grammar parses the test sentences with 79.1% accuracy, a small but statistically significant improvement over the baseline. This suggests that, when faced with a dearth of training candidates, fte can make good use of the available data to induce grammars that are comparable to those directly induced from more data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML