File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1010_intro.xml
Size: 4,893 bytes
Last Modified: 2025-10-06 14:06:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1010"> <Title>Supervised Grammar Induction using Training Data with Limited Constituent Information *</Title> <Section position="2" start_page="0" end_page="73" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The availability of large hand-parsed corpora such as the Penn Treebank Project has made high-quality statistical parsers possible. However, the parsers risk becoming too tailored to these labeled training data that they cannot reliably process sentences from an arbitrary domain. Thus, while a parser trained on the * Wall Street Journal corpus can fairly accurately parse a new Wall Street Journal article, it may not perform as well on a New Yorker article.</Paragraph> <Paragraph position="1"> To parse sentences from a new domain, one would normally directly induce a new grammar * This material is based upon work supported by the National Science Foundation under Grant No. IRI 9712068. We thank Stuart Shieber for his guidance, and Lillian Lee, Ric Crabbe, and the three anonymous reviewers for their comments on the paper.</Paragraph> <Paragraph position="2"> from that domain, in which the training process would require hand-parsed sentences from the new domain. Because parsing a large corpus by hand is a labor-intensive task, it would be beneficial to minimize the number of labels needed to induce the new grammar.</Paragraph> <Paragraph position="3"> We propose to adapt a grammar already trained on an old domain to the new domain.</Paragraph> <Paragraph position="4"> Adaptation can exploit the structural similarity between the two domains so that fewer labeled data might be needed to update the grammar to reflect the structure of the new domain.</Paragraph> <Paragraph position="5"> This paper presents a quantitative study comparing direct induction and adaptation under different training conditions. Our goal is to understand the effect of the amounts and types of labeled data on the training process for both induction strategies. For example, how much training data need to be hand-labeled? Must the parse trees for each sentence be fully specified? Are some linguistic constituents in the parse more informative than others? To answer these questions, we have performed experiments that compare the parsing qualities of grammars induced under different training conditions using both adaptation and direct induction. We vary the number of labeled brackets and the linguistic classes of the labeled brackets. The study is conducted on both a simple Air Travel Information System (ATIS) corpus (Hemphill et al., 1990) and the more complex Wall Street Journal (WSJ) corpus (Marcus et al., 1993).</Paragraph> <Paragraph position="6"> Our results show that the training examples do not need to be fully parsed for either strategy, but adaptation produces better grammars than direct induction under the conditions of minimally labeled training data. For instance, the most informative brackets, which label constituents higher up in the parse trees, typically identifying complex noun phrases and sentential clauses, account for only 17% of all constituents in ATIS and 21% in WSJ. Trained on this type of label, the adapted grammars parse better than the directly induced grammars and almost as well as those trained on fully labeled data. Training on ATIS sentences labeled with higher-level constituent brackets, a directly induced grammar parses test sentences with 66% accuracy, whereas an adapted grammar parses with 91% accuracy, which is only 2% lower than the score of a grammar induced from fully labeled training data. Training on WSJ sentences labeled with higher-level constituent brackets, a directly induced grammar parses with 70% accuracy, whereas an adapted grammar parses with 72% accuracy, which is 6% lower than the score of a grammar induced from fully labeled training data.</Paragraph> <Paragraph position="7"> That the most informative brackets are higher-level constituents and make up only one-fifth of all the labels in the corpus has two implications. First, it shows that there is potential reduction of labor for the human annotators.</Paragraph> <Paragraph position="8"> Although the annotator still must process an entire sentence mentally, the task of identifying higher-level structures such as sentential clauses and complex nouns should be less tedious than to fully specify the complete parse tree for each sentence. Second, one might speculate the possibilities of replacing human supervision altogether with a partial parser that locates constituent chunks within a sentence. However, as our results indicate that the most informative constituents are higher-level phrases, the parser would have to identify sentential clauses and complex noun phrases rather than low-level base noun phrases.</Paragraph> </Section> class="xml-element"></Paper>