File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/p99-1010_concl.xml
Size: 2,312 bytes
Last Modified: 2025-10-06 13:58:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1010"> <Title>Supervised Grammar Induction using Training Data with Limited Constituent Information *</Title> <Section position="8" start_page="78" end_page="78" type="concl"> <SectionTitle> 6 Conclusion and Future Work </SectionTitle> <Paragraph position="0"> In this study, we have shown that the structure of a grammar can be reliably learned without having fully specified constituent information in the training sentences and that the most informative constituents of a sentence are higher-level phrases, which make up only a small percentage of the total number of constituents.</Paragraph> <Paragraph position="1"> Moreover, we observe that grammar adaptation works particularly well with this type of sparse but informative training data. An adapted grammar consistently outperforms a directly induced grammar even when adapting from a simpler corpus to a more complex one.</Paragraph> <Paragraph position="2"> These results point us to three future directions. First, that the labels for some constituents are more informative than others implies that sentences containing more of these informative constituents make better training examples. It may be beneficial to estimate the informational content of potential training (unmarked) sentences. The training set should only include sentences that are predicted to have high information values. Filtering out unhelpful sentences from the training set reduces unnecessary work for the human annotators. Second, although our experiments show that a sparsely labeled training set is more of an obstacle for the direct induction approach than for the grammar adaptation approach, the direct induction strategy might also benefit from a two stage learning process similar to that used for grammar adaptation. Instead of training on a different corpus in each stage, the grammar can be trained on a small but fully labeled portion of the corpus in its first stage and the sparsely labeled portion in the second stage. Finally, higher-level constituents have proved to be the most informative linguistic units. To relieve humans from labeling any training data, we should consider using partial parsers that can automatically detect complex nouns and sentential clauses.</Paragraph> </Section> class="xml-element"></Paper>