File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1058_intro.xml
Size: 2,389 bytes
Last Modified: 2025-10-06 14:02:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1058"> <Title>Alternative Approaches for Generating Bodies of Grammar Rules</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Overview </SectionTitle> <Paragraph position="0"> We want to build grammars using different algorithms for inducing their rules. Our main question is aimed at understanding how different algorithms for inducing regular languages impact the parsing performance with those grammars. A second issue that we want to explore is how the grammars perform when the quality of the training material is improved, that is, when the training material is separated into part of speech (POS) categories before the regular language learning algorithms are run.</Paragraph> <Paragraph position="1"> We first transform the PTB into projective dependencies structures following (Collins, 1996). From the resulting tree bank we delete all lexical information except POS tags. Every POS in a tree belonging to the tree-bank has associated to it two different, possibly empty, sequences of right and left dependents, respectively. We extract all these sequences for all trees, producing two different sets containing right and left sequences of dependents respectively.</Paragraph> <Paragraph position="2"> These two sets form the training material used for building four different grammars. The four grammars differ along two dimensions: the number of automata used for building them and the algorithm used for inducing the automata. As to the latter dimension, in Section 4 we use two algorithms: the Minimum Discriminative Information (MDI) algorithm, and a bigram-based algorithm. As to the former dimension, two of the grammars are built using only two different automata, each of which is built using the two sample set generated from the PTB. The other two grammars were built using two automata per POS, exploiting a split of the training samples into multiple samples, two samples per POS, to be precise, each containing only those samples where the POS appeared as the head.</Paragraph> <Paragraph position="3"> The grammars built from the induced automata are so-called PCW-grammars (see Section 3), a formalism based on probabilistic context free grammars (PCFGs); as we will see in Section 3, inferring them from automata is almost immediate.</Paragraph> </Section> class="xml-element"></Paper>