File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/j98-4004_abstr.xml
Size: 6,343 bytes
Last Modified: 2025-10-06 13:49:15
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-4004"> <Title>PCFG Models of Linguistic Tree Representations</Title> <Section position="2" start_page="0" end_page="614" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Probabalistic context-free grammars (PCFGs) provide simple statistical models of natural languages. The relative frequency estimator provides a straightforward way of inducing these grammars from treebank corpora, and a broad-coverage parsing system can be obtained by using a parser to find a maximum-likelihood parse tree for the input string with respect to such a treebank gram_mar. PCFG parsing systems often perform as well as other simple broad-coverage parsing system for predicting tree structure from part-of-speech (POS) tag sequences (Charniak 1996). While PCFG models do not perform as well as models that are sensitive to a wider range of dependencies (Collins 1996), their simplicity makes them straightforward to analyze both theoretically and empirically. Moreover, since more sophisticated systems can be viewed as refinements of the basic PCFG model (Charniak 1997), it seems reasonable to first attempt to better understand the properties of PCFG models themselves.</Paragraph> <Paragraph position="1"> It is well known that natural language exhibits dependencies that context-free grammars (CFGs) cannot describe (Culy 1985; Shieber 1985). But the statistical independence assumptions embodied in a particular PCFG description of a particular natural language construction are in general much stronger than the requirement that the construction be generated by a CFG. We show below that the PCFG extension of what seems to be an adequate CFG description of PP attachment constructions performs no better than PCFG models estimated from non-CFG accounts of the same constructions.</Paragraph> <Paragraph position="2"> More specifically, this paper studies the effect of varying the tree structure representation of PP modification from both a theoretical and an empirical point of view. It compares PCFG models induced from treebanks using several different tree repre- null * Department of Cognitive and Linguistic Sciences, Box 1978, Providence, RI 02912 (~) 1998 Association for Computational Linguistics Computational Linguistics Volume 24, Number 4 sentations, including the representation used in the Penn II treebank corpora (Marcus, Santorini, and Marcinkiewicz 1993) and the &quot;Chomsky adjunction&quot; representation now standardly assumed in generative linguistics.</Paragraph> <Paragraph position="3"> One of the weaknesses of a PCFG model is that it is insensitive to nonlocal relationships between nodes. If these relationships are significant then a PCFG will be a poor language model. Indeed, the sense in which the set of trees generated by a CFG is &quot;context free&quot; is precisely that the label on a node completely characterizes the relationships between the subtree dominated by the node and the nodes that properly dominate this subtree.</Paragraph> <Paragraph position="4"> Roughly speaking, the more nodes in the trees of the training corpus, the stronger the independence assumptions in the PCFG language model induced from those trees. For example, a PCFG induced from a corpus of completely flat trees (i.e., consisting of the root node immediately dominating a string of terminals) generates precisely the strings of training corpus with likelihoods equal to their relative frequencies in that corpus. Thus the location and labeling on the nonroot nonterminal nodes determine how a PCFG induced from a treebank generalizes from that training data. Generally, one might expect that the fewer the nodes in the training corpus trees, the weaker the independence assumptions in the induced language model. For this reason, a &quot;flat&quot; tree representation of PP modification is investigated here as well.</Paragraph> <Paragraph position="5"> A second method of relaxing the independence assumptions implicit in a PCFG is to encode more information in each node's label. Here the intuition is that the label on a node is a &quot;communication channel&quot; that conveys information between the subtree dominated by the node and the part of the tree not dominated by this node, so all other things being equal, appending to the node's label additional information about the context in which the node appears should make the independence assumptions implicit in the PCFG model weaker. The effect of adding a particularly simple kind of contextual information--the category of the node's parent--is also studied in this paper.</Paragraph> <Paragraph position="6"> Whether either of these two PCFG models outperforms a PCFG induced from the original treebank is a separate question. We face a classical &quot;bias versus variance&quot; dilemma here (Geman, Bienenstock, and Doursat 1992): as the independence assumptions implicit in the PCFG model are weakened, the number of parameters that must be estimated (i.e., the number of productions) increases. Thus while moving to a class of models with weaker independence assumptions permits us to more accurately describe a wider class of distributions (i.e., it reduces the bias implicit in the estimator), in general our estimate of these parameters will be less accurate simply because there are more of them to estimate from the same data (i.e., the variance in the estimator increases).</Paragraph> <Paragraph position="7"> This paper studies the effects of these differing tree representations of PP modification theoretically by considering their effect on very simple corpora, and empirically by means of a tree transformation/detransformation methodology introduced below.</Paragraph> <Paragraph position="8"> The corpus used as the source for the empirical study is version II of the Wall Street Journal (WSJ) corpus constructed at the University of Pennsylvania, modified as described in Charniak (1996), in that: * root nodes (labeled ROOT) were inserted, * the terminal or lexical items were deleted (i.e., the terminal items in the trees were POS tags), * node labels consisted solely of syntactic category information (e.g., grammatical function and coindexation information was removed),</Paragraph> </Section> class="xml-element"></Paper>