File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1206_intro.xml

Size: 2,413 bytes

Last Modified: 2025-10-06 14:06:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1206">
  <Title>The effect of alternative tree representations on tree bank grammars</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Probabalistic Context Free
Grammars
</SectionTitle>
    <Paragraph position="0"> A PCFG is a CFG in which each production A ~ a in the grammar's set of productions R is associated with an emission probability P(A a) that satisfies a normalization constraint</Paragraph>
    <Paragraph position="2"> and a consistency or tightness constraint not discussed here.</Paragraph>
    <Paragraph position="3"> A PCFG defines a probability distribution over the (finite) parse trees generated by the grammar, where the probability of a tree T is given by</Paragraph>
    <Paragraph position="5"> where C~(A --+ a) is the 'count' of the local tree consisting of a parent node labelled A with a sequence of immediate children nodes labelled a in r, or equivalently, the number of times the production A ~ a is used in the derivation ~-.</Paragraph>
    <Paragraph position="6"> The PCFG which assigns maximum likelihood to a tree bank corpus ~ is given by the relative frequency estimator.</Paragraph>
    <Paragraph position="8"> Here C/(A -+ a) refers to the 'count' of the local tree in the tree bank, or equivalently, the number of times the production A -+ a would be used in derivations of exactly the trees in ~.</Paragraph>
    <Paragraph position="9"> It is practical to induce PCFGs from tree bank corpora and find maximum likelihood parses for such PCFGs using relatively modest computing equipment. All the experiments reported here used the Penn II Wall Street Journal (WSJ) corpus, modified as described by Charniak (Charniak, 1996), i.e., empty nodes were deleted, and all other components of nodes labels except syntactic category were removed.</Paragraph>
    <Paragraph position="10"> Grammar induction or training used the 39,832 trees in the F2-21 sections of the Penn II WSJ corpus, and testing was performed on the 1,576 sentences of length 40 or less of the F22 section of the corpus. Parsing was performed using an exhaustive CKY parser that returned a maximum likelihood parse. Ties between equally likely parses were broken randomly; on the tree bank grammar this leads to fluctuations in labelled precision and recall with a standard deviation of approximately 0.07%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML