File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1010_intro.xml

Size: 2,111 bytes

Last Modified: 2025-10-06 14:01:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1010">
  <Title>What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy?</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> One of the goals in statistical natural language parsing is to find the minimal set of statistical dependencies (between words and syntactic structures) that achieves maximal parse accuracy.</Paragraph>
    <Paragraph position="1"> Many stochastic parsing models use linguistic intuitions to find this minimal set, for example by restricting the statistical dependencies to the locality of headwords of constituents (Collins 1997, 1999; Eisner 1997), leaving it as an open question whether there exist important statistical dependencies that go beyond linguistically motivated dependencies. The Data Oriented Parsing (DOP) model, on the other hand, takes a rather extreme view on this issue: given an annotated corpus, all fragments (i.e. subtrees) seen in that corpus, regardless of size and lexicalization, are in principle taken to form a grammar (see Bod 1993, 1998; Goodman 1998; Sima'an 1999). The set of subtrees that is used is thus very large and extremely redundant. Both from a theoretical and from a computational perspective we may wonder whether it is possible to impose constraints on the subtrees that are used, in such a way that the accuracy of the model does not deteriorate or perhaps even improves. That is the main question addressed in this paper. We report on experiments carried out with the Penn Wall Street Journal (WSJ) treebank to investigate several strategies for constraining the set of subtrees. We found that the only constraints that do not decrease the parse accuracy consist in an upper bound of the number of words in the subtree frontiers and an upper bound on the depth of unlexicalized subtrees. We also found that counts of subtrees with several nonheadwords are important, resulting in improved parse accuracy over previous parsers tested on the WSJ.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML