File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1022_intro.xml

Size: 5,949 bytes

Last Modified: 2025-10-06 14:06:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1022">
  <Title>A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis</Title>
  <Section position="2" start_page="0" end_page="145" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Data-Oriented Parsing (DOP) models of natural language embody the assumption that human language perception and production works with representations of past language experiences, rather than with abstract grammar rules (cf. Bod 1992, 95; Scha 1992; Sima'an 1995; Rajman 1995). DOP models therefore maintain large corpora of linguistic representations of previously occurring utterances.</Paragraph>
    <Paragraph position="1"> New utterances are analyzed by combining (arbitrarily large) fragments from the corpus; the occurrence-frequencies of the fragments are used to determine which analysis is the most probable one. In accordance with the general DOP architecture outlined by Bod (1995), a particular DOP model is described by specifying settings for the following four parameters: * a formal definition of a well-formed representation for utterance analyses, * a set of decomposition operations that divide a given utterance analysis into a set of fragments, * a set of composition operations by which such fragments may be recombined to derive an analysis of a new utterance, and * a definition of a probability model that indicates how the probability of a new utterance analysis is computed on the basis of the probabilities of the fragments that combine to make it up.</Paragraph>
    <Paragraph position="2"> Previous instantiations of the DOP architecture were based on utterance-analyses represented as surface phrase-structure trees CTree-DOP&amp;quot;, e.g. Bod 1993; Rajman 1995; Sima'an 1995; Goodman 1996; Bonnema et al. 1997). Tree-DOP uses two decomposition operations that produce connected subtrees of utterance representations: (1) the Root operation selects any node of a tree to be the root of the new subtree and erases all nodes except the selected node and the nodes it dominates; (2) the Frontier operation then chooses a set (possibly empty) of nodes in the new subtree different from its root and erases all subtrees dominated by the chosen nodes. The only composition operation used by Tree-DOP is a node-substitution operation that replaces the left-most nonterminal frontier node in a subtree with a fragment whose root category matches the category of the frontier node. Thus Tree-DOP provides treerepresentations for new utterances by combining fragments from a corpus of phrase structure trees.</Paragraph>
    <Paragraph position="3"> A Tree-DOP representation R can typically be derived in many different ways. If each derivation D has a probability P(D), then the probability of deriving R is the sum of the individual derivation probabilities:</Paragraph>
    <Paragraph position="5"> A Tree-DOP derivation D = &lt;tl, t2 ... tk&gt; is produced by a stochastic branching process. It starts by randomly choosing a fragment tl labeled with the initial category (e.g. S). At each subsequent step, a next fragment is chosen at random from among the set of competitors for composition into the current subtree. The process stops when a tree results with no nonterminal leaves. Let CP(tlCS) denote the probability of choosing a tree t from a competition set CS containing t. Then the probability of a derivation is P(&lt;tl, t2 ... tk&gt;) = l'\]iCP(ti I CSi) where the competition probability CP(t ICS) is given by CP(t I CS) = P(t) / :El, e CS P(t') Here, P(t) is the fragment probability for t in a given corpus. Let Ti-I = tj o t2 o ... o ti.1 be the subanalysis just before the ith step of the process, let LNC(Ti.I ) denote the category of the leftmost nonterminal of Ti-l, and let r(t) denote the root categ.ory of a fragment t. Then the competition set at the i th step is CS i = { t : r(t)=LNC(Ti. 1 ) } That is, the competition sets for Tree-DOP are determined by the category of the leftmost nonterminal of the current subanalysis. This is not the only possible definition of competition set. As Manning and Carpenter (1997) have shown, the competition sets can be made dependent on the composition operation. Their left-corner language model would also apply to Tree-DOP, yielding a different definition for the competition sets. But the properties of such Tree-DOP models have not been investigated.</Paragraph>
    <Paragraph position="6"> Experiments with Tree-DOP on the Penn Treebank and the OVIS corpus show a consistent increase in parse accuracy when larger and more complex subtrees are taken into account (cf. Bod 1993, 95, 98; Bonnema et al. 1997; Sekine &amp; Grishman 1995; Sima'an 1995). However, Tree-DOP is limited in that it cannot account for underlying syntactic (and semantic) dependencies that are not  reflected directly in a surface tree. All modern linguistic theories propose more articulated representations and mechanisms in order to characterize such linguistic phenomena. DOP models for a number of richer representations have been explored (van den Berg et al. 1994; Tugwell 1995), but these approaches have remained context-free in their generative power. In contrast, Lexical-Functional Grammar (Kaplan &amp; Bresnan 1982; Kaplan 1989), which assigns representations consisting of a surface constituent tree enriched with a corresponding functional structure, is known to be beyond context-free. In the current work, we develop a DOP model based on representations defined by LFG theory CLFG-DOP&amp;quot;). That is, we provide a new instantiation for the four parameters of the DOP architecture. We will see that this basic LFG-DOP model triggers a new, corpus-based notion of grammaticality, and that it leads to a different class of its probability models which exhibit interesting properties with respect to specificity and the interpretation of ill-formed strings.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML