File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1018_intro.xml

Size: 5,784 bytes

Last Modified: 2025-10-06 14:01:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1018">
  <Title>A simple pattern-matching algorithm for recovering empty nodes and their antecedents</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> One of the main motivations for research on parsing is that syntactic structure provides important information for semantic interpretation; hence syntactic parsing is an important rst step in a variety of I would like to thank my colleages in the Brown Laboratory for Linguistic Information Processing (BLLIP) as well as Michael Collins for their advice. This research was supported by NSF awards DMS 0074276 and ITR IIS 0085940.</Paragraph>
    <Paragraph position="1"> useful tasks. Broad coverage syntactic parsers with good performance have recently become available (Charniak, 2000; Collins, 2000), but these typically produce as output a parse tree that only encodes local syntactic information, i.e., a tree that does not include any empty nodes . (Collins (1997) discusses the recovery of one kind of empty node, viz., WH-traces). This paper describes a simple pattern-matching algorithm for post-processing the output of such parsers to add a wide variety of empty nodes to its parse trees.</Paragraph>
    <Paragraph position="2"> Empty nodes encode additional information about non-local dependencies between words and phrases which is important for the interpretation of constructions such as WH-questions, relative clauses, etc.1 For example, in the noun phrase the man Sam likes the fact the man is interpreted as the direct object of the verb likes is indicated in Penn treebank notation by empty nodes and coindexation as shown in Figure 1 (see the next section for an explanation of why likes is tagged VBZ t rather than the standard VBZ).</Paragraph>
    <Paragraph position="3"> The broad-coverage statistical parsers just mentioned produce a simpler tree structure for such a relative clause that contains neither of the empty nodes just indicated. Rather, they produce trees of the kind shown in Figure 2. Unlike the tree depicted in Figure 1, this type of tree does not explicitly represent the relationship between likes and the man.</Paragraph>
    <Paragraph position="4"> This paper presents an algorithm that takes as its input a tree without empty nodes of the kind shown 1There are other ways to represent this information that do not require empty nodes; however, information about non-local dependencies must be represented somehow in order to interpret these constructions.</Paragraph>
    <Paragraph position="5"> Computational Linguistics (ACL), Philadelphia, July 2002, pp. 136-143. Proceedings of the 40th Annual Meeting of the Association for  measures for evaluating parse accuracy do not measure the accuracy of empty node and antecedent recovery, but there is a fairly straightforward extension of them that can evaluate empty node and antecedent recovery, as described in section 3. The rest of this section provides a brief introduction to empty nodes, especially as they are used in the Penn Treebank.</Paragraph>
    <Paragraph position="6"> Non-local dependencies and displacement phenomena, such as Passive and WH-movement, have been a central topic of generative linguistics since its inception half a century ago. However, current linguistic research focuses on explaining the possible non-local dependencies, and has little to say about how likely different kinds of dependencies are. Many current linguistic theories of non-local dependencies are extremely complex, and would be dif cult to apply with the kind of broad coverage described here. Psycholinguists have also investigated certain kinds of non-local dependencies, and their theories of parsing preferences might serve as the basis for specialized algorithms for recovering certain kinds of non-local dependencies, such as WH dependencies. All of these approaches require considerably more specialized linguitic knowledge than the pattern-matching algorithm described here. This algorithm is both simple and general, and can serve as a benchmark against which more complex approaches can be evaluated.</Paragraph>
    <Paragraph position="7">  coverage statistical parser lacking empty nodes.</Paragraph>
    <Paragraph position="8"> The pattern-matching approach is not tied to any particular linguistic theory, but it does require a tree-bank training corpus from which the algorithm extracts its patterns. We used sections 2 21 of the Penn Treebank as the training corpus; section 24 was used as the development corpus for experimentation and tuning, while the test corpus (section 23) was used exactly once (to obtain the results in section 3). Chapter 4 of the Penn Treebank tagging guidelines (Bies et al., 1995) contains an extensive description of the kinds of empty nodes and the use of co-indexation in the Penn Treebank. Table 1 contains summary statistics on the distribution of empty nodes in the Penn Treebank. The entry with POS SBAR and no label refers to a compound type of empty structure labelled SBAR consisting of an empty complementizer and an empty (moved) S (thus SBAR is really a nonterminal label rather than a part of speech); a typical example is shown in Figure 3. As might be expected the distribution is highly skewed, with most of the empty node tokens belonging to just a few types. Because of this, a system can provide good average performance on all empty nodes if it performs well on the most frequent types of empty nodes, and conversely, a system will perform poorly on average if it does not perform at least moderately well on the most common types of empty nodes, irrespective of how well it performs on more esoteric constructions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML