File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1082_intro.xml
Size: 6,407 bytes
Last Modified: 2025-10-06 14:02:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1082"> <Title>Using linguistic principles to recover empty categories</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many recent approaches to parsing (e.g. Charniak, 2000) have focused on labeled bracketing of the input string, ignoring aspects of structure that are not reflected in the string, such as phonetically null elements and long-distance dependencies, many of which provide important semantic information such as predicate-argument structure. In the Penn Treebank (Marcus et al., 1993), null elements, or empty categories, are used to indicate non-local dependencies, discontinuous constituents, and certain missing elements. Empty categories are coindexed with their antecedents in the same sentence. In addition, if a node has a particular grammatical function (such as subject) or semantic role (such as location), it has a function tag indicating that role; empty categories may also have function tags. Thus in the sentence below, who is coindexed with the empty category *T* in the embedded S; the function tag SBJ indicates that this empty category is the subject of that S:</Paragraph> <Paragraph position="2"> Empty categories, with coindexation and function tags, allow a transparent reconstruction of predicate-argument structure not available from a simple bracketed string.</Paragraph> <Paragraph position="3"> In addition to bracketing the input string, a fully adequate syntactic analyzer should also locate empty categories in the parse tree, identify their antecedents, if any, and assign them appropriate function tags. State-of-the-art statistical parsers (e.g. Charniak, 2000) typically provide a labeled bracketing only; i.e., a parse tree without empty categories. This paper describes an algorithm for inserting empty categories in such impoverished trees, coindexing them with their antecedents, and assigning them function tags. This is the first approach to include function tag assignment as part of the more general task of empty category recovery.</Paragraph> <Paragraph position="4"> Previous approaches to the problem (Collins, 1997; Johnson, 2002; Dienes and Dubey, 2003a,b; Higgins, 2003) have all been learning-based; the primary difference between the present algorithm and earlier ones is that it is not learned, but explicitly incorporates principles of Government-Binding theory (Chomsky, 1981), since that theory underlies the annotation. The absence of rule-based approaches up until now is not motivated by the failure of such approaches in this domain; on the contrary, no one seems to have tried a rule-based approach to this problem. Instead it appears that there is an understandable predisposition against rule-based approaches, given the fact that data-driven, especially machine-learning, approaches have worked so much better in many other domains.</Paragraph> <Paragraph position="5"> Empty categories however seem different, in that, for the most part, their location and existence is determined, not by observable data, but by explicitly constructed linguistic principles, which Both Collins (1997: 19) and Higgins (2003: 100) are explicit about this predisposition.</Paragraph> <Paragraph position="6"> were consciously used in the annotation; i.e., unlike overt words and phrases, which correspond to actual strings in the data, empty categories are in the data only because linguists doing the annotation put them there. This paper therefore explores a rule-based approach to empty category recovery, with two purposes in mind: first, to explore the limits of such an approach; and second, to establish a more realistic baseline for future (possibly data-driven or hybrid) approaches.</Paragraph> <Paragraph position="7"> Although it does not seem likely that any application trying to glean semantic information from a parse tree would care about the exact string position of an empty category, the algorithm described here does try to insert empty categories in the correct position in the string. The main reason for this is to facilitate comparison with previous approaches to the problem, which evaluate accuracy by including such information.</Paragraph> <Paragraph position="8"> In Section 5, however, a revised evaluation metric is proposed that does not depend on string position per se.</Paragraph> <Paragraph position="9"> Before proceeding, a note on terminology is in order. I use the term detection (of empty categories) for the insertion of a labeled empty category into the tree (and/or string), and the term resolution for the coindexation of the empty category with its antecedent(s), if any. The term recovery refers to the complete package: detection, resolution, and assignment of function tags to empty categories.</Paragraph> <Paragraph position="10"> 2 Empty nodes in the Penn Treebank The major types of empty category in the Penn Treebank (PTB) are shown in Table 1, along with their distribution in section 24 of the Wall Street Journal portion of the PTB.</Paragraph> <Paragraph position="11"> distribution in section 24 of the PTB A detailed description of the categories and their uses in the treebank is provided in Chapter 4 of the annotation guidelines (Bies et al., 1995).</Paragraph> <Paragraph position="12"> Following Johnson (2002) and Dienes and Dubey (2003a), the compound empty SBAR consisting of an empty complementizer followed by *T* of category S is treated as a single item for purposes of evaluation. This compound category is labeled SBAR in Table 1.</Paragraph> <Paragraph position="13"> The PTB annotation in general, but especially the annotation of empty categories, follows a modified version of Government-Binding (GB) theory (Chomsky, 1981). In GB, the existence and location of empty categories is determined by the interaction of linguistic principles. In addition, the type of a given empty category is determined by its syntactic context, with the result that the various types of empty category are in complementary distribution. For example, the GB categories NP-trace and PRO (which are conflated to a single category in the PTB) occur only in argument positions in which an overt NP could not occur, namely as the object of a passive verb or as the subject of certain kinds of infinitive.</Paragraph> </Section> class="xml-element"></Paper>