XML Viewer - w04-2003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2003_intro.xml
Size: 6,070 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2003">
  <Title>A Robust and Hybrid Deep-Linguistic Theory Applied to Large-Scale Parsing</Title>
  <Section position="2" start_page="0" end_page="8" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Robustness in Computational Linguistics has been recently recognized as a central issue for the design of reliable, large-scale Natural Language Processing (NLP) systems. While the highest possible linguistic coverage is desirable, speed and robustness are equally important in practical applications.</Paragraph>
    <Paragraph position="1"> Formal Grammar Parser have carefully crafted grammars written by professional linguists. In addition to expressing local relations, i.e. relations between a mother and a direct daughter node, a number of non-local relations, i.e. relations involving more than two generations, are also modeled. An example of a non-local relation is the subject control relation in the sentence John wants to leave,whereJohn is not only the explicit subject of want,butequally the implicit subject of leave. A parser that fails to recognize control subjects misses important information, quantitatively about 3 % of all subjects. null But unrestricted real-world texts still pose a problem to NLP systems that are based on Formal Grammars. Few hand-crafted, deep linguistic grammars achieve the coverage and robustness needed to parse large corpora (see (Riezler et al., 2002) for an exception, and (Burke et al., 2004; Hockenmaier and Steedman, 2002) for approaches extracting formal grammars from the Treebank), and speed remains a serious challenge. The typical problems can be grouped as follows.</Paragraph>
    <Paragraph position="2"> Grammar complexity Fully comprehensive grammars are difficult to maintain and considerably increase parsing complexity. Note that statistical parsers can equally suffer from this problem, see e.g. (Kaplan et al., 2004).</Paragraph>
    <Paragraph position="3"> Parsing complexity Typicalformalgrammar parser complexity is much higher than the O(n  ) for CFG (Eisner, 1997). The complexity of some formal grammars is still unknown. For Tree-Adjoining Grammars (TAG) it is O(n  )orO(n  ) depending on the implementation (Eisner, 2000). (Sarkar et al., 2000) state that the theoretical bound of worst time complexity for Head-Driven Phrase Structure Grammar (HPSG) parsing is exponential. Parsing algorithms able to treat completely unrestricted long-distance dependencies are NP-complete (Neuhaus and Br&amp;quot;oker, 1997). Ranking Returning all syntactically possible analyses for a sentence is not really what is expected of a syntactic analyzer if it should be of practical use, since for a human there is usually only one &amp;quot;correct&amp;quot; interpretation. A clear indication of preference, by means of ranking the analyses in a preference order is needed.</Paragraph>
    <Paragraph position="4"> Pruning In order to keep search spaces manageable it is in fact necessary to discard unconvincing alternatives already during the parsing process. In a statistical parser, the ranking of intermediate structures occurs naturally, while a rule-based system has to rely on ad hoc heuristics. With a beam search in a parse-time pruning system, which means that the total number of alternatives kept is constant from a certain search complexity onwards, real-world parsing time can be reduced to near-linear. If one were to assume a constantly full beam, or uses an oracle (Nivre, 2004) it is linear in practice. A number of robust statistical parsers that offer solutions to these problems have now become available (Charniak, 2000; Collins, 1999; Henderson, 2003), but they typically produce CFG constituency data as output, trees that do not express long-distance dependencies.</Paragraph>
    <Paragraph position="5"> Although grammatical function and empty nodes annotation expressing long-distance dependencies are provided in Treebanks such as the Penn Treebank (Marcus et al., 1993), most statistical Treebank trained parsers fully or largely ignore them  , which entails two problems: first, the training cannot profit from valuable annotation data. Second, the extraction of long-distance dependencies (LDD) and the mapping to shallow semantic representations is not always possible from the output of these parsers. This limitation is aggravated by a lack of co-indexation information and parsing errors across an LDD. In fact, some syntactic relations cannot be recovered on configurational grounds only. For these reasons, (Johnson, 2002) provocatively refers to them as &amp;quot;halfgrammars&amp;quot;. null The paper is organized as follows. We first explore a deep-linguistic grammar theory for English that is inherently designed to be robust by extending the low processing complexity and the robustness of statistical approaches to a more deep-linguistic level, by making careful use of underspecification, grammar compression techniques and using a grammar that directly delivers simple predicate-argument structures.</Paragraph>
    <Paragraph position="6"> This allow us to use a context-free grammar at parse-time while successfully treating long-distance dependencies using low-complexity approaches before and after parsing. Our approach is to use finite-state approximations of long-distance dependencies, as they are described in (Schneider, 2003a) for Dependency Grammar (DG) and (Cahill et al., 2004) for Lexical Functional Grammar (LFG). (Dienes and Dubey, 2003) show that finite-state pre-processing modules can successfully deal with LDDs. Our approach is similar in also amounting to a preprocessing recognition of LDDs. Then we show that the implementation (Pro3Gres) profits from hybridness and is fast  (Collins, 1999) Model 2 uses some of the functional labels, and Model 3 some long-distance dependencies and robust enough to do large-scale parsing of totally unrestricted texts and give an overview of its applications. To conclude, two evaluations are given.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML