File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2003_metho.xml
Size: 21,798 bytes
Last Modified: 2025-10-06 14:09:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2003"> <Title>A Robust and Hybrid Deep-Linguistic Theory Applied to Large-Scale Parsing</Title> <Section position="3" start_page="8" end_page="8" type="metho"> <SectionTitle> 2 A Robust Deep-Linguistic Theory </SectionTitle> <Paragraph position="0"> Generally, a linguistic analysis model aims at complete and correct analysis, which means that the mapping between the text data and its syntactic and semantic analysis is sound (the model extracts correct readings) and complete (the model deals with all language phenomena).</Paragraph> <Paragraph position="1"> In practice, however, both goals cannot be totally reached. The main obstacle for soundness is the all-pervasive characteristic of natural language to be ambiguous, where ambiguities can oftenonlyberesolvedwithworldknowledge.</Paragraph> <Paragraph position="2"> Statistical disambiguation such as (Collins and Brooks, 1995) for PP-attachment or (Collins, 1997; Charniak, 2000) for generative parsing greatly improve disambiguation, but as they model by imitation instead of by understanding, complete soundness has to remain elusive. null As for completeness, already early &quot;na&quot;ive&quot; statistical approaches have shown that the problem of grammar size is not solved but even aggravated by a naive probabilistic parser implementation, in which e.g. all CFG rules permitted in the Penn Treebank are extracted. From his 300,000 words training part of the Penn Treebank (Charniak, 1996) obtains more than 10,000 CFG rules, of which only about 3,000 occur more than once. It is therefore necessary to either discard infrequent rules, do manual editing, use a different rule format such as individual dependencies (Collins, 1996) or gain full linguistic control and insight by using a hand-written grammar - each of which sacrifices total completeness.</Paragraph> <Section position="1" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.1 Near-full Parsing Theapproachwehavechosenistousea </SectionTitle> <Paragraph position="0"> manually-developed wide-coverage tag sequence grammar (Abney, 1995; Briscoe and Carroll, 2002), and to exclude or restrict rare, marked and error-prone phenomena. For example, while it is generally possible for nouns to be modified by more than one PP, only nouns seen in the Treebank with several PPs are allowed to have several PPs. Or, while it is generally possible for a subject to occur to the immediate right of a verb (said she), this is only allowed for verbs seen with a subject to the right in the training corpus, typically verbs of utterance, and only in a comma-delimited or sentence-final context. This entails that the parser profits from a lean grammar but finds a complete structure spanning the entire sentence in the majority of real-world sentences and needs to resorts to collecting partial parses in the remaining minority.</Paragraph> <Paragraph position="1"> Starting from the most probable longest span, recursively the most probable longest span to left and right is searched.</Paragraph> <Paragraph position="2"> Near-full parsing only leads to a very small loss. If an analysis consists of two partial parses, on the dependency relation level only the one, usually high-level relation between the heads of the two partial parses remains unexpressed.</Paragraph> <Paragraph position="3"> The risk of returning &quot;garden path&quot;, locally correct but globally wrong, analyses diminishes with increasing span length.</Paragraph> </Section> <Section position="2" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.2 Functional Dependency Grammar </SectionTitle> <Paragraph position="0"> We follow the broad architecture suggested by (Abney, 1995) which naturally integrates chunking and dependency parsing and has proven to be practical, fast and robust (Collins, 1996; Basili and Zanzotto, 2002). Tagging and chunking are very robust, finite-state approaches, parsing then only occurs between heads of chunks.</Paragraph> <Paragraph position="1"> The perspicuous rules of a hand-written dependency grammar build up the possible syntactic structures, which are ranked and pruned by calculating lexical attachment probabilities for the majortiy of the dependency relations used in the grammar. The grammar contains around 1000 rules containing the dependent's and the head's tag, the direction of the dependency, lexical information for closed class words, and context restrictions . Context restrictions express e.g. that only a verb which has an object in its context is allowed to attach a secondary object.</Paragraph> <Paragraph position="2"> Our approach can be seen as an extension of (Collins and Brooks, 1995) from PP-attachment to most dependency relations. Training data is a partial mapping of the Penn Treebank to deep-linguistic dependency structures, similar to (Basili et al., 1998).</Paragraph> <Paragraph position="3"> Robustness also depends on the grammar formalism. While many formalisms fail to Practical experiments using a toy NP and verb-group grammar have shown that parsing between heads of chunks only is about four times faster than parsing between every word, i.e. without chunking.</Paragraph> <Paragraph position="4"> the number of rules is high because of tag combinatorics leading to many almost identical rules. A subject relations is e.g. possible between the 6 verb tags and the</Paragraph> </Section> </Section> <Section position="4" start_page="8" end_page="8" type="metho"> <SectionTitle> 4 noun tags </SectionTitle> <Paragraph position="0"> project when subcategorized arguments cannot be found, in a grammar like DG, in which maximal projections and terminal nodes are isomorphic, projection can never fail.</Paragraph> <Paragraph position="1"> In classical DG, only content words can be heads, and there is no distinction between syntactic and semantic dependency - semantic dependency is used as far as possible. These assumptions entail that there are no functional and no empty nodes, which means that low complexity O(n ) algorithms such as CYK, which is used here, can be employed.</Paragraph> <Paragraph position="2"> The classical dependency grammar distinction between ordre lin'eaire and ordre structural, basically an immediate dominance / linear precedence distinction (ID/LP) also has the advantage that a number of phenomena classically assumed to involve long-distance dependencies, fronted or inversed constituents, can be treated locally. They only need rules that allow an inversion of the &quot;canonical&quot; dependency direction under well-defined conditions. As for fronted elements, since DG does not distinguish between external and internal arguments, front positions are always locally available to the verb.</Paragraph> <Section position="1" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.3 Underspecification and Disambiguation </SectionTitle> <Paragraph position="0"> The cheapest approach to dealing with the all-pervasive NL ambiguity is to underspecifiy everything, which leads to a sound and complete mapping, but one that is content-free and absurd. But in few, carefully selected areas where distinctions do not matter for the task at hand, where the disambiguation task is particularly unreliable, or where inter-annotator agreement is very low, underspecification can serve as a tool to greatly facilitate linguistic analysis. For example, intra-base NP ambiguities, such as quantifier scope ambiguities do not matter for a parser like ours aiming at predicate-argument structure, and are thus not attempted to analyze. There is one part-of-speech distinction where inter-annotator agreement is quite low and the performance of taggers generally very poor: the distinction between verbal particles and prepositions. We currently leave the distinction underspecified, but a statistical disambiguator is being developed.</Paragraph> <Paragraph position="1"> Conversely, the Penn Treebank annotation is sometimes not specific enough. The parser distinguishes between the reading of the tag IN as a complementizer or as a preposition, and disambiguates commas as far as it can, between apposition, subordination and conjunction.</Paragraph> <Paragraph position="2"> Some typical tagging errors can be robustly corrected by the hand-written grammar. For example, the distinction between verb past tense VBD and participle VBN is unreliable, but can usually be disambiguated in the parsing process by leaving this tag distinction underspecified for a number of constructions.</Paragraph> </Section> <Section position="2" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.4 Long-distance Dependencies </SectionTitle> <Paragraph position="0"> Long-distance dependencies exponentially increase parsing complexity (Neuhaus and Br&quot;oker, 1997). We therefore use an approach that preprocesses, post-processes and partly underspecifies them, allowing us to use a context-free grammar at parse time.</Paragraph> <Paragraph position="1"> In detail, (1) before the parsing we model dedicated patterns across several levels of constituency subtrees partly leading to dedicated, compressed and fully local dependency relations, (2) we use statistical lexicalized postprocessing, and (3) we rely on traditional Dependency Grammar assumptions (section 2.2).</Paragraph> <Paragraph position="2"> (Johnson, 2002) presents a pattern-matching algorithm for post-processing the output of statistical parsers to add empty nodes to their parse trees. While encouraging results are reported for perfect parses, performance drops considerably when using trees produced by a statistical parser. &quot;If the parser makes a single parsing error anywhere in the tree fragment matched by the pattern, the pattern will no longer match. This is not unlikely since the statistical model used by the parser does not model these larger tree fragments. It suggests that one might improve performance by integrating parsing, empty node recovery and antecedent finding in a single system ... &quot; (Johnson, 2002). We have applied structural patterns to the Penn Treebank, where like in perfect parses precision and recall are high, and where in addition functional labels and empty nodes are available, so that patterns similar to Johnson's but - like (Jijkoun, 2003) - relying on functional labels and empty nodes reach precision close to 100%. Unlike in Johnson, also patterns for local dependencies are used; non-local patterns simply stretch across more subtree-levels. We use the extracted lexical counts as lexical frequency training material. Every dependency relation has a group of structural extraction patterns associated with it. This amounts to a partial mapping of the Penn Treebank to Functional</Paragraph> </Section> <Section position="3" start_page="8" end_page="8" type="sub_section"> <SectionTitle> Relation Label Example </SectionTitle> <Paragraph position="0"> verb-subject subj he sleeps verb-first object obj sees it verb-second object obj2 gave (her) kisses verb-adjunct adj ate yesterday verb-subord. clause sentobj saw (they) came verb-prep. phrase pobj slept in bed noun-prep. phrase modpp draft of paper noun-participle modpart report written verb-complementizer compl to eat apples noun-preposition prep to the house DG (HajiVc, 1998), (Tapanainen and J&quot;arvinen, 1997). Table 1 gives an overview of the most important dependencies.</Paragraph> <Paragraph position="1"> The subj relation, for example, has the head of an arbitrarily nested NP with the functional tag SBJ as dependent, and the head of an arbitrarily nested VP as head for all active verbs. In passive verbs, however, a movement involving an empty constituent is assumed, which corresponds to the extraction pattern in figure 1, where VP@ is an arbitrarily nested VP, and NP-SBJ-X@ the arbitrarily nested surface subject and X the co-indexed, moved element. Movements are generally supposed to be of arbitrary length, but a closer investigation reveals that this type of movement is fixed.</Paragraph> <Paragraph position="2"> Thesameargumentcanbemadeforother relations, for example control structures, which have the extraction pattern shown in figure 1. Grammatical role labels, empty node labels and tree configurations spanning several local sub-trees are used as integral part of some of the patterns. This leads to much flatter trees, as typical for DG, which has the advantages that (1) it helps to alleviate sparse data by mapping nested structures that express the same dependency relation, (2) the costly overhead for dealing with unbounded dependencies can be largely avoided, (3) it is ensured that the lexical information that matters is available in one central place, allowing the parser to take one well-informed decision instead of several brittle decisions plagued by sparseness, which greatly reduces complexity and the risk of errors (Johnson, 2002). Collapsing deeply nested structures into a single dependency relation is less complex but has the same effect as carefully selecting what goes in to the parse history in history-based approaches. &quot;Much of the interesting work is determining what goes into [the history] H(c)&quot;(Charniak, 2000).</Paragraph> <Paragraph position="3"> (Schneider, 2003a) shows that the vast majority of LDDs can be treated in this way, essentially compressing non-local subtrees into dedicated relations even before grammar writing starts. The compressed trees correspond to a simple LFG f-structure. The trees obtained from parsing can be decompressed into traditional constituency trees including empty nodes and co-indexation, or into shallow semantic structures such as Minimal Logical Forms (MLF) (Rinaldi et al., 2004b; Schneider et al., 2000; Schwitter et al., 1999). This approach leaves LDDs underspecified, but recoverable, and makes no claims as to whether empty nodes at an automonous syntactic level exist or not.</Paragraph> <Paragraph position="4"> After parsing, shared constituents can be extracted again. The parser explicitly does this for control, raising and semi-auxiliary relations, because the grammar does not distinguish between subordinating clauses with and without control. A probability model based on the verb semantics is invoked if a subordinate clause without overt subject is seen, in order to decide whether the matrix clause subject or object is shared.</Paragraph> <Paragraph position="5"> Among the 10 most frequent types of empty nodes, which cover more than 60,000 of the 64,000 empty nodes in the Penn treebank, there are only two problematic LDD types: WH Traces and indexed gerunds.</Paragraph> <Paragraph position="6"> WH traces Only 113 of the 10,659 WHNP antecedents in the Penn Treebank are actually question pronouns. The vast majority, over 9,000, are relative pronouns. For them, an inversion of the direction of the relation they have to the verb is allowed if the relative pronoun precedes the subject. This method succeeds in most cases, but linguistic non-standard assumptions need to be made for stranded prepositions. Only non-subject WH-question pronouns and support verbs need to be treated as &quot;real&quot; non-local dependencies. In question sentences, before the main parsing is started, the support verb is attached to any lonely participle chunk in the sentence, and the WH-pronoun pre-parses with any verb.</Paragraph> <Paragraph position="7"> Indexed Gerunds Unlike in control, raising and semi-auxiliary constructions, the antecedent of an indexed gerund cannot be established easily. The fact that almost half of the gerunds are non-indexed in the Penn Tree-bank indicates that information about the unexpressed participant is rather semantic than syntactic in nature, much like in pronoun resolution. Currently, the parser does not try to decide whether the target gerund is an indexed or non-indexed gerund nor does it try to find the identity of the lacking participant in the latter case. This is an important reason why recall values for the subject and object relations are lower than the precision values.</Paragraph> </Section> </Section> <Section position="5" start_page="8" end_page="8" type="metho"> <SectionTitle> 3 Robustness &quot;in the small&quot; </SectionTitle> <Paragraph position="0"> In addition to a robust deep-linguistic design (robustness &quot;in the large&quot;, section 2), the implemented parser, Pro3Gres, uses a number of practical robust approaches &quot;in the small&quot; at each processing level, such as relying on finite-state tagging and chunking or collecting partial parses if no complete analysis can be found, or using incrementally more aggressive pruning techniques in very long sentences. During the parsing process, only a certain number of alternatives for each possible span are kept. Experiments have shown that using a fixed number or a number dependent on the parsing complexity in terms of global chart entries lead to very similar results. Using reasonable beam sizes increases parsing speed by an order of magnitude while hardly affecting parser performance. For the fixed number model, performance starts to collapse only when less than 4 alternatives per span are kept.</Paragraph> <Paragraph position="1"> When a certain complexity has been reached (currently 1000 chart entries), only reductions above a certain probability threshold are permissible. The threshold starts very low, but is a function of the total number of chart entries. This entails that even sentences with hundreds of words can be parsed quickly, but it is not aimed at finding complete parses for them, rather a graceful degradation of performance (Menzel, 1995) is intended.</Paragraph> <Paragraph position="2"> 4 A hybrid approach on many levels Pro3Gres profits from being hybrid on many levels. Hybridness means that the most robust approach can be chosen for each task and each processing level.</Paragraph> <Paragraph position="3"> statistical vs. rule-based the most obvious way in which Pro3Gres is a hybrid (Schneider, 2003b). Unlike formal grammars to which post-hoc statistical disambiguators can be added, Pro3Gres has been designed to be hybrid, carefully distinguishing between tasks that can best be solved by finite-state methods, rule-based methods and statistical methods. While e.g.</Paragraph> <Paragraph position="4"> grammar writing is easy for a linguist, and a naive Treebank grammar suffers from similar complexity problems as a comprehensive formal grammar, the scope of application and the amount of ambiguity a rule creates is often beyond our imagination and best handled by a statistical system.</Paragraph> <Paragraph position="5"> shallow vs. deep the designing philosophy for Pro3Gres has been to stay as shallow as possible to obtain reliable results at each level. Treebank constituency vs. DG the observation that a DG that expresses grammatical relations is more informative, but also more intuitive to interpret for a non-expert, and that Functional DG can avoid a number of LDD types has made DG the formalism of our choice.</Paragraph> <Paragraph position="6"> For lexicalizing the grammar, a partial mapping from the largest manually annotated corpus available, the Penn Treebank, was necessary, exhibiting a number of mapping challenges.</Paragraph> <Paragraph position="7"> history-based vs. mapping-based Pro3Gres is not a parse-history-based approach. Instead of manually selecting what goes into the history, as is usually done (see (Henderson, 2003) for an exception), we manually select how to linguistically meaningfully map Treebank structures onto dependency relations by the use of mapping patterns adapted from (Johnson, 2002).</Paragraph> <Paragraph position="8"> probabilistic vs. statistical Pro3Gres is not a probabilistic system in the sense of a PCFG. From a practical viewpoint, knowing the probability of a certain rule expansion per se is of little interest. Pro3Gres models decision probabilities, the probability of a parse is understood to be the product of all the decision probabilities taken during the derivation.</Paragraph> <Paragraph position="9"> local subtress vs. DOP psycholinguistic experiments and Data-Oriented Parsing (DOP) (Bod et al., 2003) suggest that people store subtrees of various sizes, from two-word fragments to entire sentences. But (Goodman, 2003) suggests that the large number of sub-trees can be reduced to a compact grammar that makes DOP parsing computationally tractable.</Paragraph> <Paragraph position="10"> In Pro3Gres, a subset of non-local fragments which, based on linguistic intuition are especially important, are used.</Paragraph> <Paragraph position="11"> generative vs. structure-generating DG generally, although generative in the sense that connected complete structures are generated, is not generative in the sense that it is always guaranteed to terminate if used for random generation of language. Since a complete or partial hierarchical structure that follows CFG assumptions due to the employed grammar is built up for each sentence. Pro3Gres' constraint to allow each complement dependency type only once per verb can be seen as a way of rendering it generative in practice.</Paragraph> <Paragraph position="12"> syntax vs. semantics insteadofusing a back-off to tags (Collins, 1999), semantic classes, Wordnet for nouns and Levin classes for verbs, are used, in the hope that they better manage better to express selectional restrictions than tags. Practical experiments have shown, however, that, in accordance to (Gildea, 2001) on head-lexicalisation, there is almost no increase in performance.</Paragraph> </Section> class="xml-element"></Paper>