File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1015_metho.xml

Size: 25,842 bytes

Last Modified: 2025-10-06 14:08:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1015">
  <Title>Combining Deep and Shallow Approaches in Parsing German</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Parser Evaluation
</SectionTitle>
    <Paragraph position="0"> The simplest method to evaluate a parser is to count the parse trees it gets correct. This measure is, however, not very informative since most applications do not require one hundred percent correct parse trees.</Paragraph>
    <Paragraph position="1"> Thus, an important question in parser evaluation is how to break down parsing results.</Paragraph>
    <Paragraph position="2"> In the PARSEVAL evaluation scheme (Black et al., 1991), partially correct parses are gauged by the number of nodes they produce and have in common with the gold standard (measured in precision and recall). Another figure (crossing brackets) only counts those incorrect nodes that change the partial order induced by the tree. A problematic aspect of the PARSEVAL approach is that the weight given to particular constructions is again grammar-specific, since some grammars may need more nodes to describe them than others. Further, the approach does not pay sufficient heed to the fact that parsing decisions are often intricately twisted: One wrong decision may produce a whole series of other wrong decisions.</Paragraph>
    <Paragraph position="3"> Both these problems are circumvented when parsing results are evaluated on a more abstract level, viz. dependency structure (Lin, 1995).</Paragraph>
    <Paragraph position="4"> Dependency structure generally follows predicate-argument structure, but departs from it in that the basic building blocks are words rather than predicates. In terms of parser evaluation, the first property guarantees independence of decisions (every link is relevant also for the interpretation level), while the second property makes for a better empirical justification. for evaluation units. Dependency structure can be modelled by a directed acylic graph, with word tokens at the nodes. In labelled dependency structure, the links are furthermore classified into a certain set of grammatical roles.</Paragraph>
    <Paragraph position="5"> Dependency can be easily determined from constituent structure if in every phrase structure rule a constituent is singled out as the head (Gaifman, 1965). To derive a labelled dependency structure, all non-head constituents in a rule must be labelled with the grammatical role that links their head tokens to the head token of the head constituent.</Paragraph>
    <Paragraph position="6"> There are two cases where the divergence between predicates and word tokens makes trouble: (1) predicates expressed by more than one token, and (2) predicates expressed by no token (as they occur in ellipsis). Case 1 frequently occurs within the verb complex (of both English and German). The solution proposed in the literature (Black et al., 1991; Lin, 1995; Carroll et al., 1998; Kubler and Telljohann, 2002) is to define a normal form for dependency structure, where every adjunct or argument attaches to some distinguished part of the verb complex. The underlying assumption is that those cases where scope decisions in the verb complex are semantically relevant (e.g. with modal verbs) are not resolvable in syntax anyway. There is no generally accepted solution for case 2 (ellipsis). Most authors in the evaluation literature neglect it, perhaps due to its infrequency (in the NEGRA corpus, ellipsis only occurs in 1.2% of all dependency relations).</Paragraph>
    <Paragraph position="7"> Robinson (1970, 280) proposes to promote one of the dependents (preferably an obligatory one) (1a) or even all dependents (1b) to head status.</Paragraph>
    <Paragraph position="8"> (1) a. the very brave b. John likes tea and Harry coffee.</Paragraph>
    <Paragraph position="9"> A more sweeping solution to these problems is to abandon dependency structure at all and directly go for predicate-argument structure (Carroll et al., 1998). But as we argued above, moving to a more theoretical level is detrimental to comparability across grammatical frameworks.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A Direct Approach: Learning
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Dependency Structure
</SectionTitle>
      <Paragraph position="0"> According to the dependency structure approach to evaluation, the task of the parser is to find the correct dependency structure for a string, i.e. to associate every word token with pairs of head token and grammatical role or else to designate it as independent. To make the learning task easier, the number of classes should be reduced as much as possible. For one, the task could be simplified by focusing on unlabelled dependency structure (measured in &amp;quot;unlabelled&amp;quot; precision and recall (Eisner, 1996; Lin, 1995)), which is, however, in general not sufficient for further semantic processing.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Tree Property
</SectionTitle>
      <Paragraph position="0"> Another possibility for reduction is to associate every word with at most one pair of head token and grammatical role, i.e. to only look at dependency trees rather than graphs. There is one case where the tree property cannot easily be maintained: coordination. Conceptually, all the conjuncts are head constituents in coordination, since the conjunction could be missing, and selectional restrictions work on the individual conjuncts (2).</Paragraph>
      <Paragraph position="1"> (2) John ate (fish and chips|*wish and ships).</Paragraph>
      <Paragraph position="2"> But if another word depends on the conjoined heads (see (4a)), the tree property is violated. A way out of the dilemma is to select a specific conjunct as modification site (Lin, 1995; Kubler and Telljohann, 2002). But unless care is taken, semantically vital information is lost in the process: Example (4) shows two readings which should be distinguished in dependency structure. A comparison of the two readings shows that if either the first conjunct or the last conjunct is unconditionally selected certain readings become undistinguishable. Rather, in order to distinguish a maximum number of readings, pre-modifiers must attach to the last conjunct and post-modifiers and coordinating conjunctions to the first conjunct2. The fact that the modifier refers to a conjunction rather than to the conjunct is recorded in the grammatical role (by adding c to it).</Paragraph>
      <Paragraph position="3"> (4) a. the [fans and supporters] of Arsenal b. [the fans] and [supporters of Arsenal] Other constructions contradicting the tree property are arguably better treated in the lexicon anyway (e.g. control verbs (Carroll et al., 1998)) or could be solved by enriching the repertory of grammatical roles (e.g. relative clauses with null relative pronouns could be treated by adding the dependency relation between head verb and missing element to the one between head verb and modified noun).</Paragraph>
      <Paragraph position="4"> In a number of linguistic phenomena, dependency theorists disagree on which constituent should be chosen as the head. A case in point are PPs. Few grammars distinguish between adjunct and subcategorized PPs at the level of prepositions. In predicate-argument structure, however, the embedded NP is in one case related to the preposition, in the other to the subcategorizing verb. Accordingly, some approaches take the preposition to be the head of a PP (Robinson, 1970; Lin, 1995), others the NP (Kubler and Telljohann, 2002). Still other approaches (Tesniere, 1959; Carroll et al., 1998) conflate verb, preposition and head noun into a triple, and thus only count content words in the evaluation. For learning, the matter can be resolved empirically: 2Even in this setting some readings cannot be distinguished (see e.g. (3) where a conjunction of three modifiers would  be retrieved). Nevertheless, the proposed scheme fails in only 0.0017% of all dependency tuples.</Paragraph>
      <Paragraph position="5"> (3) In New York, we never meet, but in Boston.</Paragraph>
      <Paragraph position="6">  Note that by this move we favor interpretability over projectivity, but example (4a) is non-projective from the start. Taking prepositions as the head somewhat improves performance, so we took PPs to be headed by prepositions. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Encoding Head Tokens
</SectionTitle>
      <Paragraph position="0"> Another question is how to encode the head token. The simplest method, encoding the word by its string position, generates a large space of classes. A more efficient approach uses the distance in string position between dependent and head token. Finally, Lin (1995) proposes a third type of representation: In his work, a head is described by its word type, an indication of the direction from the dependent (left or right) and the number of tokens of the same type that lie between head and dependent. An illustrative representation would be&gt;&gt;paperwhich refers to the second nearest token paper to the right of the current token. Obviously there are far too many word tokens, but we can use Part-Of-Speech tags instead.</Paragraph>
      <Paragraph position="1"> Furthermore information on inflection and type of noun (proper versus common nouns) is irrelevant, which cuts down the size even more. We will call this approach nth-tag. A further refinement of the nth-tag approach makes use of the fact that dependency structures are acylic. Hence, only those words with the same POS tag as the head between dependent and head must be counted that do not depend directly or indirectly on the dependent. We will call this approach covered-nth-tag.</Paragraph>
      <Paragraph position="2"> pos dist nth-tag cover labelled 1,924 1,349 982 921 unlabelled 97 119 162 157  ual approaches generate on the NEGRA Treebank.</Paragraph>
      <Paragraph position="3"> Note that the longest sentence has 115 tokens (with punctuation marks) but that punctuation marks do not enter dependency structure. The original tree-bank exhibits 31 non-head syntactic3 grammatical roles. We added three roles for marker complements (CMP), specifiers (SPR), and floating quantifiers (NK+), and subtracted the roles for conjunction markers (CP) and coreference with expletive (RE).</Paragraph>
      <Paragraph position="4"> 3i.e. grammatical roles not merely used for tokenization 22 roles were copied to mark reference to conjunction. Thus, all in all there was a stock of 54 grammatical roles.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Experiments
</SectionTitle>
      <Paragraph position="0"> We used a0 -grams (3-grams and 5-grams) of POS tags as context and C4.5 (Quinlan, 1993) for machine learning. All results were subjected to 10-fold cross validation.</Paragraph>
      <Paragraph position="1"> The learning algorithm always returns a result. We counted a result as not assigned, however, if it referred to a head token outside the sentence. See Figure 2 for results4 of the learner. The left column shows performance with POS tags from the treebank (ideal tags, I-tags), the right column values obtained with POS tags as generated automatically by a tagger with an accuracy of 95% (tagger tags, T-tags). I-tags T-tags F-val prec rec F-val prec rec dist, 3 .6071 .6222 .5928 .5902 .6045 .5765 dist, 5 .6798 .6973 .6632 .6587 .6758 .6426 nth-tag, 3 .7235 .7645 .6866 .6965 .7364 .6607 nth-tag, 5 .7716 .7961 .7486 .7440 .7682 .7213 cover, 3 .7271 .7679 .6905 .7009 .7406 .6652 cover, 5 .7753 .7992 .7528 .7487 .7724 .7264  The nth-tag head representation outperforms the distance representation by 10%. Considering acyclicity (cover) slightly improves performance, but the gain is not statistically significant (t-test with 99%). The results are quite impressive as they stand, in particular the nth-tag 5-gram version seems to achieve quite good results. It should, however, be stressed that most of the dependencies correctly determined by the n-gram methods extend over no more than 3 tokens. With the distance method, such 'short' dependencies make up 98.90% of all dependencies correctly found, with the nth-tag method still 82%, but only 79.63% with the finite-state parser (see section 4) and 78.91% in the treebank. 4If the learner was given a chance to correct its errors, i.e. if it could train on its training results in a second round, there was a statistically significant gain in F-value with recall rising and precision falling (e.g. F-value .7314, precision .7397, recall .7232 for nth-tag trigrams, and F-value .7763, precision .7826, recall .7700 for nth-tag 5-grams).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Cascaded Finite-State Parser
</SectionTitle>
    <Paragraph position="0"> In addition to the learning approach, we used a cascaded finite-state parser (Schiehlen, 2003), to extract dependency structures from the text. The layout of this parser is similar to Abney's parser (Abney, 1991): First, a series of transducers extracts noun chunks on the basis of tokenized and POS-tagged text. Since center-embedding is frequent in German noun phrases, the same transducer is used several times over. It also has access to inflectional information which is vital for checking agreement and determining case for subsequent phases (see (Schiehlen, 2002) for a more thorough description). Second, a series of transducers extracts verb-final, verb-first, and verb-second clauses. In contrast to Abney, these are full clauses, not just simplex clause chunks, so that again recursion can occur. Third, the resulting parse tree is refined and decorated with grammatical roles, using non-deterministic 'interpretation' transducers (the same technique is used by Abney (1991)). Fourth, verb complexes are examined to find the head verb and auxiliary passive or raising verbs. Only then subcategorization frames can be checked on the clause elements via a non-deterministic transducer, giving them more specific grammatical roles if successful. Fifth, dependency tuples are extracted from the parse tree.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Underspecification
</SectionTitle>
      <Paragraph position="0"> Some parsing decisions are known to be not resolvable by grammar. Such decisions are best handed over to subsequent modules equipped with the relevant knowledge. Thus, in chart parsing, an under-specified representation is constructed, from which all possible analyses can be easily and efficiently read off. Elworthy et al. (2001) describe a cascaded parser which underspecifies PP attachment by allowing modifiers to be linked to several heads in a dependency tree. Example (5) illustrates this scheme.</Paragraph>
      <Paragraph position="1"> (5) I saw a man in a car on the hill.</Paragraph>
      <Paragraph position="2"> The main drawback of this scheme is its overgeneration. In fact, it allows six readings for example (5), which only has five readings (the speaker could not have been in the car, if the man was asserted to be on the hill). A similar clause with 10 PPs at the end would receive 39,916,800 readings rather than 58,786. So a more elaborate scheme is called for, but one that is just as easy to generate.</Paragraph>
      <Paragraph position="3"> A device that often comes in handy for under-specification are context variables (Maxwell III and Kaplan, 1989; Dorre, 1997). First let us give every sequence of prepositional phrases in every clause a specific name (e.g. 1B for the second sequence in the first clause). Now we generate the ambiguous dependency relations (like (Elworthy et al., 2001)) but label them with context variables. Such context variables consist of the sequence name a0 , a number a1 designating the dependent in left-to-right order (e.g. 0 for in, 1 for on in example (5)), and a number a2 designating the head in left-to-right (e.g.</Paragraph>
      <Paragraph position="4"> 0 for saw, 1 for man, 2 for hill in (5)). If the links are stored with the dependents, the number a1 can be left implicit. Generation of such a representation is straightforward and, in particular, does not lead to a higher class of complexity of the full system. Example (6) shows a tuple representation for the two prepositions of sentence (5).</Paragraph>
      <Paragraph position="5"> (6) in [1A00] saw ADJ, [1A01] man ADJ on [1A10] saw ADJ, [1A11] man ADJ, [1A12] car ADJ In general, a dependent a1 can modify a1a4a3a6a5 heads, viz. the heads numbered a7a9a8a11a10a11a10a11a10a12a8a13a1a14a3a15a5 . Now we put the following constraint on resolution: A dependent a1a17a16 can only modify a head a2a18a16 if no previous dependent a1a20a19 which could have attached to a2a21a16 (i.e. a2 a16a23a22 a1 a19 a3a24a5 ) chose some head a2 a19 to the left of a2 a16 rather than a2a20a16 . The condition is formally expressed in (7). In example (6) there are only two dependents (a1a25a19a27a26 in, a1a28a16a29a26 on). If in attaches to saw, on cannot attach to a head between saw and in; conversely if on attaches to man, in cannot attach to a head before man. Nothing follows if on attaches to car.</Paragraph>
      <Paragraph position="7"> The cascaded parser described adopts this under-specification scheme for right modification. Left modification (see (8)) is usually not stacked so the simpler scheme of Elworthy et al. (2001) suffices.</Paragraph>
      <Paragraph position="8"> (8) They are usually competent people.</Paragraph>
      <Paragraph position="9"> German is a free word order language, so that sub-categorization can be ambiguous. Such ambiguities should also be underspecified. Again we introduce a context variable a0 for every ambiguous subcategorization frame (e.g. 1 in (9)) and count the individual readings a49 (with letters a,b in (9)).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Evaluation of the Underspecified
Representation
</SectionTitle>
      <Paragraph position="0"> In evaluating underspecified representations, Riezler et al. (2002) distinguish upper and lower bound, standing for optimal performance in disambiguation and average performance, respectively. In  of the parser without underspecification, i.e. always favoring maximal attachment and word order without scrambling (direct). Interestingly this method performs significantly better than average, an effect mainly due to the preference for high attachment.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Combining the Parsers
</SectionTitle>
    <Paragraph position="0"> We considered several strategies to combine the results of the diverse parsing approaches: simple voting, weighted voting, Bayesian learning, Maximum Entropy, and greedy optimization of F-value.</Paragraph>
    <Paragraph position="1"> Simple Voting. The result predicted by the majority of base classifiers is chosen. The finite-state parser, which may give more than one result, distributes its vote evenly on the possible readings.</Paragraph>
    <Paragraph position="2"> Weighted Voting. In weighted voting, the result which gets the most votes is chosen, where the number of votes given to a base classifier is correlated with its performance on a training set.</Paragraph>
    <Paragraph position="3"> Bayesian Learning. The Bayesian approach of Xu et al. (1992) chooses the most probable prediction. The probability of a prediction a0 is computed by the product a1a3a2a5a4 a31a6a0a8a7a0 a2 a47 of the probability of a0 given the predictions a0 a2 made by the individual base classifiers a9 . The probability a4 a31a6a0 a19a10a7a0a18a16a11a47 of a correct prediction a0 a19 given a learned prediction a0 a16 is approximated by relative frequency in a training set.</Paragraph>
    <Paragraph position="4"> Maximum Entropy. Combining the results can also be seen as a classification task, with base predictions added to the original set of features. We used the Maximum Entropy approach5 (Berger et al., 1996) as a machine learner for this task. Underspecified features were assigned multiple values.</Paragraph>
    <Paragraph position="5"> Greedy Optimization of F-value. Another method uses a decision list of prediction-classifier pairs to choose a prediction by a classifier. The list is obtained by greedy optimization: In each step, the prediction-classifier pair whose addition results in the highest gain in F-value for the combined model on the training set is appended to the list.</Paragraph>
    <Paragraph position="6"> The algorithm terminates when F-value cannot be improved by any of the remaining candidates. A finer distinction is possible if the decision is made dependent on the POS tag as well. For greedy optimization, the predictions of the finite-state parser were classified only in grammatical roles, not head positions. We used 10-fold cross validation to determine the decision lists.</Paragraph>
    <Paragraph position="7">  We tested the various combination strategies for the combination Finite-State parser (lower bound) and C4.5 5-gram nth-tag on ideal tags (results in Figure 4). Both simple and weighted voting degrade the results of the base classifiers. Greedy optimization outperforms all other strategies. Indeed it comes near the best possible choice which would give an F-score of .9089 for 5-gram nth-tag and finite-state parser (upper bound) (cf. Figure 5).</Paragraph>
    <Paragraph position="8"> without POS tag with POS tag  Figure 5 shows results for some combinations with the greedy optimization strategy on ideal tags.</Paragraph>
    <Paragraph position="9"> All combinations listed yield an improvement of more than 1% in F-value over the base classifiers.</Paragraph>
    <Paragraph position="10"> It is striking that combination with a shallow parser does not help the Finite-State parser much in coverage (upper bound), but that it helps both in disambiguation (pushing up the lower bound to almost the level of upper bound) and robustness (remedying at least some of the errors). The benefit of underspecification is visible when lower bound and direct are compared. The nth-tag 5-gram method was the best method to combine the finite-state parser with. Even on T-tags, this combination achieved an F-score of .8520 (lower, upper: .8579, direct: .8329) without POS tag and an F-score of .8563 (lower, upper: .8642, direct: .8535) with POS tags.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 In-Depth Evaluation
</SectionTitle>
    <Paragraph position="0"> Figure 6 gives a survey of the performance of the parsing approaches relative to grammatical role.</Paragraph>
    <Paragraph position="1"> These figures are more informative than overall F-score (Preiss, 2003). The first column gives the name of the grammatical role, as explained below.</Paragraph>
    <Paragraph position="2"> The second column shows corpus frequency in percent. The third column gives the standard deviation of distance between dependent and head. The three last columns give the performance (recall) of C4.5 with distance representation and 5-grams, C4.5 with nth-tag representation and 5-grams, and the cascaded finite-state parser, respectively. For the finite-state parser, the number shows performance with optimal disambiguation (upper bound) and, if the grammatical role allows underspecification, the number for average disambiguation (lower bound) in parentheses.</Paragraph>
    <Paragraph position="3"> Relations between function words and content words (e.g. specifier (SPR), marker complement (CMP), infinitival zu marker (PM)) are frequent and easy for all approaches. The cascaded parser has an edge over the learners with arguments (subject (SB), clausal (OC), accusative (OA), second accusative (OA2), genitive (OG), dative object (DA)). For all these argument roles a slight amount of ambiguity persists (as can be seen from the divergence between upper and lower bound), which is due to free word order. No ambiguity is found with reported speech (RS). The cascaded parser also performs quite well where verb complexes are concerned (separable verb prefix (SVP), governed verbs (OC), and predicative complements (PD, SP)). Another clearly discernible complex are adjuncts (modifier (MO), negation (NG), passive subject (SBP); one-place coordination (JUnctor) and discourse markers (DM); finally postnominal modifier (MNR), genitive (GR), or von-phrase (PG)), which all exhibit attachment ambiguities. No attachment ambiguities are attested for prenominal genitives (GL). Some types of adjunction have not yet been implemented in the cascaded parser, so that it performs badly on them (e.g. relative clauses (RC), which are usually extraposed to the right (average distance is 11.6) and thus quite difficult also for the learners; comparative constructions (CC, CM), measure phrases (AMS), floating quantifiers (NK+)). Attachment ambiguities also occur with appositions (APP, NK6). Notoriously difficult is coordination (attachrole freq dev dist nth-t FS-parser  ment of conjunction to conjuncts (CD), and dependency on multiple heads (a10a11a10a11a10c)). Vocatives (VO) are not treated in the cascaded parser. AC is the relation between parts of a circumposition.</Paragraph>
    <Paragraph position="4"> 6Other relations classified as NK in the original tree-bank have been reclassified: prenominal determiners to SPR, prenominal adjective phrases to MO.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML