File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0308_metho.xml
Size: 18,821 bytes
Last Modified: 2025-10-06 14:09:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0308"> <Title>Incrementality in Deterministic Dependency Parsing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Dependency Parsing </SectionTitle> <Paragraph position="0"> In a dependency structure, every word token is dependent on at most one other word token, usually called its head or regent, which</Paragraph> <Paragraph position="2"> means that the structure can be represented as a directed graph, with nodes representing word tokens and arcs representing dependency relations. In addition, arcs may be labeled with specific dependency types. Figure 1 shows a labeled dependency graph for a simple Swedish sentence, where each word of the sentence is labeled with its part of speech and each arc labeled with a grammatical function.</Paragraph> <Paragraph position="3"> In the following, we will restrict our attention to unlabeled dependency graphs, i.e. graphs without labeled arcs, but the results will apply to labeled dependency graphs as well. We will also restrict ourselves to projective dependency graphs (Mel'cuk, 1988). Formally, we define these structures in the following way: 1. A dependency graph for a string of words</Paragraph> <Paragraph position="5"> (a) W is the set of nodes, i.e. word tokens in the input string, (b) A is a set of arcs (wi,wj) (wi,wj [?] W). We write wi < wj to express that wi precedes wj in the string W (i.e., i < j); we write wi - wj to say that there is an arc from wi to wj; we use -[?] to denote the reflexive and transitive closure of the arc relation; and we use - and -[?] for the corresponding undirected relations, i.e. wi - wj iff wi - wj or wj - wi.</Paragraph> <Paragraph position="6"> 2. A dependency graph D = (W,A) is well-formed iff the five conditions given in Figure 2 are satisfied.</Paragraph> <Paragraph position="7"> The task of mapping a string W = w1***wn to a dependency graph satisfying these conditions is what we call dependency parsing. For a more detailed discussion of dependency graphs and well-formedness conditions, the reader is referred to Nivre (2003).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Incrementality in Dependency </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Parsing </SectionTitle> <Paragraph position="0"> Having defined dependency graphs, we may now consider to what extent it is possible to construct these graphs incrementally. In the strictest sense, we take incrementality to mean that, at any point during the parsing process, there is a single connected structure representing the analysis of the input consumed so far. In terms of our dependency graphs, this would mean that the graph being built during parsing is connected at all times. We will try to make this more precise in a minute, but first we want to discuss the relation between incrementality and determinism.</Paragraph> <Paragraph position="1"> It seems that incrementality does not by itself imply determinism, at least not in the sense of never undoing previously made decisions. Thus, aparsingmethodthatinvolvesbacktrackingcan be incremental, provided that the backtracking is implemented in such a way that we can always maintain a single structure representing the input processed up to the point of backtracking.</Paragraph> <Paragraph position="2"> In the context of dependency parsing, a case in point is the parsing method proposed by Kromann (Kromann, 2002), which combines heuristic search with different repair mechanisms. In this paper, we will nevertheless restrict our attention to deterministic methods for dependency parsing, because we think it is easier to pinpoint the essential constraints within a more restrictive framework. We will formalize deterministic dependency parsing in a way which is inspired by traditional shift-reduce parsing for context-free grammars, using a buffer of input tokens and a stack for storing previously processed input. However, since there are no non-terminal symbols involved in dependency parsing, we also need to maintain a representation of the dependency graph being constructed during processing.</Paragraph> <Paragraph position="3"> We will represent parser configurations by Unique label (wi r-wj [?] wi rprime-wj) = r = rprime sented as a list), I is the list of (remaining) input tokens, and A is the (current) arc relation for the dependency graph. (Since the nodes of the dependency graph are given by the input string, only the arc relation needs to be represented explicitly.) Given an input string W, the parser is initialized to <nil,W,[?]> and terminates when it reaches a configuration <S,nil,A> (for any list S and set of arcs A). The input string W is accepted if the dependency graph D = (W,A) given at termination is well-formed; otherwise W is rejected.</Paragraph> <Paragraph position="4"> In order to understand the constraints on incrementality in dependency parsing, we will begin by considering the most straightforward parsing strategy, i.e. left-to-right bottom-up parsing, which in this case is essentially equivalent to shift-reduce parsing with a context-free grammar in Chomsky normal form. The parser is defined in the form of a transition system, represented in Figure 3 (where wi and wj are arbitrary word tokens): 1. The transition Left-Reduce combines the two topmost tokens on the stack, wi and wj, by a left-directed arc wj - wi and reduces them to the head wj.</Paragraph> <Paragraph position="5"> 2. The transition Right-Reduce combines the two topmost tokens on the stack, wi and wj, by a right-directed arc wi - wj and reduces them to the head wi.</Paragraph> <Paragraph position="6"> 3. The transition Shift pushes the next input token wi onto the stack.</Paragraph> <Paragraph position="7"> The transitions Left-Reduce and Right-Reduce are subject to conditions that ensure that the Single head condition is satisfied. For Shift, the only condition is that the input list is non-empty.</Paragraph> <Paragraph position="8"> As it stands, this transition system is nondeterministic, since several transitions can often be applied to the same configuration. Thus, in order to get a deterministic parser, we need to introduce a mechanism for resolving transition conflicts. Regardless of which mechanism is used, the parser is guaranteed to terminate after at most 2n transitions, given an input string of length n. Moreover, the parser is guaranteed to produce a dependency graph that is acyclic and projective (and satisfies the single-head constraint). This means that the dependency graph given at termination is well-formed if and only if it is connected.</Paragraph> <Paragraph position="9"> We can now define what it means for the parsing to be incremental in this framework. Ideally, we would like to require that the graph (W [?]I,A) is connected at all times. However, given the definition of Left-Reduce and Right-Reduce, it is impossible to connect a new word without shifting it to the stack first, so it seems that a more reasonable condition is that the size of the stack should never exceed 2. In this way, we require every word to be at- null tached somewhere in the dependency graph as soon as it has been shifted onto the stack.</Paragraph> <Paragraph position="10"> We may now ask whether it is possible to achieve incrementality with a left-to-right bottom-up dependency parser, and the answer turns out to be no in the general case. This can be demonstrated by considering all the possible projective dependency graphs containing only three nodes and checking which of these can be parsed incrementally. Figure 4 shows the relevant structures, of which there are seven altogether. null We begin by noting that trees (2-5) can all be constructed incrementally by shifting the first two tokens onto the stack, then reducing - with</Paragraph> <Paragraph position="12"> shifted onto the stack before the first reduction.</Paragraph> <Paragraph position="13"> However, the reason why we cannot parse the structure incrementally is different in (1) compared to (6-7).</Paragraph> <Paragraph position="14"> In (6-7) the problem is that the first two tokens are not connected by a single arc in the final dependency graph. In (6) they are sisters, both being dependents on the third token; in (7) the first is the grandparent of the second.</Paragraph> <Paragraph position="15"> And in pure dependency parsing without non-terminal symbols, every reduction requires that one of the tokens reduced is the head of the other(s). This holds necessarily, regardless of the algorithm used, and is the reason why it is impossible to achieve strict incrementality in dependency parsing as defined here. However, it is worth noting that (2-3), which are the mirror images of (6-7) can be parsed incrementally, even though they contain adjacent tokens that are not linked by a single arc. The reason is that in (2-3) the reduction of the first two tokens makes the third token adjacent to the first. Thus, the defining characteristic of the problematic structures is that precisely the leftmost tokens are not linked directly.</Paragraph> <Paragraph position="16"> The case of (1) is different in that here the problem is caused by the strict bottom-up strategy, which requires each token to have found all its dependents before it is combined with its head. For left-dependents this is not a problem, as can be seen in (5), which can be processed by alternating Shift and Left-Reduce. But in (1) the sequence of reductions has to be performed from right to left as it were, which rules out strict incrementality. However, whereas the structures exemplified in (6-7) can never be processed incrementally within the present framework, the structure in (1) can be handled by modifying the parsing strategy, as we shall see in the next section.</Paragraph> <Paragraph position="17"> It is instructive at this point to make a comparison with incremental parsing based on extended categorial grammar, where the structures in (6-7) would normally be handled by some kind of concatenation (or product), which does not correspond to any real semantic combination of the constituents (Steedman, 2000; Morrill, 2000). By contrast, the structure in (1) would typically be handled by function composition, which corresponds to a well-defined compositional semantic operation. Hence, it might be argued that the treatment of (6-7) is only pseudo-incremental even in other frameworks.</Paragraph> <Paragraph position="18"> Before we leave the strict bottom-up approach, it can be noted that the algorithm described in this section is essentially the algorithm used by Yamada and Matsumoto (2003) in combination with support vector machines, except that they allow parsing to be performed in multiple passes, where the graph produced in one pass is given as input to the next pass.1 The main motivation they give for parsing in multiple passes is precisely the fact that the bottom-up strategy requires each token to have found all its dependents before it is combined with its head, which is also what prevents the incremental parsing of structures like (1).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Arc-Eager Dependency Parsing </SectionTitle> <Paragraph position="0"> In order to increase the incrementality of deterministic dependency parsing, we need to combine bottom-up and top-down processing. More precisely, we need to process left-dependents bottom-up and right-dependents top-down. In this way, arcs will be added to the dependency graph as soon as the respective head and dependent are available, even if the dependent is not complete with respect to its own dependents.</Paragraph> <Paragraph position="1"> Following Abney and Johnson (1991), we will call this arc-eager parsing, to distinguish it from the standard bottom-up strategy discussed in the previous section.</Paragraph> <Paragraph position="2"> Using the same representation of parser configurations as before, the arc-eager algorithm can be defined by the transitions given in Figure 5, where wi and wj are arbitrary word tokens (Nivre, 2003): 1. The transition Left-Arc adds an arc wj r- wi from the next input token wj to the token wi on top of the stack and pops the stack.</Paragraph> <Paragraph position="3"> 2. The transition Right-Arc adds an arc wi r- wj from the token wi on top of the stack to the next input token wj, and pushes wj onto the stack.</Paragraph> <Paragraph position="4"> 3. The transition Reduce pops the stack.</Paragraph> <Paragraph position="5"> 4. The transition Shift (SH) pushes the next input token wi onto the stack.</Paragraph> <Paragraph position="6"> The transitions Left-Arc and Right-Arc, like their counterparts Left-Reduce and Right-Reduce, are subject to conditions that ensure 1A purely terminological, but potentially confusing, difference is that Yamada and Matsumoto (2003) use the term Right for what we call Left-Reduce and the term Left for Right-Reduce (thus focusing on the position of the head instead of the position of the dependent). that the Single head constraint is satisfied, while the Reduce transition can only be applied if the token on top of the stack already has a head. The Shift transition is the same as before and can be applied as long as the input list is non-empty.</Paragraph> <Paragraph position="7"> Comparing the two algorithms, we see that the Left-Arc transition of the arc-eager algorithm corresponds directly to the Left-Reduce transition of the standard bottom-up algorithm. The only difference is that, for reasons of symmetry, the former applies to the token on top of the stack and the next input token instead of the two topmost tokens on the stack. If we compare Right-Arc to Right-Reduce, however, we see that the former performs no reduction but simply shifts the newly attached right-dependent onto the stack, thus making it possible for this dependent to have right-dependents of its own. But in order to allow multiple right-dependents, we must also have a mechanism for popping right-dependents off the stack, and this is the function of the Reduce transition. Thus, we can say that the action performed by the Right-Reduce transition in the standard bottom-up algorithm is performed by a Right-Arc transition in combination with a subsequent Reduce transition in the arc-eager algorithm. And since the Right-Arc and the Reduce can be separated by an arbitrary number of transitions, this permits the incremental parsing of arbitrary long right-dependent chains.</Paragraph> <Paragraph position="8"> Defining incrementality is less straightforward for the arc-eager algorithm than for the standard bottom-up algorithm. Simply considering the size of the stack will not do anymore, since the stack may now contain sequences of tokens that form connected components of the dependency graph. On the other hand, since it is no longer necessary to shift both tokens to be combined onto the stack, and since any tokens that are popped off the stack are connected to some token on the stack, we can require that the graph (S,AS) should be connected at all times, where AS is the restriction of A to S, i.e. AS = {(wi,wj) [?] A|wi,wj [?] S}.</Paragraph> <Paragraph position="9"> Given this definition of incrementality, it is easy to show that structures (2-5) in Figure 4 can be parsed incrementally with the arc-eager algorithm as well as with the standard bottom-up algorithm. However, with the new algorithm we can also parse structure (1) incrementally, as We conclude that the arc-eager algorithm is optimal with respect to incrementality in dependency parsing, even though it still holds true that the structures (6-7) in Figure 4 cannot be parsed incrementally. This raises the question how frequently these structures are found in practical parsing, which is equivalent to asking how often the arc-eager algorithm deviates from strictly incremental processing. Although the answer obviously depends on which language and which theoretical framework we consider, we will attempt to give at least a partial answer to this question in the next section. Before that, however, we want to relate our results to some previous work on context-free parsing.</Paragraph> <Paragraph position="10"> First of all, it should be observed that the terms top-down and bottom-up takeonaslightly different meaning in the context of dependency parsing, as compared to their standard use in context-free parsing. Since there are no nonterminal nodes in a dependency graph, top-down construction means that a head is attached to a dependent before the dependent is attached to (some of) its dependents, whereas bottom-up construction means that a dependent is attached to its head before the head is attached to its head. However, top-down construction of dependency graphs does not involve the prediction of lower nodes from higher nodes, since all nodes are given by the input string. Hence, in terms of what drives the parsing process, all algorithms discussed here correspond to bottom-up algorithms in context-free parsing. It is interesting to note that if we recast the problem of dependency parsing as context-free parsing with a CNF grammar, then the problematic structures (1), (6-7) in Figure 4 all correspond to right-branching structures, and it is well-known that bottom-up parsers may require an unbounded amount of memory in order to process right-branching structure (Miller and Chomsky, 1963; Abney and Johnson, 1991).</Paragraph> <Paragraph position="11"> Moreover, if we analyze the two algorithms discussed here in the framework of Abney and Johnson (1991), they do not differ at all as to the order in which nodes are enumerated, but only with respect to the order in which arcs are enumerated; the first algorithm is arc-standard while the second is arc-eager. One of the observations made by Abney and Johnson (1991), is that arc-eager strategies for context-free parsing may sometimes require less space than arc-standard strategies, although they may lead to an increase in local ambiguities. It seems that the advantage of the arc-eager strategy for dependency parsing with respect to structure (1) in Figure 4 can be explained along the same lines, although the lack of nonterminal nodes in dependency graphs means that there is no corresponding increase in local ambiguities. Although a detailed discussion of the relation between context-free parsing and dependency parsing is beyond the scope of this paper, we conjecture that this may be a genuine advantage of dependency representations in parsing.</Paragraph> </Section> class="xml-element"></Paper>