XML Viewer - c04-1010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1010_metho.xml
Size: 17,945 bytes
Last Modified: 2025-10-06 14:08:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1010">
  <Title>Deterministic Dependency Parsing of English Text</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Deterministic Dependency Parsing
</SectionTitle>
    <Paragraph position="0"> In dependency parsing the goal of the parsing process is to construct a labeled dependency graph of the kind depicted in Figure 1. In formal terms, we define dependency graphs as follows:  1. Let R = {r1,...,rm} be the set of permissible dependency types (arc labels).</Paragraph>
    <Paragraph position="1"> 2. A dependency graph for a string of words W = w1***wn is a labeled directed graph D = (W,A), where (a) W is the set of nodes, i.e. word tokens in the input string, (b) A is a set of labeled arcs (wi,r,wj) (wi,wj [?] W, r [?] R), (c) for every wj [?] W, there is at most one arc (wi,r,wj) [?] A.</Paragraph>
    <Paragraph position="2"> 1The attachment score only considers whether a word is assigned the correct head; the labeled accuracy score in addition requires that it is assigned the correct dependency type; cf. section 4.</Paragraph>
    <Paragraph position="3"> 3. A graph D = (W,A) is well-formed iff it is acyclic, projective and connected.</Paragraph>
    <Paragraph position="4">  For a more detailed discussion of dependency graphs and well-formedness conditions, the reader is referred to Nivre (2003).</Paragraph>
    <Paragraph position="5"> The parsing algorithm used here was first defined for unlabeled dependency parsing in Nivre (2003) and subsequently extended to labeled graphs in Nivre et al. (2004). Parser configurations are represented by triples &lt;S,I,A&gt; , where S is the stack (represented as a list), I is the list of (remaining) input tokens, and A is the (current) arc relation for the dependency graph. (Since in a dependency graph the set of nodes is given by the input tokens, only the arcs need to be represented explicitly.) Given an input string W, the parser is initialized to &lt;nil,W,[?]&gt; 2 and terminates when it reaches a configuration &lt;S,nil,A&gt; (for any list S and set of arcs A). The input string W is accepted if the dependency graph D = (W,A) given at termination is well-formed; otherwise W is rejected. Given an arbitrary configuration of the parser, there are four possible transitions to the next configuration (where t is the token on top of the stack, n is the next input token, w is any word, and r,rprime [?] R):  1. Left-Arc: In a configuration &lt;t|S,n|I,A&gt; , if there is no arc (w,r,t) [?] A, extend A with (n,rprime,t) and pop the stack, giving the configuration &lt;S,n|I,A[?]{(n,rprime,t)}&gt; .</Paragraph>
    <Paragraph position="6"> 2. Right-Arc: In a configuration &lt;t|S,n|I,A&gt; , if there is no arc (w,r,n) [?] A, extend A with (t,rprime,n) and push n onto the stack, giving the configuration &lt;n|t|S,I,A[?]{(t,rprime,n)}&gt; .</Paragraph>
    <Paragraph position="7"> 3. Reduce: In a configuration &lt;t|S,I,A&gt; , if there is an arc (w,r,t)[?]A, pop the stack, giving the configuration &lt;S,I,A&gt; .</Paragraph>
    <Paragraph position="8"> 4. Shift: In a configuration &lt;S,n|I,A&gt; , push n onto the stack, giving the configuration &lt;n|S,I,A&gt; .</Paragraph>
    <Paragraph position="9"> 2We use nil to denote the empty list and a|A to denote a list with head a and tail A.</Paragraph>
    <Paragraph position="11"> After initialization, the parser is guaranteed to terminate after at most 2n transitions, given an input string of length n (Nivre, 2003). Moreover, the parser always constructs a dependency graph that is acyclic and projective. This means that the dependency graph given at termination is well-formed if and only if it is connected (Nivre, 2003). Otherwise, it is a set of connected components, each of which is a well-formed dependency graph for a substring of the original input.</Paragraph>
    <Paragraph position="12"> The transition system defined above is nondeterministic in itself, since several transitions can often be applied in a given configuration. To construct deterministic parsers based on this system, we use classifiers trained on treebank data in order to predict the next transition (and dependency type) given the current configuration of the parser.</Paragraph>
    <Paragraph position="13"> In this way, our approach can be seen as a form of history-based parsing (Black et al., 1992; Magerman, 1995). In the experiments reported here, we use memory-based learning to train our classifiers.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Memory-Based Learning
</SectionTitle>
    <Paragraph position="0"> Memory-based learning and problem solving is based on two fundamental principles: learning is the simple storage of experiences in memory, and solving a new problem is achieved by reusing solutions from similar previously solved problems (Daelemans, 1999). It is inspired by the nearest neighbor approach in statistical pattern recognition and artificial intelligence (Fix and Hodges, 1952), as well as the analogical modeling approach in linguistics (Skousen, 1989; Skousen, 1992). In machine learning terms, it can be characterized as a lazy learning method, since it defers processing of input until needed and processes input by combining stored data (Aha, 1997).</Paragraph>
    <Paragraph position="1"> Memory-based learning has been successfully applied to a number of problems in natural language processing, such as grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking (Daelemans et al., 2002). Previous work on memory-based learning for deterministic parsing includes Veenstra and Daelemans (2000) and Nivre et al. (2004).</Paragraph>
    <Paragraph position="2"> For the experiments reported in this paper, we have used the software package TiMBL (Tilburg Memory Based Learner), which provides a variety of metrics, algorithms, and extra functions on top of the classical k nearest neighbor classification kernel, such as value distance metrics and distance weighted class voting (Daelemans et al., 2003).</Paragraph>
    <Paragraph position="3"> The function we want to approximate is a mapping f from configurations to parser actions, where each action consists of a transition and (except for Shift and Reduce) a dependency type:</Paragraph>
    <Paragraph position="5"> Here Config is the set of all configurations and R is the set of dependency types. In order to make the problem tractable, we approximate f with a function ^f whose domain is a finite space of parser states, which are abstractions over configurations.</Paragraph>
    <Paragraph position="6"> For this purpose we define a number of features that can be used to define different models of parser state.</Paragraph>
    <Paragraph position="7"> Figure 2 illustrates the features that are used to define parser states in the present study. The two central elements in any configuration are the token on top of the stack (T) and the next input token (N), the tokens which may be connected by a dependency arc in the next configuration. For these tokens, we consider both the word form (T.LEX, N.LEX) and the part-of-speech (T.POS, N.POS), as assigned by an automatic part-of-speech tagger in a preprocessing phase. Next, we consider a selection of dependencies that may be present in the current arc relation, namely those linking T to its head (TH) and its leftmost and rightmost dependent (TL, TR), and that linking N to its leftmost dependent (NL),3 considering both the dependency type (arc label) and the part-of-speech of the head or dependent. Finally, we use a lookahead of three tokens, considering only their parts-of-speech.</Paragraph>
    <Paragraph position="8"> We have experimented with two different state models, one that incorporates all the features depicted in Figure 2 (Model 1), and one that excludes the parts-of-speech of TH, TL, TR, NL (Model 2). Models similar to model 2 have been found to work well for datasets with a rich annotation of dependency types, such as the Swedish dependency treebank derived from Einarsson (1976), where the extra part-of-speech features are largely redundant (Nivre et al., 2004). Model 1 can be expected to work better for datasets with less informative dependency annotation, such as dependency trees extracted from the Penn Treebank, where the extra part-of-speech features may compensate for the lack of information in arc labels.</Paragraph>
    <Paragraph position="9"> The learning algorithm used is the IB1 algorithm (Aha et al., 1991) with k = 5, i.e. classification based on 5 nearest neighbors.4 Distances are measured using the modified value difference metric (MVDM) (Stanfill and Waltz, 1986; Cost and Salzberg, 1993) for instances with a frequency of at least 3 (and the simple overlap metric otherwise), and classification is based on distance weighted class voting with inverse distance weighting (Dudani, 1976). These settings are the result of extensive experiments partially reported in Nivre et al. (2004). For more information about the different parameters and settings, see Daelemans et al. (2003).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The data set used for experimental evaluation is the standard data set from the Wall Street Journal section of the Penn Treebank, with sections 2-21  tances rather than k nearest neighbors, which means that, even with k = 1, the nearest neighbor set can contain several instances that are equally distant to the test instance. This is different from the original IB1 algorithm, as described in Aha et al. (1991).</Paragraph>
    <Paragraph position="1"> used for training and section 23 for testing (Collins, 1999; Charniak, 2000). The data has been converted to dependency trees using head rules (Magerman, 1995; Collins, 1996). We are grateful to Yamada and Matsumoto for letting us use their rule set, which is a slight modification of the rules used by Collins (1999). This permits us to make exact comparisons with the parser of Yamada and Matsumoto (2003), but also the parsers of Collins (1997) and Charniak (2000), which are evaluated on the same data set in Yamada and Matsumoto (2003).</Paragraph>
    <Paragraph position="2"> One problem that we had to face is that the standard conversion of phrase structure trees to dependency trees gives unlabeled dependency trees, whereas our parser requires labeled trees. Since the annotation scheme of the Penn Treebank does not include dependency types, there is no straightforward way to derive such labels. We have therefore experimented with two different sets of labels, none of which corresponds to dependency types in a strict sense. The first set consists of the function tags for grammatical roles according to the Penn II annotation guidelines (Bies et al., 1995); we call this set G. The second set consists of the ordinary bracket labels (S, NP, VP, etc.), combined with function tags for grammatical roles, giving composite labels such as NP-SBJ; we call this set B. We assign labels to arcs by letting each (non-root) word that heads a phrase P in the original phrase structure have its incoming edge labeled with the label of P (modulo the set of labels used). In both sets, we also include a default label DEP for arcs that would not otherwise get a label. This gives a total of 7 labels in the G set and 50 labels in the B set. Figure 1 shows a converted dependency tree using the B labels; in the corresponding tree with G labels NP-SBJ would be replaced by SBJ, ADVP and VP by DEP.</Paragraph>
    <Paragraph position="3"> We use the following metrics for evaluation:  1. Unlabeled attachment score (UAS): The proportion of words that are assigned the correct head (or no head if the word is a root) (Eisner, 1996; Collins et al., 1999).</Paragraph>
    <Paragraph position="4"> 2. Labeled attachment score (LAS): The proportion of words that are assigned the correct head and dependency type (or no head if the word is a root) (Nivre et al., 2004).</Paragraph>
    <Paragraph position="5"> 3. Dependency accuracy (DA): The proportion of non-root words that are assigned the correct head (Yamada and Matsumoto, 2003).</Paragraph>
    <Paragraph position="6"> 4. Root accuracy (RA): The proportion of root words that are analyzed as such (Yamada and Matsumoto, 2003).</Paragraph>
    <Paragraph position="7"> 5. Complete match (CM): The proportion of  sentences whose unlabeled dependency structure is completely correct (Yamada and Matsumoto, 2003).</Paragraph>
    <Paragraph position="8"> All metrics except CM are calculated as mean scores per word, and punctuation tokens are consistently excluded.</Paragraph>
    <Paragraph position="9"> Table 1 shows the attachment score, both unlabeled and labeled, for the two different state models with the two different label sets. First of all, we see that Model 1 gives better accuracy than Model 2 with the smaller label set G, which confirms our expectations that the added part-of-speech features are helpful when the dependency labels are less informative. Conversely, we see that Model 2 outperforms Model 1 with the larger label set B, which is consistent with the hypothesis that part-of-speech features become redundant as dependency labels get more informative. It is interesting to note that this effect holds even in the case where the dependency labels are mostly derived from phrase structure categories. null We can also see that the unlabeled attachment score improves, for both models, when the set of dependency labels is extended. On the other hand, the labeled attachment score drops, but it must be remembered that these scores are not really comparable, since the number of classes in the classification problem increases from 7 to 50 as we move from the G set to the B set. Therefore, we have also included the labeled attachment score restricted to the G set for the parser using the B set (BG), and we see then that the attachment score improves, especially for Model 2. (All differences are significant beyond the .01 level; McNemar's test.) Table 2 shows the dependency accuracy, root accuracy and complete match scores for our best parser (Model 2 with label set B) in comparison with Collins (1997) (Model 3), Charniak (2000), and Yamada and Matsumoto (2003).5 It is clear that, with respect to unlabeled accuracy, our parser does not quite reach state-of-the-art performance, even if we limit the competition to deterministic methods such as that of Yamada and Matsumoto (2003). We believe that there are mainly three reasons for this. First of all, the part-of-speech tagger used for preprocessing in our experiments has a lower accuracy than the one used by Yamada and Matsumoto (2003) (96.1% vs. 97.1%). Although this is not a very interesting explanation, it undoubtedly accounts for part of the difference. Secondly, since 5The information in the first three rows is taken directly from Yamada and Matsumoto (2003).</Paragraph>
    <Paragraph position="10"> our parser makes crucial use of dependency type information in predicting the next action of the parser, it is very likely that it suffers from the lack of real dependency labels in the converted treebank. Indirect support for this assumption can be gained from previous experiments with Swedish data, where almost the same accuracy (85% unlabeled attachment score) has been achieved with a treebank which is much smaller but which contains proper dependency annotation (Nivre et al., 2004).</Paragraph>
    <Paragraph position="11"> A third important factor is the relatively low root accuracy of our parser, which may reflect a weakness in the one-pass parsing strategy with respect to the global structure of complex sentences. It is noteworthy that our parser has lower root accuracy than dependency accuracy, whereas the inverse holds for all the other parsers. The problem becomes even more visible when we consider the dependency and root accuracy for sentences of different lengths, as shown in Table 3. Here we see that for really short sentences (up to 10 words) root accuracy is indeed higher than dependency accuracy, but while dependency accuracy degrades gracefully with sentence length, the root accuracy drops more drastically (which also very clearly affects the complete match score). This may be taken to suggest that some kind of preprocessing in the form of clausing may help to improve overall accuracy.</Paragraph>
    <Paragraph position="12"> Turning finally to the assessment of labeled dependency accuracy, we are not aware of any strictly comparable results for the given data set, but Buchholz (2002) reports a labeled accuracy of 72.6% for the assignment of grammatical relations using a cascade of memory-based processors. This can be compared with a labeled attachment score of 84.4% for Model 2 with our B set, which is of about the same size as the set used by Buchholz, although the labels are not the same. In another study, Blaheta and Charniak (2000) report an F-measure of 98.9% for the assignment of Penn Treebank grammatical role labels (our G set) to phrases that were correctly parsed by the parser described in Charniak (2000).</Paragraph>
    <Paragraph position="13"> If null labels (corresponding to our DEP labels) are excluded, the F-score drops to 95.7%. The corresponding F-measures for our best parser (Model 2, BG) are 99.0% and 94.7%. For the larger B set, our best parser achieves an F-measure of 96.9% (DEP labels included), which can be compared with 97.0% for a similar (but larger) set of labels in Collins (1999).6 Although none of the previous results on labeling accuracy is strictly comparable to ours, it nevertheless seems fair to conclude that the  labeling accuracy of the present parser is close to the state of the art, even if its capacity to derive correct structures is not.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML