XML Viewer - w06-2936

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2936_metho.xml
Size: 13,933 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2936">
  <Title>Maximum Spanning Tree Algorithm for Non-projective Labeled Dependency Parsing</Title>
  <Section position="4" start_page="0" end_page="236" type="metho">
    <SectionTitle>
2 Non-Projective Dependency Parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="236" type="sub_section">
      <SectionTitle>
2.1 Dependency Structure
</SectionTitle>
      <Paragraph position="0"> Let us define x to be a generic sequence of input tokens together with their POS tags and other morphological features, and y to be a generic dependency structure, that is, a set of edges for x. We use the terminology in (Taskar et al., 2004) for a generic structured output prediction, and define a part.</Paragraph>
      <Paragraph position="1"> A part represents an edge together with its label.</Paragraph>
      <Paragraph position="2"> A part is a tuple &lt;DEPREL,i,j&gt; where i is the start point of the edge, j is the end point, and DEPREL is the label of the edge. The token at i is the head of the token at j.</Paragraph>
      <Paragraph position="3"> Table 1 shows our formulation of building a non-projective dependency tree as a prediction problem. The task is to predict y, the set of parts (column 3, Table 1), given x, the input tokens and their features (column 1 and 2, Table 1).</Paragraph>
      <Paragraph position="4"> In this paper we use the common method of factoring the score of the dependency structure as the sum of the scores of all the parts.</Paragraph>
      <Paragraph position="5"> A dependency structure is characterized by its features, and for each feature, we have a correspond- null ing weight. The score of a dependency structure is the sum of these weights. Now, the dependency structures are factored by the parts, so that each feature is some type of a specialization of a part. Each part in a dependency structure maps to several features. If we sum up the weights for these features, we have the score for the part, and if we sum up the scores of the parts, we have the score for the dependency structure.</Paragraph>
      <Paragraph position="6"> For example, let us say we would like to find the score of the part &lt;OBJ,2,4&gt; . This is the edge going to the 4th token &amp;quot;dog&amp;quot; in Table 1. Suppose there are two features for this part.</Paragraph>
      <Paragraph position="7"> * There is an edge labeled with &amp;quot;OBJ&amp;quot; that points to the right. ( = DEPREL, dir(i,j) ) * There is an edge labeled with &amp;quot;OBJ&amp;quot; starting at the token &amp;quot;saw&amp;quot; which points to the right. ( = DEPREL, dir(i,j), wordi ) If a statement is never true during the training, the weight for it will be 0. Otherwise there will be a positive weight value. The score will be the sum of all the weights of the features given by the part.</Paragraph>
      <Paragraph position="8"> In the upcoming section, we explain a decoding algorithm for the dependency structures, and later we give a method for learning the weight vector used in the decoding.</Paragraph>
    </Section>
    <Section position="2" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
2.2 Maximum Spanning Tree Algorithm
</SectionTitle>
      <Paragraph position="0"> As in (McDonald et al., 2005), the decoding algorithm we used is the Chu-Liu-Edmonds (CLE) algorithm (Chu and Liu, 1965; Edmonds, 1967) for finding the Maximum Spanning Tree in a directed graph. The following is a nice summary by (Mc-Donald et al., 2005).</Paragraph>
      <Paragraph position="1"> Informally, the algorithm has each vertex in the graph greedily select the incoming edge with highest weight.</Paragraph>
      <Paragraph position="2"> Note that the edge is coming from the parent to the child. This means that given a child node wordj, we are finding the parent, or the head wordi such that the edge (i,j) has the highest weight among all i, i negationslash= j.</Paragraph>
      <Paragraph position="3"> If a tree results, then this must be the maximum spanning tree. If not, there must be a cycle. The procedure identifies a cycle and contracts it into a single vertex and recalculates edge weights going into and out of the cycle. It can be shown that a maximum spanning tree on the contracted graph is equivalent to a maximum spanning tree in the original graph (Leonidas, 2003). Hence the algorithm can recursively call itself on the new graph.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="236" end_page="236" type="metho">
    <SectionTitle>
3 Online Learning
</SectionTitle>
    <Paragraph position="0"> Again following (McDonald et al., 2005), we have used the single best MIRA (Crammer and Singer, 2003), which is a variant of the voted perceptron (Collins, 2002; Collins and Roark, 2004) for structured prediction. In short, the update is executed when the decoder fails to predict the correct parse, and we compare the correct parse yt and the incorrect parse y' suggested by the decoding algorithm.</Paragraph>
    <Paragraph position="1"> The weights of the features in y' will be lowered, and the weights of the features in yt will be increased accordingly. null</Paragraph>
  </Section>
  <Section position="6" start_page="236" end_page="237" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> Our experiments were conducted on CoNLL-X shared task, with various datasets (HajiVc et al., 2004; Simov et al., 2005; Simov and Osenova, 2003; Chen et al., 2003; B&amp;quot;ohmov'a et al., 2003; Kromann, 2003; van der Beek et al., 2002; Brants et al., 2002; Kawata and Bartels, 2000; Afonso et al., 2002; DVzeroski et al., 2006; Civit Torruella and Mart'i Anton'in, 2002; Nilsson et al., 2005; Oflazer et al., 2003; Atalay et al., 2003) .</Paragraph>
    <Section position="1" start_page="236" end_page="237" type="sub_section">
      <SectionTitle>
4.1 Dependency Relation
</SectionTitle>
      <Paragraph position="0"> The CLE algorithm works on a directed graph with unlabeled edges. Since the CoNLL-X shared task  Given a part &lt;DEPREL,i,j&gt; DEPREL, dir(i,j) DEPREL, dir(i,j), wordi DEPREL, dir(i,j), posi DEPREL, dir(i,j), wordj DEPREL, dir(i,j), posj  requires the labeling of edges, as a preprocessing stage, we created a directed complete graph without multi-edges, that is, given two distinct nodes i and j, exactly two edges exist between them, one from i to j, and the other from j to i. There is no self-pointing edge. Then we labeled each edge with the highest scoring dependency relation. This complete graph was given to the CLE algorithm and the edge labels were never altered in the course of finding the maximum spanning tree. The result is the non-projective dependency tree with labeled edges.</Paragraph>
    </Section>
    <Section position="2" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
4.2 Features
</SectionTitle>
      <Paragraph position="0"> The features we used to score each part (edge) &lt;DEPREL,i,j&gt; are shown in Table 2. The index i is the position of the parent and j is that of the child. wordj = the word token at the position j.</Paragraph>
      <Paragraph position="1"> posj = the coarse part-of-speech at j.</Paragraph>
      <Paragraph position="2"> dir(i,j) = R if i &lt; j, and L otherwise.</Paragraph>
      <Paragraph position="3"> No other features were used beyond the combinations of the CPOS tag and the word token in Table 2. We have evaluated our parser on Arabic, Danish, Slovene, Spanish, Turkish and Swedish, and used the &amp;quot;additional features&amp;quot; listed in Table 2 for all languages except for Danish and Swedish. The reason for this is simply that the model with the additional features did not fit in the 4 GB of memory used in the training.</Paragraph>
      <Paragraph position="4"> Although we could do batch learning by running the online algorithm multiple times, we run the on-line algorithm just once. The hardware used is an Intel Pentinum D at 3.0 Ghz with 4 GB of memory, and the software was written in C++. The training time required was Arabic 204 min, Slovene 87 min, Spanish 413 min, Swedish 1192 min, Turkish 410 min, Danish 381 min.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="237" end_page="238" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> The results are shown in Table 3. Although our feature set is very simple, the results were around the averages. We will do error analysis of three notable languages: Arabic, Swedish and Turkish.</Paragraph>
    <Section position="1" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
5.1 Arabic
</SectionTitle>
      <Paragraph position="0"> Of 4990 words in the test set, 800 are prepositions.</Paragraph>
      <Paragraph position="1"> The prepositions are the most frequently found tokens after nouns in this set. On the other hand, our head attachment error was 44% for prepositions.</Paragraph>
      <Paragraph position="2"> Given the relatively large number of prepositions found in the test set, it is important to get the preposition attachment right to achieve a higher mark in this language. The obvious solution is to have a feature that connects the head of a preposition to the child of the preposition. However, such a feature effects the edge based factoring and the decoding algorithm, and we will be forced to modify the MST algorithm in some ways.</Paragraph>
    </Section>
    <Section position="2" start_page="237" end_page="237" type="sub_section">
      <SectionTitle>
5.2 Swedish
</SectionTitle>
      <Paragraph position="0"> Due to the memory constraint on the computer, we did not use the additional features for Swedish and our feature heavily relied on the CPOS tag. At the same time, we have noticed that relatively higher performance of our parser compared to the average coincides with the bigger tag set for CPOS for this corpus. This suggests that we should be using more fine grained POS in other languages.</Paragraph>
    </Section>
    <Section position="3" start_page="237" end_page="238" type="sub_section">
      <SectionTitle>
5.3 Turkish
</SectionTitle>
      <Paragraph position="0"> The difficulty with parsing Turkish stems from the large unlabeled attachment error rate on the nouns  (39%). Since the nouns are the most frequently occurring words in the test set (2209 out of 5021 total), this seems to make Turkish the most challenging language for any system in the shared task. On the average, there are 1.8 or so verbs per sentence, and nouns have a difficult time attaching to the correct verb or postposition. This, we think, indicates that there are morphological features or word ordering features that we really need in order to disambiguate them.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="238" end_page="238" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> As well as making use of fine-grained POS tags and other morphological features, given the error analysis on Arabic, we would like to add features that are dependent on two or more edges.</Paragraph>
    <Section position="1" start_page="238" end_page="238" type="sub_section">
      <SectionTitle>
6.1 Bottom-Up Non-Projective Parsing
</SectionTitle>
      <Paragraph position="0"> In order to incorporate features which depend on other edges, we propose Bottom-Up Non-Projective Parsing. It is often the case that dependency relations can be ordered by how close one relation is to the root of dependency tree. For example, the dependency relation between a determiner and a noun should be decided before that between a preposition and a noun, and that of a verb and a preposition, and so on. We can use this information to do bottom-up parsing.</Paragraph>
      <Paragraph position="1"> Suppose all words have a POS tag assigned to them, and every edge labeled with a dependency relation is attached to a specific POS tag at the end point. Also assume that there is an ordering of POS tags such that the edge going to the POS tag needs be decided before other edges. For example, (1) determiner, (2) noun, (3) preposition, (4) verb would be one such ordering. We propose the following algorithm: null * Assume we have tokens as nodes in a graph and no edges are present at first. For example, we have tokens &amp;quot;I&amp;quot;, &amp;quot;ate&amp;quot;, &amp;quot;with&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;spoon&amp;quot;, and no edges between them. * Take the POS tag that needs to be decided next. Find all edges that go to each token labeled with this POS tag, and put them in the graph. For example, if the POS is noun, put edges from &amp;quot;ate&amp;quot; to &amp;quot;I&amp;quot;, from &amp;quot;ate&amp;quot; to &amp;quot;spoon&amp;quot;, from &amp;quot;with&amp;quot; to &amp;quot;I&amp;quot;, from &amp;quot;with&amp;quot; to &amp;quot;spoon&amp;quot;, from &amp;quot;I&amp;quot; to &amp;quot;spoon&amp;quot;, and from &amp;quot;spoon&amp;quot; to &amp;quot;I&amp;quot;. * Run the CLE algorithm on this graph. This selects the highest incoming edge to each token with the POS tag we are looking at, and remove cycles if any are present.</Paragraph>
      <Paragraph position="2"> * Take the resulting forests and for each edge, bring the information on the child node to the parent node. For example, if this time POS was noun, and there is an edge to a preposition &amp;quot;with&amp;quot; from a noun &amp;quot;spoon&amp;quot;, then &amp;quot;spoon&amp;quot; is absorbed by &amp;quot;with&amp;quot;. Note that since no remaining dependency relation will attach to &amp;quot;spoon&amp;quot;, we can safely ignore &amp;quot;spoon&amp;quot; from now on.</Paragraph>
      <Paragraph position="3"> * Go back and repeat until no POS is remaining and we have a dependency tree. Now in the next round, when deciding the score of the edge from &amp;quot;ate&amp;quot; to &amp;quot;with&amp;quot;, we can use the all information at the token &amp;quot;with&amp;quot;, including &amp;quot;spoon&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML