File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1072_metho.xml

Size: 12,340 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1072">
  <Title>T&amp;quot;uSBL: A Similarity-Based Chunk Parser for Robust Syntactic Processing</Title>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2. THE T
&amp;quot;
USBL ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> In order to ensure a robust and efficient architecture, T&amp;quot;uSBL, a similarity-based chunk parser, is organized in a three-level architecture, with the output of each level serving as input for the next higher level. The first level is part-of-speech (POS) tagging of the input string with the help of the bigram tagger LIKELY [10].</Paragraph>
    <Paragraph position="1">  The parts of speech serve as pre-terminal elements for the next step, i.e. the chunk analysis. Chunk parsing is carried out by an adapted version of Abney's [2] scol parser, which is realized as a cascade of finite-state transducers. The chunks, which extend if possible to the simplex clause level, are then remodeled into complete trees in the tree construction level.</Paragraph>
    <Paragraph position="2"> The tree construction is similar to the DOP approach [3, 4] in that it uses complete tree structures instead of rules. Contrary to Bod, we do not make use of probabilities and do not allow tree cuts, instead we only use the complete trees and minimal tree modifications. Thus the number of possible combinations of partial trees is strictly controlled. The resulting parser is highly efficient (3770 English sentences took 106.5 seconds to parse on an Ultra Sparc 10).</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3. CHUNK PARSING AND TREE CONSTRUC-
TION
</SectionTitle>
    <Paragraph position="0"> The division of labor between the chunking and tree construction modules can best be illustrated by an example.</Paragraph>
    <Paragraph position="1">  For complex sentences such as the German input dann w&amp;quot;urde ich vielleicht noch vorschlagen Donnerstag den elften und Freitag den zw&amp;quot;olften August (then I would suggest maybe Thursday eleventh and Friday twelfth of August), the chunker produces a structure in which some constituents remain unattached or partially annotated in keeping with the chunk-parsing strategy to factor out recursion and to resolve only unambigous attachments, as shown in Fig. 1.</Paragraph>
    <Paragraph position="2"> In the case at hand, the subconstituents of the extraposed co-ordinated noun phrase are not attached to the simplex clause that ends with the non-finite verb that is typically in clause-final position in declarative main clauses of German. Moreover, each conjunct of the coordinated noun phrase forms a completely flat structure. T&amp;quot;uSBL's tree construction module enriches the chunk output as shown in Fig. 2  . Here the internally recursive NP conjuncts have been coordinated and integrated correctly into the clause as a whole. In addition, function labels such as mod (for: modifier), hd (for: head), on (for: subject), oa (for: direct object), and ov (for: verbal object) have been added that encode the function-argument structure of the sentence.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="3" type="metho">
    <SectionTitle>
4. SIMILARITY-BASED TREE CONSTRUC-
TION
</SectionTitle>
    <Paragraph position="0"> The tree construction algorithm is based on the machine learning paradigm of memory-based learning [12].</Paragraph>
    <Paragraph position="1">  Memory-based learning assumes that the classification of a given input should be based on the similarity to previously seen instances of the same type that have been stored in memory. This paradigm is an instance of lazy learning in the sense that these previously encountered instances are stored &amp;quot;as is&amp;quot; and are crucially not abstracted over, as is typically the case in rule-based systems or other learning approaches. Past applications of memory-based learning to NLP tasks consist of classification problems in which the set of classes to be learnt is simple in the sense that the class items do not have any internal structure and the number of distinct items is small.</Paragraph>
    <Paragraph position="2"> The use of a memory-based approach for parsing implies that parsing needs to be redefined as a classification task. There are two fundamentally different, possible approaches: the one is to split parsing up into different subtasks, that is, one needs separate classifiers for each functional category and for each level in a recursive structure. Since the classifiers for the functional categories as well as the individual decisions of the classifiers are independent, multiple or no candidates for a specific grammatical function or constituents with several possible functions may be found so that an additional classifier is needed for selecting the most appropriate assignment (cf. [6]).</Paragraph>
    <Paragraph position="3"> The second approach, which we have chosen, is to regard the complete parse trees as classes so that the task is defined as the selection of the most similar tree from the instance base. Since in</Paragraph>
  </Section>
  <Section position="7" start_page="3" end_page="3" type="metho">
    <SectionTitle>
BE
</SectionTitle>
    <Paragraph position="0"> All trees in this contribution follow the data format for trees defined by the NEGRA project of the Sonderforschungsbereich 378 at the University of the Saarland, Saarbr&amp;quot;ucken. They were printed by the NEGRA annotation tool [5].</Paragraph>
  </Section>
  <Section position="8" start_page="3" end_page="4" type="metho">
    <SectionTitle>
BF
</SectionTitle>
    <Paragraph position="0"> Memory-based learning has recently been applied to a variety of NLP classification tasks, including part-of-speech tagging, noun phrase chunking, grapheme-phoneme conversion, word sense disambiguation, and pp attachment (see [9], [14], [15] for details).</Paragraph>
    <Paragraph position="1"> construct tree(chunk list, treebank): while (chunk list is not empty) do remove first chunk from chunk list process chunk(chunk, treebank)</Paragraph>
    <Paragraph position="3"> if (tree is not empty) direct hit, then output(tree) i.e. complete chunk found in treebank else tree := partial match(words, treebank) if (tree is not empty) then if (tree = postfix of chunk)</Paragraph>
    <Paragraph position="5"> if ((chunk - tree) is not empty) chunk might consist of both chunks (s.a.) then process chunk(chunk - tree, treebank) i.e. process remaining chunk else back off to POS sequence</Paragraph>
    <Paragraph position="7"> else back off to subchunks while (chunk is not empty) do remove first subchunk c1 from chunk process chunk(c1, treebank)  this case, the internal structure of the item to be classified (i.e. the input sentence) and of the class item (i.e. the most similar tree in the instance base) need to be considered, the classification task is much more complex, and the standard memory-based approach needs to be adapted to the requirements of the parsing task. The features T&amp;quot;uSBL uses for classification are the sequence of words in the input sentence, their respective POS tags and (to a lesser degree) the labels in the chunk parse. Rather than choosing a bag-of-words approach, since word order is important for choosing the most similar tree, the algorithm needed to be modified in order to rely more on sequential information.</Paragraph>
    <Paragraph position="8"> Another modification was necessitated by the need to generalize from the limited number of trees in the instance base. The classification is simple only in those cases where a direct hit is found, i.e. where a complete match of the input with a stored instance exists. In all other cases, the most similar tree from the instance base needs to be modified to match the chunked input.</Paragraph>
    <Paragraph position="9"> If these strategies for matching complete trees fail, T&amp;quot;uSBL attempts to match smaller subchunks in order to preserve the quality of the annotations rather than attempt to pursue only complete parses.</Paragraph>
    <Paragraph position="10"> The algorithm used for tree construction is presented in a slightly simplified form in Figs. 3-6. For readability's sake, we assume here that chunks and complete trees share the same data structure so that subroutines like string yield can operate on both of them indiscriminately.</Paragraph>
    <Paragraph position="11"> The main routine construct tree in Fig. 3 separates the list of input chunks and passes each one to the subroutine process chunk in Fig. 4 where the chunk is then turned into one or more (partial) trees. process chunk first checks if a complete match with an instance from the instance base is possible.</Paragraph>
    <Paragraph position="12">  If this is not the case, a partial match on the lexical level is attempted. If a partial tree is found, attach next chunk in Fig. 5 and extend tree in Fig. 6 are used to extend the tree by either attaching one more chunk or by resorting to a comparison of the missing parts of the chunk with tree extensions on the POS level. attach next chunk is necessary to ensure that the best possible tree is found even in the rare case that the original segmentation into chunks contains mistakes. If no partial tree is found, the tree construction backs off to finding a complete match in the POS level or to starting the subroutine for processing a chunk recursively with all the subchunks of the present chunk. The application of memory-based techniques is implemented in the two subroutines complete match and partial match. The presentation of the two cases as two separate subroutines is for expository purposes only. In the actual implementation, the search is carried out only once. The two subroutines exist because of</Paragraph>
  </Section>
  <Section position="9" start_page="4" end_page="4" type="metho">
    <SectionTitle>
BG
</SectionTitle>
    <Paragraph position="0"> string yield returns the sequence of words included in the input structure, pos yield the sequence of POS tags.</Paragraph>
    <Paragraph position="1"> attach next chunk(tree, treebank): attempts to attach the next chunk to the tree take first chunk chunk2 from chunk list</Paragraph>
    <Paragraph position="3"/>
    <Paragraph position="5"> if ((tree2 is not empty) and (subtree(tree, tree2)))  the postprocessing of the chosen tree which is necessary for partial matches and which also deviates from standard memory-based applications. Postprocessing mainly consists of shortening the tree from the instance base so that it covers only those parts of the chunk that could be matched. However, if the match is done on the lexical level, a correction of tagging errors is possible if there is enough evidence in the instance base. T&amp;quot;uSBL currently uses an overlap metric, the most basic metric for instances with symbolic features, as its similarity metric. This overlap metric is based on either lexical or POS features. Instead of applying a more sophisticated metric like the weighted overlap metric, T&amp;quot;uSBL uses a backing-off approach that heavily favors similarity of the input with pre-stored instances on the basis of substring identity. Splitting up the classification and adaptation process into different stages allows T&amp;quot;uSBL to prefer analyses with a higher likelihood of being correct. This strategy enables corrections of tagging and segmentation errors that may occur in the chunked input.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Example
</SectionTitle>
      <Paragraph position="0"> dann w&amp;quot;urde ich sagen and ist das vereinbart. A look-up in the instance base finds a direct hit for the first clause. Therefore, the correct tree can be output directly. For the second clause, only a partial match on the level of words can be found. The system finds the tree for the subsequence of words ist das, as shown in Fig. 8.</Paragraph>
      <Paragraph position="1"> By backing off to a comparison on the POS level, it finds a tree for the sentence hatten die gesagt (they had said) with the same POS sequence and the same structure for the first two words. Thus the original tree that covers only two words is extended via the newly</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML