File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-3004_metho.xml
Size: 6,139 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-3004"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Annotation Schemes and their Influence on Parsing Results</Title> <Section position="4" start_page="19" end_page="19" type="metho"> <SectionTitle> 5 presents a conclusion and plans for future work. 2 The Treebanks: T&quot;uBa-D/Z and NeGra </SectionTitle> <Paragraph position="0"> With respect to treebanks, German is in a privileged position. Various treebanks are available, among them are two similar ones: NeGra (Skut et al., 1997), from Saarland University at Saarbr&quot;ucken and T&quot;uBa-D/Z (Telljohann et al., 2003), from the University of T&quot;ubingen. NeGra contains about 20,000 sentences, T&quot;uBa-D/Z about 15,000, both consist of newspaper text. In both treebanks, predicate argument structure is annotated, the core principle of the annotation being its theory independence. Terminal nodes are labeled with part-of-speech tags and morphological labels, non-terminal nodes with phrase labels. All edges are labeled with grammatical functions. Annotation was accomplished semi-automatically with the same software tools.</Paragraph> <Paragraph position="1"> The main difference between the treebanks is rooted in the partial free word order of German sentences: the positions of complements and adjuncts are of great variability. This leads to a high number of discontinuous constituents, even in short sentences. An annotation scheme for German must account for that. NeGra allows for crossing branches, thereby giving up the context-free backbone of the annotation. With crossing branches, discontinuous constituents are not a problem anymore: all children of every constituent, discontinuous or not, can always be grouped under the same node. The inconvenience of this method is that the crossing branches must be resolved before the treebank can be used with a (PCFG) parser. However, this can be accomplished easily by reattaching children of discontinuous constituents to higher nodes.</Paragraph> <Paragraph position="2"> T&quot;uBa-D/Z uses another mechanism to account for the free word order. Above the phrase level, an additional layer of annotation is introduced. It consists of topological fields (Drach, 1937; H&quot;ohle, 1986). The concept of topological fields is widely accepted among German grammarians. It reflects the empirical observation that German has three possible sentence configurations with respect to the position of the finite verb. In its five fields (initial field, left sentence bracket, middle field, right sentence bracket, final field), verbal material generally resides in the two sentence brackets, while the initial field and the middle field contain all other elements. The final field contains mostly extraposed material. Since word order variations generally do not cross field boundaries, with the model of topological fields, the free word order of German can be accounted for in a natural way.</Paragraph> <Paragraph position="3"> On the phrase level, the treebanks show great differences, too. NeGra does not allow for any intermediate (&quot;bar&quot;) phrasal projections. Additionally, no unary productions are allowed. This results in very flat phrases: pre- and postmodifiers are attached directly to the phrase, nominal subjects are attached directly to the sentence, nominal material within PPs doesn't project to NPs, complex (non-coordinated) NPs remain flat. T&quot;uBa-D/Z, on the contrary, allows for &quot;deep&quot; annotation. Intermediate productions and unary productions are allowed and extensively used.</Paragraph> <Paragraph position="4"> To illustrate the annotation principles, the figures 1 and 2 show the annotation of the sentences</Paragraph> </Section> <Section position="5" start_page="19" end_page="19" type="metho"> <SectionTitle> 3 Treebanks, Parsing, and Comparisons </SectionTitle> <Paragraph position="0"> Our goal is to determine which components of the annotation schemes of T&quot;uBa-D/Z and NeGra have which influence on parsing results. A direct comparison of the parsing results shows that the T&quot;uBa-D/Z annotation scheme is more appropriate for PCFG parsing than NeGra's (see tables 2 and 3). However, this doesn't tell us anything about the role of the subparts of the annotation schemes.</Paragraph> <Paragraph position="1"> A first idea for a more detailed comparison could be to compare the results for different phrase types. The problem is that this would not give meaningful results. NeGra noun phrases, e.g., cover a different set of constituents than T&quot;uBa-D/Z noun phrases, due to NeGra's flat annotation and avoidance of annotation of unary NPs. Furthermore, both annotation schemes contain categories not contained in the other one. There are, e.g., no categories in NeGra that correspond to T&quot;uBa-D/Z's field categories, while in T&quot;uBa-D/Z, there are no categories equivalent to NeGra's categories for coordinated phrases or verb phrases.</Paragraph> <Paragraph position="2"> We therefore pursue another approach. We use a method introduced by K&quot;ubler (2005) to investigate the usefulness of different annotation components for parsing. We gradually modify the tree-bank annotations in order to approximate the annotation style of the treebanks to one another. This is accomplished by taking out or inserting certain components of the annotation. For our treebanks, this generally results in reduced structures for T&quot;uBa-D/Z and augmented structures for Ne-Gra. Table 1 presents three measures that capture the changes between each of the modifications. The average number of child nodes of non-terminal nodes shows the degree of flatness of the annotation on phrase level. Here, the unmodified NeGra consequently shows the highest values.</Paragraph> <Paragraph position="3"> The average tree height relates directly to the number of annotation hierarchies in the tree. Here, the unmodified T&quot;uBa-D/Z has the highest values.</Paragraph> </Section> class="xml-element"></Paper>