File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-3004_evalu.xml

Size: 11,867 bytes

Last Modified: 2025-10-06 13:59:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-3004">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Annotation Schemes and their Influence on Parsing Results</Title>
  <Section position="6" start_page="19" end_page="23" type="evalu">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> For our experiments, we use lopar (Schmid, 2000), a standard PCFG parser. We read the grammar and the lexicon directly off the trees together with their frequencies. The parser is given the gold POS tagging to avoid parsing errors that are caused by wrong POS tags. Only sentences up to a length of 40 words are considered due to memory limitations.</Paragraph>
    <Paragraph position="1"> Traditionally, most of the work on WSJ uses the same section of the treebank for testing. However, for our aims, this method has a shortcoming: since both treebanks consist of text created by different authors, linguistic phenomena are not evenly distributed over the treebank. When using a whole section as test set, some phenomena may only occur there and thus not occur in the grammar. To reduce data sparseness, we use another test/training-set split for the treebanks and their variations. Each 10th sentence is put into the test set, all other sentences go into the training set.</Paragraph>
    <Section position="1" start_page="19" end_page="21" type="sub_section">
      <SectionTitle>
4.1 Preprocessing the Treebanks
</SectionTitle>
      <Paragraph position="0"> Since we want to read the grammars for our parser directly off the treebanks, preprocessing of the treebanks is necessary due to the non-context-free nature of the original annotation. In both treebanks, punctuation is not included in the trees, furthermore, sentence splitting in both treebanks does not always coincide with the linguistic notion of a sentence. This leads to sentences consisting of several unconnected trees. All nodes in a sentence, i.e. the roots and the punctation, are grouped by a virtual root node, which may cause crossing branches. Furthermore, the NeGra annotation scheme allows for crossing branches for linguistic reasons, as described in section 2. All of the crossing branches have to be removed before parsing.</Paragraph>
      <Paragraph position="1"> The crossing branches caused by the NeGra annotation scheme are removed with a small program by Thorsten Brants. It attaches some of the children of discontinuous constituents to higher nodes. The virtual root node is made continuous by attaching all punctuation to the highest possible location in the tree. Pairs of parenthesis and quotation marks are preferably attached to  the same node, to avoid low-frequent productions in the grammar that only differ by the position of parenthesis marks on their right hand side.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="23" type="sub_section">
      <SectionTitle>
4.2 Results of the Comparison
</SectionTitle>
      <Paragraph position="0"> We use the standard parseval measures for the evaluation of parser output. They measure the percentage of correctly parsed constituents, in terms of precision, recall, and F-Measure. The parser output of each modified treebank version is evaluated against the correspondingly modified test set. Unparsed sentences are fully included in the evaluation. null NeGra. Along with the unmodified treebank, two modifications of NeGra are tested. Both of them introduce annotation components present in T&amp;quot;uBa-D/Z but not in NeGra. In the first one, NE fi, we add an annotation layer of topological fields2, as existing in T&amp;quot;uBa-D/Z. The precision value benefits the most from this modification.</Paragraph>
      <Paragraph position="1"> When parsing without grammatical functions, it increases about 6,5%. When parsing with grammatical functions, it increases about 14%. Thus, the additional rules provided by a topological field level that groups phrases below the clausal level are favorable for parsing. The average number of crossing brackets per sentence increases, which is due to the fact that there are simply more brackets to create.</Paragraph>
      <Paragraph position="2"> A detailed evaluation of the results for node categories shows that the new field categories are easy to recognize (e.g. LF gets 97.79 F-Measure).</Paragraph>
      <Paragraph position="3"> Nearly all categories have a better precision value. However, the F-Measure for VPs is low (only 26.70 while 59.41 in the unmodified treebank), while verb phrases in the unmodified T&amp;quot;uBa-D/Z (see below) are recognized with nearly 100 points F-Measure. The problem here is the following. In the original NeGra annotation, a verb and its complements are grouped under the same VP. To pre1explanation: N/T = node/token ratio, u D/N = average number of daughters of non-terminal nodes, uH(T) = average tree height 2We are grateful to the DFKI Saarbr&amp;quot;ucken for providing us with the topological field annotation.</Paragraph>
      <Paragraph position="4"> serve as much of the annotation as possible, the topological fields are inserted below the VP (complements are grouped by a middle field node, the verb complex by the right sentence bracket). Since this way, the phrase node VP resides above the field level, it becomes difficult to recognize.</Paragraph>
      <Paragraph position="5"> In the second modification, NE NP, we approximate NeGra's PPs to T&amp;quot;uBa-D/Z's by grouping all nominal material below the PPs to separate NPs. This modification gives us a small benefit in terms of precision and recall (about 2-3%). Although there are more brackets to place, the number of crossing parents increases only slightly, which can be attributed to the fact that below PPs, there is no room to get brackets wrong.</Paragraph>
      <Paragraph position="6"> We finally parse a version of NeGra where for each node movement during the resolution of crossing edges, a trace label was created in the corresponding edge (NE tr). Although this brings the treebank closer to the format of T&amp;quot;uBa-D/Z, the results get even worse than in the version without traces. However, the high number of unparsed sentences indicates that the result is not reliable due to data sparseness.</Paragraph>
      <Paragraph position="7"> NeGra NE fi. NE NP NE tr.</Paragraph>
      <Paragraph position="8"> without grammatical functions cross. br. 1.10 1.67 1.14 -lab. prec. 68.14% 74.96% 70.43% -lab. rec. 69.98% 70.37% 72.81% -lab. F1 69.05 72.59 71.60 -not parsed 1.00% 0.10% 0.15% -with grammatical functions cross. br. 1.10 1.21 1.27 1.05 lab. prec. 52.67% 67.90% 59.77% 51.81% lab. rec. 52.17% 65.18% 60.36% 49.19% lab. F1 52.42 66.51 60.06 50.47 not parsed 12.90% 1.66% 9.88% 16.01%  T&amp;quot;uBa-D/Z. Apart from the original treebank, we test six modifications of T&amp;quot;uBa-D/Z. In each of the modifications, annotation material is removed in order to obtain NeGra-like structures. Since they are equally absent in NeGra, we delete the annotation of topological fields in the first modification, T&amp;quot;u NF. This results in small losses.  A closer look at category results shows that losses are mainly due to categories on the clausal level; structures within fields do not deteriorate. Field categories are thus especially helpful for the clausal level.</Paragraph>
      <Paragraph position="9"> In the second modification of T&amp;quot;uBa-D/Z, T&amp;quot;u NU, unary nodes are collapsed with the goal to get structures comparable to NeGra's. As the figures show, the unary nodes are very helpful, the F-Measure drops about 6 points without them. The number of crossing brackets also drops, along with the total number of nodes. When parsing with grammatical functions, taking out unary productions has a detrimental effect, F-Measure drops about 13 points. A plausible explanation could be data sparseness. 32.78% of the rules that the parser needs to produce a correct parse don't occur in the training set.</Paragraph>
      <Paragraph position="10"> An evaluation of the results for the different categories shows that all major phrase categories loose both in precision and recall. Since field nodes are mostly unary, many of them disappear, but most of the middle field nodes stay because they generally contain more than one element.</Paragraph>
      <Paragraph position="11"> However, their recall drops about 10%. Supposedly it is more difficult for the parser to annotate the middle field &amp;quot;alone&amp;quot; without the other field categories. null We also test a version of T&amp;quot;uBa-D/Z with flattened phrases that mimic NeGra's flat phrases, T&amp;quot;u flat. With this treebank version, we get results very similar to those of the unmodified treebank. The F-Measure values are slightly higher and the parser produces less crossing brackets. A single category benefits the most from this treebank modification: EN-ADD, its F-Measure rising about 45 points. It was originally introduced as a marker for named entities, which means that it has no specific syntactic function. In the T&amp;quot;uBa-D/Z version with flattened phrases, many of the nominal nodes below EN-ADD are taken out, bringing EN-ADD closer to the lexical level. This way, the category has more meaningful context and therefore produces better results.</Paragraph>
      <Paragraph position="12"> Furthermore, we test combinations of the modifications. Apart from the average tree height, the dimensions of T&amp;quot;uBa-D/Z with flattened phrases and without unary productions (T&amp;quot;u f NU) resemble those of the unmodified NeGra treebank, which indicates their similarity. Nevertheless, parser results are worse on NeGra. This indicates that T&amp;quot;uBa-D/Z still benefits from the remaining field nodes. The number of crossing branches is the lowest in this treebank version.</Paragraph>
      <Paragraph position="13"> In the last modification that combines all modifications made before (T &amp;quot;U f NU NF), as expected, all values drop dramatically. F-Measure is about 5 points worse than with the unmodified NeGra treebank.</Paragraph>
      <Paragraph position="14"> POS tagging. In a second round, we investigate the benefits that gold POS tags have when making them available in the parser input. We repeat all experiments without giving the parser the perfect tagging.</Paragraph>
      <Paragraph position="15"> This leads to higher time and space requirements during parsing, caused by the additional tagging step. With T&amp;quot;uBa-D/Z, NeGra, and all their modifications, the F-Measure results are about 35 points worse when parsing with grammatical functions. When parsing without them, they drop 3-6 points. We can determine two exceptions: T&amp;quot;uBa-D/Z with flattened phrases, where the F-Score drops more than 9 points when parsing with grammatical functions, and the T&amp;quot;uBa-D/Z version with all modifications combined, where F-Score drops only a little less than 2 points. The behavior  of the flattened T&amp;quot;uBa-D/Z relates directly to the fact that the categories that loose the most without gold POS tags are phrase categories (particularly infinite VPs and APs). They are directly conditioned on the POS tagging and thus behave accordingly to its quality. For the T&amp;quot;uBa-D/Z version with all modifications combined, one could argue that the results are not reliable because of data sparseness, which is confirmed by the high number of unparsed sentences in this treebank version. However, in all cases, less crossing brackets are produced.</Paragraph>
      <Paragraph position="16"> To sum up, obviously, it is more difficult for the parser to build a parse tree onto an already existing layer of POS-tagging. This explains the bigger number of unparsed sentences. Nevertheless, in terms of F-Score, the parsing results profit visibly from the gold POS tagging.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML