File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1041_metho.xml

Size: 20,116 bytes

Last Modified: 2025-10-06 14:09:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1041">
  <Title>High Precision Treebanking Blazing Useful Trees Using POS Information</Title>
  <Section position="4" start_page="330" end_page="331" type="metho">
    <SectionTitle>
2 kflaten curtain
</SectionTitle>
    <Paragraph position="0"> The semantic view shows some ambiguity has been resolved that is not visible in the purely syntactic view. In Japanese, relative clauses can have gapped and non-gapped readings. In the gapped reading (selected here), a17 mono thing is the sub-ject of a18a20a19 kakusu hide . In the non-gapped reading there is some unspeci ed relation between the thing and the verb phrase. This is similar to the difference in the two readings of the day he knew in English: the day that he knew about (gapped) vs the day on which he knew (something) (non-gapped).</Paragraph>
    <Paragraph position="1">  Such semantic ambiguity is resolved by selecting the correct derivation tree that includes the applied rules in building the tree, as shown in Figure 3. In the next phase of the Hinoki project, we are concentrating on acquiring an ontology from these semantic representations and using it to improve the parse selection (Bond et al., 2004).</Paragraph>
  </Section>
  <Section position="5" start_page="331" end_page="332" type="metho">
    <SectionTitle>
3 Treebanking Using Discriminants
</SectionTitle>
    <Paragraph position="0"> Selection among analyses in our set-up is done through a choice of elementary discriminants, basic and mostly independent contrasts between parses.</Paragraph>
    <Paragraph position="1"> These are (relatively) easy to judge by annotators.</Paragraph>
    <Paragraph position="2"> The system selects features that distinguish between different parses, and the annotator selects or rejects the features until only one parse is left. In a small number of cases, annotation may legitimately leave more than one parse active (see below). The system we used for treebanking was the [incr tsdb()] Redwoods environment2 (Oepen et al., 2002). The number of decisions for each sentence is proportional to the log of the number of parses. The number of decisions required depends on the ambiguity of the parses and the length of the input. For Hinoki, on average, the number of decisions presented to the annotator was 27.5. However, the average number of decisions needed to disambiguate each sentence was only 2.6, plus an additional decision to accept or reject the selected parses3. In general, even a sentence with 100 parses requires only around 5 decisions and 1,000 parses only around 7 decisions. A graph of parse results versus number of decisions presented and required is given in Figure 6.</Paragraph>
    <Paragraph position="3"> The primary data stored in the treebank is the derivation tree: the series of rules and lexical items the parser used to construct the parse. This, along with the grammar, can be combined to rebuild the complete HPSG sign. The annotators task is to select the appropriate derivation tree or trees. The possible derivation trees for a0a21a1a21a3a22a4 2 kflaten curtain are shown in Figure 3. Nodes in the trees indicate applied rules, simpli ed lexical types or words. We  ones, which only require a decision as to whether to accept or reject.</Paragraph>
    <Paragraph position="4"> will use it as an example to explain the annotation process. Figure 3 also displays POS tag from a separate tagger, shown in typewriter font.4 This example has two major sources of ambiguity.</Paragraph>
    <Paragraph position="5"> One is lexical: aru a certain/have/be is ambiguous between a reading as a determiner a certain (det-lex) and its use as a verb of possession have (aru-verb-lex). If it is a verb, this gives rise to further structural ambiguity in the relative clause, as discussed in Section 2. Reliable POS tags can thus resolve some ambiguity, although not all.</Paragraph>
    <Paragraph position="6"> Overall, this ve-word sentence has 6 parses. The annotator does not have to examine every tree but is instead presented with a range of 9 discriminants, as shown in Figure 4, each local to some segment of the utterance (word or phrase) and thus presenting a contrast that can be judged in isolation. Here the rst column shows deduced status of discriminants (typically toggling one discriminant will rule out others), the second actual decisions, the third the discriminating rule or lexical type, the fourth the constituent spanned (with a marker showing segmentation of daughters, where it is unambiguous), and the fth the parse trees which include the rule or lexical type.</Paragraph>
    <Paragraph position="7">  lected). D : deduced decisions, A : actual decisions After selecting a discriminant, the system recalculates the discriminant set. Those discriminants which can be deduced to be incompatible with the decisions are marked with ' ', and this information is recorded. The tool then presents to the annotator  only those discriminants which still select between the remaining parses, marked with '?'.</Paragraph>
    <Paragraph position="8"> In this case the desired parse can be selected with a minimum of two decisions. If the rst decision is that a45a47a46 aru is a determiner (det-lex), it eliminates four parses, leaving only three discriminants (corresponding to trees #1 and #2 in Figure 3) to be decided on in the second round of decisions. Selecting a17 mono thing as the gapped subject of a18a48a19 kakusu hide (rel-cl-sbj-gap) resolves the parse forest to the single correct derivation tree #1 in Figure 3.</Paragraph>
    <Paragraph position="9"> The annotator also has the option of leaving some ambiguity in the treebank. For example, the verbal noun a49 a1a16a50a44a4 flopun open is de ned with the single word a51a53a52 aku/hiraku open . This word however, has two readings: aku which is intransitive and hiraku which is transitive. As a49 a1a16a50a44a4 flopun open can be either transitive or intransitive, both parses are in fact correct! In such cases, the annotators were instructed to leave both parses.</Paragraph>
    <Paragraph position="10"> Finally, the annotator has the option of rejecting all the parses presented, if none have the correct syntax and semantics. This decision has to be made even for sentences with a unique parse.</Paragraph>
  </Section>
  <Section position="6" start_page="332" end_page="333" type="metho">
    <SectionTitle>
4 Using POS Tags to Blaze the Trees
</SectionTitle>
    <Paragraph position="0"> Sentences in the Lexeed dictionary were already part-of-speech tagged so we investigated exploiting this information to reduce the number of decisions the annotators had to make. More generally, there are many large corpora with a subset of the information we desire already available. For example, the Kyoto Corpus (Kurohashi and Nagao, 2003) has part of speech information and dependency information, but not the detailed information available from an HPSG analysis. However, the existing information can be used to blaze5 trees in the parse forest: that is to select or reject certain discriminants based on existing information.</Paragraph>
    <Paragraph position="1"> Because other sources of information may not be entirely reliable, or the granularity of the information may be different from the granularity in our 5In forestry, to blaze is to mark a tree, usually by painting and/or cutting the bark, indicating those to be cut or the course of a boundary, road, or trail.</Paragraph>
    <Paragraph position="2">  treebank, we felt it was important that the blazes be defeasible. The annotator can always reject the blazed decisions and retag the sentence.</Paragraph>
    <Paragraph position="3"> In [incr tsdb()], it is currently possible to blaze using POS information. The criteria for the blazing depend on both the grammar used to make the treebank and the POS tag set. The system matches the tagged POS against the grammar's lexical hierarchy, using a one-to-many mapping of parts of speech to types of the grammar and a subsumption-based comparison.</Paragraph>
    <Paragraph position="4"> It is thus possible to write very general rules. Blazes can be positive to accept a discriminant or negative to reject it. The blaze markers are de ned to be a POS tag, and then a list of lexical types and a score.</Paragraph>
    <Paragraph position="5"> The polarity of the score determines whether to accept or reject. The numerical value allows the use of a threshold, so that only those markers whose absolute value is greater than a threshold will be used. The threshold is currently set to zero: all blaze markers are used.</Paragraph>
    <Paragraph position="6"> Due to the nature of discriminants, having two positively marked but competing discriminants for the same word will result in no trees satisfying the conditions. Therefore, it is important that only negative discriminants should be used for more general lexical types.</Paragraph>
    <Paragraph position="7"> Hinoki uses 13 blaze markers at present, a simpli ed representation of them is shown in Figure 5. E.g. if hverb-aux, v-stem-lex, -1.0i was a blaze marker, then any sentence with a verb that has two non-auxiliary entries (e.g. hiraku/aku vt and vi) would be eliminated. The blaze set was derived from a conservative inspection of around 1,000 trees from an earlier round of annotation of similar data, identifying high-frequency contrasts in lexical ambiguity that can be con dently blazed from the POS granularity available for Lexeed.</Paragraph>
    <Paragraph position="8">  For the example shown in Figures 3 and 4, the blaze markers use the POS tagging of the determiner a45a54a46 aru to mark it as det-lex. This eliminates four parses and six discriminants leaving only three to be presented to the annotator. On average, marking blazes reduced the average number of blazes presented per sentence from 27.5 to 23.8 (a reduction of 15.6%). A graphical view of number of discriminants versus parse ambiguity is shown in Figure 6.</Paragraph>
  </Section>
  <Section position="7" start_page="333" end_page="335" type="metho">
    <SectionTitle>
5 Measuring Inter-Annotator Agreement
</SectionTitle>
    <Paragraph position="0"> Lacking a task-oriented evaluation scenario at this point, inter-annotator agreement is our core measure of annotation consistency in Hinoki. All trees (and associated semantics) in Hinoki are derived from a computational grammar and thus should be expected to demonstrate a basic degree of internal consistency. On the other hand, the use of the grammar exposes large amounts of ambiguity to annotators that might otherwise go unnoticed. It is therefore not a priori clear whether the Redwoods-style approach to treebank construction as a general methodology results in a high degree of internal consistency or a comparatively low one.</Paragraph>
    <Paragraph position="1"> a b b g g a Average  terms of the harshest possible measure, the proportion of sentences for which two annotators selected the exact same parse or both decided to reject all available parses. Each set was annotated by three annotators (a, b, g). They were all native speakers of Japanese with a high score in a Japanese pro ciency test (Amano and Kondo, 1998) but no linguistic training. The average annotation speed was 50 sentences an hour.</Paragraph>
    <Paragraph position="2"> In around 19 per cent of the cases annotators chose to not fully disambiguate, keeping two or even three active parses; for these we scored ij , with j being the number of identical pairs in the cross-product of active parses, and i the number of mismatches.</Paragraph>
    <Paragraph position="3"> One annotator keeping f1, 2, 3g, for example, and another f3, 4g would be scored as 16. In addition to  leaving residual ambiguity, annotators opted to reject all available parses in some eight per cent of cases, usually indicating opportunities for improvement of the underlying grammar. The Parse Agreement gures (65.4%) in Table 1 are those sentences where both annotators chose one or more parses, and they showed non-zero agreement. This gure is substantially above the published gure of 52% for NeGra Brants et al. (2003). Parse Disagreement is where both chose parses, but there was no agreement. Reject Agreement shows the proportion of sentences for which both annotators found no suitable analysis. Finally Reject Disagreement is those cases were one annotator found no suitable parses, but one selected one or more.</Paragraph>
    <Paragraph position="4"> The striking contrast between the comparatively high exact match ratios (over a random choice base-line of below seven per cent; k = 0.628) and the low agreement between annotators on which structures to reject completely suggests that the latter type of decision requires better guidelines, ideally tests that can be operationalized.</Paragraph>
    <Paragraph position="5"> To obtain both a more ne-grained measure and also be able to compare to related work, we computed a labeled precision f-score over derivation trees. Note that our inventory of labels is large, as they correspond in granularity to structures of the grammar: close to 1,000 lexical and 120 phrase types. As there is no 'gold' standard in contrasting two annotations, our labeled constituent measure F is the harmonic mean of standard labeled precision P (Black et al., 1991; Civit et al., 2003) applied in both 'directions': for a pair of annotators a and b, F is de ned as: F = 2P(a, b)P(b, a)P(a, b) + P(b, a) As found in the discussion of exact match inter-annotator agreement over the entire treebank, there are two fundamentally distinct types of decisions made by annotators, viz. (a) elimination of unwanted ambiguity and (b) the choice of keeping at least one analysis or rejecting the entire item. Of these, only (b) applies to items that are assigned only one parse by the grammar, hence we omit unambiguous items from our labeled precision measures (a little more than twenty per cent of the total) to exclude trivial agreement from the comparison. In the same spirit, to eliminate noise hidden in pairs of items where one or both annotators opted for multiple valid parses, we further reduced the comparison set to those pairs where both annotators opted for exactly one active parse. Intersecting both conditions for pairs of annotators leaves us with subsets of around 2,500 sentences each, for which we record F values ranging from 95.1 to 97.4, see Table 2. When broken down by pairs of annotators and sets of 1,000 items each, which have been annotated in strict sequential order, F scores in Table 2 con rm that: (a) inter-annotator agreement is stable, all three annotators appear to have performed equally (well); (b) with growing experience, there is a slight increase in F scores over time, particularly when taking into account that set E exhibits a noticeably higher average ambiguity rate (1208 parses per item) than set D (820 average parses); and (c) Hinoki inter-annotator agreement compares favorably to results reported for the German NeGra (Brants, 2000) and Spanish Cast3LB (Civit et al., 2003) treebanks, both of which used manual mark-up seeded from automated POS tagging and chunking.</Paragraph>
    <Paragraph position="6"> Compared to the 92.43 per cent labeled F score reported by Brants (2000), Hinoki achieves an 'error' (i.e. disagreement) rate of less than half, even though our structures are richer in information and should probably be contrasted with the 'edge label' F score for NeGra, which is 88.53 per cent. At the same time, it is unknown to what extent results are in uenced by differences in text genre, i.e. average sentence length of our dictionary de nitions is noticeably shorter than for the NeGra newspaper corpus. In addition, our measure is computed only over a subset of the corpus (those trees that can be parsed and that had multiple parses which were not rejected). If we recalculate over all 5,000 sentences, including rejected sentences (F measure of 0) and those with no ambiguity (F measure of 1) then the average F measure is 83.5, slightly worse than the score for NeGra. However, the annotation process itself identi es which the problematic sentences are, and how to improve the agreement: improve the grammar so that fewer sentences need to be rejected and then update the annotation. The Hinoki treebank is, by design, dynamic, so we expect to continue to improve the grammar and annotation continuously over the project's lifetime.</Paragraph>
    <Section position="1" start_page="335" end_page="335" type="sub_section">
      <SectionTitle>
5.1 The Effects of Blazing
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the number of decisions per annotator, including revisions, and the number of decisions that can be done automatically by the part-of-speech blazed markers. The test sets where the annotators used the blazes are shown underlined. The nal decision to accept or reject the parses was not included, as it must be made for every sentence.</Paragraph>
      <Paragraph position="1"> The blazed test sets require far fewer annotator decisions. In order to evaluate the effect of the blazes, we compared the average number of decisions per sentence for the test sets in which some annotators used blazes and some did not (B D). The average number of decisions went from 2.63 to 2.11, a substantial reduction of 19.5%. similarly, the time required to annotate an utterance was reduced from 83 seconds per sentence to 70, a speed up of 15.7%.</Paragraph>
      <Paragraph position="2"> We did not include A and E, as there was variation in dif culty between test sets, and it is well known that annotators improve (at least in speed of annotation) over time. Research on other projects has shown that it is normal for learning curve differences to swamp differences in tools (Wallis, 2003, p. 65). The number of decisions against the number of parses is show in Figure 6, both with and without the blazes.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="335" end_page="336" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> Annotators found the rejections the most time consuming. If a parse was eliminated, they often redid the decision process several times to be sure  they had not eliminated the correct parse in error, which was very time consuming. This shows that the most important consideration for the success of treebanking in this manner is the quality of the grammar. Fortunately, treebanking offers direct feed-back to the grammar developers. Rejected sentences identify which areas need to be improved, and because the treebank is dynamic, it can be improved when we improve the analyses in the grammar. This is a notable improvement over semi-automatically constructed grammars, such as the Penn Treebank, where many inconsistencies remain (around 4,500 types estimated by Dickinson and Meurers, 2003) and the treebank does not allow them to be identied automatically or easily updated.</Paragraph>
    <Paragraph position="1"> Because we are simultaneously using the semantic output of the grammar in building an ontology, and the syntax and semantics are tightly coupled, the knowledge acquisition provides a further route for feedback. Extracting an ontology from the semantic representations revealed many issues with the semantics that had previously been neglected.</Paragraph>
    <Paragraph position="2"> Our top priority for further work within Hinoki  is to improve the grammar so as to both increase the cover and decrease the number of results with no acceptable parses. This will allow us to treebank a higher proportion of sentences, with even higher precision.</Paragraph>
    <Paragraph position="3"> For more general work on treebank construction, we would like to investigate (1) using other information for blazes (syntactic constituents, dependencies, translation data) and marking blazes automatically using con dent scores from existing POS taggers or parsers, (2) other agreement measures (for example agreement over the semantic representations), (3) presenting discriminants based on the semantic representations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML