File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1068_evalu.xml
Size: 16,007 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1068"> <Title>Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation of Annotated Data </SectionTitle> <Paragraph position="0"> Materials. Compared to the pilot study we previously reported (Erk et al., 2003), in which 3 annotators tagged 440 corpus instances of a single frame, resulting in 1,320 annotation instances, we now dispose of a considerably larger body of data. It consists of 703 corpus instances for the two frames shown in Figure 1, making up a total of 4,653 annotation instances. For the frame REQUEST, we obtained 421 instances with 8-fold and 114 with 7-fold annotation. The annotated lemmas comprise auffordern (to request), fordern, verlangen (to demand), zur&quot;uckfordern (demand back), the noun Forderung (demand), and compound nouns ending with -forderung. For the frame C T we have 30, 40 and 98 instances with 5-, 3-, and 2-fold annotation respectively. The annotated lemmas are kaufen (to buy), erwerben (to acquire), verbrauchen (to consume), and verkaufen (to sell).</Paragraph> <Paragraph position="1"> Note that the corpora we are evaluating do not constitute a random sample: At the moment, we cover only two frames, and REQUEST seems to be relatively easy to annotate. Also, the annotation results may not be entirely predictive for larger sample sizes: While the annotation guidelines were being developed, we used REQUEST as a &quot;calibration&quot; frame to be annotated by everybody. As a result, in some cases reliability may be too low because detailed guidelines were not available, and in others it may be too high because controversial instances were discussed in project meetings.</Paragraph> <Paragraph position="2"> Results. The results in this section refer solely to the assignment of fully specified frames and frame elements. Underspecification is discussed at length and frame elements (below).</Paragraph> <Paragraph position="3"> in Section 6. Due to the limited space in this paper, we only address the question of inter-annotator agreement or annotation reliability, since a reliable annotation is necessary for all further corpus uses.</Paragraph> <Paragraph position="4"> Table 1 shows the inter-annotator agreement on frame assignment and on frame element assignment, computed for pairs of annotators. The &quot;average&quot; column shows the total agreement for all annotation instances, while &quot;best&quot; and &quot;worst&quot; show the figures for the (lemma-specific) subcorpora with highest and lowest agreement, respectively. The upper half of the table shows agreement on the assignment of frames to FEEs, for which we performed 14,410 pairwise comparisons, and the lower half shows agreement on assigned frame elements (29,889 pair-wise comparisons). Agreement on frame elements is &quot;exact match&quot;: both annotators have to tag exactly the same sequence of words. In sum, we found that annotators agreed very well on frames. Disagreement on frame elements was higher, in the range of 12-25%. Generally, the numbers indicated considerable differences between the subcorpora.</Paragraph> <Paragraph position="5"> To investigate this matter further, we computed the Alpha statistic (Krippendorff, 1980) for our annotation. Like the widely used Kappa, is a chance-corrected measure of reliability. It is defined as = 1 observed disagreementexpected disagreement We chose Alpha over Kappa because it also indicates unreliabilities due to unequal coder preference for categories. With an value of 1 signifying total agreement and 0 chance agreement, values above 0.8 are usually interpreted as reliable annotation.</Paragraph> <Paragraph position="6"> Figure 4 shows single category reliabilities for the assignment of frame elements. The graphs shows that not only did target lemmas vary in their difficulty, but that reliability of frame element assignment was also a matter of high variation. Firstly, frames introduced by nouns (Forderung and -forderung) were more difficult to annotate than verbs. Secondly, frame elements could be assigned to three groups: frame elements which were always annotated reliably, those whose reliability was highly dependent on the FEE, and the third group whose members were impossible to annotate reliably (these are not shown in the graphs). In the REQUEST frames, SPEAKER, MESSAGE and ADDRESSEE belong to the first group, at least for verbal FEEs. MEDIUM is a member of the second group, and TOPIC was annotated at chance level ( 0).</Paragraph> <Paragraph position="7"> In the COMMERCE frame, only BUYER and GOODS always show high reliability. SELLER can only be reliably annotated for the target verkaufen. PURPOSE and REASON fall into the third group.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Discussion </SectionTitle> <Paragraph position="0"> Interpretation of the data. Inter-annotator agreement on the frames shown in Table 1 is very high.</Paragraph> <Paragraph position="1"> However, the lemmas we considered so far were only moderately ambiguous, and we might see lower figures for frame agreement for highly polysemous FEEs like laufen (to run).</Paragraph> <Paragraph position="2"> For frame elements, inter-annotator agreement is not that high. Can we expect improvement? The Prague Treebank reported a disagreement of about 10% for manual thematic role assignment ( VZabokrtsk'y, 2000). However, in contrast to our study, they also annotated temporal and local modifiers, which are easier to mark than other roles. One factor that may improve frame element agreement in the future is the display of syntactic structure directly in the annotation tool. Annotators were instructed to assign each frame element to a single syntactic constituent whenever possible, but could only access syntactic structure in a separate viewer. We found that in 35% of pairwise frame element disagreements, one annotator assigned a single syntactic constituent and the other did not. Since a total of 95.6% of frame elements were assigned to single constituents, we expect an increase in agreement when a dedicated annotation tool is available. As to the pronounced differences in reliability between frame elements, we found that while most central frame elements like SPEAKER or BUYER were easy to identify, annotators found it harder to agree on less frequent frame elements like MEDIUM, particularly low agreement ( < 0:8) contribute towards the low overall inter-annotator agreement of the C T frame. We suspect that annotators saw too few instances of these elements to build up a reliable intuition. However, the elements may also be inherently difficult to distinguish.</Paragraph> <Paragraph position="3"> How can we interpret the differences in frame element agreement across target lemmas, especially between verb and noun targets? While frame elements for verbal targets are usually easy to identify based on syntactic factors, this is not the case for nouns. Figure 3 shows an example: Should SPD be tagged as INTERLOCUTOR-2 in the CONVERSATION frame? This appears to be a question of pragmatics. Here it seems that clearer annotation guidelines would be desirable.</Paragraph> <Paragraph position="4"> FrameNet as a resource for semantic role annotation. Above, we have asked about the suitability of FrameNet for semantic role annotation, and our data allow a first, though tentative, assessment. Concerning the portability of FrameNet to other languages than English, the English frames worked well for the German lemmas we have seen so far.</Paragraph> <Paragraph position="5"> For C T a number of frame elements seem to be missing, but these are not language-specific, like CREDIT (for on commission and in installments).</Paragraph> <Paragraph position="6"> The FrameNet frame database is not yet complete.</Paragraph> <Paragraph position="7"> How often do annotators encounter missing frames? The frame UNKNOWN was assigned in 6.3% of the instances of REQUEST, and in 17.6% of the C T instances. The last figure is due to the overwhelming number of UNKNOWN cases in verbrauchen, for which the main sense we encountered is &quot;to use up a resource&quot;, which FrameNet does not offer.</Paragraph> <Paragraph position="8"> Is the choice of frame always clear? And can frame elements always be assigned unambiguously? Above we have already seen that frame element assignment is problematic for nouns. In the next section we will discuss problematic cases of frame assignment as well as frame element assignment.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6 Vagueness, Ambiguity and Underspecification </SectionTitle> <Paragraph position="0"> Annotation Challenges. It is a well-known problem from word sense annotation that it is often impossible to make a safe choice among the set of possible semantic correlates for a linguistic item. In frame annotation, this problem appears on two levels: The choice of a frame for a target is a choice of word sense. The assignment of frame elements to phrases poses a second disambiguation problem.</Paragraph> <Paragraph position="1"> An example of the first problem is the German verb verlangen, which associates with both the frame REQUEST and the frame C T. We found several cases where both readings seem to be equally present, e.g. sentence (3). Sentences (4) and (5) ex- null emplify the second problem. The italicised phrase in (4) may be either a SPEAKER or a MEDIUM and the one in (5) either a MEDIUM or not a frame element at all. In our exhaustive annotation, these problems are much more virulent than in the FrameNet corpus, which consists mostly of prototypical examples.</Paragraph> <Paragraph position="2"> (3) Gleichwohl versuchen offenbar Assekuranzen,[das Gesetz] zu umgehen, indem sie von Nichtdeutschen mehr Geld verlangen.</Paragraph> <Paragraph position="3"> (Nonetheless insurance companies evidently try to circumvent [the law] by asking/demanding more money from non-Germans.) (4) Die nachhaltigste Korrektur der Programmatik fordert ein Antrag. . .</Paragraph> <Paragraph position="4"> (The most fundamental policy correction is requested by a motion. . . ) (5) Der Parteitag billigte ein Wirtschaftskonzept, in dem der Umbau gefordert wird.</Paragraph> <Paragraph position="5"> (The party congress approved of an economic concept in which a change is demanded.) Following Kilgarriff and Rosenzweig (2000), we distinguish three cases where the assignment of a single semantic tag is problematic: (1), cases in which, judging from the available context information, several tags are equally possible for an ambiguous utterance; (2), cases in which more than one tag applies at the same time, because the sense distinction is neutralised in the context; and (3), cases in which the distinction between two tags is systematically vague or unclear.</Paragraph> <Paragraph position="6"> In SALSA, we use the concept of underspecification to handle all three cases: Annotators may assign underspecified frame and frame element tags. While the cases have different semantic-pragmatic status, we tag all three of them as underspecified. This is in accordance with the general view on underspecification in semantic theory (Pinkal, 1996). Furthermore, Kilgarriff and Rosenzweig (2000) argue that it is impossible to distinguish those cases Allowing underspecified tags has several advantages. First, it avoids (sometimes dubious) decisions for a unique tag during annotation. Second, it is useful to know if annotators systematically found it hard to distinguish between two frames or two frame elements. This diagnostic information can be used for improving the annotation scheme (e.g. by removing vague distinctions). Third, underspecified tags may indicate frame relations beyond an inheritance hierarchy, horizontal rather than vertical connections. In (3), the use of underspecification can indicate that the frames REQUEST and C T are used in the same situation, which in turn can serve to infer relations between their respective frame elements.</Paragraph> <Paragraph position="7"> Evaluating underspecified annotation. In the previous section, we disregarded annotation cases involving underspecification. In order to evaluate underspecified tags, we present a method of computing inter-annotator agreement in the presence of underspecified annotations. Representing frames and frame elements as predicates that each take a sequence of word indices as their argument, a frame annotation can be seen as a pair (CF;CE) of two formulae, describing the frame and the frame elements, respectively. Without underspecification, CF is a single predicate and CE is a conjunction of predicates. For the CONVERSATION frame of sentence (1), CF has the form CONVERSATION(Gespr&quot;ach)1 , and CE is INTLC 1(Koalition) ^ TOPIC(&quot;uber Reform). Underspecification is expressed by conjuncts that are disjunctions instead of single predicates. Table 2 shows the admissible cases. For example, the CE of (4) contains the conjunct SPKR(ein Antrag) _ MEDIUM(ein Antrag). Our annotation scheme guarantees that every FE name appears in at most one conjunct of CE. Exact agreement means that every conjunct of annotator A must correspond to a conjunct by annotator B, and vice versa. For partial agreement, it suffices that for each conjunct of A, one disjunct matches a disjunct in a conjunct of B, and conversely.</Paragraph> <Paragraph position="8"> frame annotation F(t) single frame: F is assigned to t (F1(t)_F2(t)) frame disjunction: F1 or F2 is assigned to t frame element annotation E(s) single frame element: E is assigned to s (E1(s)_E2(s)) frame element disjunction: E1 or E2 is assigned to s (E(s)_NOFE(s)) optional element: E1 or no frame element is assigned to s (E(s)_E(s1ss2)) underspecified length: frame element E is assigned to s or the longer sequence s1ss2, a frame element name, and t and s are sequences of word indices (t is for the target (FEE)) Using this measure of partial agreement, we now evaluate underspecified annotation. The most striking result is that annotators made little use of underspecification. Frame underspecification was used in 0.4% of all frames, and frame element underspecification for 0.9% of all frame elements. The frame element MEDIUM, which was rarely assigned outside 1We use words instead of indices for readability.</Paragraph> <Paragraph position="9"> underspecification, accounted for roughly half of all underspecification in the REQUEST frame. 63% of the frame element underspecifications are cases of optional elements, the third class in the lower half of Table 2. (Partial) agreement on underspecified tags was considerably lower than on non-underspecified tags, both in the case of frames (86%) and in the case of frame elements (54%). This was to be expected, since the cases with underspecified tags are the more difficult and controversial ones. Since underspecified annotation is so rare, overall frame and frame element agreement including underspecified annotation is virtually the same as in Table 1.</Paragraph> <Paragraph position="10"> It is unfortunate that annotators use underspecification only infrequently, since it can indicate interesting cases of relatedness between different frames and frame elements. However, underspecification may well find its main use during the merging of independent annotations of the same corpus. Not only underspecified annotation, also disagreement between annotators can point out vague and ambiguous cases. If, for example, one annotator has assigned SPEAKER and the other MEDIUM in sentence (4), the best course is probably to use an underspecified tag in the merged corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>