File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0307_evalu.xml
Size: 10,742 bytes
Last Modified: 2025-10-06 14:00:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0307"> <Title>Tagging Grammatical Functions</Title> <Section position="10" start_page="511" end_page="511" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> To investigate the possibility of automating annotation, experiments were performed with the cleaned part of the treebank 6 (approx. 1,200 sentences, 24,000 words). The first run of experiments was carried out to test tagging of grammatical functions, the second run to test tagging of phrase categories.</Paragraph> <Section position="1" start_page="511" end_page="511" type="sub_section"> <SectionTitle> 6.1 Grammatical Functions </SectionTitle> <Paragraph position="0"> This experiment tested the reliability of assigning grammatical functions given the category of the phrase and the daughter nodes (supplied by the annotator). null Let us consider the sentence in figure 6: two sequences of grammatical functions are to be determined, namely the grammatical functions of the daughter nodes of S and VP. The information given for selbst besucht Sabine is its category (VP) and the daughter categories: adverb (ADV), past participle (wee), and proper noun (NE). The task is to assign the functions modifier (MO) to ADV, head (SO) to wee and direct (accusative) object (OA) to NE. Similarly, function tags are assigned to the components of the sentence (S).</Paragraph> <Paragraph position="1"> The tagger described in section 4 was used.</Paragraph> <Paragraph position="2"> The corpus was divided into two disjoint parts, one for training (90% of the respective corpus), and one for testing (10%). This procedure was repeated 10 times with different partitions. Then the average accuracy was calculated.</Paragraph> <Paragraph position="3"> The thresholds for search beams were set to 61 = 5 and 62 = 100, i.e., a decision is classified as reliable if there is no alternative with a probability larger than 1~0 of the best function tag. The prediction is classified as unreliable if the probability of an alternative is larger than ~ of the most probable tag. provided on the ECI CD-ROM. It has been part-of-speech tagged and manually corrected previously, cf. ses where the tagger assigned a correct grammatical function (or would have assigned if a decision is forced).</Paragraph> <Paragraph position="4"> If there is an akernative between these two thresholds, the prediction is classified as almost reliable and marked in the output (cf. section 4.3: marked assignments are to be confirmed by the annotator, unreliable assignments are deleted, annotation is left to the annotator).</Paragraph> <Paragraph position="5"> Table 1 shows tagging accuracy depending on the three different levels of reliability. The results confirm the choice of reliability measures: the lower the reliability, the lower the accuracy.</Paragraph> <Paragraph position="6"> Table 2 shows tagging accuracy depending on the category of the phrase and the level of reliability. The table contains the following information: the number of all mother-daughter relations (i.e., number of words and phrases which are immediately dominated by a mother node of a particular category), the overall accuracy for that phrasal category and the accuraciees for the three reliability intervals.</Paragraph> </Section> <Section position="2" start_page="511" end_page="511" type="sub_section"> <SectionTitle> 6.2 Error Analysis for Function Assignment </SectionTitle> <Paragraph position="0"> The inspection of tagging errors reveals several sources of wrong assignments. Table 3 shows the 10 most frequent errors 7 which constitute 25% of all errors (1509 errors occurred during 10 test runs).</Paragraph> <Paragraph position="1"> Read the table in the following way: line 2 shows the second-most frequent error. It concerns NPs occurring in a sentence (S); this combination occurred 1477 times during testing. In 286 of these occurrences the N P is assigned the grammatical function OA (accusative object) manually, but of these 286 cases the tagger assigned the function SB (subject) 56 times.</Paragraph> <Paragraph position="2"> The errors fall into the following classes: 1. There is insufficient information in the node labels to disambiguate the grammatical function.</Paragraph> <Paragraph position="3"> Line 1 is an example for insufficient information.</Paragraph> <Paragraph position="4"> The tag NP is uninformative about its case and therefore the tagger has to distinguish SB (subject) and functions depending on the category of the mother node. For each category, the first row shows the percentage of branches that occur within this category and the overall accuracy, the following rows show the relative percentage and accuracy for different levels of reliability.</Paragraph> <Paragraph position="5"> grammatical functions. The table shows a mother and a daughter node category, the frequency of this particular combination (sum over 10 test runs), the grammatical function assigned manually (and its frequency) and the grammatical function assigned by the tagger (and its frequency).</Paragraph> <Paragraph position="6"> OA (accusative object) on the basis of its position, which is not very reliable in German. Missing information in the labels is the main source of errors. Therefore, we currently investigate the benefits of a morphological component and percolation of selected information to parent nodes.</Paragraph> <Paragraph position="7"> 2. Due to the n-gram approach, the tagger only sees a local window of the sentences.</Paragraph> <Paragraph position="8"> Some linguistic knowledge is inherently global, e.g., there is at most one subject in a sentence and one head in a VP. Errors of this type may be reduced by introducing finite state constraints that restrict the possible sequences of functions within each phrase. 3. The manual annotation is wrong, and a correct tagger prediction is counted as an error.</Paragraph> <Paragraph position="9"> At earlier stages of annotation, the main source of errors was wrong or missing manual annotation. In some cases, the tagger was able to abstract from these errors during the training phase and subsequently assigned the correct tag for the test data. However, when performing a comparison against the corpus, these differences are marked as errors. Most of these errors were eliminated by comparing two independent annotations and cleaning up the data.</Paragraph> </Section> <Section position="3" start_page="511" end_page="511" type="sub_section"> <SectionTitle> 6.3 Phrase Categories </SectionTitle> <Paragraph position="0"> In this experiment, the reliability of assigning phrase categories given the categories of the daughter nodes (they are supplied by the annotator) was tested.</Paragraph> <Paragraph position="1"> Consider the sentence in figure 6: two phrase categories are to be determined (VP and S). The information given for selbst besucM Sabine is the sequence of categories: adverb (ADV), past participle cases in which the tagger assigned a correct phrase category (or would have assigned if a decision is forced).</Paragraph> <Paragraph position="2"> (VVPP), and proper noun (NE). The task is to assign category VP. Subsequently, S is to be assigned based on the categories of the daughters VP, VAFIN, NE, and ADV.</Paragraph> <Paragraph position="3"> The extended tagger using a combined model as described in section 5 was applied.</Paragraph> <Paragraph position="4"> Again, the corpus is divided into two disjoint parts, one for training (90% of the corpus), and one for testing (10%). The procedure is repeated 10 times with different partitions. Then the average accuracy was calculated.</Paragraph> <Paragraph position="5"> The same thresholds for search beams as for the first set of experiments were used.</Paragraph> <Paragraph position="6"> Table 4 shows tagging accuracy depending on the three different levels of reliability.</Paragraph> <Paragraph position="7"> Table 5 shows tagging accuracy depending on the category of the phrase and the level of reliability. The table contains the following information: the percentage of occurrences of the particular phrase, the overall accuracy for that phrasal category and the accuracy for each of the three reliability intervals. null</Paragraph> </Section> <Section position="4" start_page="511" end_page="511" type="sub_section"> <SectionTitle> 6.4 Error Analysis for Category Assignment </SectionTitle> <Paragraph position="0"> When forced to make a decision (even in unreliable cases) 435 errors occured during the 10 test runs (4.5% error rate). Table 6 shows the 10 most-frequent errors which constitute 50% of all errors.</Paragraph> <Paragraph position="1"> The most frequent error was the confusion of S and VP. They differ in that sentences S contain finite verbs and verb phrases VP contain non-finite verbs. But the tagger is trained on data that contain incomplete sentences and therefore sometimes erroneously assumes an incomplete S instead of a VP. To avoid this type of error, the tagger should be able to take the neighborhood of phrases into account. Then, it could detect the finite verb that completes the sentence.</Paragraph> <Paragraph position="2"> Adjective phrases AP and noun phrases NP are confused by the tagger (line 5 in table 6), since almost all AP's can be NP's. This error could also Table 5: Tagging accuracy for assigning phrase categories, depending on the manually assigned category. For each category, the first row shows the percentage of phrases belongi:lg to a specific category (according to manual ~,zsignment) and the percentage of correct assignments. The following rows show the relative percentage and accuracy for different levels of reliability.</Paragraph> <Paragraph position="3"> phrase categories (summed over reliability levels).</Paragraph> <Paragraph position="4"> The table shows the phrase category assigned manually (and its frequency) and the category erroneously assigned by the tagger (and its frequency).</Paragraph> <Paragraph position="5"> be fixed by inspecting the context and detecting the associated NP.</Paragraph> <Paragraph position="6"> As for assigning grammatical functions, insufficient information in the labels is a significant source of errors, cf. the second-most frequent error. A large number of cardinal-noun pairs forms a numerical component (NM), like 7 Millionen, 50 Prozent, etc (7 million, 50 percent). But this combination also occurs in NPs like 20 Leule, 3 Monate, ... (20 people, 3 months), which are mis-tagged since they are less frequent. This can be fixed by introducing an extra tag for nouns denoting numericals.</Paragraph> </Section> </Section> class="xml-element"></Paper>