File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1040_metho.xml
Size: 20,483 bytes
Last Modified: 2025-10-06 14:09:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1040"> <Title>Detecting Errors in Discontinuous Structural Annotation</Title> <Section position="3" start_page="322" end_page="323" type="metho"> <SectionTitle> 2 The variation n-gram method </SectionTitle> <Paragraph position="0"> Our approach builds on the variation n-gram algorithm introduced in Dickinson and Meurers (2003a,b). The basic idea behind that approach is that a string occurring more than once can occur with different labels in a corpus, which we refer to as variation. Variation is caused by one of two reasons: i) ambiguity: there is a type of string with multiple possible labels and different corpus occurrences of that string realize the different options, or ii) error: the tagging of a string is inconsistent across compa- null The more similar the context of a variation, the more likely the variation is an error. In Dickinson and Meurers (2003a), contexts are composed of words, and identity of the context is required.</Paragraph> <Paragraph position="1"> The term variation n-gram refers to an n-gram (of words) in a corpus that contains a string annotated differently in another occurrence of the same n-gram in the corpus. The string exhibiting the variation is referred to as the variation nucleus.</Paragraph> <Section position="1" start_page="323" end_page="323" type="sub_section"> <SectionTitle> 2.1 Detecting variation in POS annotation </SectionTitle> <Paragraph position="0"> In Dickinson and Meurers (2003a), we explore this idea for part-of-speech annotation. For example, in the WSJ corpus the string in (2) is a variation 12gram since off is a variation nucleus that in one corpus occurrence is tagged as a preposition (IN), while in another it is tagged as a particle (RP).2 (2) to ward off a hostile takeover attempt by two European shipping concerns Once the variation n-grams for a corpus have been computed, heuristics are employed to classify the variations into errors and ambiguities. The first heuristic encodes the basic fact that the label assignment for a nucleus is dependent on the context: variation nuclei in long n-grams are likely to be errors. The second takes into account that natural languages favor the use of local dependencies over non-local ones: nuclei found at the fringe of an n-gram are more likely to be genuine ambiguities than those occurring with at least one word of surrounding context. Both of these heuristics are independent of a specific corpus, annotation scheme, or language.</Paragraph> <Paragraph position="1"> We tested the variation error detection method on the WSJ and found 2495 distinct3 nuclei for the variation n-grams between the 6-grams and the 224grams. 2436 of these were actual errors, making for a precision of 97.6%, which demonstrates the value of the long context heuristic. 57 of the 59 genuine ambiguities were fringe elements, confirming that fringe elements are more indicative of a true ambiguity. null</Paragraph> </Section> <Section position="2" start_page="323" end_page="323" type="sub_section"> <SectionTitle> 2.2 Detecting variation in syntactic annotation </SectionTitle> <Paragraph position="0"> In Dickinson and Meurers (2003b), we decompose the variation n-gram detection for syntactic annotation into a series of runs with different nucleus sizes. This is needed to establish a one-to-one relation between a unit of data and a syntactic category annotation for comparison. Each run detects the variation in the annotation of strings of a specific length. By performing such runs for strings from length 1 to the length of the longest constituent in the corpus, the approach ensures that all strings which are analyzed as a constituent somewhere in the corpus are compared to the annotation of all other occurrences of that string.</Paragraph> <Paragraph position="1"> For example, the variation 4-gram from a year earlier appears 76 times in the WSJ, where the nucleus a year is labeled noun phrase (NP) 68 times, and 8 times it is not annotated as a constituent and is given the special label NIL. An example with two syntactic categories involves the nucleus next Tuesday as part of the variation 3-gram maturity next Tuesday, which appears three times in the WSJ.</Paragraph> <Paragraph position="2"> Twice it is labeled as a noun phrase (NP) and once as a prepositional phrase (PP).</Paragraph> <Paragraph position="3"> To be able to efficiently calculate all variation nuclei of a treebank, in Dickinson and Meurers (2003b) we make use of the fact that a variation necessarily involves at least one constituent occurrence of a nucleus and calculate the set of nuclei for a window of length i by first finding the constituents of that length. Based on this set, we then find non-constituent occurrences of all strings occurring as constituents. Finally, the variation n-grams for these variation nuclei are obtained in the same way as for POS annotation.</Paragraph> <Paragraph position="4"> In the WSJ, the method found 34,564 variation nuclei, up to size 46; an estimated 71% of the 6277 non-fringe distinct variation nuclei are errors.</Paragraph> </Section> </Section> <Section position="4" start_page="323" end_page="324" type="metho"> <SectionTitle> 3 Discontinuous constituents </SectionTitle> <Paragraph position="0"> In Dickinson and Meurers (2003b), we argued that null elements need to be ignored as variation nuclei because the variation in the annotation of a null element as the nucleus is largely independent of the local environment. For example, in (3) the null element *EXP* (expletive) can be annotated a. as a sentence (S) or b. as a relative/subordinate clause (SBAR), depending on the properties of the clause it refers to.</Paragraph> <Paragraph position="1"> (3) a. For cities losing business to suburban shopping centers , it *EXP* may be a wise business investment [S * to help * keep those jobs and sales taxes within city limits] . b. But if the market moves quickly enough , it *EXP* may be impossible [SBAR for the broker to carry out the order] because the investment has passed the specified price . We found that removing null elements as variation nuclei of size 1 increased the precision of error detection to 78.9%.</Paragraph> <Paragraph position="2"> Essentially, null elements represent discontinuous constituents in a formalism with a context-free backbone (Bies et al., 1995). Null elements are co-indexed with a non-adjacent constituent; in the predicate argument structure, the constituent should be interpreted where the null element is.</Paragraph> <Paragraph position="3"> To be able to annotate discontinuous material without making use of inserted null elements, some treebanks have instead relaxed the definition of a linguistic tree and have developed more complex graph annotations. An error detection method for such corpora thus does not have to deal with the problems arising from inserted null elements discussed above, but instead it must function appropriately even if constituents are discontinuously realized.</Paragraph> <Paragraph position="4"> A technique such as the variation n-gram method is applicable to corpora with a one-to-one mapping between the text and the annotation. For corpora with positional annotation--e.g., part-of-speech annotated corpora--the mapping is trivial given that the annotation consists of one-to-one correspondences between words (i.e., tokens) and labels. For corpora annotated with more complex structural information--e.g., syntacticallyannotated corpora--the one-to-one mapping is obtained by considering every interval (continuous string of any length) which is assigned a category label somewhere in the corpus.</Paragraph> <Paragraph position="5"> While this works for treebanks with continuous constituents, a one-to-one mapping is more complicated to establish for syntactic annotation involving discontinuous constituents (NEGRA, Skut et al., 1997; TIGER, Brants et al., 2002). In order to apply the variation n-gram method to discontinuous constituents, we need to develop a technique which is capable of comparing labels for any set of corpus positions, instead of for any interval.</Paragraph> </Section> <Section position="5" start_page="324" end_page="327" type="metho"> <SectionTitle> 4 Extending the variation n-gram method </SectionTitle> <Paragraph position="0"> To extend the variation n-gram method to handle discontinuous constituents, we first have to define the characteristics of such a constituent (section 4.1), in other words our units of data for comparison.</Paragraph> <Paragraph position="1"> Then, we can find identical non-constituent (NIL) strings (section 4.2) and expand the context into variation n-grams (section 4.3).</Paragraph> <Section position="1" start_page="324" end_page="325" type="sub_section"> <SectionTitle> 4.1 Variation nuclei: Constituents </SectionTitle> <Paragraph position="0"> For traditional syntactic annotation, a variation nucleus is defined as a contiguous string with a single label; this allows the variation n-gram method to be broken down into separate runs, one for each constituent size in the corpus. For discontinuous syntactic annotation, since we are still interested in comparing cases where the nucleus is the same, we will treat two constituents as having the same size if they consist of the same number of words, regardless of the amount of intervening material, and we can again break the method down into runs of different sizes. The intervening material is accounted for when expanding the context into n-grams.</Paragraph> <Paragraph position="1"> A question arises concerning the word order of elements in a constituent. Consider the German ex- null 'because the man gave the woman the book.' The three arguments of the verb gab ('give') can be permuted in all six possible ways and still result in a well-formed sentence. It might seem, then, that we would want to allow different permutations of nuclei to be treated as identical. If das Buch der Frau gab is a constituent in another sentence, for instance, it should have the same category label as der Frau das Buch gab.</Paragraph> <Paragraph position="2"> Putting all permutations into one equivalence class, however, amounts to stating that all order- null ings are always the same. But even &quot;free word order&quot; languages are more appropriately called free constituent order; for example, in (4), the argument noun phrases can be freely ordered, but each argument noun phrase is an atomic unit, and in each unit the determiner precedes the noun.</Paragraph> <Paragraph position="3"> Since we want our method to remain data-driven and order can convey information which might be reflected in an annotation system, we keep strings with different orders of the same words distinct, i.e., ordering of elements is preserved in our method.</Paragraph> </Section> <Section position="2" start_page="325" end_page="326" type="sub_section"> <SectionTitle> 4.2 Variation nuclei: Non-constituents </SectionTitle> <Paragraph position="0"> The basic idea is to compare a string annotated as a constituent with the same string found elsewhere-whether annotated as a constituent or not. So we need to develop a method for finding all string occurrences not analyzed as a constituent (and assign them the special category label NIL). Following Dickinson and Meurers (2003b), we only look for non-constituent occurrences of those strings which also occur at least once as a constituent.</Paragraph> <Paragraph position="1"> But do we need to look for discontinuous NIL strings or is it sufficient to assume only continuous ones? Consider the TIGER treebank examples (5).</Paragraph> <Paragraph position="2"> In example (5a), sich einig ('SELF agree') forms an adjective phrase (AP) constituent. But in example (5b), that same string is not analyzed as a constituent, despite being in a nearly identical sentence. We would thus like to assign the discontinuous string sich einig in (5b) the label NIL, so that the labeling of this string in (5a) can be compared to its occurrence in (5b).</Paragraph> <Paragraph position="3"> In consequence, our approach should be able to detect NIL strings which are discontinuous--an issue which requires special attention to obtain an algorithm efficient enough to handle large corpora.</Paragraph> <Paragraph position="4"> Use sentence boundary information The first consideration makes use of the fact that syntactic annotation by its nature respects sentence boundaries. In consequence, we never need to search for NIL strings that span across sentences.4 Use tries to store constituent strings The second consideration concerns how we calculate the NIL strings. To find every non-constituent string in the corpus, discontinuous or not, which is identical to some constituent in the corpus, a basic approach would first generate all possible strings within a sentence and then test to see which ones occur as a constituent elsewhere in the corpus. For example, if the sentence is Nobody died when Clinton lied, we would see if any of the 31 subsets of strings occur as constituents (e.g., Nobody, Nobody when, Clinton lied, Nobody when lied, etc.). But such a generate and test approach clearly is intractable given that it generates generates 2n[?]1 potential matches for a sentence of n words.</Paragraph> <Paragraph position="5"> We instead split the task of finding NIL strings into two runs through the corpus. In the first, we store all constituents in the corpus in a trie data structure (Fredkin, 1960), with words as nodes. In the second run through the corpus, we attempt to match the strings in the corpus with a path in the trie, thus identifying all strings occurring as constituents somewhere in the corpus.</Paragraph> <Paragraph position="6"> Filter out unwanted NIL strings The final consideration removes &quot;noisy&quot; NIL strings from the candidate set. Certain NIL strings are known to be useless for detecting annotation errors, so we should remove them to speed up the variation n-gram calculations. Consider example (6) from the TIGER corpus, where the continuous constituent die Menschen is annotated as a noun phrase (NP).</Paragraph> <Paragraph position="7"> Our basic method of finding NIL strings would detect another occurrence of die Menschen in the same sentence since nothing rules out that the other occurrence of die in the sentence (preceding Weltbank) forms a discontinuous NIL string with Menschen.</Paragraph> <Paragraph position="8"> Comparing a constituent with a NIL string that contains one of the words of the constituent clearly goes against the original motivation for wanting to find discontinuous strings, namely that they show variation between different occurrences of a string.</Paragraph> <Paragraph position="9"> To prevent such unwanted variation, we eliminate occurrences of NIL-labeled strings that overlap with identical constituent strings from consideration.</Paragraph> </Section> <Section position="3" start_page="326" end_page="326" type="sub_section"> <SectionTitle> 4.3 Variation n-grams </SectionTitle> <Paragraph position="0"> The more similar the context surrounding a variation nucleus, the more likely it is for a variation in its annotation to be an error. For detecting errors in traditional syntactic annotation (see section 2.2), the context consists of the elements to the left and the right of the nucleus. When nuclei can be discontinuous, however, there can also be internal context, i.e., elements which appear between the words forming a discontinuous variation nucleus.</Paragraph> <Paragraph position="1"> As in our earlier work, an instance of the a priori algorithm is used to expand a nucleus into a longer n-gram by stepwise adding context elements.</Paragraph> <Paragraph position="2"> Where previously it was possible to add an element to the left or the right, we now also have the option of adding it in the middle--as part of the new, internal context. But depending on how we fill in the internal context, we can face a serious tractability problem.</Paragraph> <Paragraph position="3"> Given a nucleus with j gaps within it, we need to potentially expand it in j + 2 directions, instead of in just 2 directions (to the right and to the left).</Paragraph> <Paragraph position="4"> For example, the potential nucleus was werden appears as a verb phrase (VP) in the TIGER corpus in the string was ein Seeufer werden; elsewhere in the corpus was and werden appear in the same sentence with 32 words between them. The chances of one of the middle 32 elements matching something in the internal context of the VP is relatively high, and indeed the twenty-sixth word is ein. However, if we move stepwise out from the nucleus in order to try to match was ein Seeufer werden, the only options are to find ein directly to the right of was or Seeufer directly to the left of werden, neither of which occurs, thus stopping the search.</Paragraph> <Paragraph position="5"> In conclusion, we obtain an efficient application of the a priori algorithm by expanding the context only to elements which are adjacent to an element already in the n-gram. Note that this was already implicitly assumed for the left and the right context.</Paragraph> <Paragraph position="6"> There are two other efficiency-related issues worth mentioning. Firstly, as with the variation nucleus detection, we limit the n-grams expansion to sentences only. Since the category labels do not represent cross-sentence dependencies, we gain no new information if we find more context outside the sentence, and in terms of efficiency, we cut off what could potentially be a very large search space.5 Secondly, the methods for reducing the number of variation nuclei discussed in section 4.2 have the consequence of also reducing the number of possible variation n-grams. For example, in a test run on the NEGRA corpus we allowed identical strings to overlap; this generated a variation nucleus of size 63, with 16 gaps in it, varying between NP and NIL within the same sentence. Fifteen of the gaps can be filled in and still result in variation. The filter for unwanted NIL strings described in the previous section eliminates the NIL value from consideration. Thus, there is no variation and no tractability problem in constructing n-grams.</Paragraph> <Paragraph position="7"> 4.3.1 Generalizing the n-gram context So far, we assumed that the context added around variation nuclei consists of words. Given that tree-banks generally also provide part-of-speech information for every token, we experimented with part-of-speech tags as a less restrictive kind of context. The idea is that it should be possible to find more variation nuclei with comparable contexts if only the part-of-speech tags of the surrounding words have to be identical instead of the words themselves.</Paragraph> <Paragraph position="8"> As we will see in section 5, generalizing n-gram contexts in this way indeed results in more variation n-grams being found, i.e., increased recall.</Paragraph> </Section> <Section position="4" start_page="326" end_page="327" type="sub_section"> <SectionTitle> 4.4 Adapting the heuristics </SectionTitle> <Paragraph position="0"> To determine which nuclei are errors, we can build on the two heuristics from previous research (Dick5Note that similar sentences which were segmented differently could potentially cause varying n-gram strings not to be found. We propose to treat this as a separate sentence segmentation error detection phase in future work.</Paragraph> <Paragraph position="1"> inson and Meurers, 2003a,b)--trust long contexts and distrust the fringe--with some modification, given that we have more fringe areas to deal with for discontinuous strings. In addition to the right and the left fringe, we also need to take into account the internal context in a way that maintains the non-fringe heuristic as a good indicator for errors. As a solution that keeps internal context on a par with the way external context is treated in our previous work, we require one word of context around every terminal element that is part of the variation nucleus.</Paragraph> <Paragraph position="2"> As discussed below, this heuristic turns out to be a good predictor of which variations are annotation errors; expanding to the longest possible context, as in Dickinson and Meurers (2003a), is not necessary.</Paragraph> </Section> </Section> class="xml-element"></Paper>