File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0904_metho.xml
Size: 15,612 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0904"> <Title>Translating Treebank Annotation for Evaluation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Task </SectionTitle> <Paragraph position="0"> Given a subset of the examples from the Penn Treebank annotated with syntactic and part-of-speech information (slightly modified), the system should return the examples annotated with the correct CG categories attached to the words of the sentence and the lexicons these imply.</Paragraph> <Paragraph position="1"> The context of the task explains some parts of its definition. The translated corpus is to be used as a standard against which to compare the lexical annotation (i.e. the categories assigned to the words) of the output of an unsupervised CG learner that annotated the words of the examples with CG categories and then extracts a probabilistic lexicon (see Watkinson and Manandhar (Watkinson and Manandhar, 2001) for details).</Paragraph> <Paragraph position="2"> Hence, there is no need for specific tree annotation. The learner currently uses a slightly modified subset of the treebank, which is described below.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The Corpus </SectionTitle> <Paragraph position="0"> The systems are applied to examples from the Penn Treebank (Marcus et al., 1993; Marcus et al., 1994; Bies et al., 1994) a corpus of over 4.5 million words of American English annotated with both part-of-speech and syntactic tree information. null To be exact, we are using the Treebank II version (Bies et al., 1994; Marcus et al., 1994), which attempts to address the problem of complement/adjunct distinction, which previous versions had ignored. While the documentation is clear that the complement/adjunct structure is not explicitly marked (Marcus et al., 1994), the annotation includes a set of labels that relate to the role of a particular constituent in the sentence. These labels are attached to the standard constituent label and it is possible to use heuristics to determine the probable complement/adjunct structure in the trees (Collins, 1999; Xia, 1999), which is obviously useful in translating the annotation.</Paragraph> <Paragraph position="1"> The full Penn Treebank is not being used. As mentioned already, the current research only uses sentences without null elements (i.e. without movement) from the treebank and does not include any of the sentence fragments. However, as Categorial Grammar formalisms do not usually change the lexical entries of words to deal with movement, but use further rules (Wood, 1993; Steedman, 1993; Hockenmaier et al., 2000), the lexicons learned here will be valid over corpora with movement. The extracted corpus, C1, in fact contains 5000 of the declarative sentences of fifteen words or less (although the sentence length makes little difference to either of the translation procedures described) from the Wall Street Journal section of the treebank. To give an indication of the complexity of the corpus, the number of tokens, i.e. the total number of words including repetitions of the same word, is 47,782. The total number of unique words, i.e. not including repetitions of the same word, is 12,277. We also extracted C2, a 1000 example corpus (also of declarative sentences from the Wall Street Journal section) with 9467 tokens and 3731 words, which is used in the evaluation process.</Paragraph> <Paragraph position="2"> The corpora also have some small modifications, which mean that adjacent nominals in the same subtree are combined to form a single nominal and the punctuation is removed. These modifications are made for use with the unsupervised learner (Watkinson and Manandhar, 2000; Watkinson and Manandhar, 2001) to simplify the learning process. They may also slightly simplify the translation process, but it is necessary for the corpus annotation that we want.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Categorial Grammar </SectionTitle> <Paragraph position="0"> Categorial Grammar (CG) (Wood, 1993; Steedman, 1993) provides a functional approach to lexicalised grammar, and so can be thought of as defining a syntactic calculus. Below we describe the basic (AB) CG. The current work uses this simple form of the grammar, which suffices for the syntactic annotation of the corpora currently being used.</Paragraph> <Paragraph position="1"> There is a set of atomic categories in CG, which are usually nouns (n), noun phrases (np) sentences (s) and sometimes prepositional phrases (pp), although this can be consider shorthand for the full category (Wood, 1993). It is then possible to build up complex categories using the two slash operators &quot;/&quot; and &quot;D2&quot;. If A and B are categories then A/B and AD2B are categories, where (following Steedman's notation (Steedman, 1993)) A is the resulting category when B, the argument category, is found. The direction of the &quot;slash&quot; functors indicates the position of the argument in the sentence i.e. a &quot;/&quot; indicates that a word or phrase with the category of the argument should immediately follow in the sentence. With the &quot;D2&quot; the word or phrase with the argument category should immediately precede the word or phrase with this category. This is most easily seen with examples.</Paragraph> <Paragraph position="2"> Suppose we consider an intransitive verb like &quot;run&quot;. The category that is required to complete the sentence is a subject noun phrase. Hence, the category of &quot;run&quot; is a sentence that is missing a preceding noun phrase i.e. sD2np. Similarly, with a transitive verb like &quot;ate&quot;, the verb requires a subject noun phrase. However, it also requires an object noun phrase, which is attached first. The category for &quot;ate&quot; is therefore (sD2np)/np.</Paragraph> <Paragraph position="3"> With basic CG there are just two rules for combining categories: the forward (FA) and backward (BA) functional application rules. Follow- null where CG and CH are CG categories. In Figure 1 the parse derivation for &quot;John ate the apple&quot; is presented, showing examples of how these rules are applied to categories.</Paragraph> <Paragraph position="4"> ate the apple The CG formalism described above has been shown to be weakly equivalent to context-free phrase structure grammars (Bar-Hillel et al., 1964). While such expressive power covers a large amount of natural language structure, it has been suggested that a more flexible and expressive formalism may capture natural language more accurately (Wood, 1993; Steedman, 1993).</Paragraph> <Paragraph position="5"> In future we may consider applying the principle developed here to perform translations to these more complex formalisms, although many of the changes will not actually change the lexical entries, just the way they can be combined.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Alternative Approaches </SectionTitle> <Paragraph position="0"> This section presents the two approaches to translation that are being compared. Firstly, there is the top-down method, which is a version of the algorithm described by Hockenmaier et al (Hockenmaier et al., 2000), but used for translating into simple (AB) CG rather than the Steedman's Combinatory Categorial Grammar (CCG) (Steedman, 1993). The algorithm here does not need to deal with movement, as the corpus does not contain any. The atomic pp category is included in the CG with this approach, but not with our approach, as it is a convenient shorthand for the prepositional phrase category.</Paragraph> <Paragraph position="1"> The second approach is a multiple-pass data-driven system. Rules for translating the trees are applied in order of complexity starting with simple part-of-speech translation and finishing with a category generation stage.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Top-Down Category Generation </SectionTitle> <Paragraph position="0"> The algorithm has two stages.</Paragraph> <Paragraph position="1"> Mark constituents All the nodes of all trees are marked with their roles i.e. as heads, complements or adjuncts. While Hockenmaier et al (Hockenmaier et al., 2000) are unclear, it is assumed that this is achieved using heuristics.</Paragraph> <Paragraph position="2"> Collins (Collins, 1999) describes such a set of heuristics, which are used with some minor modifications for CG and the changed Penn Treebank annotation. Figure 2 shows an example of an annotated tree.</Paragraph> <Paragraph position="3"> Assign categories This is a recursive top-down process, where the top category in the tree is an s. The category of the complements is determined by a mapping between Treebank labels and categories e.g. NP in the treebank becomes np. Hockenmaier et al (Hockenmaier et al., 2000) do not provide the mapping, so it was built specially for this system. This mapping led to the inclusion of the pp category as shorthand for prepositional complements. It should make no difference to the annotation process, but could lead to the generation of a few more categories. The head child of a subtree is given the category of the parent plus the complements required, which are found by looking first to the left of the head and then to the right, and adding them in the order they should processed in. Finally, adjuncts are assigned the generic CGBPCG or CGD2CG where CG is the head category with the complements removed which have been dealt with before the adjunct is processed.</Paragraph> <Paragraph position="4"> Figure 3 shows an example of a tree with the categories assigned to it.</Paragraph> <Paragraph position="5"> This algorithm has several advantages. It is simple and robust and has been shown by Hockenmaier et al (Hockenmaier et al., 2000) to provide good lexical annotation leading to useful CCG lexicons.</Paragraph> <Paragraph position="6"> However, it has two main disadvantages.</Paragraph> <Paragraph position="7"> Firstly, there is no control over category generation other than the rather weak constraints of the formalism and the heuristic syntactic roles. This is likely to lead to some linguistically implausible annotation. Secondly, the top-down nature of the algorithm is likely to lead to any translation errors being propagated down the tree, which will lead to some unusual and large categories, as Hockenmaier et al (Hockenmaier et al., 2000) report.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Bottom-Up Sequential </SectionTitle> <Paragraph position="0"> Our system uses a four stage process, where the type of translation changes at each stage.</Paragraph> <Paragraph position="1"> This is the simplest level of translation. The mapping between the Penn Treebank part-of-speech annotation and the CG category annotation is many-to-many, but some parts-of-speech the dollar also declined can be translated directly into categories using simple rules e.g. the following rule states that words with the determiner part of speech (DT) can be translated into the CG category np/n. BWCCAX D2D4BPD2 The system passes through the full set of examples and translates the appropriate parts-ofspeech. See Figure 4 for an example of the output of this stage.</Paragraph> <Paragraph position="2"> The next pass through the data allows more complex rules to be used. Consider the part-of-speech label NNS, used in the Penn Treebank annotation scheme to indicate a plural noun. Its syntactic role can be that of a simple noun (n)ora noun phrase (np), so we need a mechanism for choosing between these two possibilities.</Paragraph> <Paragraph position="3"> The most obvious mechanism is to use the surrounding subtree to provide the context to select the correct rule. If the NNS tag is part of a noun phrase which begins with something fulfilling the determiner role, then the tag should be translated to the CG category n, otherwise it should be translated as an np.</Paragraph> <Paragraph position="4"> The algorithm for applying the set of context-based rules is a simple matching process throughout the treebank. Figure 5 shows the output from this stage on an example.</Paragraph> <Paragraph position="5"> In this stage, the system uses further knowledge to attempt to inform the translation process. Where words have not been translated, the system annotates the subtree with the head, complements and adjuncts using a modified version of Collins' heuristics (Collins, 1999).</Paragraph> <Paragraph position="6"> the dollar also declined Further categories can now be obtained. For example, if the head of the subtree requires an np category to its right as its first complement and there is a word marked as a complement in this position, then it can be translated as an np. Alternatively, if the head category is unknown, but it is verbal according to the Penn Treebank label then looking at the categories of the complements can determine the type of verb it is e.g. no complements following a verb indicates a CG category sD2np. Figure 6 shows the effects of this stage on the example.</Paragraph> <Paragraph position="7"> In the final stage each lexical category that has not been annotated is given a variable for a category. The tree is then traversed bottom-up instantiating these categories by using head, complement and adjunct annotation and the already annotated categories. The building of head and adjunct categories follows the same process described for the top-down algorithm. Complements either gain their categories through this process or have already had them assigned. Figure 7 shows the final output.</Paragraph> <Paragraph position="8"> This approach has two main advantages.</Paragraph> <Paragraph position="9"> Firstly, the user has control over the type of CG to which the treebank is translated, due to the use of predefined categories for predefined contexts. Secondly, the bottom-up approach ensures that translation errors are not propagated seriously through the tree.</Paragraph> <Paragraph position="10"> A further advantage exists that has not, as yet, been fully investigated. The system, due to its multi-pass nature, has the potential for translations to clash. Experience has shown that this occurs when there is an annotation error, so the system can be used to highlight these and can also provide some level of self-correction. This has not been investigated in detail, but the current approach, which gives satisfactory results, is to assume the head category is correct and adjust complements and adjuncts accordingly. In future, a simple correction scheme could easily be added to produce a self-correcting translator.</Paragraph> <Paragraph position="11"> The main weakness of the system is the reliance upon the head/complement/adjunct annotating heuristics, which were not designed to be used with a CG.</Paragraph> <Paragraph position="12"> The system also returns some categories with variables. This is due in part to the heuristics and in part to the small number of rules currently used in the early stages of the translation process. Most of the problem categories could be dealt with by the addition of a few more rules in stages 2 and 3.</Paragraph> </Section> </Section> class="xml-element"></Paper>