File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1155_metho.xml
Size: 20,022 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1155"> <Title>A Flexible Example Annotation Schema: Translation Corresponding Tree Representation</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Translation Corresponding Tree (TCT) Representation </SectionTitle> <Paragraph position="0"> TCT structure, as an extension of structure string-tree correspondence representation (Boitet and Zaharin, 1988), is a general structure that can flexibly associate not only the string of a sentence to its syntactic structure in source language, but also allow the language annotator to explicitly associate the string from its translation in target language for the purpose to describe the correspondences between different languages.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 The TCT Structure </SectionTitle> <Paragraph position="0"> The TCT representation uses a triple sequence intervals [SNODE(n)/STREE(n)/STC(n)] encoded for each node in the tree to represent the corresponding relations between the structure of source sentence and the substrings from both the source and target sentences. In TCT structure, the correspondence is made up of three interrelated correspondences: 1) one between the node and the substring of source sentence encoded by the interval SNODE(n), which denotes the interval containing the substring corresponding to the node; 2) one between the subtree and the sub-string of source sentence represented by the interval STREE(n), which indicates the interval of substring that is dominated by the subtree with the node as root; and 3) the other between the subtree of source sentence and the substring of target sentence expressed by the interval STC(n), which indicates the interval containing the sub-string in target sentence corresponding to the subtree of source sentence. The associated sub-strings may be discontinuous in all cases. This annotation schema is quite suitable for representing translation example, where it preserves the strength in describing non-standard and non-projective linguistic phenomena for a language (Boitet and Zaharin, 1988; Al-Adhaileh et al., 2002), on the other hand, it allows the annotator to flexibly define the corresponding translation substring from the target sentence to the representation tree of source sentence when it is necessary. This is actually the central idea behind the formalism of TCT.</Paragraph> <Paragraph position="1"> the translation example &quot;Onde ficam as barracas de praia? (Where are the bathhouses?) / Kayng Yi Shi Zai Na I ?&quot; and its phrase structure together with the correspondences between the substrings (of both the source and target sentences) and the subtrees of sentence in source language.</Paragraph> <Paragraph position="2"> As illustrated in Figure 2, the translation example &quot;Onde ficam as barracas de praia?/ Geng Yi Shi Zai Na Li ?&quot; is annotated in a TCT structure. Based on the interpretation structure of the source sentence &quot;Onde ficam as barracas de praia?&quot;, the correspondences between the sub-strings (of source and target sentences) and the grammatical units at different inter levels of the syntactic tree of the source sentence are expressed in terms of sequence intervals. The words of the sentences pair are assigned with their positions respectively, i.e. &quot;Onde (1)&quot;, &quot;ficam (2)&quot;, &quot;as (3)&quot;, &quot;barracas (4)&quot;, &quot;de (5)&quot; and &quot;praia (6)&quot; for the source sentence, as well as for the target sentence. But considering that Chinese uses ideograms in writing without any explicit word delimiters, the process to identify the boundaries of words is considered to be the task of word segmentation (Teahan et al., 2000), instead of assigning indices in word level with the help of word segmentation utility, a position interval is assigned to each character for the target (Chinese) sentence, i.e. &quot;Geng (1)&quot;, &quot;Yi (2)&quot;, &quot;Shi (3)&quot;, &quot;Zai (4)&quot;, &quot;Na (5)&quot; and &quot;Li (6)&quot;. Hence, a sub-string in source sentence that corresponds to the node of its representation is denoted by the intervals encoded in SNODE(n) for the node, e.g. the shaded node, NP, with interval, SNODE(NP)=4, corresponds to the substring &quot;barracas&quot; in source sentence that has the same interval. A substring of source sentence that corresponds to a subtree of its syntactic tree is denoted by the interval recorded in STREE(n) attached to the root of the subtree, e.g. the subtree of the shaded node, NP, encoded with the interval, STREE(NP)=3-6, corresponds to the substring &quot;as barracas de praia&quot; in source sentence. While the translation correspondence between the subtree of source sentence and substring in the target sentence is denoted by the interval assigned to the STC(n) of each node, e.g. the subtree rooted at shaded node, NP, with interval, STC(NP)=1-3, corresponds to the translation fragment (substring) &quot;Geng Yi Shi &quot; in target sentence.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Expressiveness of Linguistic Infor- </SectionTitle> <Paragraph position="0"> mation Another inherited characteristic of TCT structure is that it can be flexibly extended to keep various kinds of linguistic information, if they are considered useful for specific purpose, in particularly the linguistic information that differentiating the characteristics of two languages which are structural divergences (Wong et al., 2001). Basically, each node representing a grammatical constituent in the TCT annotation is tagged with grammatical category (part of speech). Such feature is quite suitable for the describing specific linguistic phenomena due to the characteristic of a language. For instance, in our case, the crossing dependencies (syntax transformation rules) for the sentence constituents between Portuguese and Chinese are captured and attached to each node in the TCT structure for a constituent that indicates the order in forming the corresponding translation for the node from the subtrees it dominated. In many phrasal matching approaches, such as constituency-oriented (Kaji et al., 1992; Grishman, 1994) and dependency-oriented (Matsumoto et al., 1993; Watanabe et al., 2000; Aramaki et al., 2001), crossing constraints are deployed implicitly in finding the structural correspondences between pair of representation trees of a source sentence and its translation in target. Here, in our TCT representation, we adopted the use of constraint (Wu, 1995) for a constituent unit, where the immediate subtrees are only allowed to cross in the inverted order.</Paragraph> <Paragraph position="1"> Such constraints, during the phase of target language generation, can help in determining the order in producing the translation for an intermediate constituent unit from its subtrees when the corresponding translation of the unit is not associated in the TCT representation.</Paragraph> <Paragraph position="2"> sentence-constituents of source language and its translation in target language are recorded in TCT structure.</Paragraph> <Paragraph position="3"> Figure 3 demonstrates the crossing relations between the source and target constituents in an TCT representation structure. In graphical structure annotation, a horizontal line is used to represent the inversion of translation fragments of its immediate subtrees. For example, the translation substring &quot;Geng Yi Shi Zai &quot; of the shaded node, VP, can be obtained by inverting the order of the corresponding target translations &quot;Zai &quot; and &quot;Geng Yi Shi &quot; from the dominated nodes V and NP. Therefore, such schema can serve as a mean to represent translation examples, and find structural correspondences for the purpose of transfer grammar learning (Watanabe et al., 2000; Matsumoto et al., 1993; Meyers et al., 1998).</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="11" type="metho"> <SectionTitle> 3 Construction of Example Base </SectionTitle> <Paragraph position="0"> In the construction of bilingual knowledge base (example base) in example-based machine translation system (Sato and Nagao, 1990; Watanabe et al., 2000), translation examples are usually annotated by mean of a pair analyzed structures, where the corresponding relations between the source and target sentences are established at the structural level through the explicit links. Here, to facilitate such examples representation, we use the Translation Corresponding Tree as the basic annotation structure. The main different and advantage of our approach is that it uses a single language parser to process other than two different parsers, one for each language (Tang and Al-Adhaileh, 2001).</Paragraph> <Paragraph position="1"> In our example base, each translation pairs is stored in terms of an TCT structure. The construction starts by analyzing the grammatical structure of Portuguese sentence with the aid of a Portuguese parser, and a shallow analysis to the Chinese sentence is carried out by using the Chinese Lexical Analysis System (ICTCLAS) (Zhang, 2002) to segment and tag the words with a part of speech. The grammatical structure produced by the parser for Portuguese sentence is then used for establishing the correspondences between the surface substrings and the inter levels of its structure, which includes the correspondences between nodes and its substrings, as well as the correspondences between subtrees and substrings in the sentence. Next, in order to identify and establish the translation correspondences for structural constituents of Portuguese sentence, it relies on the grammatical information of the analyzed structure of Portuguese and a given bi-lingual dictionary to search the corresponding translation substrings from the Chinese sentence.</Paragraph> <Paragraph position="2"> Finally, the consequent TCT structure will be verified and edited manually to obtain the final representation, which is the basic element of the knowledge base.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 The TCT Generation Algorithm </SectionTitle> <Paragraph position="0"> In the overall construction processes, the task to compile the syntactic structure of source sentence into the TCT representation by linking the translation fragments from the target sentence is the vital part. The following steps present the complete process to generate an TCT structure for a translation example &quot;Actos anteriores a publicidade da accao (Publicity of action prior to acts) /</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Zai Su Song Gong Kai Qian Suo Zuo Zhi Xing Wei &quot;. Parsing Portuguese Sentence </SectionTitle> <Paragraph position="0"> The process begins by parsing the Portuguese sentences with a Portuguese parser. The parsing result is a phrase structure in terms of bracketed annotation. Each bracketed constituent of the structure tree is attached with a grammatical category. Figure 4 shows the resultant parsed structure of the Portuguese sentence.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Analyzing Chinese Sentence </SectionTitle> <Paragraph position="0"> The construction of TCT structure is fundamentally based on the syntactic structure of Portuguese sentence. The finding of translation units between the sentences pair is relying on structure tree of Portuguese sentence and the sequences of lexical words from Chinese sentence. Thus, instead of analyzing the Chinese sentence in deep, we analyze the Chinese sentence in the lexical level by using the Chinese Lexical Analysis System (ICTCLAS) (Zhang, 2002). Each Chinese word is delimited with spaces and assigned with a part of speech as illustrated in Figure 5.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Constructing Correspondence Structure for Portuguese Sentence </SectionTitle> <Paragraph position="0"> After parsing and obtaining the syntactic structure of Portuguese sentence, next step is to compute the correspondences for the structure against the surface strings of the source sentence, which includes the corresponding phrase for a constituent unit in the tree and the corresponding content word that headed the constituent unit, both of these correspondences are denoted by the sequence intervals of the substrings spanning across the sentence fragments. In finding the corresponding phrasal substrings for subtrees, we start associating the lexical words to the corresponding terminal nodes of the structure tree by assigning the related offsets to SNODE(n) and STREE(n) of the nodes. Then we proceed to next upper level constituent units in the tree where the corresponding substrings are derived by connecting the lexical words from the nodes in the lower level it dominated. Theoretically, if node, N, has</Paragraph> <Paragraph position="2"> , then the sequence interval for N will be STREE(N) = STREE(N</Paragraph> <Paragraph position="4"> ), the interval is bounded by spanning nodes of its immediate subtrees. To identify the lexical head for a constituent unit, we use simple rule to determine it by considering the grammatical category of the phrasal unit, and choose the word that owns the same category from the daughter nodes, then assign the interval of chosen to SNODE(N). Figure</Paragraph> </Section> <Section position="5" start_page="1" end_page="11" type="sub_section"> <SectionTitle> Associating Translation Correspondences </SectionTitle> <Paragraph position="0"> In this process, we adopt a search for alignments between constituent units of Portuguese sentence and the corresponding translation fragments from Chinese sentence, proceeding bottom-up through the tree. It makes use of the information about possible lexical correspondences from a bilingual dictionary and the grammatical categories of the lexical words, tagged in previous stage, to generate initial candidate alignments. Figure 7 presents the initial lexical alignments.</Paragraph> <Paragraph position="1"> sponding words.</Paragraph> <Paragraph position="2"> Based on the possible word correspondences, the associated structure of the Portuguese sentence, together with the grammatical categories information, the search proceeds to align phrases by gradually increasing length (phrasal correspondences in different levels of constituent tree) based on the following criterions.</Paragraph> <Paragraph position="3"> First, for any un-aligned words sequence &quot;w &quot; (including the bounding words or phrases) as the corresponding sub-string for the parent node that immediately dominates the daughter nodes, such that STC(N)</Paragraph> <Paragraph position="5"> Second, for the case that the un-aligned fragment is not bounded by any aligned units, our approach relies on the assumption that if two set of sentence constituents (source and target sentences) are corresponding, their grammatical categories as well as the number of constituents should be consistent. The essential idea of the search is to look for inter levels where the constituent units of the structure of Portuguese sentence and the lexical words in Chinese sentence can be projected in one-to-one manner. We use the previous example &quot;Onde ficam as barracas de praia? (Where are the bathhouses?)/ Kayng Yi Shi Zai Na I ?&quot; to illustrate the searching strategy. Beside the corresponding lexical items, e.g.</Paragraph> <Paragraph position="6"> &quot;Onde / Na Li &quot; and &quot;Ficam / Zai &quot;, that can be determined with the aid of a given dictionary, the process proceeds bottom-up and searches through the tree by considering only the unmatched items that if the assumption hold or not. For example, at the leaf level, the different numbers of the lexical items (&quot;as N &quot;) violates the assumption. The process repeats the investigation in next upper level in the representation structure of Portuguese sentence. As illustrated in Figure 8, the alignment can be identified only at the level where the number and the part of speech of constituent unit consistent to that of the lexical item in Chinese sentence (&quot;[Geng Yi Shi ] N &quot;). Consequently, the correspondences between the associated structure of Portuguese sentence and the translation fragments of Chinese sentence can be determined and established. For any node in the structure which has no translation equivalent is assigned with &quot;empty (O)&quot; interval to STC(N).</Paragraph> <Paragraph position="7"> Third, for acquiring the crossing constraint for a constituent node in the representation tree, which is determined by examining the order of the translation correspondences of the spanning nodes against the sequence of those appeared in Chinese sentence. For any node that representing Portuguese phrase whose corresponding translation is derived from its daughters by inverting the corresponding translations is denoted by assigning a Boolean value to INVERT(N) attached to the node. In graphical annotation, a horizontal line is used as a sign for indicating the inversion. As demonstrated in Figure 9, the corresponding translations of the daughters of node S are crossed between the sentences of Portuguese and its translation in Chinese. The corresponding translation &quot;Zai Su Song Gong Kai Qian &quot; of its second daughter appears prior to that &quot;Xing Wei &quot; of the first daughter node in the target translation of Portuguese sentence. Hence the inversion property for the constituent node in the syntactic structure of source sentence is consequently determined.</Paragraph> <Paragraph position="8"> Finally, in case the representation of TCT generated in previous process needs further editing, an TCT editor can be used to perform the necessary amendment. Figure 10 presents the final TCT structure describing a translation example.</Paragraph> <Paragraph position="9"> translation example &quot;Actos anteriores a publicidade da accao (Publicity of action prior to acts) /</Paragraph> </Section> <Section position="6" start_page="11" end_page="11" type="sub_section"> <SectionTitle> Zai Su Song Gong Kai Qian Suo Zuo Zhi Xing Wei &quot;. 3.2 Translation Equivalents </SectionTitle> <Paragraph position="0"> Through the notation of translation corresponding structure for representing translation examples in the bilingual knowledge base, the translation units between the Portuguese sentence and its target translation in Chinese are explicitly expressed by the sequence intervals STREE(n) and STC(n) encoded in the intermediate nodes of an TCT structure, that may represent the phrasal and lexical correspondences. For instance, from the translation example being annotated under the TCT representation schema as shown in Figure 10, the Chinese translation &quot;Su Song &quot; of Portuguese word &quot;accao&quot; is denoted by [STREE(n)=6/ STC(n)=2-3] in the terminal node. For phrasal translation, we may visit the higher level constituents in the representing structure of TCT and apply the similar coding information to retrieve the corresponding translation for the unit that representing a phrasal constituent in a sentence.</Paragraph> <Paragraph position="1"> Each TCT structure is being indexed by its nodes in the bilingual knowledge base, in order that the representation examples can be effectively consulted. null</Paragraph> </Section> </Section> class="xml-element"></Paper>