File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-2004_metho.xml
Size: 17,351 bytes
Last Modified: 2025-10-06 14:14:32
<?xml version="1.0" standalone="yes"?> <Paper uid="J97-2004"> <Title>A Class-based Approach to Word Alignment</Title> <Section position="4" start_page="332" end_page="335" type="metho"> <SectionTitle> 4. Experiments with ClassAlign </SectionTitle> <Paragraph position="0"> To assess the proposed method's effectiveness, we have implemented the algorithms described in Section 3 and conducted a series of experiments. Tests are performed on the sentences found in the LecDOCE and a user's manual available in both languages to assess the method's robustness and generality. The similarities and differences between English and Mandarin texts are briefly reviewed, since our experiments involve the alignment of English-Mandarin parallel corpora. A general description of the materials used in the experiments follows. Finally, the success rates are quantitatively evaluated.</Paragraph> <Section position="1" start_page="332" end_page="335" type="sub_section"> <SectionTitle> 4.1 Contrastive Analysis of English and Mandarin Chinese </SectionTitle> <Paragraph position="0"> Language typology is the study of similarities and differences between languages, formalized in terms of parameters such as word order and morphological structure.</Paragraph> <Paragraph position="1"> Li and Thompson (1981) examine Mandarin Chinese according to four typological parameters that reveal the basic structure of Mandarin Chinese as compared to those of other languages, English in particular. These four parameters are the morphological structure of words, the number of syllables per word, topic prominence, and word order. Li and Thompson's typological description of Mandarin is described below, from the perspective of the task of word alignment.</Paragraph> <Paragraph position="2"> 8 A small percentage of connections (7.8%) in our evaluation are incomplete ones and are considered to be correct. Melamed (1996) takes the same stance in his study of deriving a probabilistic lexicon. He observes that even incomplete entries are useful for many applications and there are ways of expanding incomplete morphemes or words in a connection, so that they become complete (Smadja 1992).</Paragraph> <Paragraph position="3"> English is the relative simplicity of word structure. That is, most Mandarin words are comprised of a single morpheme rather than a stem morpheme and a suffix serving grammatical functions such as case (as in Turkish and Japanese), number, agreement, or tense (as in many other languages including English). Mandarin verbs do have aspect morphemes, including -j&quot; (-le, perfective), ~ (-guo, experienced action) and ,~(-zhe, durative). Other grammatical functions are either non-existent or expressed through an additional function word. In contrast to this lack of inflectional morphological complexity, Mandarin is relatively rich in other types of morphological combinations, including compounding.</Paragraph> <Paragraph position="4"> These morphological differences result in a difference in the number of words in an English sentence and its Mandarin translation. In terms of alignment, this word-number difference means that multiword connections must be considered, a task which Sue J. Ker and Jason S. Chang Word Alignment is beyond the reach of methods proposed in recent alignment works based on Brown et al.'s (1993) Model 1 and 2.</Paragraph> <Paragraph position="5"> Basic Orientation of the Sentence: Topic vs. Subject. Another feature distinguishing Mandarin from other languages is topic prominence. In addition to the grammatical relation of subject, a description of Mandarin must include the topic element, which can be characterized as follows: First, a topic always comes first in the sentence and is optionally followed by a pause in speech. Second, a topic is the old information of which both the speaker and listener have some knowledge. Third, what distinguishes a topic from a subject is that the subject must always have a direct syntactic and semantic relation with the verb, but the topic does not need to. For instance, in the sentence (E13, C13), the first word ~ (daxiang, 'elephant') is the topic and the second word ~ (bizi, 'nose') is the subject; ~ 'elephant' is the focus of the discourse, but it is the subject ~ 'nose' that is very long; not ~ 'elephant'.</Paragraph> <Paragraph position="6"> (E13) The elephant has a very long nose.</Paragraph> <Paragraph position="7"> (C13) ~ ~:~ ql~ ~o Daxiang bizi hen chang Elephant nose very long The topic prominence of Mandarin sentences represents alignment connections with a large distortion in position, leading to difficulty in estimating the likelihood of a connection according to translational position.</Paragraph> <Paragraph position="8"> Word Order. Greenberg (1963) stated that the world's languages fall into three word order groups according to the order of the subject (S), verb (V), and object (O) in a simple transitive sentence. A language, in general, belongs to one of three basic word order types, SVO, SOV, and VSO. By this notion, English is an SVO language in which the verb typically follows the subject and precedes the object. For most languages, other aspects of word order, such as that of modifier and modified elements, correlate with the order of V and O. However, Mandarin is not an easy language to classify according to this typology for a number of reasons. First, the notion of subject is not well-defined. Second, unlike in English, word order in Mandarin is not determined solely on grammatical grounds but rather depends on semantics. For instance, whether an adverbial expression appears in pre- or postverbal position depends on subtle semantic differences. More specificall~ a time phrase in preverbal position tends to denote punctual time, while that in postverbal position signals durative time, as in: In contrast, both kinds of time phrase appear in postverbal position in English. As a result of facts such as these, many linguists contend that Mandarin is a language in transition from SVO to SOV. Further details can be found in Li and Thompson (1981). Computational Linguistics Volume 23, Number 2 Similar to the situation created for topic prominent sentences, the SOV features of Mandarin represent a deviation from the SVO order of English. Such a deviation further worsens our ability to estimate the likelihood of a connection according to translational position.</Paragraph> </Section> <Section position="2" start_page="335" end_page="335" type="sub_section"> <SectionTitle> 4.2 The Experimental Setup </SectionTitle> <Paragraph position="0"> The experimental results obtained from the proposed algorithm with respect to word alignment are presented in this section. Nearly 42,000 example sentences and their translations from the LecDOCE were used as training data, primarily to acquire rules and to determine MLE estimates for the cases of LTP and DP. The algorithm's performance was evaluated using the two sets of data. The closed test set consists of 200 examples and their Mandarin translations randomly selected from the LecDOCE.</Paragraph> <Paragraph position="1"> The English examples range from 8 to 23 words long; average example length is 11.5 words. There are, on average, 1.56 inversions per example-translation pair. The open test set consists of 200 sentences randomly drawn from the English and Chinese versions of the LightShip User's Guide. The English sentences in this test set range from 4 to 34 words long; average sentence length is 11.8 words. There are, on average, 1.60 inversions per sentence pair. Table 16 provides some examples from the LightShip User's Guides.</Paragraph> <Paragraph position="2"> The two thesauri, LLOCE and CILIN, are used as the classification systems of source and target words. The LLOCE contains 23,769 entries and CILIN contains 63,754 entries. Both thesauri cover just over 90% of the words in the test sets.</Paragraph> </Section> <Section position="3" start_page="335" end_page="335" type="sub_section"> <SectionTitle> 4.3 Evaluation </SectionTitle> <Paragraph position="0"> The first three experiments were designed to demonstrate the effectiveness of the naive DictAlign algorithm based on a bilingual MRD. According to the experimental results, although DictAlign produces high-precision alignment, the coverage for both test sets is below 30%. However, if the thesaurus effect is exploited, the coverage can be increased considerably, at the cost of a decrease of less than 4% in precision. Table 17 provides further details.</Paragraph> <Paragraph position="1"> In the fourth experiment, the ClassAlign algorithm is employed to align both sets of test data again. Table 18 reveals that the acquired conceptual information compensates for what is lacking in the LecDOCE to yield optimum alignment results. The ClassAlign algorithm expands coverage almost twofold to over 80%, while maintaining the same level of precision. The generality of the approach is evident from the open test's comparably high coverage and precision rates. As shown in Table 18, over 80% of the source words in both test sets are connected to a target and over 90% of the connections are true ones.</Paragraph> </Section> </Section> <Section position="5" start_page="335" end_page="339" type="metho"> <SectionTitle> 5. Discussion </SectionTitle> <Paragraph position="0"> This section thoroughly analyzes the alignment results from the experiments described in Section 4 and, in particular, the data relating to cases where the algorithms failed.</Paragraph> <Paragraph position="1"> Analytical results demonstrate the strengths and limitations of the methods and suggest possible improvements to the algorithms.</Paragraph> <Section position="1" start_page="335" end_page="338" type="sub_section"> <SectionTitle> 5.1 Compounding in Mandarin </SectionTitle> <Paragraph position="0"> As stated earlier, the compounding effect in Mandarin frequently results in a change in the number of words between an English sentence and its Mandarin translation.</Paragraph> <Paragraph position="1"> The correct alignment decision for a Mandarin compound frequently involves more Sue J. Ker and Jason S. Chang Word Alignment</Paragraph> <Paragraph position="3"> ClassAlign incorrectly connects the compound @Ira in (C16) to a single English word company according to the alignment rule (Co292, Din07).</Paragraph> <Paragraph position="5"> Other methods for aligning English and Mandarin texts in the literature also fall prey to the problem of Mandarin compounds. For instance, the following partially correct connections complicated by compounding are reported in a recent study on alignment of Hong Kong Basic Law (Fung and McKeown 1994).</Paragraph> <Paragraph position="6"> tence (Fung and McKeown 1994; Wu and Xai 1994), ClassAlign avoids most instances of these errors. In addition, with elaborate preprocessing such as parsing, phrase grouping, and collocation analysis (Smadja 1992), the problem of word-number difference can be averted by performing alignment at various levels: parse tree (Matsumoto, Ishimoto, and Utsuro 1993; Meyers, Yangarber, and Grishman 1996), phrase (Kupiec 1993), and collocation (Smadja, McKeown, and Hatzivassiloglou 1996).</Paragraph> </Section> <Section position="2" start_page="338" end_page="339" type="sub_section"> <SectionTitle> 5.2 Function Words, Collocation, and Free Translation </SectionTitle> <Paragraph position="0"> Language-Specific Function Words. The morphological differences between English and Mandarin give rise to many language-specific function words. Such Mandarin function words are often quite ambiguous in part of speech as well as in word sense, leading to numerous alignment errors. For instance, ClassAlign connects the words for and of in (E20) erroneously to the morphemes T and ~ in (C20), respectively. Table 19 presents further details.</Paragraph> <Paragraph position="1"> (E20) He abdicated all responsibility for the care of the child.</Paragraph> <Paragraph position="2"> (C20) ~ ~ iF ~, @-~,\], ~x ~-- ty-J ~ ~ o Collocation. As mentioned in the previous section, collocation is one of the reasons why in-context translation usually deviates from the dictionary translation. However, unlike other deviations, bilingual collocation is not easily bounded within a couple of classes. For instance, the translation for take (Mb051, carrying, taking and bring) in the collocation take effect is usually ~ ('see') (Fc04, seeing and looking), as in example (E21, C21). However, there is insufficient evidence to support a class-to-class mapping from Mb051 to Fc04. In any case, deriving the MbO51-to-Fc04 mapping would be an overgeneralization.</Paragraph> <Paragraph position="3"> (E21) How soon does the medicine take effect?</Paragraph> <Paragraph position="5"> Paraphrased and Free Translations. For various reasons, such as language typology, style, and cultural differences, a translator does not always translate literally on a word-by-word basis. Adding and deleting words is commonplace, sometimes resulting in a paraphrased or free translation. Such translations obviously create problems for word alignment. For instance, in example (E24, C24), only one word, I, is translated literally, into ~. The main verb angle in example (E25) is given a paraphrased translation ~g~ ('to change the angle'). The noun phrase the people she is speaking to in (E25) is paraphrased as ~ 'audience.' A significant amount of free translation arises due to the use of four-morpheme Mandarin idioms for stylistic reasons. For instance, the clause as long as I breathe in (E22) translates into an idiom ~'al~.%Zde~ and the sentence Computational Linguistics Volume 23, Number 2 (E23) translates into ,,k,~). Such free or paraphrased translations are beyond the reach of the proposed method.</Paragraph> <Paragraph position="6"> (E22) I shall love you as long as I breathe!</Paragraph> </Section> <Section position="3" start_page="339" end_page="339" type="sub_section"> <SectionTitle> 5.3 Class-based versus Word-based Models </SectionTitle> <Paragraph position="0"> ClassAlign achieves a degree of generality in the sense that a true connection can be identified, even when it occurs only rarel}~ or not at all, in the training corpus.</Paragraph> <Paragraph position="1"> This kind of generality is unattainable with statistically trained word-based models.</Paragraph> <Paragraph position="2"> Moreover, class-based models offer the advantages of a smaller storage requirement and higher system efficiency. Unfortunately, they have the disadvantage of erroneous overgeneralization from word-specific connections. For instance, due to the acquired mapping from Gg273 (element of sound in language) to Bg07 (sound, tone, etc.), the verb accent in (E26) is connected erroneously to ~\[~ ('syllable') in (C26).</Paragraph> <Paragraph position="3"> (E26) The accent in the word &quot;important&quot; is on the second syllable.</Paragraph> <Paragraph position="4"> (C26) Important ~.~-~-~i~3~?~PS~\[~.~ o Nevertheless, our experiment has shown that the advantages outweigh the disadvantages, at least for this particular formulation of a class-based approach to alignment.</Paragraph> </Section> </Section> <Section position="6" start_page="339" end_page="340" type="metho"> <SectionTitle> 6. Concluding Remarks </SectionTitle> <Paragraph position="0"> In this paper, we have presented an algorithm capable of identifying words and their in-context translations in a bilingual corpus. The algorithm is effective for specific linguistic reasons. First, a significant majority of words have diversified translations that are not found in a bilingual dictionary or statistically-derived lexicon but that are largely bounded within the word classes in thesauri. Therefore, we contend that a more successful alignment can be achieved using a class-based approach. Our assumption seems to hold, for the experiments in this study demonstrate that the method provides broad-coverage alignment with almost no loss in precision.</Paragraph> <Paragraph position="1"> In a broader sense, we have shown that thesauri and corpora can be used in combination to address the critical issues of generality and efficiency. The thesaurus provides classification that can be used to generalize the empirical knowledge gleaned Sue J. Ker and Jason S. Chang Word Alignment from a corpus. The corpus provides training and testing materials, thereby allowing knowledge to be derived and evaluated objectively.</Paragraph> <Paragraph position="2"> The algorithm's performance could definitely be improved by enhancing the various modules of the algorithms, e.g., morphological analyses, bilingual dictionar~ monolingual thesauri, and rule acquisition. Nevertheless, this work presents a functional core for processing bilingual corpora at lexical and conceptual levels.</Paragraph> <Paragraph position="3"> While this paper has specifically addressed English-Chinese corpora, the linguistic issues motivating the algorithms seem to be quite general and are, to a large extent, language independent, which means that the algorithm presented here should be adaptable to other language pairs. The prospects for English-Japanese or Chinese-Japanese, in particular, seem highly promising.</Paragraph> </Section> class="xml-element"></Paper>