File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0809_metho.xml
Size: 25,442 bytes
Last Modified: 2025-10-06 14:07:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0809"> <Title>Generation of Vietnamese for French-Vietnamese and English- Vietnamese Machine Translation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Brief description of ITS3 </SectionTitle> <Paragraph position="0"> ITS3 (Wehrli, 1992; Etchegoyhen & Wehrli, 1998; L'haire & al, 2000) can now translate from French to English and vice versa. Modules for other languages such as German, Italian, are under development. ITS3 is a principle-based system, linguistically inspired by the Government & Binding (GB) theory. (See eg.</Paragraph> <Paragraph position="1"> Haegeman (1994) for an introduction to GB, Berwick & al (1991) for principle-based systems). The system chooses the classical analysis-transfer-generation approach of MT (see Hutchins & Sommers, 1992). ITS3 works on single isolated sentences. A sentence in the source language is analyzed into a logicolinguistic structure, called pseudo-semantic structure (PSS). After a lexical transfer phase, this PSS is passed to the generation phase, which finally produces the sentence in the target language. By default, ITS3 gives a unique solution, the best one.</Paragraph> <Paragraph position="2"> Let's take an example of French-English translation to illustrate the process. The analysis phase consists of two steps: GB-based syntax analysis and PSS construction. Syntax analysis is carried out by the IPS parser (Wehrli, 1992), which builds the X-bar structure of the sentence, using many filtering constraints (on thematic roles, on cases, etc.) to reduce overgeneration. (1) La maison a ete vendue.</Paragraph> <Paragraph position="3"> (2) [TP [DP la [NP maison]]i [T' a [VP ete [VP vendue [DP ei]]]]] A PSS is then derived from the syntax analysis results (Etchegoyhen & Wehrli, 1998).</Paragraph> <Paragraph position="4"> Components of the sentence are represented in corresponding frame-liked structures. For example, a clause gives rise to a PSS of type CLS, which contains the main verb or adjective (the Predicate slot) and other information on tense, mood, voice, etc., as well as the PSS's of its arguments and adjuncts (the Satellites).</Paragraph> <Paragraph position="5"> Similarly, a noun phrase gives rise to a PSS of type DPS, which contains, besides the main noun (the Property slot), its number, gender, referential index for binding resolution, etc. A PSS thus contains abstract linguistic values for &quot;closed&quot; features (tense, mood, voice, number, gender, etc.), and lexical values for &quot;open&quot;</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> ]PSS </SectionTitle> <Paragraph position="0"> In the lexical transfer phase, the lexical units in the PSS are replaced by those in the target language, using frequency data for translation selection. In the generation phase, a generic engine called GBGEN (Etchegoyhen & Wehrle, 1998; Etchegoyhen & al, 1999) cooperates with language-specific modules to construct the output from the PSS in three steps. First, D-structure generation maps the PSS into an X-bar structure in a top-down fashion (see 3a). Next, S-structure generation carries out movements and bindings (3b). Finally, morphological realization is done (3c), and the result is output, as in (3d).</Paragraph> <Paragraph position="1"> (3) (a) [CP [TP [VP aux [VP aux [VP sell [DP the [NP house]]i]]]]] (b) [CP [TP [DP the [NP house]]i [T' [VP aux [VP aux [VP sell [DP ei]]]]]]] (c) [CP [TP [DP the [NP house]]i [T' [VP has [VP been [VP sold [DP ei]]]]]]] (d) The house has been sold.</Paragraph> <Paragraph position="2"> Note that ITS3 does only lexical, and not structural, transfer. This approach can therefore be considered as half transfer half interlingual. It is not the purpose of this paper to discuss the pros and cons of the transfer and interlingual approaches in MT. See eg. Gdaniec (1998) for discussions about advantages of a particular transfer-based MT system, and Dorr (1993) for an interlingual one. The latter, also based on GB, concentrates on treating mismatches across languages, an issue less considered in ITS3. It needs however to use very complex representations for its interlingual approach, hence is not likely to become a practical system. As for the specification issue, ITS3 chooses to be purely procedural. All generic engines and language-specific modules are written in Modula-2. Procedure templates are designed so that one can fill in language-specific parameters when adding a new language. However, this is not always straightforward, as one will see in the integration of Vietnamese below. In general, any development requires to read, understand, and often modify some parts of the huge code. This is an important reason why a declarative approach would be preferred (see eg. Emele & al, 1992; Nicolov & Mellish, 2000).</Paragraph> <Paragraph position="3"> Unfortunately, we do not have at our disposal any declarative system with high-quality French analysis. Also, as best as we know, there are no parallel French-Vietnamese or English-Vietnamese corpora built so far to think of statistical or example-based MT approaches.</Paragraph> <Paragraph position="4"> ITS3 is one among few systems that can do French syntax analysis with large lexical and grammatical coverage. It can therefore serve our main purpose to develop a prototype of French-Vietnamese MT in a short term.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Generation of Vietnamese </SectionTitle> <Paragraph position="0"> In this section, we present the problems and our solutions for constructing Vietnamese NPs, VPs, AdvPs, relative clauses, etc. in ITS3. Below we will use generalized notions of NP and VP in GB, that of DP and TP, respectively.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 DP construction </SectionTitle> <Paragraph position="0"> 3.1.1 Vietnamese noun categorizers Many Vietnamese nouns have to be preceded by a &quot;categorizer&quot; to form an NP. For example, knowing that a0a2a1a3a0a5a4a6a0a2a7a9a8a10a12a11a5a0a2a13a14a0a16a15a12a1a17a11a18a0a5a4a9a0a19a7a21a20a22a23a8a24a0 , we cannot translate &quot;a cat&quot; into a25 a0a19a7a9a8a10a12a11a6a7a21a20a22a23a8a3a0 , but a0a2a7a9a8a10a12a11 a26a28a27a30a29 a7a21a20a22a23a8a24a0 . Here &quot;a7a21a20a22a23a8 &quot; needs the categorizer &quot;con&quot;. A categorizer is also a noun, giving some vague idea on the semantic class of the noun which requires it. For example, almost every noun designating an animal needs &quot;con&quot;.</Paragraph> <Paragraph position="1"> However, there seems to be no general rule to determine the categorizer for a particular noun.</Paragraph> <Paragraph position="2"> We therefore specify the categorizer for each noun in the Vietnamese lexicon. This information helps to form Vietnamese NPs appropriately, eg. &quot;a cat&quot; gives rise to (4) [DP a7a9a8a10a12a11 [NP con [NP a7a21a20a22a23a8 ]]], but &quot;a language&quot; to (5) [DP a7a9a8a10a12a11 [NP a31a33a32 a8a34 a31a21a31a33a32a36a35a37 ]], because &quot;a31a33a32 a8a34 a31a21a31a33a32a36a35a37 a0 needs no categorizer. One important task in DP construction for many languages is to assure agreement (on number, gender, etc.). Vietnamese words are morphologically invariant with respect to all these concepts. For plural DPs, we need to add an appropriate determiner: a quantifier if it is specified (&quot;two students&quot; = [DP a38 a1a36a39a41a40a43a42a41a44a46a45a47a39 a31 a38</Paragraph> <Paragraph position="4"> GBGEN supposes a 1-1 mapping in which a determiner in a language corresponds to a universal operator and vice versa, eg.: English French Operator each chaque every this, these ce, cette, ces demonstrative no aucun, aucune no &quot;Ces chats&quot;, eg., is analyzed into a PSS like (note the Operator slot): After &quot;chat&quot; is replaced by &quot;cat&quot;, this gives [DP these [NP cats]]. This model does not apply totally to Vietnamese DPs. Some operators correspond to a determiner as prescribed by the model. Some do not, but require instead an adjective after the noun, and some others need both a determiner and an adjective.</Paragraph> <Paragraph position="5"> It turns out to be somewhat problematic to construct Vietnamese DPs in the generic model of GBGEN. First, the procedure template for deriving the determiner from the DPS Operator slot does not expect that there may be an adjective after the noun. Modifying this procedure template would lead to many obligatory changes in modules for all other languages of the system. Moreover, this would not mean that the template be generic enough for every human language. Second, the generic model does not evidently foresee a facility for treating Vietnamese categorizers. We therefore found more convenient to develop a specialized1 procedure for Vietnamese DP construction. This allows a safe treatment of Vietnamese DPs while still respecting the available system. This procedure computes the determiner and post-nominal adjective from the Operator and Number slots of the DPS. A DP is then projected from the determiner. Its NP complement is built from the main noun (the Property slot in the DPS). If the noun needs a categorizer, which is given in its lexical entry, the NP will be of structure [NP Categorizer [NP Main]], otherwise it will be only [NP Main]. Finally, the post-nominal adjective is added as a complement of the NP.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 TP construction </SectionTitle> <Paragraph position="0"> The principal strategy of GBGEN for TP construction is to create the following general 1 As understood in object-oriented paradigm.</Paragraph> <Paragraph position="1"> frame, and attempt to fill it gradually with appropriate elements: [TP [T' Modal [VP Perfective [VP Passive [VP Progresive [VP Main]]]]]] where Modal, Perfective, Passive, and Progressive stand for auxiliary verbs representing respectively the modal, perfective, passive, and progressive aspects of the TP, and Main is the main verb. See example (3) above. This model seems to work at least with French and English. However, Vietnamese has many differences from these languages on verbal notions and on VP formation, as will be presented in the following.</Paragraph> <Paragraph position="2"> In Vietnamese, verbs are not conjugated, and tense and aspect are generally understood in context. &quot;He sleeps&quot;, &quot;He slept&quot;, &quot;He is sleeping&quot; eg., can all be translated in suitable contexts into &quot;a48a50a49a23a51a53a52a2a54a55a49a57a56a59a58a60 &quot;. To explicit the tense and aspect, Vietnamese uses some adverbs as shown below.</Paragraph> <Paragraph position="3"> The Negation slot of a CLS specifies whether it is in negative form or not. The Modality slot contains an abstract value for the modality of the verb, eg. possibility corresponds to English &quot;can&quot; and French &quot;pouvoir&quot;, obligation to &quot;must&quot; and &quot;devoir&quot;. GBGEN foresees an orthogonal combination of negation and modality; it inserts &quot;not&quot; after the modal verb for English, or &quot;ne&quot; and &quot;pas&quot; around it for French. In Vietnamese, one generally adds the adverb &quot;a0a2a1a4a3a5a7a6a4a8 &quot; before the verb to form a Evidently, this orthogonal model will have trouble in translation, because a modal verb in negative form may have different logical interpretations from one language to another.</Paragraph> <Paragraph position="4"> For example, &quot;must&quot; = &quot;a58a44a1a21a20 a59a11a16 &quot;, &quot;a26a61a60a62a30a64a63 a47 a28a24a30 a6 &quot; = At the moment, the specifications in the PSS does not allow to determine the logical interpretation of a negated modal verb. In waiting for an improvement of GBGEN on this issue, we implement a temporary solution which helps to translate negative modal verbs from English and French, specifically, to Vietnamese. The appropriate Vietnamese negative modal verb form is derived not only from the Modality slot of the interested CLS, as done in GBGEN, but also by examining its Negation slot.</Paragraph> <Paragraph position="5"> Passivization is realized in Vietnamese by adding &quot;a68a15a69a13a70a22a71a18 &quot; or &quot;a72a21a73 &quot; before the verb. &quot;a74a37a73 &quot; is used when the subject suffers a bad effect from the action, otherwise &quot;a68a15a69a13a70a22a71a18 &quot; is used. We put &quot;a68a15a69a13a70a22a71a18 &quot; or &quot;a72a21a73 &quot; in the specifier component of the VP, ie. [Spec, VP]. The choice of &quot;a68a15a69a13a70a22a71a18 &quot; or &quot;a72a21a73 &quot; for a verb is considered as a lexical one, and stored in the Vietnamese lexicon.</Paragraph> <Paragraph position="6"> The lexical transfer procedure in ITS3 does not take into account the interaction between the components of the sentence when it translates the lexical units in the PSS. In particular, the English &quot;be&quot; is always translated into the French &quot;etre&quot;, and vice versa. However, to translate be/etre into Vietnamese, one has to distinguish For the first case, it suffices to test the theta role of the complement of the verb in the PSS, which should be THEME, to have the right translation &quot;a162 a20 a83 &quot;. In the last two cases, whether using &quot;a47a51a1a42a144 &quot; or &quot;a162 a20 a83 &quot; a128a21a163a37a109 a134a21a132 a112a123a111 a134 a163 is too delicate to explain, as it concerns pragmatic issues. We decide to put From the discussion above, it seems not very natural to follow the construction order of GBGEN in building Vietnamese TPs, neither to reuse some of its pre-designed procedure templates, such as selecting auxiliary verbs. We need rather to implement a different strategy. At first, a simple frame [TP [T' [VP ...]]] is built as D-structure. Verbal information, such as tense, aspect, modality, negation, is gathered from the PSS as much as possible. The complete TP is then constructed based on the combination of gathered information, and in an order particular to Vietnamese. The adverb representing the tense/aspect of the clause, if exists, will occupy the head position of the TP. The modal, passive, and main verb make up layers of VPs in the TP.</Paragraph> <Paragraph position="7"> Values of negation and modal are computed together. The maximal frame looks like: [TP [T' Tense [VP Negation [V' Modal [VP Passive [V' Main]]]]]] For example, for the sentence (16) Il n'a pas pu etre tue. (He could not be killed.) the past tense gives &quot;a10a11a6a12 &quot;, the negation and the modality combine and give &quot;a13a14a1a11a15a16a18a17a11a19a20a0a2a1a22a21a23 &quot;4, and the passive gives &quot;a24a26a25 &quot; by consulting the lexical entry of the verb &quot;a19a28a27a2a21a29a30a0 &quot;:</Paragraph> <Paragraph position="9"> In particular, if the main verb is a translation of be/etre (checked with a bit in the lexical entry), its complements will be examined to give the right translation.</Paragraph> <Paragraph position="11"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Other constructions 3.3.1 AdvP location </SectionTitle> <Paragraph position="0"> In ITS3, a large set of adverbs and, more generally, adverbial phrases (AdvPs) are classified into semantic groups, specified by a value. For example, English &quot;much&quot; and French &quot;beaucoup&quot; are assigned the abstract value degree. GBGEN uses this information to locate the generated AdvP in an appropriate position.</Paragraph> <Paragraph position="1"> This generic approach is not perfect. For example, the equivalent adverbs &quot;where&quot; (English), &quot;ou&quot; (French), and &quot;a75a76a77a10a11a6a16a40a78 &quot; (Vietnamese) all have the where value, and would be moved to [Spec, CP] of the subordinate clause. This would give a bad Vietnamese sentence (20). The correct one is (21).</Paragraph> <Paragraph position="2"> (18) I know [CP [AdvP where]i [C' [TP he [T' [VP sleeps [AdvP ei]]]]]].</Paragraph> <Paragraph position="3"> (19) Je sais [CP [AdvP ou]i [C' [TP il [T'</Paragraph> <Paragraph position="5"> This example shows that AdvP location should be language-specific and lexicalized. The generic procedure is in fact just a specialized one valid for some class of languages. It is not difficult here to imitate it for a treatment of AdvP location specific to Vietnamese.</Paragraph> <Paragraph position="6"> Translating structures with negative words, such as &quot;jamais&quot; = &quot;never&quot; = &quot;a13a53a1a11a15a16a18a17a11a19a97a24a26a6a22a15a98a19a28a27a46a75a7a64a99 , &quot;rien&quot; = &quot;nothing&quot; = &quot;a13a53a1a11a15a16a18a17a11a19a101a100a49a6a2a102a80a27a11a19a28a3a4a100a49a6 a76a49a99 , etc. into Vietnamese is problematic. A straightforward application of the generic engine might yield exactly the opposite meaning, eg.: inserted before the verb to form a negation.</Paragraph> <Paragraph position="7"> The right sentence should be</Paragraph> <Paragraph position="9"> The same problem was known in French-English translation, and cured in GBGEN by realizing the English sentence not in negative but in affirmative form. This solution does not work for Vietnamese: Our solution here is to keep the verb in the negative form, and use the &quot;indefinite&quot; counterparts &quot;a97 a63a47a7a98a33a8a10a31a83a99 a86 a100 &quot;a66a54a63a43a72a2a10a101a33a8a102a58a66a54a63 a35 a86 , etc. of the expressions &quot;a103 a59a4a7a9a76a32a4a33 a97 a63a47a7a104a33a8a10a31a83a99 a86 a100 &quot;a103 a59a4a7a9a76a32a4a33a105a66a54a63a43a72a2a10a47a33a8a102a8a66a54a63 a35 a86 , etc6. The structure of eg. the translation (24) is thus</Paragraph> <Paragraph position="11"> where &quot;a103 a59a4a7a9a76a32a4a33 a86 and &quot;a97 a63a47a7a23a33a8a10a31a83a99 a86 are two different constituents. Note however that this solution gives a less good but still acceptable translation of (27), that of a86 a57a58a32a47a59 a103 a59a4a7a9a76a32a4a33a23a81a8a82a79a83a67a84a66 a22a4a24a25a15a115a20a14a27a29a28a31a30 a66a39a59a51a63a67a69a68 a70a21a34a41a63a43a72a116a32a47a59a51a63a4a32a47a59 a86 . We could have done better, but at the cost of much more complicated programming.</Paragraph> <Paragraph position="12"> (33) a57a58a32a47a59a117a81a4a63a118a116a45a43a59a51a63a44a119a68 a24a4a27 a120 (&quot;whom&quot;=&quot;ai&quot;) We therefore block the wh-movement procedure in GBGEN in constructing wh-questions.</Paragraph> <Paragraph position="13"> However, there is a case where a movement is preferred and realized, that of why7.</Paragraph> <Paragraph position="14"> To form a relative clause in Vietnamese, one can generally add an optional complementizer &quot;a129 a63a99 &quot; before the clause. We decide to put &quot;a0 a129 a63a99 a5 &quot; for subject relative clauses, and &quot;a129 a63a99 &quot; for object relative clauses, as it is more acceptable to drop &quot;a129 a63a99 &quot; in the former case than in the latter. The translation of adjunct relative clauses which begin with a preposition from French or English into Vietnamese is difficult. In general, we need to keep the preposition at the end of the relative clause, rather than move it to the beginning as GBGEN proposes: (41) La fille / avec qui John parle / est Mary.</Paragraph> <Paragraph position="15"> (42) The girl / with whom John talks / is Mary.</Paragraph> <Paragraph position="16"> 7 This is done by the AdvP location procedure (see section 3.3.1).</Paragraph> <Paragraph position="17"> 8 If &quot; a138a58a139a140 &quot; is dropped, it is a sort of garden-path sentence. But this is common in Vietnamese, and may be an interesting subject to study.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> The implemented generation module for Vietnamese can realize almost all structures that can be generated from the intermediate PSSs.</Paragraph> <Paragraph position="1"> Many of them are of course not yet perfect, but a French-Vietnamese translation test on a sample of French sentences of many different syntactic structures gave encouraging results. We did not consider tests on English-Vietnamese translation, because the English analysis module in ITS3 has not yet been well developed.</Paragraph> <Paragraph position="2"> We have not been able to do a large-scale test on real corpora yet, because our lexicons are still small (about 400 entries for each bilingual lexicon, among them many functional words (prepositions, adverbs, pronouns, conjunctions)). However, tests are not necessarily restricted by the size of the lexicons, because if a source language word is not found in the bilingual lexicon, it is still retained in the PSS during the lexical transfer phase. This word will then appear in the target language sentence exactly at the position of its supposed translation.</Paragraph> <Paragraph position="3"> As it is well known, lexicon building requires huge investments on human work and time. One can use methods of (semi-)automatic acquisition of dictionary resources (see eg., Doan-Nguyen, 1998) to obtain quickly a large draft of necessary lexicons, provided that such resources (eg. a French-Vietnamese dictionary text file) exist. In the worst case, a human will verify and complete this draft, but in general this is still much cheaper than developing a lexicon from scratch. We did not, unfortunately, have any of these resources. Nevertheless, we profited much from a French-English lexicon draft extracted from ITS3's lexicons: much lexical information in its entries can be reused in the corresponding Vietnamese entries (eg. the part-of-speech, the verb theta grid). Moreover, English translations of a French word, as well as French translations of an English word, help to choose correct corresponding Vietnamese translations.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Although not totally perfect, ITS3, and in particular GBGEN, show to be good systems for multilingual MT. They have a solid linguistic theoretical base, a modular computational design, and a surprising performance. Besides the problems presented in this paper, we find convenient to use many available procedure templates, such as PP construction, movements and bindings. In particular, ITS3 is able to do robust, high-quality, and broad-coverage syntactic analysis for French. Our experience can be seen as a test on integrating an &quot;exotic&quot; language into the sytem.</Paragraph> <Paragraph position="1"> As we have shown above, many difficulties in implementing the generation module for Vietnamese stem from &quot;mismatches&quot; between Vietnamese grammatical notions and the model of the generic engine GBGEN. It is largely agreed that designing a generic, flexible, and efficient system for pratical applications of multilingual generation and MT is a very difficult problem. Our experience suggests that in a principle-based generation system such as GBGEN, the parameterized modules, which contain language-specific and lexicalized properties, should be of more importance. The flexibility of a generic system consists in designing good &quot;slots&quot; so that modules for a new language can be plugged in systematically and conveniently.</Paragraph> <Paragraph position="2"> As discussed in section 2, a declarative approach may be very beneficial for system development, including genericity and flexibility. The programming paradigm is also an important factor. The LATL has recently begun to reengineer ITS3 in an object-oriented language, which facilitates the development of the system while still guanratees its performance9.</Paragraph> <Paragraph position="3"> Apart from the generation phase, the quality of an MT system depends heavily on the analysis modules. The construction of the PSS from the syntactic analysis of the input sentence is of crucial importance. We find that this is a real bottleneck in ITS3: in many cases, despite a good syntactic analysis, the translation fails because of a bad PSS construction. PSS construction is obviously a very difficult task, as it is in fact a kind of translation, that goes from a syntactic structure into a logical formalism. See eg. Alshawi (1992) for a similar task, ie.</Paragraph> <Paragraph position="4"> translating English sentences into a logical representation.</Paragraph> </Section> class="xml-element"></Paper>