File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2031_metho.xml

Size: 22,527 bytes

Last Modified: 2025-10-06 14:09:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2031">
  <Title>Problems Of Reusing An Existing MT System[?]</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The last decade has witnessed several attempts to increase the quality of MT systems by introducing new methods. The strong stress on stochastic methods in the NLP in general and in the MT in particular, the attempts to develop hybrid systems, a wide acceptance of translation-memory based systems among the translation professionals, the aim at limited domain speech-to-speech translation systems, all these (and many other) trends have demonstrated encouraging results in recent years.</Paragraph>
    <Paragraph position="1"> Developing and using new methods definitely moves the whole MT field forward, but one [?]The work described in this paper has been supported by the grant of the Grant Agency of the Czech Republic GACR No.405/03/0914 and partially also by the grant of the Grant Agency of the Charles University GAUK No. 351/2005 should not forget about all the effort invested into the old systems. Reusing at least some parts of those systems may help to decrease the costs of new systems, especially when one of the languages is not a &amp;quot;big&amp;quot; language and therefore there is not such a wide range of tools, grammars, dictionaries available as for example for English, German, Japanese or Spanish. In this paper we would like to describe one such attempt to reuse the existing system for a new language pair.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="179" type="metho">
    <SectionTitle>
2 The original system
</SectionTitle>
    <Paragraph position="0"> One of the systems which was silently abandoned in early nineties was the system for the translation from Czech to Russian called RUSLAN (Oliva, 1989). It was being developed in the second half of eighties with the aim to translate texts from a relatively closed thematic domain, the domain of operating systems of mainframes.</Paragraph>
    <Paragraph position="1"> The system used transfer-based architecture.</Paragraph>
    <Paragraph position="2"> The implementation of the system was almost completely done in Q-systems, a formalism created by Alain Colmerauer (Colmerauer, 1969) for the TAUM-METEO project. The Czech-to Russian system also relied upon a set of dictionaries containing all data exploited by individual modules of the system. Each lexical item in the main (bilingual) dictionary contained not only lexico-syntactic data (valency frames etc.), but also a set of semantic features.</Paragraph>
    <Paragraph position="3"> The work on the system RUSLAN has been terminated in 1990, in the final phase of system testing and debugging. The reason was quite simple - after the political changes in 1989 there was no more any commercial demand for Czech to  Russian MT system.</Paragraph>
    <Paragraph position="4"> The demand for Czech-English translation has grown dramatically during the years following the abandonment of the system RUSLAN. On the other hand, also the range of methods, tools and resources for MT has grown substantially. Several corpora were created for Czech, the most prominent ones being the morphologically annotated Czech National Corpus and syntactically annotated Prague Dependency Treebank. In 2002 we have started the work on the parallel bilingual Prague Czech English Dependency Treebank (PCEDT) (CuVr'in et al., 2004), which contains about a half of the texts from PennTreebank 3 translated into Czech by native speakers. A large morphological dictionary of Czech has been developed (HajiVc, 2001), allowing for a good quality morphological analysis of Czech, which has been tested in numerous commercial applications and scientific projects since then.</Paragraph>
  </Section>
  <Section position="5" start_page="179" end_page="179" type="metho">
    <SectionTitle>
3 The background of the project
</SectionTitle>
    <Paragraph position="0"> The main motivation for our Czech-English MT experiment was to test several hypotheses. The most prominent of these hypotheses concerns the level, at which it is reasonable to perform the transfer. Due to the differences between both languages it is not sufficient to perform the transfer immediately after the morphological analysis or shallow parsing, as it has been done in the MT system eslko aiming at the translation between closely related (and similar) languages [cf (HajiVc et al., 2003)]. On the other hand, it is a question whether the typological differences between Czech and English justify the transfer being performed at the tectogrammatical (deep syntactic) level.</Paragraph>
    <Paragraph position="1"> Last but not least, one of our aims was to develop a rule-based MT system with minimal possible costs, either reusing the existing modules or trying to use (semi)automatic methods whenever possible, concentrating on areas where using the human labor would be extremely expensive (for example building a large coverage bilingual dictionary, cf. the following paragraphs.)</Paragraph>
  </Section>
  <Section position="6" start_page="179" end_page="183" type="metho">
    <SectionTitle>
4 Czech-English MT system
</SectionTitle>
    <Paragraph position="0"> The main goal of our project is to develop an experimental MT system for the translation of texts from the PCEDT from Czech to English. The system investigates the possibility of reusing the existing resources (grammar, dictionary) in order to decrease the development time. It also exploits the parallel bilingual corpus of syntactically annotated texts, although not as a direct learning material, more like an additional source of linguistic data especially for the dictionary development and for the testing of the system.</Paragraph>
    <Paragraph position="1"> The task is complicated by the fact that this translation direction is according to our opinion more complicated than the reverse one. There are several reasons for this claim; the most prominent one is the free word-order nature of the source language. It generally means that it is very often necessary to make substantial changes of the word order if we want to get a grammatical English sentence, while when translating from English to Czech the results are more or less grammatically correct and comprehensible even if we don't change the word order at all.</Paragraph>
    <Paragraph position="2"> Another problem of the Czech-English translation is the insertion of articles. Czech doesn't use any articles and it is of course much easier to remove them from the text (when translating from English) than to insert a proper article on a proper place (when translating from Czech).</Paragraph>
    <Paragraph position="3"> Let us now look at the individual modules of the new system.</Paragraph>
    <Section position="1" start_page="179" end_page="180" type="sub_section">
      <SectionTitle>
4.1 Morphological analysis
</SectionTitle>
      <Paragraph position="0"> Due to the limited size of the original morpho-syntactic dictionary of the system it was necessary to replace the original module by a new one.</Paragraph>
      <Paragraph position="1"> The new module of morphological analysis of Czech (HajiVc, 2001) has been already exploited in numerous applications. It covers almost the entire Czech language, with very few exceptions (it is estimated that it contains about 800 000 lemmas).</Paragraph>
      <Paragraph position="2"> It is very reliable, due to a really large coverage there are almost no unknown words in the whole PCEDT. The only problem was the incorporation of the new module into the system - the original module of syntactic analysis of Czech from the system RUSLAN was very closely bound to a dictionary lookup and to the morphological module.</Paragraph>
      <Paragraph position="3"> The new module also uses a different tagset.</Paragraph>
    </Section>
    <Section position="2" start_page="180" end_page="181" type="sub_section">
      <SectionTitle>
4.2 Bilingual dictionary
</SectionTitle>
      <Paragraph position="0"> The bilingual dictionary of the system RUSLAN contained approximately 8000 lexical items with a rich lexico-syntactic information. We have originally assumed that the information contained in the dictionary might be transformed and reused in the new system, but this assumption turned to be false. Although the information contained in the original bilingual dictionary is extremely valuable for the module of syntactic analysis of Czech, we have decided to sacrifice it. The mere 8000 lexical items constitute too small part of the new bilingual dictionary and we have decided to prefer handling the dictionary in a uniform way.</Paragraph>
      <Paragraph position="1"> At the moment there are no Czech-English dictionaries exploitable in an MT system. The available machine-readable dictionaries built mainly for a human user (such as WinGED1 or Svoboda (2001)) suffer from important limitations: null  to some extent. (E.g. valency frames are encoded by means of rather inconsistent abbreviations in plain text: accession to = vstoupen'i do or adjudge sb. to be guilty = uznat vinn'ym koho.) * Usually, no morphological information is given along the entries, although the morphological information can be vital for correctly recognizing an occurrence of the entry in a text. For example, an expression kniha 'uVcetn'i can be translated as either an accounting book or a book of an accountant depending whether the Czech word 'uVcetn'i is an adjective or a noun.</Paragraph>
      <Paragraph position="2">  onym to translation pair, i.e. a pair of Czech and English expressions.</Paragraph>
      <Paragraph position="3"> lexicographers to annotate syntactic properties in plain text (such as putting the head of the clause as the first word).</Paragraph>
      <Paragraph position="4"> From the point of view of structural machine translation, the lack of syntactic information in the translation dictionary is crucial. In the course of translation, the input sentence is syntactically analyzed before searching for foreign language equivalents. In order to check for presence of multi-word expressions in the input, the dictionary must encode the structural shape of such entries, otherwise the system does not know how to traverse the relevant part of the tree. Similarly, some expressions require some constraints to be met (such as an agreement in case or number) in the input text. If these constraints are not fulfilled, the proposed foreign language equivalent is not applicable.</Paragraph>
      <Paragraph position="5"> The importance of valency (subcategorization) frames and their equivalents should be stressed, too. In the described system, already the syntactic analyzer requires verb and adjective valency frames in order to allow for specific syntactic constructions. In general, knowledge of translation equivalents of valencies is important to preserve the meaning (pVrij'it na nVejak'y n'apad = come at an idea, literal translation: come on an idea; chodit na housle = attend violin lessons, lit. walk on violin) or to handle auxiliary words properly (Vcekat na nVehoko = wait for somebody, lit. wait on sb.; Vr'ici nVeco = tell something but pVrejet nVeco = run over something).</Paragraph>
      <Paragraph position="6">  In order to handle the problems mentioned above, we performed an extensive cleanup of the data from available machine-readable dictionaries. The core steps of the cleanup are as follows: Identifying meta-information.</Paragraph>
      <Paragraph position="7"> We manually processed all the entries and searched for frequent words that typically encode some meta-information, such as sth., st., oneself. We also checked all entries ending with a word that is potentially a preposition. Based on the expression in the other language, we were able to recognize the meaning and identify, whether the suspicious word expresses a &amp;quot;slot&amp;quot; in the expression or whether it is a fixed part of the expression. (E.g. m'it o sobVe vysok'e m'inVen'i = think something  of oneself, only the word oneself encodes a slot, the word something is a fixed part of the expression.) null During this phase, entries encoding several translation variants at once were disassembled into separate translation pairs, too.</Paragraph>
      <Paragraph position="8"> Part-of-speech disambiguation.</Paragraph>
      <Paragraph position="9"> We processed the Czech part of each entry with a morphological analyzer (HajiVc, 2001) and we performed manual part-of-speech disambiguation of expressions with ambiguity. It should be noted that automatic tagging would not provide us with satisfactory results due to the lack of sentential context around the expressions.</Paragraph>
      <Paragraph position="10"> Adding morphological constraints.</Paragraph>
      <Paragraph position="11"> Morphological constraints on word entries describe which values of morphological features are valid for each word of the entry or have to be shared among some words of the entry. Once identified, morphological constraints can be used to check whether a word group in the input text represents an entry or not. With respect to our final task (translation from Czech to English), we aim at Czech constraints only.</Paragraph>
      <Paragraph position="12"> We decided to induce morphological constraints automatically, based on corpus examples of the entries. For each entry, we look up sentences that contain all the lemmas of the entry in a close neighborhood (but irrespective to the word order and possible presence of inserted extra words). We weight the instances to promote those with no intervening words and those with connected dependency graph. The list of weighted instances is scanned for both unary (such as &amp;quot;case is accusative&amp;quot;, &amp;quot;number is singular&amp;quot;) and binary (&amp;quot;the case of the first and second words match&amp;quot;) pre-defined constraints selecting those that are satisfied by at least 75% of total weight.</Paragraph>
      <Paragraph position="13"> Most of the expressions with at least 10 corpus instances obtain a valid set of constraints. Only expressions containing very common words (so that the words do appear quite often close together without actually forming the expression) obtain too weak constraints. For instance, no case and gender agreement constraints are selected for the expression bohat'y VclovVek (wealthy man).</Paragraph>
      <Paragraph position="14"> Adding syntactic information.</Paragraph>
      <Paragraph position="15"> Syntactic information (dependency relations among words in the expression) is needed mainly during the analysis of input sentences, therefore we focused on adding the information to the Czech part of entries first. For most of the entries, it was possible to add the dependency structure manually, based on the part-of-speech pattern of the entry. For instance all the entries containing an adjective followed by a noun get the same structure: the noun governs the preceeding adjective. For the remaining entries (with very varied POS patterns), we employ a corpus-based search similar to the automatic procedure of identifying morphological constraints.</Paragraph>
    </Section>
    <Section position="3" start_page="181" end_page="182" type="sub_section">
      <SectionTitle>
4.3 Named entity recognition module
</SectionTitle>
      <Paragraph position="0"> Named entities (NE) are atomic units such as proper names, temporal expressions (e.g., dates) and quantities (e.g., monetary expressions). They occur quite often in various texts and carry important information. Hence, proper analysis of NEs and their translation has an enormous impact on MT quality (Babych and Hartley, 2004). In our system they are extremely important due to the nature of input texts. The Wall Street Journal section of PennTreebank shows much higher density of named entities than ordinary texts. Their correct recognition therefore has a tremendous impact on the performance of the whole system, especially if the evaluation of the translation quality is based on golden standard translations.</Paragraph>
      <Paragraph position="1"> NE translation involves both semantic translation and phonetic transliteration. Each type of NE is handled in a different way. For instance, person names do not undergo semantic translation (only transliteration is required), while certain titles and part of names do (e.g., prvn'i d'ama Laura Bushov'a - first lady Laura Bush). In case of organizations, application of regular transfer rules for NPs seems to be sufficient (e.g., 'Ustav form'aln'i a aplikovan'e lingvistiky - Institute of formal and applied linguistics), although an idiomatic translation may be probably preferable sometimes. With respect to geographical places we apply bilingual glossaries and a set of regular transfer rules as well.</Paragraph>
      <Paragraph position="2"> For NE-recognition, we have developed a grammar based on regular expressions that processes typed feature structures. The grammar framework, similarly as the formally a bit weaker platform SProUT (Bering et al., 2003),  uses finite-state techniques and unification, i.e., a grammar consists of pattern/action rules, where the left-hand side is a regular expression over typed feature structures (TFS) with variables, representing the recognition pattern, and the right-hand side is a TFS specification of the output structure.</Paragraph>
      <Paragraph position="3"> The NE grammar is based on the experiment described in (Piskorski et al., 2004). An example of a simple rule is: #subst[LEMMA: ministerstvo]$s1 + #top[CASE: gen, PHRASE: $phr]$s2</Paragraph>
      <Paragraph position="5"> The first TFS matches any morphological variant of the word ministerstvo (ministry), followed by a genitive NP. The variables $s1, $s2 and $phr create dynamic value assignments and allow to transport these values to the slots in the output structure of type ministry. The output structure contains a new attribute called PHRASE with the lemmatized value of the whole phrase.</Paragraph>
      <Paragraph position="6"> If the input phrase is informace ministerstva zahraniVc'i o cestov'an'i do ohroVzen'ych oblast'i (2) then the phrase &amp;quot;ministerstva zahraniVc'i&amp;quot; will be recognized as a NE and handled as an atomic unit in the whole MT process:  Lemmatization of NEs is crucial in the context of MT. However, it might pose a serious problem in case of languages with rich inflection due to structural ambiguities, e.g., internal bracketing of complex noun phrases might be difficult to analyze. The core of the framework is based on grammars that have been developed for the MT system VCes'ilko (HajiVc et al., 2003).</Paragraph>
    </Section>
    <Section position="4" start_page="182" end_page="182" type="sub_section">
      <SectionTitle>
4.4 Syntactic analysis of Czech
</SectionTitle>
      <Paragraph position="0"> Although we have originally assumed that the module of syntactic analysis of Czech will require only small modifications and its reuse in the new system was one of the goals of our system, it turned out that this module is one of the main sources of problems.</Paragraph>
      <Paragraph position="1"> In the course of testing and debugging of the system we had to create a number of new grammar rules covering the phenomena which were not properly accounted for in the original system due to the different nature of the original domain. The texts from PCEDT show for example much higher number of numerals and numeric expressions, some of which require either special grammatical or transfer rules than operating systems manuals from the system RUSLAN. The complexity of input sentences with regard to the number of clauses and their mutual relationship is also much higher. This, of course, decreases the number of sentences which are completely syntactically analyzed and thus degrades the translation quality.</Paragraph>
      <Paragraph position="2"> One of the biggest problems of the grammar are the properties of Q-systems. It was quite clear since the start of the project that it is impossible to extract only the knowledge encoded into the grammar, the grammar rules written in Q-systems are so complicated that rewriting them into a different (even chart-parser based) formalism would actually mean to write a completely new grammar. Although we have at our disposal a new, modernized and reimplemented version of a Q-systems compiler and interpreter which overcomes the technical problems of the original version, the nature of the formalism is of course preserved. null</Paragraph>
    </Section>
    <Section position="5" start_page="182" end_page="183" type="sub_section">
      <SectionTitle>
4.5 Transfer
</SectionTitle>
      <Paragraph position="0"> The main task of this module is to transform the syntactic structure (syntactic tree) of the input Czech sentence into the syntactic structure (tree) of the corresponding English sentence. The transfer module does not handle the translation of regularly translated lexical units, it is handled by the bilingual dictionary in the earlier phases of the system. The transfer concentrates on three main tasks:  * The transformation of the Czech syntactic tree into the English one reflecting the differences in the word order between both languages. null * The identification and translation of those constructions in Czech, which require specific (irregular) translation into English. * The insertion of articles (which do not exist in Czech) into the target language sentences.</Paragraph>
      <Paragraph position="1"> The development of this module still continues, the initial tests confirmed that a substantial improvement can be achieved in the future.</Paragraph>
    </Section>
    <Section position="6" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
4.6 Syntactic synthesis of English
</SectionTitle>
      <Paragraph position="0"> The syntactic synthesis of Russian in RUSLAN is very closely bound to transfer, therefore we have tried to use as big portion of the grammar as possible, but of course, substantial modifications of the grammar were necessary. As well as the work on the transfer module, also the work on this module still continues.</Paragraph>
    </Section>
    <Section position="7" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
4.7 Morphological synthesis of English
</SectionTitle>
      <Paragraph position="0"> Due to the simplicity of English morphology this module has a very limited role in our system. It handles plurals, 3rd persons and irregular words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML