File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/j04-2003_intro.xml
Size: 4,017 bytes
Last Modified: 2025-10-06 14:02:16
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-2003"> <Title>c(c) 2004 Association for Computational Linguistics Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information</Title> <Section position="3" start_page="182" end_page="183" type="intro"> <SectionTitle> 1.3 Related Work </SectionTitle> <Paragraph position="0"> 1.3.1 Morphology. Some publications have already dealt with the treatment of morphology in the framework of language modeling and speech recognition: Kanevsky, Roukos, and Sedivy (1997) propose a statistical language model for inflected languages.</Paragraph> <Paragraph position="1"> They decompose word forms into stems and affixes. Maltese and Mancini (1992) report that a linear interpolation of word n-grams, part of speech n-grams, and lemma n-grams yields lower perplexity than pure word-based models. Larson et al. (2000) apply a data-driven algorithm for decomposing compound words in compounding languages as well as for recombining phrases to enhance the pronunciation lexicon and the language model for large-vocabulary speech recognition systems.</Paragraph> <Paragraph position="2"> As regards machine translation, the treatment of morphology is part of the analysis and generation step in virtually every symbolic machine translation system. For this purpose, the lexicon should contain base forms of words and the grammatical category, Niessen and Ney SMT with Scarce Resources subcategorization features, and semantic information in order to enable the size of the lexicon to be reduced and in order to account for unknown word forms, that is, word forms not present explicitly in the dictionary.</Paragraph> <Paragraph position="3"> Today's statistical machine translation systems build upon the work of P. F. Brown and his colleagues at IBM. The translation models they presented in various papers between 1988 and 1993 (Brown et al. 1988; Brown et al. 1990; Brown, Della Pietra, Della Pietra, and Mercer 1993) are commonly referred to as IBM models 1-5, based on the numbering in Brown, Della Pietra, Della Pietra, and Mercer (1993). The underlying (probabilistic) lexicon contains only pairs of full forms. On the other hand, Brown et al. (1992) had already suggested word forms be annotated with morpho-syntactic information, but they did not perform any investigation on the effects.</Paragraph> <Paragraph position="4"> 1.3.2 Translation with Scarce Resources. Some recent publications, like Al-Onaizan et al. (2000), have dealt with the problem of translation with scarce resources. Al-Onaizan et al. report on an experiment involving Tetun-to-English translation by different groups, including one using statistical machine translation. Al-Onaizan et al. assume the absence of linguistic knowledge sources such as morphological analyzers and dictionaries. Nevertheless, they found that the human mind is very well capable of deriving dependencies such as morphology, cognates, proper names, and spelling variations and that this capability was finally at the basis of the better results produced by humans compared to corpus-based machine translation. The additional information results from complex reasoning, and it is not directly accessible from the full-wordform representation in the data.</Paragraph> <Paragraph position="5"> This article takes a different point of view: Even if full bilingual training data are scarce, monolingual knowledge sources like morphological analyzers and data for training the target language model as well as conventional dictionaries (one word and its translation[s] per entry) may be available and of substantial usefulness for improving the performance of statistical translation systems. This is especially the case for more-inflecting major languages like German. The use of dictionaries to augment or replace parallel corpora has already been examined by Brown, Della Pietra, Della Pietra, and Goldsmith (1993) and Koehn and Knight (2001), for instance.</Paragraph> </Section> class="xml-element"></Paper>