File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-2012_intro.xml

Size: 2,852 bytes

Last Modified: 2025-10-06 14:03:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2012">
  <Title>Phrase Linguistic Classification and Generalization for Improving Statistical Machine Translation</Title>
  <Section position="2" start_page="0" end_page="67" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Since its revival in the beginning of the 1990s, statistical machine translation (SMT) has shown promising results in several evaluation campaigns. From original word-based models, results were further improved by the appearance of phrase-based translation models.</Paragraph>
    <Paragraph position="1"> However, many SMT systems still ignore any morphological analysis and work at the surface level of word forms. For highly-inflected languages, such as German or Spanish (or any language of the Romance family) this poses severe limitations both in training from parallel corpora, as well as in producing a correct translation of an input sentence.</Paragraph>
    <Paragraph position="2"> This lack of linguistic knowledge in SMT forces the translation model to learn different translation probability distributions for all inflected forms of nouns, adjectives or verbs ('vengo', 'vienes', 'viene', etc.), and this suffers from usual data sparseness. Despite the recent efforts in the community to provide models with this kind of information (see Section 6 for details on related previous work), results are yet to be encouraging.</Paragraph>
    <Paragraph position="3"> In this paper we address the incorporation of morphological and shallow syntactic information regarding verbs and compound verbs, as a first step towards an SMT model based on linguistically-classified phrases. With the use of POS-tags and lemmas, we detect verb structures (with or without personal pronoun, single-word or compound with auxiliaries) and substitute them by the base form1 of the head verb. This leads to an improved statistical word alignment performance, and has the advantages of improving the translation model and generalizing to unseen verb forms, during translation.</Paragraph>
    <Paragraph position="4"> Experiments for the English - Spanish language pair are performed.</Paragraph>
    <Paragraph position="5"> The organization of the paper is as follows. Section 2 describes the rationale of this classification strategy, discussing the advantages and difficulties of such an approach. Section 3 gives details of the implementation for verbs and compound verbs, whereas section 4 shows the experimental setting used to evaluate the quality of the alignments. Section 5 explains the current point of our research, as well as both our most-immediate to-do tasks and our medium and long-term experimentation lines. Finally, sections 6 and 7 discuss related works that can be found in literature and conclude, respectively.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML