File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2012_metho.xml
Size: 12,879 bytes
Last Modified: 2025-10-06 14:09:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2012"> <Title>Phrase Linguistic Classification and Generalization for Improving Statistical Machine Translation</Title> <Section position="3" start_page="67" end_page="68" type="metho"> <SectionTitle> 2 Morphosyntactic classification of </SectionTitle> <Paragraph position="0"> translation units State-of-the-art SMT systems use a log-linear combination of models to decide the best-scoring target sentence given a source sentence. Among these models, the basic ones are a translation model Pr(e|f) and a target language model Pr(e), which can be complemented by reordering models (if the language pairs presents very long alignments in training), word penalty to avoid favoring short sentences, class-based target-language models, etc (Och and Ney, 2004).</Paragraph> <Paragraph position="1"> The translation model is based on phrases; we have a table of the probabilities of translating a certain source phrase ~fj into a certain target phrase ~ek. Several strategies to compute these probabilities have been proposed (Zens et al., 2004; Crego et al., 2004), but none of them takes into account the fact that, when it comes to translation, many different inflected forms of words share the same translation. Furthermore, they try to model the probability of translating certain phrases that contain just auxiliary words that are not directly relevant in translation, but play a secondary role. These words are a consequence of the syntax of each language, and should be dealt with accordingly.</Paragraph> <Paragraph position="2"> For examples, consider the probability of translating 'in the' into a phrase in Spanish, which does not make much sense in isolation (without knowing the following meaning-bearing noun), or the modal verb 'will', when Spanish future verb forms are written without any auxiliary.</Paragraph> <Paragraph position="3"> Given these two problems, we propose a classification scheme based on the base form of the phrase head, which is explained next.</Paragraph> <Section position="1" start_page="67" end_page="67" type="sub_section"> <SectionTitle> 2.1 Translation with classified phrases </SectionTitle> <Paragraph position="0"> Assuming we translate from f to e, and defining ~ei, ~fj a certain source phrase and a target phrases (sequences of contiguous words), the phrase translation model Pr(~ei |~fj) can be decomposed as:</Paragraph> <Paragraph position="2"> where ~Ei, ~Fj are the generalized classes of the source and target phrases, respectively, and T = ( ~Ei, ~Fj) is the pair of source and target classes used, which we call Tuple. In our current implementation, we consider a classification of phrases that is: The second condition implies Pr( ~F |~f) = 1, leading to the following expression:</Paragraph> <Paragraph position="4"> where we have just two terms, namely a standard phrase translation model based on the classified parallel data, and an instance model assigning a probability to each target instance given the source class and the source instance. The latter helps us choose among target words in combination with the language model.</Paragraph> </Section> <Section position="2" start_page="67" end_page="68" type="sub_section"> <SectionTitle> 2.2 Advantages </SectionTitle> <Paragraph position="0"> This strategy has three advantages: Better alignment. By reducing the number of words to be considered during first word alignment (auxiliary words in the classes disappear and no inflected forms used), we lessen the data sparseness problem and can obtain a better word alignment.</Paragraph> <Paragraph position="1"> In a secondary step, one can learn word alignment relationships inside aligned classes by realigning them as a separate corpus, if that is desired.</Paragraph> <Paragraph position="2"> Improvement of translation probabilities. By considering many different phrases as different instances of a single phrase class, we reduce the size of our phrase-based (now class-based) translation model and increase the number of occurrences of each unit, producing a model Pr( ~E |~F) with less perplexity.</Paragraph> <Paragraph position="3"> Generalizing power. Phrases not occurring in the training data can still be classified into a class, and therefore be assigned a probability in the translation model. The new difficulty that rises is how to produce the target phrase from the target class and the source phrase, if this was not seen in training.</Paragraph> </Section> <Section position="3" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 2.3 Difficulties </SectionTitle> <Paragraph position="0"> Two main difficulties2 are associated with this strategy, which will hopefully lead to improved translation performance if tackled conveniently.</Paragraph> <Paragraph position="1"> Instance probability. On the one hand, when a phrase of the test sentence is classified to a class, and then translated, how do we produce the instance of the target class given the tuple T and the source instance? This problem is mathematically expressed by the need to model the term of the Pr(~ei|T, ~fj) in Equation 2.</Paragraph> <Paragraph position="2"> At the moment, we learn this model from relative frequency across all tuples that share the same source phrase, dividing the times we see the pair ( ~fj, ~ei) in the training by the times we see ~fj. Unseen instances. To produce a target instance ~f given the tuple T and an unseen ~e, our idea is to combine both the information of verb forms seen in training and off-the-shelf knowledge for generation.</Paragraph> <Paragraph position="3"> A translation memory can be built with all the seen pairs of instances with their inflectional affixes separated from base forms.</Paragraph> <Paragraph position="4"> For example, suppose we translate from English to Spanish and see the tuple T=(V[go],V[ir]) in training, with the following instances: it for granted that this is performed by an independent system based on other knowledge sources, and therefore out of scope here.</Paragraph> <Paragraph position="5"> where the second row is the analyzed form in terms of person (1S: 1st singular, 2S: 2nd singular and so on) and tense (VB: infinitive and P: present, F: future). From these we can build a generalized rule independent of the person ' PRP(X) will VB ' that would enable us to translate 'we will go' to two different alternatives (present and future form): we will go VB 1P F we will go VB 1P P These alternatives can be weighted according to the times we have seen each case in training. An un-ambiguous form generator produces the forms 'iremos' and 'vamos' for the two Spanish translations.</Paragraph> </Section> </Section> <Section position="4" start_page="68" end_page="68" type="metho"> <SectionTitle> 3 Classifying Verb Forms </SectionTitle> <Paragraph position="0"> As mentioned above, our first and basic implementation deals with verbs, which are classified unambiguously before alignment in training and before translating a test.</Paragraph> <Section position="1" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 3.1 Rules used </SectionTitle> <Paragraph position="0"> We perform a knowledge-based detection of verbs using deterministic automata that implement a few simple rules based on word forms, POS-tags and word lemmas, and map the resulting expression to the lemma of the head verb (see Figure 1 for some rules and examples of detected verbs). This is done both in the English and the Spanish side, and before word alignment.</Paragraph> <Paragraph position="1"> Note that we detect verbs containing adverbs and negations (underlined in Figure 1), which are ordered before the verb to improve word alignment with Spanish, but once aligned they are reordered back to their original position inside the detected verb, representing the real instance of this verb.</Paragraph> </Section> </Section> <Section position="5" start_page="68" end_page="70" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In this section we present experiments with the Spanish-English parallel corpus developed in the framework of the LC-STAR project. This corpus consists of transcriptions of spontaneously spoken dialogues in the tourist information, appointment scheduling and travel planning domain. Therefore, sentences often lack correct syntactic structure. Pre-processing includes:</Paragraph> <Paragraph position="2"> L: Lemma (or base form) tool (Carreras et al., 2004). This software also generates a lemma or base form for each input word.</Paragraph> <Section position="1" start_page="69" end_page="69" type="sub_section"> <SectionTitle> 4.1 Parallel corpus statistics </SectionTitle> <Paragraph position="0"> Table 1 shows the statistics of the data used, where each column shows number of sentences, number of words, vocabulary, and mean length of a sentence, respectively.</Paragraph> <Paragraph position="1"> sent. words vocab. Lmean There are 116 unseen words in the Spanish test set (1.7% of all words), and 48 unseen words in the English set (0.7% of all words), an expected big difference given the much more inflectional nature of the Spanish language.</Paragraph> </Section> <Section position="2" start_page="69" end_page="69" type="sub_section"> <SectionTitle> 4.2 Verb Phrase Detection/Classification </SectionTitle> <Paragraph position="0"> Table 2 shows the number of detected verbs using the detection rules presented in section 3.1, and the number of different lemmas they map to. For the test set, the percentage of unseen verb forms and lemmas are also shown.</Paragraph> <Paragraph position="1"> In average, detected English verbs contain 1.81 words, whereas Spanish verbs contain 1.08 words. This is explained by the fact that we are including the personal pronouns in English and modals for future, conditionals and other verb tenses.</Paragraph> </Section> <Section position="3" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 4.3 Word alignment results </SectionTitle> <Paragraph position="0"> In order to assess the quality of the word alignment, we randomly selected from the training corpus 350 sentences, and a manual gold standard alignment has been done with the criterion of Sure and Possible links, in order to compute Alignment Error Rate (AER) as described in (Och and Ney, 2000) and widely used in literature, together with appropriately redefined Recall and Precision measures. Mathematically, they can be expressed thus:</Paragraph> <Paragraph position="2"> where A is the hypothesis alignment and S is the set of Sure links in the gold standard reference, and P includes the set of Possible and Sure links in the gold standard reference.</Paragraph> <Paragraph position="3"> We have aligned our data using GIZA++ (Och, 2003) from English to Spanish and vice versa (performing 5 iterations of model IBM1 and HMM, and 3 iterations of models IBM3 and IBM4), and have evaluated two symmetrization strategies, namely the union and the intersection, the union always rating the best. Table 3 compares the result when aligning words (current baseline), and when aligning classified verb phrases. In this case, after the alignment we substitute the class for the original verb form and each new word gets the same links the class had. Of course, adverbs and negations are kept apart from the verb and have separate links.</Paragraph> </Section> <Section position="4" start_page="70" end_page="70" type="sub_section"> <SectionTitle> Results show a significant improvement in AER, </SectionTitle> <Paragraph position="0"> which proves that verbal inflected forms and auxiliaries do harm alignment performance in absence of the proposed classification.</Paragraph> </Section> <Section position="5" start_page="70" end_page="70" type="sub_section"> <SectionTitle> 4.4 Translation results </SectionTitle> <Paragraph position="0"> We have integrated our classification strategy in an SMT system which implements: * Pr(~ei |~fk) as a tuples language model (Ngram), as done in (Crego et al., 2004) * Pr(e) as a standard Ngram language model using SRILM toolkit (Stolcke, 2002) Parameters have been optimised for BLEU score in a 350 sentences development set. Three references are available for both development and test sets. Table 4 presents a comparison of English to Spanish translation results of the baseline system and the configuration with classification (without dealing with unseen instances). Results are promising, as we achieve a significant mWER error reduction, while still leaving about 5.6 % of the verb forms in the test without translation. Therefore, we expect a further improvement with the treatment of unseen instances.</Paragraph> <Paragraph position="1"> mWER BLEU baseline 23.16 0.671 with class. verbs 22.22 0.686</Paragraph> </Section> </Section> class="xml-element"></Paper>