File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0830_intro.xml

Size: 7,026 bytes

Last Modified: 2025-10-06 14:03:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0830">
  <Title>Deploying Part-of-Speech Patterns to Enhance Statistical Phrase-Based Machine Translation Resources</Title>
  <Section position="3" start_page="0" end_page="164" type="intro">
    <SectionTitle>
2 System
</SectionTitle>
    <Paragraph position="0"> The resources used for our experiments are as follows. The statistical machine translation GIZA++ toolkit was used to generate a bilingual translation table from the French-English parallel and sentence-aligned Europarl corpus. Additionally, a phrase table generated from the Europarl French-English corpus, and a training test set of 2000 French and English sentences that were made available on the webpage of the ACL 2005 workshop1 were also used. Syntactic tagging was realized by the TreeTagger, which is a probabilistic part-of-speech tagger and lemmatizer. The decoder used to produce machine translations was Pharaoh, version 1.2.3.</Paragraph>
    <Paragraph position="1"> We used GIZA++ to generate a translation table from the parallel corpus. The table produced consisted of individual words and phrases, followed by their corresponding translation and a unique probability value. Specifically, every line of the said table consisted of a French entry (in the form of one or more tokens), followed by an English entry (in the form of one or more tokens), followed by P(f|e), which is the probability P of translation to the French entry f given the English entry e. We added the GIZA++-generated table to the phrase-based translation table downloaded from the workshop webpage. During this merging of translation tables, no word or phrase was omitted, replaced or altered. We chose to combine the two aforementioned translation tables in order to achieve better coverage. We called the resulting merged translation table lexical phrase table.</Paragraph>
    <Paragraph position="2"> In order to utilize the syntactic information stemming from our resources, we used the TreeTagger to tag both the parallel corpus and the lexical phrase table. The probability values included in the lexical phrase table were not tagged. The TreeTagger uses a slightly modified version of the Penn Treebank tagset, different for each language.</Paragraph>
    <Paragraph position="3"> In order to achieve tag-uniformity, we performed the following dual tag-smoothing operation.</Paragraph>
    <Paragraph position="4">  Firstly, we changed the French tags into their English equivalents, i.e. NOM (noun - French) became NN (noun - English). Secondly, we simplified the tags, so that they reflected nothing more than general part-of-speech information. For example, tags denoting predicate-argument structures, whmovement, passive voice, inflectional variation, and so on, were simplified. For example, NNS (noun - plural) became NN (noun).</Paragraph>
    <Paragraph position="5"> Once our resources were uniformly tagged, we used them to extract part-of-speech correspondences between the two languages. Specifically, we extracted a sentence-aligned parallel corpus of French and English part-of-speech patterns from the tagged Europarl parallel corpus. We called this corpus of parallel and corresponding part-of-speech patterns pos-corpus. The format of the pos-corpus remained identical to the format of the original parallel corpus, with the sole difference that individual words were replaced by their corresponding part-of-speech tag. Similarly, we extracted a translation table of part-of-speech patterns from the tagged lexical phrase table. We called this part-of-speech translation table pos-table. The pos-table had exactly the same format as the lexical phrase table, with the unique difference that individual words were replaced by their corresponding part-of-speech tag. The translation probability values included in the lexical phrase table were copied onto the pos-table intact.</Paragraph>
    <Paragraph position="6"> Each of the part-of-speech patterns contained in the pos-corpus was matched against the part-of-speech patterns contained in the pos-table. Matching was realized similarly to conventional left-to-right string matching operations. Matching was considered to be successful not simply when a part-of-speech pattern was found to be contained in, or part of a longer pattern, but when patterns were found to be absolutely identical. When a perfect match was found, the translation probability value of the specific pattern in the pos-table was increased to the maximum value of 1. If the score were already 1, it remained unchanged. When there were no matches, values remained unchanged. We chose to match identical part-of-speech patterns, and not to accept partial pattern matches, because the latter would require a revision of our probability recomputation method. This point is discussed in section 3 of this paper.</Paragraph>
    <Paragraph position="7"> Once all matching was complete, the newly enhanced pos-table, which now contained translation probability scores reflecting the syntactic features of the relevant languages, was used to update the original lexical phrase table. This update consisted in matching each and every part-of-speech pattern with its original lexical phrase, and replacing the initial translation probability score with the values contained in the pos-table. The identification of the original lexical phrases that generated each and every part-of-speech pattern was facilitated by the use of pattern-identifiers (pos-ids) and phraseidentifiers (phrase-ids), which were introduced at a very early stage in the process for that purpose.</Paragraph>
    <Paragraph position="8"> The resulting translation phrase table contained exactly the same entries as the lexical phrase table, but had different probability scores assigned to some of these entries, in line with the parallel part-of-speech co-occurrences and correspondences found in the Europarl corpus. We called this table enhanced phrase table. Table 1 illustrates the process described above with the example of a phrase, the part-of-speech analysis of which has been used to increase its original translation prob- null speech pattern to increase translation probability.</Paragraph>
    <Paragraph position="9"> We used the Pharaoh decoder firstly with our lexical phrase table, and secondly with our enhanced phrase table in order to generate statistical machine translations of source and target language variations of the French and English training test set. We measured performance using the BLEU score [Papineri et al., 2001], which estimates the accuracy of translation output with respect to a reference translation. For both source-target language combinations, the use of the lexical phrase table received a slightly lower score than the score achieved when using the enhanced phrase table.</Paragraph>
    <Paragraph position="10"> The difference between these two approaches is not significant (p-value &gt; 0.05). The results of our  experiments are displayed in Table 2 and discussed</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML