File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0832_metho.xml

Size: 14,684 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0832">
  <Title>Gaming Fluency: Evaluating the Bounds and Expectations of Segment-based Translation Memory</Title>
  <Section position="3" start_page="175" end_page="176" type="metho">
    <SectionTitle>
2 A Simple Chinese-English
Translation Memory
</SectionTitle>
    <Paragraph position="0"> For our experiments below, we constructed a simple translation memory from a sentence-aligned parallel corpus. The system consists of three stages. A source-language input string is rewritten to form an information retrieval (IR) query. The IR engine is called to return a list of candidate translation pairs. Finally a single target-language translation as output is chosen.</Paragraph>
    <Section position="1" start_page="175" end_page="175" type="sub_section">
      <SectionTitle>
2.1 Query rewriting
</SectionTitle>
      <Paragraph position="0"> To retrieve a list of translation candidates from the IR engine, we first create a query which is a concatenation of all possible ngrams of the source sentence, for all ngram sizes from 1 to a fixed n.</Paragraph>
      <Paragraph position="1"> We rely on the fact that the Chinese data in the translation memory is tokenized and indexed at the unigram level. Each Chinese character in the source sentence is tokenized individually, and we make use of the IR engine's phrase query feature, which matches documents in which all terms in the phrase appear in consecutive order, to create the ngrams. For example, to produce a trigram + bigram + unigram query for a Chinesesentence of 10 characters, we would create a query consisting of eight three-character phrases, nine two-character phrases, and 10 single-character &amp;quot;phrases&amp;quot;. All phrases are weighted equally in the query.</Paragraph>
      <Paragraph position="2"> This approach allows us to perform lookups for arbitrary ngram sizes. Depending on the specifics of how idf is calculated, this may yield different results from indexing ngrams directly, but it is advantageous in terms of space consumed and scalability to different ngram sizes without reindexing.</Paragraph>
      <Paragraph position="3"> This is a slight generalization of the successful approach to Chineseinformation retrieval using bigrams (Kwok, 1997). Unlike that work, we perform no second stage IR after query expansion. Using a segmentation-independent engineering approach to Chinese IR allows us to sidestep the lack of a strong segmentation standard for our heterogeneous parallel corpus and prepares us to rapidly move to other languages with segmentation or lemmatization challenges.</Paragraph>
    </Section>
    <Section position="2" start_page="175" end_page="176" type="sub_section">
      <SectionTitle>
2.2 The IR engine
</SectionTitle>
      <Paragraph position="0"> Simply for performance reasons, an IR engine, or some other sort of index, is needed to implement a TM (Brown, 2004). We use the open-source Lucene v1.4.3, (Apa, 2004) as our IR engine. Lucene scores candidate segments from the parallel text using a modified tf-idf formula that includes normalizations for the input segment length and the candidate segment length.</Paragraph>
      <Paragraph position="1"> We did not modify any Lucene defaults for these experiments.</Paragraph>
      <Paragraph position="2"> To form our translation memory, we indexed all sentence pairs in the translation memory corpora, each pair as a separate document. We</Paragraph>
      <Paragraph position="4"> However , everything depended on the missions to be decided by the Security Council .</Paragraph>
      <Paragraph position="5"> The presentations focused on the main lessons learned from their activities in the field .</Paragraph>
      <Paragraph position="6"> It is wrong to commit suicide or to use ones own body as a weapon of destruction .</Paragraph>
      <Paragraph position="7"> There was practically full employment in all sectors .</Paragraph>
      <Paragraph position="8"> One reference translation (of four) Doug Collins said, &amp;quot;He may appear any time. It really depends on how he feels.&amp;quot; At present, his training is defense oriented but he also practices shots.</Paragraph>
      <Paragraph position="9"> He is elevating the intensity to test whether his body can adapt to it.</Paragraph>
      <Paragraph position="10"> So far as his knee is concerned, he thinks it heals a hundred percent after the surgery.&amp;quot;  indexed in such a way that IR searches can be restricted to just the source language side or just the target language side.</Paragraph>
    </Section>
    <Section position="3" start_page="176" end_page="176" type="sub_section">
      <SectionTitle>
2.3 Rescoring
</SectionTitle>
      <Paragraph position="0"> The IR engine returns a list of candidate translation pairs based on the query string, and the final stage of the TM process is the selection of a single target-language output sentence from that set.</Paragraph>
      <Paragraph position="1"> We consider a variety of selection metrics in the experiments below. For each metric, the source-language side of each pair in the candidate list is evaluated against the original source language input string. The target language segment of the pair with the highest score is then output as the translation.</Paragraph>
      <Paragraph position="2"> In the case of automated MT evaluation metrics, which are not necessarily symmetric, the source-language input string is treated as the reference and the source-language side of each pair returned by the IR engine as the hypothesis. null All tie-breaking is done via tf-idf, i.e. if multiple entries share the same score, the one ranked higher by the search engine will be output.</Paragraph>
      <Paragraph position="3"> Table 1 gives a typical example of how the TM performs. Four contiguous source segments are presented, followed by TM output and finally one of the reference translations for those source segments. The only indicator of the translation quality available to monolingual English speakers is the awkwardness of the segments as a group. By design, the TM performs with perfect fluency at the segment level.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="176" end_page="178" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> We performed several experiments in the course of optimizing this TM, all using the same set of parallel texts for the TM database and multiple-reference translation corpus for evalutation. The parallel texts for the TM come from several Chinese-English parallel corpora, all available from the Linguistic Data Consortium (LDC). These corpora are described in Table 2. We discarded any sentence pairs that seemed trivially incomplete, corrupt, or otherwise invalid. In the case of LDC2002E18, in which sentences were aligned automatically and confidence scores produced for each alignment, we dropped all pairs with scores above 9, indicating poor alignment. No duplication checks were performed. Our final corpus contained approximately 7 million sentence pairs and contained 3.2 GB of UTF-8 data.</Paragraph>
    <Paragraph position="1"> Our evaluation corpus and reference corpus  come from the data used in the NIST 2002 MT competition. (NIST, 2002). The evaluation corpus is 878 segments of Chinese source text. The reference corpus consists of four independent human-generated reference English translations of the evaluation corpus.</Paragraph>
    <Paragraph position="2"> All performance measurements were made using a fast reimplementation of NIST's bleu.</Paragraph>
    <Paragraph position="3"> bleu exhibits a high correlation with human judgments of translation quality when measuring on large sections of text (Papineni et al., 2001). Furthermore, using bleu allowed us to compare our performance to that of other systems that have been tested with the same evaluation data.</Paragraph>
    <Section position="1" start_page="177" end_page="177" type="sub_section">
      <SectionTitle>
3.1 An upper bound on whole-segment
</SectionTitle>
      <Paragraph position="0"> translation memory Our first experiment was to determine an upper bound for the entire translation memory corpus. In other words, given an oracle that picks the best possible translation from the translation memory corpus for each segment in the evaluation corpus, what is the bleu score for the resulting document? This score is unlikely to approach the maximum, bleu =100 because this oracle is constrained to selecting a translation from the target language side of the parallel corpus. All of the calculations for this experiment are performed on the target language side of the parallel text.</Paragraph>
      <Paragraph position="1"> We were able to take advantage of a trait particular to bleu for this experiment, avoiding many of bleu score calculations required to assess all of the 878 x 7.5 million combinations. bleu produces a score of 0 for any hypothesis string that doesn't share at least one 4-gram with one reference string. Thus, for each set of four references, we created a Lucene query that returned all translation pairs which matched at least one 4-gram with one of the references. We picked the top segment by calculating bleu scores against the references, and created a hypothesis document from these segments. null Note that, for document scores, bleu's brevity penalty (BP) is applied globally to an entire document and not to individual segments.</Paragraph>
      <Paragraph position="2"> Thus, the document score does not necessarily increase monotonically with increases in scores of individual segments. As more than 99% of the segment pairs we evaluated yielded scores of zero, we felt this would not have a significant effect on our experiments. Also, the TM does not have much liberty to alter the length of the returned segments. Individual segments were chosen to optimize bleu score, and the resulting documents exhibited appropriately increasing scores. While there is no efficient strategy for whole-document bleu maximization, an iterative rescoring of the entire document while optimizing the choice of only one candidate segment at a time could potentially yield higher scores than those we report here.</Paragraph>
    </Section>
    <Section position="2" start_page="177" end_page="177" type="sub_section">
      <SectionTitle>
3.2 TM performance with varied
Ngram length
</SectionTitle>
      <Paragraph position="0"> The second experiment was to determine the effect that different ngram sizes in the Chinese IR query have on the IR engine's ability to retrieve good English translations.</Paragraph>
      <Paragraph position="1"> We considered cumulative ngram sizes from 1 to 7, i.e. unigram, unigram + bigram, unigram + bigram + trigram, and so on. For each set of ngram sizes, we created a Lucene query for every segment of the (Chinese) evaluation corpus. We then produced a hypothesis document by combining the English sides of the top results returned by Lucene for each query. The hypothesis document was evaluated against the reference corpora by calculating a bleu score.</Paragraph>
      <Paragraph position="2"> While it was observed that IR performance is maximized by performing bigram queries (Kwok, 1997), we had reason to believe the TM would not be similar. TMs must attempt to match short sequences of stop words that indicate grammar as well as more traditional content words. Note that our system performed neither stemming nor stop word (or ngram) removal on the input Chinese strings.</Paragraph>
    </Section>
    <Section position="3" start_page="177" end_page="178" type="sub_section">
      <SectionTitle>
3.3 An upper bound on TM N-best list
</SectionTitle>
      <Paragraph position="0"> rescoring The next experiment was to determine an upper bound on the performance of tf-idf for different result set sizes, i.e. for different (maximum)  &amp;quot;pairs&amp;quot; column gives the number of translation pairs available after trivial pruning. numbers of translation pairs returned by the IR engine. This experiment describes the trade-off between more time spent in the IR engine creating a longer list of returns and the potential increase in translation score.</Paragraph>
      <Paragraph position="1"> To determine how much IR was &amp;quot;enough&amp;quot; IR, we performed an oracle experiment on different IR query sizes. For each segment of the evaluation corpus, we performed a cumulative 4-gram query as described in Section 4.2. We produced the n-best list oracle's hypothesis document by selecting the English translation from this result set with the highest bleu score when evaluated against the corresponding segment from the reference corpus. We then evaluated the hypothesis documents against the reference corpus by computing bleu scores.</Paragraph>
    </Section>
    <Section position="4" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
3.4 N-best list rescoring with several
</SectionTitle>
      <Paragraph position="0"> MT evaluation metrics The fourth experiment was to determine whether we could improve upon tf-idf by applying automated MT metrics to pick the best sentence from the top n translation pairs returned by the IR engine. We compared a variety of metrics from MT evaluation literatures. All of these were run on the tokens in the source language side of the IR result, comparing against the single pseudo-reference, the original source language segment. While many of these metrics aren't designed to perform well with one reference, they stand in as good approximate string matching algorithms.</Paragraph>
      <Paragraph position="1"> The score that the IR engine associates with each segment is retained and marked as tf-idf in this experiment. Naturally, bleu (Papineni et al., 2001) was the first choice metric, as it was well-matched to the target language evaluation function. rouge was a reimplementation of ROUGE-L from (Lin and Och, 2004). It computes an F-measure from precision and recall that are both based on the longest common sub-sequence of the hypothesis and reference strings. wer-g is a variation on traditional word error rate that was found to correlate very well with human judgments (Foster et al., 2003), and per is thetraditional position-independenterrorrate that was also shown to correlate well with human judgments (Leusch et al., 2003). Finally, a random metric was added to show the bleu value one could achieve by selecting from the top n strictly by chance.</Paragraph>
      <Paragraph position="2"> After the individual metrics are calculated for these segments, a uniform-weight log-linear combination of the metrics is calculated and used to produce a new rank ordering under the belief that the different metrics will make predictions that are constructive in aggregate.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML