File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0313_evalu.xml
Size: 9,567 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0313"> <Title>Translation Spotting for Translation Memories</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We describe here a series of experiments that were carried out to evaluate the performance of the TS methods described in section 3. We essentially identified a number of SL queries, looked up these segments in a TM to extract matching pairs of SL-TL sentences, and manually identified the TL tokens corresponding to the SL queries in each of these pairs, hence producing manual TS's. We then submitted the same sentence-pairs and SL queries to each of the proposed TS methods, and measured how the TL answers produced automatically compared with those produced manually. We describe this process and the results we obtained in more details below.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Test Material </SectionTitle> <Paragraph position="0"> The test material for our experiments was gathered from a translation memory, made up of approximately 14 years of Hansard (English-French transcripts of the Canadian parliamentary debates), i.e. all debates published between April 1986 and January 2002, totalling over 100 million words in each language. These documents were mostly collected over the Internet, had the HTML markup removed, were then segmented into paragraphs and sentences, aligned at the sentence level using an implementation of the method described in (Simard et al., 1992), and finally dumped into a document-retrieval system (MG (Witten et al., 1999)). We call this the Hansard TM.</Paragraph> <Paragraph position="1"> To identify SL queries, a distinct document from the Hansard was used, the transcript from a session held in March 2002. The English version of this document was segmented into syntactic chunks, using an implementation of Osborne's chunker (Osborne, 2000). All sequences of chunks from this text that contained three or more word tokens were then looked up in the Hansard TM. Among the sequences that did match sentences in the TM, 100 were selected at random. These made up the While some SL queries yielded only a handful of matches in the TM, others turned out to be very productive, producing hundreds (and sometimes thousands) of couples. For each test segment, we retained only the 100 first matching pair of sentences from the TM. This process yielded 4100 pairs of sentences from the TM, an average of 41 per SL query; we call this our test corpus.</Paragraph> <Paragraph position="2"> Within each sentence pair, we spotted translations manually, i.e. we identified by hand the TL word-tokens corresponding to the SL query for which the pair had been extracted. These annotations were done following the TS guidelines proposed by V'eronis (1998); we call this the reference TS.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation Metrics </SectionTitle> <Paragraph position="0"> The results of our TS methods on the test corpus were compared to the reference TS, and performance was measured under different metrics. Given each pair <S,T> from the test corpus, and the corresponding reference and evaluated TL answers r[?] and r, represented as sets of tokens, we computed: exactness : equal to 1 if r[?] = r, 0 otherwise;</Paragraph> <Paragraph position="2"> In all the above computations, we considered that &quot;empty&quot; TL answers (r = [?]) actually contained a single &quot;null&quot; word. These metrics were then averaged over all pairs of the test corpus (and not over SL queries, which means that more &quot;productive&quot; queries weight more heavily in the reported results).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Experiments </SectionTitle> <Paragraph position="0"> We tested all three methods presented in section 3, as well as the three &quot;post-processings&quot; on Viterbi TS proposed in section 3.2. All of these methods are based on IBM Model 2. The same model parameters were used for all the experiments reported here, which were computed with the GIZA program of the Egypt toolkit (Al-Onaizan et al., 1999). Training was performed on a subset of about 20% of the Hansard TM. The results of our experiments are presented in table 1.</Paragraph> <Paragraph position="1"> The Zero-tolerance post-processing produces empty TL answers whenever the TL tokens are not contiguous. On our test corpus, over 70% of all Viterbi alignments turned out to be non-contiguous. These empty TL answers were counted in the statistics above (Viterbi + Zero-tolerance row), which explains the low performance obtained with this method. In practice, the intention of Zero-tolerance post-processing is to filter out non-contiguous answers, under the hypotheses that they probably would not be usable in a TM application. Table 2 presents the performance of this method, taking into account only non-empty answers.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Discussion </SectionTitle> <Paragraph position="0"> Globally, in terms of exactness, compositional TS produces the best TL answers, with 40% correct answers, an improvement of 135% over plain Viterbi TS. This gain is impressive, particularily considering the fact that all methods use exactly the same data. In more realistic terms, the gain in F-measure is over 20%, which is still considerable.</Paragraph> <Paragraph position="1"> The best results in terms of precision are obtained with contiguous TS, which in fact is not far behind compositional TS in terms of recall either. This clearly demonstrates the impact of a simple contiguity constraint in this type of TS application. Overall, the best recall figures are obtained with the simple Extension post-processing on Viterbi TS, but at the cost of a sharp decrease in precision. Considering that precision is possibly more important than recall in a TM application, the contiguous TS would probably be a good choice.</Paragraph> <Paragraph position="2"> The Zero-tolerance strategy, used as a filter on Viterbi alignments, turns out to be particularily effective. It is interesting to note that this method is equivalent to the one proposed by Marcu (Marcu, 2001) to automatically construct a sub-sentential translation memory. Taking only non-null TS's into consideration, it outclasses all other methods, regardless of the metric. But this is at the cost of eliminating numerous potentially useful TL answers (more than 70%). This is particularily frustrating, considering that over 90% of all TL answers in the reference are indeed contiguous.</Paragraph> <Paragraph position="3"> To understand how this happens, one must go back to the definition of IBM-style alignments, which specifies that each SL token is linked to at most one TL token.</Paragraph> <Paragraph position="4"> This has a direct consequence on Viterbi TS's: if the SL queries contains K word-tokens, then the TL answer will itself contain at most that number of tokens. As a result, this method has systematic problems when the actual TL answer is longer than the SL query. It turns out that this occurs very frequently, especially when aligning from English to French, as is the case here. For example, consider the English sequence airport security, most often translated in French as s'ecurit'e dans les a'eroports. The Viterbi alignment normally produces links airport a'eroport and security - s'ecurit'e, and the sequence dans les is then left behind (or accidentally picked up by erroneous links from other parts of the SL sentence), thus leaving a non-contiguous TL answer.</Paragraph> <Paragraph position="5"> The Expansion post-processing, which finds the shortest possible sequence that covers all the tokens of the Viterbi TL answer, solves the problem in simple situations such as the one in the above example. But in general, integrating contiguity constraints directly in the search procedure (contiguous and compositional TS) turns out to be much more effective, without solving the problem entirely. This is explained in part by the fact that these techniques are also based on IBM-style alignments.</Paragraph> <Paragraph position="6"> When &quot;surplus&quot; words appear at the boundaries of the TL answer, these words are not counted in the alignment probability, and so there is no particular reason to include them in the TL answer. Consider the following example: * These companies indicated their support for the government 's decision.</Paragraph> <Paragraph position="7"> * Ces compagnies ont d'eclar'e qu' elles appuyaient la d'ecision du gouvernement .</Paragraph> <Paragraph position="8"> When looking for the French equivalent to the English indicated their support, we will probably end up with an alignment that links indicated - d'eclar'e and support appuyaient. As a result of contiguity constraints, the TL sequence qu' elle will naturally be included in the TL answer, possibly forcing a link their - elles in the process. However, the only SL that could be linked to ont is the verb indicated, which is already linked to d'eclar'e. As a result, ont will likely be left behind in the final alignment, and will not be counted when computing the alignment's probability.</Paragraph> </Section> </Section> class="xml-element"></Paper>