File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-2090_intro.xml
Size: 3,515 bytes
Last Modified: 2025-10-06 14:00:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2090"> <Title>Multi-level Similar Segment Matching Algorithm for Translation Memories and Example-Based Machine Translation</Title> <Section position="3" start_page="625" end_page="626" type="intro"> <SectionTitle> 3 Optimizing </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="625" end_page="625" type="sub_section"> <SectionTitle> 3.1 Triangularization of the array </SectionTitle> <Paragraph position="0"> In this algorithm, for each Ij, there must be at least one possible matching C~. Hence, in a valid path, there are at least m matches. As a match between C~ and Ij occurs when &quot;stepping across a diagonal&quot;, the (m-l) first diagonals (from the lower left corner of the array) can not give birth to a valid path. Therefore, we do not calculate d\[i,j\] across these small diagonals.</Paragraph> <Paragraph position="1"> Symmetrically, the small diagonals after the last full one (in the upper right corner) cannot give birth to a valid path. We then also eliminate these (m-l) last diagonals. This gives a reduced matrix as shown in the new example in Figure 10. The computed cells are then situated in a parallelogram of dimensions (n-m+l) and m.</Paragraph> <Paragraph position="2"> The results is: only m(n-m+l) cells have to be computed. Instead of initiating the first row 0 to &quot;inf&quot;, we initiate the cells of the diagonal just before the last full top diagonal (between cell (0,1) and cell (3,4)in Figure 10) to &quot;000inf&quot; to be sure that no insertion is possible.</Paragraph> </Section> <Section position="2" start_page="625" end_page="626" type="sub_section"> <SectionTitle> 3.2. Complexity </SectionTitle> <Paragraph position="0"> The worst time complexity of this algorithm is F-proportional to the number of cells in the computed array, which is ln*(n-m+l). With the &quot;lazy&quot; strategy, all F levels are often not visited. As the number of cells computed by the W&F algorithm is m'n, our algorithm is always more rapid. The backtracking algorithm takes m+n operations in the W&F algorithm, as well as in our algorithm, leading to m(n-m+2)+n operations in the MSSM algorithm, and m(n+l)+n operations in the W&F algorithm.</Paragraph> <Paragraph position="1"> The general complexity is then sub-quadratic.</Paragraph> <Paragraph position="2"> When the lengths of both segments to be compared are similar (like it often happens in TMs), the complexity tends towards linearity.</Paragraph> <Paragraph position="3"> The two graphics in Figure 11 show two interesting particular cases (ln=n and m running from 1 to n=10), comparing W&F and our algorithm. For strings of similar lengths, the longer they are, the more the MSSM algorithm becomes interesting. When n is fixed, the MSSM algorithm is more interesting for extreme values of the length of I: small and similar to n. Conclusions The first contribution of this algorithm is to provide TM and EBMT systems with a precise and quick way to compare segments of words with a similarity vector. This leads to an ahnost complete eradication of noise for the matter of retrieving similar sentences in TM systems (97% &quot;reusability&quot; in our prototype). The second is to offer an unambiguous word to word matching through the &quot;trace&quot;. This last point opens the way to the Shallow Translation paradigm.</Paragraph> </Section> </Section> class="xml-element"></Paper>