File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-2090_intro.xml

Size: 3,515 bytes

Last Modified: 2025-10-06 14:00:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2090">
  <Title>Multi-level Similar Segment Matching Algorithm for Translation Memories and Example-Based Machine Translation</Title>
  <Section position="3" start_page="625" end_page="626" type="intro">
    <SectionTitle>
3 Optimizing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="625" end_page="625" type="sub_section">
      <SectionTitle>
3.1 Triangularization of the array
</SectionTitle>
      <Paragraph position="0"> In this algorithm, for each Ij, there must be at least one possible matching C~. Hence, in a valid path, there are at least m matches. As a match between C~ and Ij occurs when &amp;quot;stepping across a diagonal&amp;quot;, the (m-l) first diagonals (from the lower left corner of the array) can not give birth to a valid path. Therefore, we do not calculate d\[i,j\] across these small diagonals.</Paragraph>
      <Paragraph position="1"> Symmetrically, the small diagonals after the last full one (in the upper right corner) cannot give birth to a valid path. We then also eliminate these (m-l) last diagonals. This gives a reduced matrix as shown in the new example in Figure 10. The computed cells are then situated in a parallelogram of dimensions (n-m+l) and m.</Paragraph>
      <Paragraph position="2"> The results is: only m(n-m+l) cells have to be computed. Instead of initiating the first row 0 to &amp;quot;inf&amp;quot;, we initiate the cells of the diagonal just before the last full top diagonal (between cell (0,1) and cell (3,4)in Figure 10) to &amp;quot;000inf&amp;quot; to be sure that no insertion is possible.</Paragraph>
    </Section>
    <Section position="2" start_page="625" end_page="626" type="sub_section">
      <SectionTitle>
3.2. Complexity
</SectionTitle>
      <Paragraph position="0"> The worst time complexity of this algorithm is F-proportional to the number of cells in the computed array, which is ln*(n-m+l). With the &amp;quot;lazy&amp;quot; strategy, all F levels are often not visited. As the number of cells computed by the W&amp;F algorithm is m'n, our algorithm is always more rapid. The backtracking algorithm takes m+n operations in the W&amp;F algorithm, as well as in our algorithm, leading to m(n-m+2)+n operations in the MSSM algorithm, and m(n+l)+n operations in the W&amp;F algorithm.</Paragraph>
      <Paragraph position="1"> The general complexity is then sub-quadratic.</Paragraph>
      <Paragraph position="2"> When the lengths of both segments to be compared are similar (like it often happens in TMs), the complexity tends towards linearity.</Paragraph>
      <Paragraph position="3"> The two graphics in Figure 11 show two interesting particular cases (ln=n and m running from 1 to n=10), comparing W&amp;F and our algorithm. For strings of similar lengths, the longer they are, the more the MSSM algorithm becomes interesting. When n is fixed, the MSSM algorithm is more interesting for extreme values of the length of I: small and similar to n. Conclusions The first contribution of this algorithm is to provide TM and EBMT systems with a precise and quick way to compare segments of words with a similarity vector. This leads to an ahnost complete eradication of noise for the matter of retrieving similar sentences in TM systems (97% &amp;quot;reusability&amp;quot; in our prototype). The second is to offer an unambiguous word to word matching through the &amp;quot;trace&amp;quot;. This last point opens the way to the Shallow Translation paradigm.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML