File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2090_metho.xml
Size: 2,440 bytes
Last Modified: 2025-10-06 14:07:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2090"> <Title>Multi-level Similar Segment Matching Algorithm for Translation Memories and Example-Based Machine Translation</Title> <Section position="4" start_page="626" end_page="626" type="metho"> <SectionTitle> MSSM algorithms </SectionTitle> <Paragraph position="0"> For more information about the use of this algorithm, please refer to Planas (1999). These two contributions bring in the main difference with relative research 4 concentrating on similarity only, represented by a sole integer.</Paragraph> <Paragraph position="1"> The TELA structure, that allows the parallel use of different layers of analysis (linguistic paradigms, but possibly non linguistic information) is essential to this work because it provides the algorithm with the supplementary information classical systems lack.</Paragraph> <Paragraph position="2"> The fact that the shallow parser (lemmas, POS) is ambiguous or not does not affect significantly the performance of the algorithln. If the same parser is used for both example and input segments, parallel errors compensate each other.</Paragraph> <Paragraph position="3"> Of course, these errors do have an influence for EBMT: the non ambiguity is then a must.</Paragraph> <Paragraph position="4"> A first evaluation of the MSSM speed gives 0.5 to 2 milliseconds for comparing only s two randomly chosen English or Japanese sentences over 3 levels (word, lemmas, POS). The 4 Cranias et al. (1997), Thompson & Brew (1994), or in a more specific way, Lcpage (1998)</Paragraph> </Section> <Section position="5" start_page="626" end_page="626" type="metho"> <SectionTitle> 5 Without the shallow analysis </SectionTitle> <Paragraph position="0"> implementation has been done with a DELL Optiplex GX 1 233 Mhz, Window NT, Java 1 18.</Paragraph> <Paragraph position="1"> This algorithm can be improved in different ways. For speed, we can introduce a similarity threshold so as not to evaluate the last cells of the columns of the computed array as soon as the threshold is overtaken. For adaptability, being able to deal with a different number of tokens according to each layer will allow us to deal nicely with compound words.</Paragraph> <Paragraph position="2"> In short, if the basis of this matching algorithm is the W&F algorithm, other algorithms can be adapted similarly to deal with multi-level data.</Paragraph> </Section> class="xml-element"></Paper>