File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1006_intro.xml
Size: 4,576 bytes
Last Modified: 2025-10-06 14:00:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1006"> <Title>The Effects of Word Order and Segmentation on Translation Retrieval Performance</Title> <Section position="3" start_page="35" end_page="35" type="intro"> <SectionTitle> 2 Segmentation and word order </SectionTitle> <Paragraph position="0"> Using segmentation to divide strings into component words or nlori)helnes has tile obvious advmltage of clustering characters into senlantic units, which in the case of ideogrmn-based languages such as Japanese (in the fern1 of kanji characters) and Chinese, generally disatnbiguates character tneaning. The kanji character 'J \[', for example, can be used to mean any of &quot;to discern/discriminate&quot;, &quot;to speak/argue&quot; and &quot;a valve&quot;, but word context easily resolves such mnbiguity, hi this sense, our intuition is that segmented strings should produce better results than non-segmented strings.</Paragraph> <Paragraph position="1"> Looking to past research on similarity metrics for TM systelns, ahnost all systems involving aal)anese as the source language rely on segnlentation (e.g.</Paragraph> <Paragraph position="2"> (Nakanmra, 1989; Sulnita and Tsutsumi, 1991; Kitalnura and Yamamoto, 1996; Tmtaka, 19971), with Sate (1992) and Sate and Kawase (1994) providing rare instances of character-based systelnS.</Paragraph> <Paragraph position="3"> By avoiding tile need to segment text;, we: (a) alleviate computational overhead; (b) avoid the need to commit ourselves to a particular analysis type in the case of ambiguity; (c) avoi(1 the issue of' how to deal with unknown words; (d) avoid the need for stemming/lenlmatisation; and (e) to a large extent get around problems related to the nornmlisation of lexical alternation (see Baldwin and Tanaka (1999) for a discussion of problems related to lexical alternation in Jal)anese). Additionally, we can use the conmlonly anlbiguous na.ture of individual kanji characters to our advantage, in modelling seinantic similarity between related words with character overlap. With word-based indexing, this would only be possible with tile aid of a thesaurus.</Paragraph> <Paragraph position="4"> Similarly for word order, we would expect that translation records that preserve the word (segment) order observed in the inImt string would provide closer-matching translations than translation records containing those stone segnlents in a different order. Natur~dly, enforcing preservation of word order is going to place a significant burden on the matching mechanism, in that a number of different substring match schenlata are inevitably going to be produced between rely two strings, each of which nmst be considered on its own merits.</Paragraph> <Paragraph position="5"> To the authors' knowledge, there is no TM system operating from Japanese that does not rely on word/segment/character order to some degree.</Paragraph> <Paragraph position="6"> Tanaka (1997) uses pivotal content words identified, by the user to search through the TM and locate translation records which contain those same content words in the stone order and preferably the stone segment distance apart. Nakamura (1989) similarly gives preference to translation records in which the content words contained in the original input occur in the same linear order, although there is tile scope to back off to translation records which do not I)reserve the original word order. Sumita and Tsutsmni (19911 take the opposite tack in iteratively filtering out NPs and adverbs to leave only functional words and nlatrix-level predicates, and find trmlslation records which contain those same key words in the same ordering, preferably with the same segment types between them in the same numbers. Nirenburg et al. (1993) propose a word order-sensitive metric based on &quot;string composition discrepancy&quot;, and increlnentally relax the restriction on the quality of match required to inehlde word lenmlata, word synonynls and then word hyt)ernylns , increasing the match penalty as they go. Sate and Kawase (1994) employ a more local model of character order in modelling similarity according to N-grams fashioned from the original string.</Paragraph> <Paragraph position="7"> The greatest advantage in ignoring word/segnlent order is computational, in that we significantly reduce the search space and require only a single over-all comparison per string pair. Below, we analyse whether this gain in speed outweighs any losses in retrieval perfbrmance.</Paragraph> </Section> class="xml-element"></Paper>