File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1004_metho.xml
Size: 8,240 bytes
Last Modified: 2025-10-06 14:07:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1004"> <Title>Low-cost, High-performance Translation Retrieval: Dumber is Better</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> (SS 5),beforeconcludingthepaper(SS 6). 2 Basic Parameters </SectionTitle> <Paragraph position="0"> In this section, we review three parametertypes that we suggest impinge on TR performance, namelysegmentation,segmentorder,andsegment contiguity.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Segmentation </SectionTitle> <Paragraph position="0"> Despite non-segmenting languages such as Japanese not making use of segment delimiters, it is possible to artificially partition off a given string into constituent morphemes through the process of segmentation. We will collectively term the resultant segments as words for the remainderofthispaper.</Paragraph> <Paragraph position="1"> Looking to past research on string comparison methods for TM systems, almost all systems involving Japanese as the source language rely on segmentation (Nakamura, 1989; Sumita and Tsutsumi, 1991; Kitamura and Yamamoto, 1996; Tanaka, 1997), with Sato(1992) and SatoandKawase(1994) providing rare instances of character-based systems. This is despite FujiiandCroft(1993) providing evidence from Japanese information retrieval that character-basedindexingperformscomparablyto word-based indexing. In analogous research, BaldwinandTanaka(2000) compared characterand word-based indexing within a JapaneseEnglishTRcontextandfoundcharacter-basedin- null dexingtoholdaslightempiricaladvantage.</Paragraph> <Paragraph position="2"> Themostobviousadvantageofcharacter-based indexing over word-based indexing is that there is no pre-processing overhead. Other arguments for character-basedindexing overword-basedindexing are that we: (a) avoid the need to commitourselvestoaparticularanalysistypeinthe null case of ambiguity or unknown words; (b) avoid theneedforstemming/lemmatisation;and(c)to alargeextentgetaroundproblemsrelatedtothe normalisationoflexicalalternation.</Paragraph> <Paragraph position="3"> Notethatallmethodsdescribedbelowareapplicabletobothword-andcharacter-basedindex- null ing. To avoidconfusion between the twolexeme types,wewillcollectivelyrefertotheelementsof indexingassegments.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 SegmentOrder </SectionTitle> <Paragraph position="0"> Ourexpectation is that TRecs that preservethe segment order observed in the input string will provide closer-matching translations than TRecs containingthosesamesegmentsinadifferentorder. null As far as we are aware, there is no TM system operating from Japanese that does not rely onword/segment/characterordertosomedegree.</Paragraph> <Paragraph position="1"> Tanaka(1997) uses pivotalcontent wordsidentified by the user to search through the TM and locate TRecs which contain those same content wordsinthesameorderandpreferablythesame segment distance apart. Nakamura(1989) similarlygivespreferencetoTRecsinwhichthecon- null tentwordscontainedintheoriginalinputoccurin thesamelinearorder,althoughthereisthescope to back off to TRecs which do not preserve the originalwordorder. SumitaandTsutsumi(1991) take the opposite tack in iteratively filtering out NPs and adverbs to leave only functional wordsandmatrix-levelpredicates,andfindTRecs which contain those same key words in the same ordering, preferably with the same segment types between them in the same numbers. SatoandKawase(1994)employamorelocalmodelof character orderinmodellingsimilarityaccordingtoN-gramsfashionedfromtheorig- null inalstring.</Paragraph> </Section> <Section position="3" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.3 Segmentcontiguity </SectionTitle> <Paragraph position="0"> mixed unigrams/bigrams. TheseN-grammodels are implemented as a pre-processing stage, followingsegmentation(whereapplicable). Allthis involves is mutating the original strings into N-grams of the desired order, while preserving the originalsegmentorderandsegmentationschema.</Paragraph> <Paragraph position="1"> get are the vector space model (Manning and Sch&quot;utze, 1999, p300) and &quot;token intersection&quot;. For segment order-sensitive approaches, we test</Paragraph> <Paragraph position="3"> parison throughout this paper, we assume that any segment made up entirely of punctuation is givena wt of0,andanyothersegmenta wt of1.</Paragraph> <Paragraph position="4"> Character boundaries (which double as word boundaries in this case) indicated by &quot;*&quot;. All methods are subject to a threshold on translation utility, and in the case that the threshold is not achieved, the null string is returned. Thevariousthresholdsareasfollows:</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Bag-of-WordsMethods VectorSpaceModel </SectionTitle> <Paragraph position="0"> Within our implementation of the vector spacemodel(VSM),thesegmentcontentofeach stringisdescribedasavector,madeupofasingle dimensionforeachsegmenttypeoccurringwithin S or T. The value of each vector component is givenas the weighted frequencyof that type accordingtoits wt value. Thestringsimilarityof S and T is then defined as the cosine of the angle The token intersection of S and T is defined as the cumulative intersecting frequency of segment types appearing in each of the strings, normalised according to the combined segment lengths of S and T using Dice's coefficient. For- null imumnumberofprimitiveeditoperationsonsinglesegmentsrequiredtotransform S into T (and vice versa). The three edit operations are segment equality (segments s i and t j areidentical), segment deletion (deletesegment s i )and segment insertion (insert segment a into a givenposition instringS). Thecostassociatedwitheachoperationisdeterminedbythewt valuesoftheoperand segments,withtheexceptionofsegmentequality whichisdefinedtohaveafixedcostof0.</Paragraph> <Paragraph position="1"> Dynamic programming (DP) techniques are used to determine the minimum edit distance between a given string pair, following the classic 4-operation edit distance formulation of step further than edit distance in analysing not onlysegmentsequentiality,butalsothecontiguity ofmatchingsegments.</Paragraph> <Paragraph position="2"> Weighted sequential correspondence associates an incremental weight (orthogonal to our wt weights)witheachmatchingsegmentassessingthe contiguity of left-neighbouring segments, in the manner described by Sato(1992) for character-based matching. Namely, the kth segment of a matched substring is given the multiplicative weightmin(k,Max), where Max is a positiveinteger. This weighting up of contiguous matches</Paragraph> <Paragraph position="4"> The fourth operator in 4-operation edit distance is segment substitution.</Paragraph> <Paragraph position="6"/> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Evaluation Specifications </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 DetailsoftheDataset </SectionTitle> <Paragraph position="0"> As our main dataset, we used 3033 unique Japanese-EnglishTRecsextractedfromconstruction machinery field reports for the purposes of thisresearch. MostTRecscompriseasinglesentence,withanaverageJapanesecharacterlength null of 27.7and English wordlength of 13.3. Importantly, our dataset constitutes a controlled lan-</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 Semi-stratifiedCrossValidation </SectionTitle> <Paragraph position="0"> Retrieval accuracy was determined by way of 10-fold semi-stratified cross validation over the dataset. As part of this, all Japanese strings of length 5 characters or less were extracted from the dataset, and cross validation was performed overthe residue, including the shorter strings in thetrainingdata(i.e.TM)oneachiteration.</Paragraph> </Section> </Section> class="xml-element"></Paper>