XML Viewer - w06-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1008_metho.xml
Size: 10,223 bytes
Last Modified: 2025-10-06 14:10:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1008">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A fast and accuratemethodfor detectingEnglish-Japaneseparalleltexts</Title>
  <Section position="5" start_page="60" end_page="63" type="metho">
    <SectionTitle>
3 ProposedMethod
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="60" end_page="61" type="sub_section">
      <SectionTitle>
3.1 Problemsettings
</SectionTitle>
      <Paragraph position="0"> There are several evaluation criteria for parallel text mining algorithms. They include accuracy, executionspeed, and generality. We say an algorithmis generalwhenit can be appliedto texts of any format,not only to webpageswith associated informationspecificto webpages(e.g., URLsand tags). In this paper, we focuson developinga fast and generalalgorithmfor determiningif a pair of texts is parallel.</Paragraph>
      <Paragraph position="1"> In general,there are two complementaryways to improve the speedof paralleltext mining. One is to reducethe numberof &amp;quot;candidatepairs&amp;quot;to be compared.The other is to make a singlecomparison of two texts faster. An example of the formeris ResnikandSmith's URLmatchingmethod, which is able to mine parallel texts from a very large corpora of Tera bytes. However, this approachis very specificto the Web and, even if we restrict our interest to webpages,there may be a significantnumberof parallelpageswhoseURLs do not matchthe prescribedpatternand therefore are filtered out. Our method is in the latter category, and is generallyapplicableto texts of any format. The approach depends only on the linguisticcontentof texts. Reducingthe numberof  be one of our futureworks.</Paragraph>
      <Paragraph position="2"> The outline of the methodis as follows. First we preprocessa bilingualdictionaryand build a mapping from words to integers, which we call &amp;quot;semanticID.&amp;quot; Texts are then preprocessed,converting each word to its correspondingsemantic ID plus its positionof the occurrence. Then we compare all pairs of texts, using their converted representations(Figure 1). Comparinga pair of texts is fast becauseit is performedin time linear in the length of the texts and does not need any tablelookupor stringmanipulation.</Paragraph>
    </Section>
    <Section position="2" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
3.2 Preprocessinga bilingualdictionary
</SectionTitle>
      <Paragraph position="0"> We take onlynounsinto accountin our algorithm.</Paragraph>
      <Paragraph position="1"> For the language pair of English and Japanese, a correspondenceof parts of speech of a word and its translationis not so clear and may make the problemmore difficult. A result was actually worsewheneveryopen-classwordwas considered thanwhenonlynounswere.</Paragraph>
      <Paragraph position="2"> The first stageof the methodis to assignan integer called semanticID to every word (in both languages)that appearsin a bilingualdictionary.</Paragraph>
      <Paragraph position="3"> Thegoalis to assignthesameIDto a pairof words thataretranslationsof eachother. In anidealsituationwhereeachwordof onelanguagecorresponds one-to-onewith a word of the other language,all you need to do is to assign differnt IDs to every translationalrelationshipbetweentwo words. The mainpurposeof this conversionis to make a comparisonof two texts in a subsequentstagefaster. However, it's not exactly that simple. A word very often has more than one words as its translationso the naive methoddescribedabove is not directly applicable. We devised an approximate solution to address this complexity. We build a bigraph whose nodes are words in the dictionaryandedgestranslationalrelationshipsbetween null them. This graph consists of many small connectedcomponents,each representinga group of words that are expectedto have similarmeanings.</Paragraph>
      <Paragraph position="4"> We then make a mappingfrom a word to its semanticID. Two words are consideredtranslations of each other when they have the same semantic ID.</Paragraph>
      <Paragraph position="5"> This methodcausesa side-effect of connecting two words not directlyrelatedin the dictionary. It has both good and bad effects. A good effect is that it may connecttwo words that do not explicitly appearas translationsin the dictionary, but are used as translationsin practice(see section 4.3).</Paragraph>
      <Paragraph position="6"> In other words, new translationalword pairs are detected.A bad effect,on the otherhand,is that it potentiallyconnectsmany wordsthat do not share meaningsat all. Figure 2 shows an actual example of suchan undesirablecomponentobserved in our experiment. You can go from fruit to army throughseveral hops and these words are treated as identicalentityin subsequentstepsof our technique. Futhermore,in the most extreme case, a very large connectedcomponentcan be created.</Paragraph>
      <Paragraph position="7"> Table1 shows thestatisticsof thecomponentsizes for the English-Japanesedictionarywe have used in our experiment(EDR ElectronicDictionary).</Paragraph>
      <Paragraph position="8">  Most componentsare fairly small (&lt; 10 words).</Paragraph>
      <Paragraph position="9"> The largest connectedcomponent,however, consistedof 3563 nodesout of the total 28001nodes in the entire graph and 3943 edges out of 19413.</Paragraph>
      <Paragraph position="10"> As we will see in the next section,this had a devastatingeffect on the qualityof judgementso we clearly need a method that circumvents the situation. One possibility is to simply drop very large components.Anotheris to divide the graph into small components. We have tried both approaches. null  For partitioninggraphs, we used a very simple greedy method. Even though a more complex method may be possible that takes advantages of linguisticinsights,this work uses a very simple partitioningmethodthat only looks at the graphstructurein thiswork. A graphis partitioned into two parts having an equal number of nodes and a partitionis recursively performeduntileach part becomessmallerthan a given threshold.The thresholdis chosenso that it yieldsthe best result for a trainingset and then applied to a test data. For each bisection,we begin with a randompartition and improves it by a local greedy search. Given the currentpartition,it seeksa pairof nodes which,if swapped,maximumlyreducesthe number of edgescrossingthe two parts. Ties are broken arbitrarilywhenthereare many such pairs. If no singleswap reducesthenumberof edgesacross parts,wesimplystop(i.e.,localsearch).A semantic ID is thengiven to eachpart.</Paragraph>
      <Paragraph position="11"> This process would lose connectionsbetween wordsthat are originallytranslationsin the dictionarybut are separatedby the partitioning.We will describea methodto partiallyrecover this loss in the end of the next section, after describinghow texts are preprocessed.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
3.3 Preprocessingtexts
</SectionTitle>
      <Paragraph position="0"> Each text (document)is preprocessedas follows.</Paragraph>
      <Paragraph position="1"> Texts are segmentedinto words and taggedwitha part-of-speech.Inflectionproblemsare addressed with lemmatization.Each word is converted into the pair(nid, pos), wherenid is the semanticID of the partitioncontainingthe word and pos its positionof occurrence.Thepositionis normalizedand representedas a floatingpointnumberbetween0.0 and 1.0. Any word which does not appearin the dictionaryis simplyignored.The positionis used to judgeif wordshavingan equalID occurin similar positionsin bothtexts, so they suggesta translation. null Afterconvertingeach word, all (nid, pos) pairs are sortedfirst by theirsemanticIDs breakingties withpositions.Thissortingtakes O(nlogn) time for a documentof n words. This preprocessing needs to be performedonly once for each document. null We recover the connectionsbetweenword pairs separatedby thepartitioningin thefollowingmanner. Supposewords J and E are translationsof each other in the dictionary, J is in a partition whosesemanticID is x andE in anotherpartition whosesemanticID is y. In this case, we translate J into two elementsx and y. This result is as if two separatewords, one in componentx and anotherin y, appearedin the originaltext, so it may potentiallyhave an undesirableside-effect on the qualityof judgement. It is thereforeimportantto keep the number of such pairs reasonablysmall.</Paragraph>
      <Paragraph position="2"> We experimentedwith both cases, one in which we recover separateconnectionsand the other in whichwe don't.</Paragraph>
    </Section>
    <Section position="4" start_page="62" end_page="63" type="sub_section">
      <SectionTitle>
3.4 Comparingdocumentpairs
</SectionTitle>
      <Paragraph position="0"> We judgeif a text pair is likely to be a translation by comparingtwo sequencesobtainedby the preprocessing. We count the number of word pairs  that have an equal semanticID and whose positions are within a distance threshold. The best thresholdis chosen to yield the best result for a trainingset and then appliedto test set. This process takes time linear in the length of texts since the sequences are sorted. First, we set cursors at the first elementof each of the two sequences. Whenthe semanticIDs of the elementsunderthe cursorsare equaland the differencebetweentheir positionsis withina threshold,we count them as an evidenceof translationalityand move bothcursors forward. Otherwise,the cursor on the element which is less accordingto the sorting criteria is moved forward. In this step, we do not performany further search to determineif original words of the elementswere relateddirectlyin the bilingualdictionarygivingpreferenceto speed over accuracy. We repeatthis operationuntil any of the cursorsreachesthe end of the sequence.Finally, we divide the numberof matchingelements by the sum of the lengthsof the two documents.</Paragraph>
      <Paragraph position="1"> We definethis value as &amp;quot;tscore,&amp;quot; whichstandsfor translationalscore. At least one cursormoves after each comparison,so this algorithmfinishesin timelinearin the lengthof the texts.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML