File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1008_evalu.xml
Size: 8,299 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1008"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A fast and accuratemethodfor detectingEnglish-Japaneseparalleltexts</Title> <Section position="6" start_page="63" end_page="65" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="63" end_page="63" type="sub_section"> <SectionTitle> 4.1 Preparation </SectionTitle> <Paragraph position="0"> To evaluateour method,we used The EDR Electronic Dictionary1 for a bilingualdictionaryand Fry's Japanese-Englishparallel web corpus (Fry, 2005) for sample data. In this experiment, we consideredonly nouns (see section 3.2) and got a graph which consists of 28001 nodes, 19413 edges and 9178 connectedcomponentsof which the largest has 3563nodesand 3943edges. Large componentsincludingit needto be partitioned.</Paragraph> <Paragraph position="1"> We conductedpartitioningwith differnt thresholds and developed various word-ID mappings. For each mapping,we made several variationsin two respect. One is whethercut connectionsare recovered or not. The other is whetherand how many numerals, which can be easily utilized to boost the vocaburary of the dictionary, are added to a bilingualdictionary.</Paragraph> <Paragraph position="2"> The parallelcorpuswe used had beencollected by Fry fromfournews sites. Mosttexts in the corpus are news report on computertechnologyand the rest is on various fields of science. A single</Paragraph> <Paragraph position="4"> documentis typically1,000-6,000bytes. He detectedparalleltexts basedonlyon HTMLtagsand link structures,which depend on websites,without lookingat textual content,so there are many false pairs in his corpus. Therefore,to evaluate our methodprecisely, we used only 400 true parallel pairsthat are randomlyselectedand checked by humaninspection.We dividedthemevenlyand randomly into two parts and use one half for a trainingset and the other for a test set. In experiments describedin section4.4 and 4.5, we used otherportionof the corpusto scaleexperiments.</Paragraph> <Paragraph position="5"> For tokenization and pos-tagging, we used MeCab2 to Japanesetexts and SS Tagger3 to Englishtexts. BecauseSS Taggerdoesn't act as lemmatizer, weusedmorphstr()functionin Word-Net library4.</Paragraph> </Section> <Section position="2" start_page="63" end_page="64" type="sub_section"> <SectionTitle> 4.2 Effectof large componentsand a </SectionTitle> <Paragraph position="0"> partitioning Figure3 shows the resultsof experimentson severalconditions.Therearethreegroupsof bars;(A) treat every connectedcomponentequally regardlessof its size,(B) simplydropthe largestcomponentand (C) dividelarge componentsinto smaller parts. In each group, the upper bar corresponds to the casethe algorithmworkswithouta distance thresholdand the lower with it (0.2). The figures attachedto each bar are the maxF1 score, which is a popularmeasureto evaluatea classificationalgorithm,and indicatehow accuratelya methodis able to detect200 true text pairs from the test set of 40,000pairs. We didn't recover word connectionsbroken in the partitioningstepanddidn't add any numeralsto the vocabraryof the bilingualdictionarythis time.</Paragraph> <Paragraph position="1"> The significantdifferencebetween(A) and (B) clearlyshows the devastatingeffect of large components. The difference between (B) and (C) shows that the accurarycan be further improved if large componentsarepartitionedintosmallones in orderto utilizeas muchinformationas possible.</Paragraph> <Paragraph position="2"> In addtion,the accuracy consistentlyimproves by and distancethresholdand testedits performance througha 2-foldcross validation. The best mappingamongthosewas the one which * divides a component recursively until the numberof nodes of each languagebecomes no morethan30, * does not recover connectionsthat are cut in the partitioning,and * addsnumeralsfrom0 to 999.</Paragraph> <Paragraph position="3"> The best distance threshold was 0.2, and tscore threshold0.102.We testedthisruleandthresholds on the test set. The resultwas F1 = 0.960.</Paragraph> </Section> <Section position="3" start_page="64" end_page="64" type="sub_section"> <SectionTitle> 4.3 Effectof falsetranslationpairs </SectionTitle> <Paragraph position="0"> Our method of matchingwords differs from Ma and Liberman's one. Whilethey only countword pairsthat directlyappearin a bilingualdictionary, we identify all words having the same semantic ID. Potential merits and drawbacks to accuracy have been describedin the section 3.2. We comparedthe accuracy of the two algorithmsto investigate the effect of our approximatematching. To this end, we implementedMa and Liberman's method with all other conditionsand input data being equal to the one in the last section. We got maxF1 = 0.933 as a result, which is slightlyworse than the figure reportedin their paper. Though it is difficult to conclude where the difference stems from, there are several factors worth pointingout. First, our experimentis done for English-Japanese,while Ma and Liberman's experimentfor English-German,whichare more similarthan Englishand Japaneseare. Second, their data set containsmuch more true pairs This numberis also worse than that of our experiment(Figure 4). This shows that, at least in the experiment,our approachof identifyingmore pairs than the original dictionary causes more good effects than bad in total. We looked at word pairswhicharenotmatchedin MaandLiberman's methodbut in ours. Whilemostof the pairscanbe hardly consideredas a strict translation,some of themare pairspracticallyusedas translations.Examplesof suchpairsare shown in Figure5.</Paragraph> </Section> <Section position="4" start_page="64" end_page="65" type="sub_section"> <SectionTitle> 4.4 ExecutionSpeed </SectionTitle> <Paragraph position="0"> We have arguedthattheexecutionspeedis a major advantage of our method. We achieved 250,000 pairs/sec throughput on single Xeon (2.4GHz) processor. It's difficult to make a fair comparison of the execution speed because Ma and Liberman's paper does not describe enough detailsabouttheirexperimantsotherthanprocessing null 3145 websiteswith 10 sparc stationsfor 10 days.</Paragraph> <Paragraph position="1"> Just for a roughestimate,we introducesomebold assumptions.Say, therewerea thousandpagesfor each languagein a website or, in other words, a millionpagepairs,andthe performanceof processorshasgrownby 32 timesin thepastseven years, our methodworks more than 40 times faster than Ma and Liberman's one. This difference seems to be caused by a difference of the complexity betweenthe two algorithms. To the extent written in their paper, Ma and Libermancalculateda score of translationalityby enumeratingall combinationsof two wordswithina distancethreshold and searcha bilingualdictionaryfor each combinationof words. This algorithmtakes Ohm(n2) time wheren is the lengthof a text, while our method takes O(n) time. In addition,our methoddoesn't need any string manipulationin the comparison step.</Paragraph> </Section> <Section position="5" start_page="65" end_page="65" type="sub_section"> <SectionTitle> 4.5 Analysisof missdetections </SectionTitle> <Paragraph position="0"> We analyzedtext pairs for whichjudgementsdiffer betweenFry's and ours.</Paragraph> <Paragraph position="1"> Among pairs Fry determinedas a translation, we examined the 10 pairs ranked highest in our algorithm. Two of them are in fact translations, which were not detected by Fry's method without any linguisticinformation.Theresteightpairs are not translations. Three of the eight pairs are aboutbioscience,anda word&quot;cell&quot;occurredmany time (Figure 6). When words with an identical semanticID appearrepeatedlyin two texts being compared,their distancesare likely to be withina distancethresholdand the pair gets unreasonably high tscore. Therefore,if we take the numberof each semanticID in a text into account,we might be ableto improve the accuracy.</Paragraph> <Paragraph position="2"> We performedthe same examinationon the 10 pairs ranked lowest among those Fry determined not to be a translation. But no interestingfeature couldbe foundat the moment.</Paragraph> </Section> </Section> class="xml-element"></Paper>