File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1008_intro.xml

Size: 3,179 bytes

Last Modified: 2025-10-06 14:03:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1008">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A fast and accuratemethodfor detectingEnglish-Japaneseparalleltexts</Title>
  <Section position="4" start_page="60" end_page="60" type="intro">
    <SectionTitle>
2 RelatedWork
</SectionTitle>
    <Paragraph position="0"> Therehave been several attemptsto collectparallel texts fromthe Web. We will mentiontwo contrastingapproachesamongthem. null</Paragraph>
    <Section position="1" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
2.1 BITS
Ma and LibermancollectedEnglish-Germanpar-
</SectionTitle>
      <Paragraph position="0"> allel webpages(Ma and Liberman,1999). They began with a list of websitesthat belongto a domainaccosiatedwithGerman-speakingareasand null searchedfor parallelwebpagesin these sites. For each site, they downloaded a subset of the site to investigate what languageit is written in, and then, downloadedall pages if it was proved to be English-Germanbilingual. For each pair of Englishand Germandocument,they judgedwhether it is a mutual translation. They made a decision in the following manner. First, they searched a bilingualdictionaryfor all English-Germanword pairs in the text pair. If a word pair is found in the dictionary, it is recognizedas an evidenceof translation. Finally, they divided the number of recognizedpairs by the sum of the length of the two texts and regard this value as a scoreof translationality. Whenthis scoreis greaterthana given threshold,the pair is judged as a mutual translation. They succeededin creatingabout63MBparallelcorpuswith10 machinesthrough20 days. The numberof webpagesis consideredto have increasedfar morerapidlythantheperformanceof computersin the past seven years. Therefore,we think it is importantto reducethe cost of calculation of a system.</Paragraph>
    </Section>
    <Section position="2" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
2.2 STRAND
</SectionTitle>
      <Paragraph position="0"> If we simplymake a dicisionfor all pairsin a collectionof texts, the calculationtakes Ohm(n2) comparisons of text pairs where n is the number of documents in the collection. In fact, most researchesutilizepropertiespeculiarto certainparallel webpagesto reducethe numberof candidate pairsin advance. ResnikandSmithfocusedon the fact that a page pair tends to be a mutualtranslation whentheir URLstringsmeeta certaincondition, and examinedonly page pairs which satisfy it (Resnikand Smith,2003). A URLstringsometimescontainsa substringwhichindicatesthe languagein whichthe page is written. For example, a webpagewrittenin Japanesesometimeshave a substringsuch asj, jp, jpn, n, eucorsjisin its URL.They regarda pairof pagesas a candidate when their URLs match completelyafter removingsuchlanguage-specificsubstringsand,onlyfor null these candidates,did they make a detailed comparisonwith bilingualdictionary. They were successfulin collecting2190parallelpairsfrom8294 candidates. However, this URL conditionseems so strictfor the purposethat they found8294candidatepairsfromas muchas 20 Terabytesof webpages. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML