File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1068_metho.xml

Size: 15,913 bytes

Last Modified: 2025-10-06 14:15:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1068">
  <Title>Mining the Web for Bilingual Text</Title>
  <Section position="3" start_page="527" end_page="528" type="metho">
    <SectionTitle>
2 STRAND Preliminaries
</SectionTitle>
    <Paragraph position="0"> This section is a brief summary of the STRAND system and previously reported preliminary results (Resnik, 1998).</Paragraph>
    <Paragraph position="1"> The STRAND architecture is organized as a pipeline, beginning with a candidate generation stage that (over-)generates candidate pairs of documents that might be parallel translations.</Paragraph>
    <Paragraph position="2"> (See Figure 1.) The first implementation of the generation stage used a query to the Altavista search engine to generate pages that could be viewed as &amp;quot;parents&amp;quot; of pages in parM\]el translation, by asking for pages containing one portion of anchor text (the readable material in a hyperlink) containing the string &amp;quot;English&amp;quot; within a fixed distance of another anchor text containing the string &amp;quot;Spanish&amp;quot;. (The matching process was case-insensitive.) This generated many good pairs of pages, such as those pointed to by hyperlinks reading Click here for English version and Click here for Spanish version, as well as many bad pairs, such as university pages containing links to English Literature in close proximity to Spanish Literature.</Paragraph>
    <Paragraph position="3"> The candidate generation stage is followed by a candidate evaluation stage that represents the core of the approach, filtering out bad candidates from the set of generated page pairs.</Paragraph>
    <Paragraph position="4"> It employs a structural recognition algorithm exploiting the fact that Web pages in parallel translation are invariably very similar in the way they are structured -- hence the 's' in STRAND. For example, see Figure 2.</Paragraph>
    <Paragraph position="5"> The structural recognition algorithm first runs both documents through a transducer that reduces each to a linear sequence of tokens corresponding to HTML markup elements, interspersed with tokens representing undifferentiated &amp;quot;chunks&amp;quot; of text. For example, the transducer would replace the HTML source text &lt;TITLE&gt;hCL'99 Conference Home Page&lt;/TITLE&gt; with the three tokens \[BEGIN: TITLE\], \[Chunk: 24\], and \[END:TITLE\]. The number inside the chunk token is the length of the text chunk, not counting whitespace; from this point on only the length of the text chunks is used, and therefore the structural filtering algorithm is completely language independent.</Paragraph>
    <Paragraph position="6"> Given the transducer's output for each document, the structural filtering stage aligns the two streams of tokens by applying a standard, widely available dynamic programming algorithm for finding an optimal alignment between two linear sequences. 1 This alignment matches identical markup tokens to each other as much as possible, identifies runs of unmatched tokens that appear to exist only in one sequence but not the other, and marks pairs of non-identical tokens that were forced to be matched to each other in order to obtain the best alignment pos-</Paragraph>
    <Section position="1" start_page="528" end_page="528" type="sub_section">
      <SectionTitle>
Highlights Best Practices of
</SectionTitle>
      <Paragraph position="0"> Seminar on Self-Regulation re$,ulla~ !mo~. AJ medm,te~ fm rile sw m, Zm~ Bro,~ DirecSc~ Gr.aera\]. Ccm*m'ael PSodu~ re.~a~t m ima= att~lmtive mm d*li~ (ASD) m~ atmh u ,~lut~at7 C/~d~a a~ in du.~T ~lf-nv*mq~nL He ~ thai * for~b,~m~n |~ ~ A~\[~ v, ua~l d e~ inch topi~ u wl~ck ASD= pm,~d= tl~ ram1 =pprop*u~ mecl=m~= *=d wire ~ m= ~udk=l~ ~m=d w~ din=.</Paragraph>
      <Paragraph position="1"> Vdmm*r~ C=I~ &amp;quot;A voluuuu7 code iJ * ,~ ~4 ~aadardized ~t~at~ -- ~ cxpl~:ifly ~ C/4 * I~isla~ve ~gut~orT ~gin'~ -* dc=iloed to ipB=oc~ ~**~, cc~Uol = ~C/ L~e b~i~ o( ~ who agre=d Treamry Board $ c~'*J.sr~, &amp;quot;t~imiam so~=u' rll6at to ~eguha~ They ,im#y c~IT~ the pm~ie,p~m* altetamln ~ bell I r(c)|tda~ed by the g~enm~&amp;quot; ~f~h~ to o~emen~ aed e~e mS~t~e f~, nSul=k~ Wht~ ~ ~des b~e * eemb~ ~ aC/~'=laSo, indudi~:  que ~alt prod~mm m &amp;~zl~mt ~ La di~Lf~ d~s nw~.s de pt~sLau~ des s~ qm traltax~t d= divm ~jets. ~ lu m/:caai=~ ~ ~ ~C/ r~e t~t ~iC/C/= I~ I~ ~prew~ ~mi gl~ IC/i probl~ae~ ~v~l pu chacua.</Paragraph>
      <Paragraph position="2"> c~l~ ,d~Ud~ t~i~ l~lillatif m t~Ic~salrC/ - ~ paur iaflt,C/~, f~, o~m34~ = ~va\]~ ~m dC/ = C/p~i ,tea oat ~. Ib ='$1imin~l p~. * p~rsui',i M. Bd= Gl~h~, ualy~e pn ~ap~. Affsi~* ~gle~mair=, = ~ aM C~il du T~sm, 5= rue&amp; aM S~ven~nt do Au ~nt o~ I= n!gtcmcmask~ fair I'ob~ d'~ e~ ~ du pabliC/, le= S~nu i I'L, chC/ll(c) * ill ~tt== d'~t= b pm~lh~ de ~ qul fraRumt I~ iuiti~v= de t~ikmmtmkm: * h f=illt ~l~iu a~ IJm$1~lle iLs peuvuq ~,e m~llft4u= Cu fcm~.~= d~ ~uB~ din,  sible. 2 At this point, if there were too many unmatched tokens, the candidate pair is taken to be prima facie unacceptable and immediately filtered out.</Paragraph>
      <Paragraph position="3"> Otherwise, the algorithm extracts from the alignment those pairs of chunk tokens that were matched to each other in order to obtain the best alignments. 3 It then computes the correlation between the lengths of these non-markup text chunks. As is well known, there is a re\]\]ably linear relationship in the lengths of text translations -- small pieces of source text translate to smaJl pieces of target text, medium to medium, and large to large. Therefore we can apply a standard statistical hypothesis test, and if p &lt; .05 we can conclude that the lengths are reliably correlated and accept the page pair as likely to be translations of each other. Otherwise, this candidate page pair is filtered out. 4 2An anonymous reviewer observes that diff has no preference for aligning chunks of similar lengths, which in some cases might lead to a poor alignment when a good one exists. This could result in a failure to identify true translations and is worth investigating further.</Paragraph>
      <Paragraph position="4"> 3Chunk tokens with exactly equal lengths are excluded; see (Resnik, 1998) for reasons and other details of the algorithm.</Paragraph>
      <Paragraph position="5"> 4The level of significance (p &lt; .05) was the initial selection during algorithm development, and never changed. This, the unmatched-tokens threshold for prima/aeie rejection due to mismatches (20~0), and the maximum distance between hyperlinks in the genera-In the preliminary evaluation, I generated a test set containing 90 English-Spanish candidate pairs, using the candidate generation stage as just described* I evaluated these candidates by hand, identifying 24 as true translation pairs. 5 Of these 24, STRAND identified 15 as true translation pairs, for a recall of 62.5%. Perhaps more important, it only generated 2 additional translation pairs incorrectly, for a precision of 15/17 = s8.2%.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="528" end_page="531" type="metho">
    <SectionTitle>
3 Adding Language Identification
</SectionTitle>
    <Paragraph position="0"> In the original STRAND architecture, additional filtering stages were envisaged as possible (see Figure 1), including such language-dependent processes as automatic language identification and content-based comparison of structually aligned document segments using cognate matching or existing bilingual dictionaries. Such stages were initially avoided in order to keep the system simple, lightweight, and independent of linguistic resources* Howtion stage (10 lines), are parameters of the algorithm that were determined during development using a small amount of arbitrarily selected French-English data downloaded from the Web. These values work well in practice and have not been varied systematically; their values were fixed in advance of the preliminary evaluation and have not been changed since.</Paragraph>
    <Paragraph position="1"> * The complete test set and my judgments for this preliminary evaluation can be found at http ://umiacs. umd* edu/~resnik/amt a98/.</Paragraph>
    <Paragraph position="2">  characteristics of parallel Web pages, it became evident that such processing would be important in addressing one large class of potential false positives. Figure 3 illustrates: it shows two documents that are generated by looking for &amp;quot;parent&amp;quot; pages containing hyperlinks to English and Spanish, which pass the structural filter with flying colors. The problem is potentially acute if the generation stage happens to yield up many pairs of pages that come from on-line catalogues or other Web sites having large numbers of pages with a conventional structure.</Paragraph>
    <Paragraph position="3"> There is, of course, an obvious solution that will handle most such cases: making sure that the two pages are actually written in the languages they are supposed to be written in. In order to filter out candidate page pairs that fail this test, statistical language identification based on character n-grams was added to the system (Dunning, 1994). Although this does introduce a need for language-specific training data for the two languages under consideration, it is a very mild form of language dependence: Dunning and others have shown that when classifying strings on the order of hundreds or thousands of characters, which is typical of the non-markup text in Web pages, it is possible to discriminate languages with accuracy in the high 90% range for many or most language pairs given as little as 50k characters per language as training material.</Paragraph>
    <Paragraph position="4"> For the language filtering stage of STRAND, the following criterion was adopted: given two documents dl and d2 that are supposed to be in languages L1 and L2, keep the document pair iff Pr(Llldl) &gt; Pr(L21dl) and Pr(/21d2) &gt; Pr(Llld2). For English and Spanish, this translates as a simple requirement that the &amp;quot;English&amp;quot; page look more like English than Spanish, and that the &amp;quot;Spanish&amp;quot; page look more like Spanish than English. Language identification is performed on the plain-text versions of the pages.</Paragraph>
    <Paragraph position="5"> Character 5-gram models for languages under consideration are constructed using 100k characters of training data from the European Corpus Initiative (ECI), available from the Linguistic Data Consortium (LDC).</Paragraph>
    <Paragraph position="6"> In a formal evaluation, STRAND with the new language identification stage was run for English and Spanish, starting from the top 1000 hits yielded up by Altavista in the candidate generation stage, leading to a set of 913 candidate  pairs. A test set of 179 items was generated for annotation by human judges, containing: * All the pairs marked GOOD (i.e. translations) by STRAND (61); these are the pairs that passed both the structural and language identification filter.</Paragraph>
    <Paragraph position="7"> * All the pairs filtered out via language idea-</Paragraph>
    <Paragraph position="9"> It was impractical to manually evaluate all pairs filtered out structurally, owing to the time required for judgments and the desire for two independent judgments per pair in order to assess inter-judge reliability.</Paragraph>
    <Paragraph position="10"> The two judges were both native speakers of Spanish with high proficiency in English, neither previously familiar with the project. They worked independently, using a Web browser to access test pairs in a fashion that allowed them to view pairs side by side. The judges were told they were helping to evaluate a system that identifies pages on the Web that are translations of each other, and were instructed to make decisions according to the following criterion: Is this pair of pages intended to show the same material to two different users, one a reader of English and the other a reader of Spanish? The phrasing of the criterion required some consideration, since in previous experience with human judges and translations I have found that judges are frequently unhappy with the quality of the translations they are looking at. For present purposes it was required neither that the document pair represent a perfect translation (whatever that might be), nor even necessarily a good one: STR,AND was being tested not on its ability to determine translation quality, which might or might not be a criterion for inclusion in a parallel corpus, but rather its ability to facilitate the task of locating page pairs that one might reasonably include in a corpus undifferentiated by quality (or potentially postfiltered manually).</Paragraph>
    <Paragraph position="11"> The judges were permitted three responses:  the two judges, between STRAND and each individual judge, and the agreement between STRAND and the intersection of the two judges' annotations -- that is, STRAND evaluated against only those cases where the two judges agreed, which are therefore the items we can regard with the highest confidence. The table also shows Cohen's to, an agreement measure that corrects for chance agreement (Carletta, 1996); the most important tC/ value in the table is the value of 0.7 for the two human judges, which can be interpreted as sufficiently high to indicate that the task is reasonably well defined. (As a rule of thumb, classification tasks with &lt; 0.6 are generally thought of as suspect in this regard.) The value of N is the number of pairs that were included, after excluding those for which the human judgement in the comparison was undecided.</Paragraph>
    <Paragraph position="12"> Since the cases where the two judges agreed can be considered the most reliable, these were used as the basis for the computation of recall and precision. For this reason, and because the human-judged set included only a sample of the full set evaluated by STRAND, it was necessary to extrapolate from the judged (by both judges) set to the full set in order to compute recall/precision figures; hence these figures are reported as estimates. Precision is estimated as the proportion of pages judged GOOD by STRAND that were also judged to be good (i.e.</Paragraph>
    <Paragraph position="13"> &amp;quot;yes&amp;quot;) by both judges -- this figure is 92.1% Recall is estimated as the number of pairs that should have been judged GOOD by STRAND (i.e. that recieved a &amp;quot;yes&amp;quot; from both judges) that STRAND indeed marked GOOD -- this figure is 47.3%.</Paragraph>
    <Paragraph position="14"> These results can be read as saying that of every 10 document pairs included by STRAND in a parallel corpus acquired fully automatically from the Web, fewer than 1 pair on average was included in error. Equivalently, one could say that the resulting corpus contains only about  8% noise. Moreover, at least for the confidently judged cases, STRAND is in agreement with the combined human judgment more often than the human judges agree with each other. The recall figure indicates that for every true translation pair it accepts, STRAND must also incorrectly reject a true translation pair. Alternatively, this can be interpreted as saying that the filtering process has the system identifying about half of the pairs it could in principle have found given the candidates produced by the generation stage. Error analysis suggests that recall could be increased (at a possible cost to precision) by making structural filtering more intelligent; for example, ignoring some types of markup (such as italics) when computing alignments. However, I presume that if the number M of translation pairs on the Web is large, then half of M is also large. Therefore I focus on increasing the total yield by attempting to bring the number of generated candidate pairs closer to M, as described in the next section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML