File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1019_intro.xml

Size: 6,098 bytes

Last Modified: 2025-10-06 14:00:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1019">
  <Title>OUVRIERES LA BOUR FA(,J()N h;VENT P * 17' I~VIDLNCL CLEARLY EVIDh;NC\]'; OBVIO USIN HOMMF, S POIATICIANS PRISONNIFJ{S PR/SONEI{S RETOUR, BA.CII(, REVENIR BACK CONVENU AGREED SIGNE SIGNEI) VU SEEN AGRJCOLE AGR1C UL'I'URE ENT'IER AROUN\]) E N T I ER T Ill RO U G I\] O U T OCCIDENTAL WESTERN AVIDUGLI~S BI,IND CIIA.USSURI'2S SI-IOES CONSTRUC;I'EURS BUILDh;RS PENSIONN, F,S PENSIONERS RISTRAITES PENSIONERS VETEMENTS CLOTHING POISSON FISI\] PORC IK)RK Figure 2: Sanli)le Chlsters</Title>
  <Section position="4" start_page="0" end_page="126" type="intro">
    <SectionTitle>
2 Converting the Problem
</SectionTitle>
    <Paragraph position="0"> The task of clustering words a.ccording to their occurrence pa, tterns ca, n 1)e testa,ted as a, sta, ndard document-clustering task by converting the l)rol)lem sl)a.ce. For each unique word to be clllstered, crea.te a. l)seudo-doculnent conta.ining the words of the contexts in which theft word N) null pears, and use the word itself as tile document identifier. After the pseudo-documents are clustered, retrieving the identitier for each document in a particular cluster l)roduces tile list of words occurring in su\[\[iciently similar contexts to be considered equivalent \['or the l)urposes of generalizing an EBM(1 ~ system.</Paragraph>
    <Paragraph position="1"> By itself, this approach only produces a monolingual clustering, but we require a, bilingum clustering fox&amp;quot; proper generalization since different senses of a word will appear in differing contexts. The method of Barrachina and Vilar (1999) provides the means for injecting bilingual information into the clustering process.</Paragraph>
    <Paragraph position="2"> Using a bilingual dictionary -- which may be created fl'om the corl)us using statistical meth()&lt;Is, such as those of Peter \]~rown el al (71990) or the author's own l)r(~viotls* work (Brown, 11997) and the parallel text, create a rough ma.pping 1)etween the words in the source-language half of each translation example in tile corpus and tile target-language half el'that example. Whenever there is exactly one l)ossible translation candidate listed for a word by the mapping, generate a bilingual word pair consisting of the word and its translation. This word pair will be treated as an indivisible token in further processing, adding bilingual information to the clustering process. \]eorming 1)airs in this manner causes each distinct translation of a. word to be treated as a separate sense; although translation pairs do not exactly correspond to word senses, pairs can be formed without any additional knowledge sonrces and are what tile EBM:I' systern requires for its equivalence classes.</Paragraph>
    <Paragraph position="3"> 1,'or every unique word pair found in the 1)revious step, we a.ccurnulate counts for each word in the surrounding context of its occurrences.</Paragraph>
    <Paragraph position="4"> The context of ~n occurrence is defined to be tile N words immediately prior to and the N words immediately following the occurrence; N currently is set to 3. Because word order is important, counts are accumulated separately for each position within the context, i.e. for N = 3, a particular context word may contribute to any of six different counts, depending on its loca-tion relative to the occurrence. Further, as the distance ffoln the occurrence increases, the surrounding words become less likely to be a true part of the word-pair's context, so tile counts are weighted to give the greatest importance to the words immediately adjacent to the word pair being examined. Currently, a silnple linear decay fl'om 1.0 to -~ is used, but other decay functions such as the reciprocal of the distance are also possible. Tile resulting weighted set of word counts tbrms the above-mentioned I)seudodocument which is converted into a term vector Ibr cosine similarity computations (a standaM measure in information retrieval, defined as the dot product of two term vectors normalized to unit length), If the clustering is seeded with a. set of initial equivalence classes (which will be discussed below), then the equivalences will be used to generalize the contexts as they are added to tile overall counts \['or tile word pair. Any words in the context for which a unique correspondence can be found (and f'or which the word and its corresponding translation are one of the pah:s in an equivalence class) will be counted as if the name of the equivMence class had been l)resent in the text rather than the original word. For example, if days of the week are an equivalence class, then ':(lid he come on Fridas:' and &amp;quot;did he leave on Mends3:' will yield identical context vectors for &amp;quot;come&amp;quot; and &amp;quot;leave&amp;quot;, maldng it easier \['or those two terms to chlster together.</Paragraph>
    <Paragraph position="5"> To illustrate the conversion process, consider tile li'rench word &amp;quot;('inq&amp;quot; in two examl)les where it translates into English as ::five&amp;quot; (thus forming tile word pair &amp;quot;cinq_fi ve&amp;quot;) : &lt;NUt&gt; &lt;NI/L&gt; Le ci,zq jours dcpuis la &lt;NUL&gt; &lt;NUL&gt; 73e five dags si~zce lhe ellcs com'me~,cc~w~,t c~z cinq jours .&lt;NUL&gt; they will begin i~), five days .&lt;NUL&gt; where &lt;NUt&gt; is used as a placeholder when the word pair is too near the beginning or end of the sentence for the flfll context to be present. Note that the word order on the target-language side \]s not considered when building the term vector, so it need llOt be the same as on the source-language side; the examples were chosen with the same word order merely for clarity.</Paragraph>
    <Paragraph position="6"> The resulting ternl vector for &amp;quot;cinqJive&amp;quot; is a.s follows, where the numbers in parentheses indicate the context word's position relative to the word pair under consideration:</Paragraph>
    <Paragraph position="8"> Term vectors such as tile above are then clustered to determine equivalent usages among words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML