File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1089_intro.xml

Size: 5,503 bytes

Last Modified: 2025-10-06 14:05:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1089">
  <Title>Learning Bilingual Collocations by Word-Level Sorting</Title>
  <Section position="3" start_page="0" end_page="525" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the field of machitm translation, there is a, growing interest in corl)llS-I)ased al)l)roa.('hes (Sato and Nagao, 1990; l)a.gan mid (',hutch, 199d; Mat.</Paragraph>
    <Paragraph position="1"> sulnoto et M., 19.93; Kumano m,d ll imka.wa., 199d ; Smadja et al., 1996). The main motiw~.tion behind this is to well ha.ndle, domain specific expressions. I~ach apl~licatiotl dom~dn has va.rious kinds of collocations ranging from word-level to sentence-level. '\]'he correct use of these collocations grea.l.ly inlluellcC's the qua.lity ofoutpttt texts. Ilexa.uso such detaih'd collocations ;~r~'~ &lt;tillicult 1:o hand-conlpile, the automatic extra(:tion of bilingual collocations is needed.</Paragraph>
    <Paragraph position="2"> A number of studies haw&gt; aJ.tetnpte(I to extract bilinguaJ collocations from paralM corpora. These studies c~m be classified into two directions. One is hased on the full parsing techniques. (Mat,sumoto et a l., 1993) I)roposed a. method to find out phrase-lew'l correspondences, while resolving syntactic ambiguities a.t the same time. Their ninth()(Is determine t)hrase eorresl)ondences I)y using the phrase structures of I,he two hmgua,ges and oxisting bilingual dict.iona.ries. Unfi)rl.unately tl,'se al)proaches are protnising only for (,he compara--I, ively short sentences l, hat ca, I&gt;e a,a\]yze(I I&gt;y ;t (',I+,:Y type l&gt;arser.</Paragraph>
    <Paragraph position="3"> The other direction for extracting bilingual cob local.ions involw+:s statistics. (Fung, 1995) acquired bilingual word correspondences without sentetlce alignment. Although these methods ;|re rob/Is |~HI(I aSSllllle rio illfOl'lll~ttiOll SOltrce~ their outputs are just word-word corresl)otMences.</Paragraph>
    <Paragraph position="4"> (Kupiec, 1993; Kumano and }lirakawa, 1!194) extracted noun phr~me (NP) correspondences from aligned parallel corpora. |n (Kupiec, 1993), Nl's in English and French t;exts are+ first extracted by a. NP recoguizer. Their correspotldence prol&gt; abilities arc then gradually relined by using an EM-like iteration algorithm. (t,~uma.no and Ilirakawa, 1994) lirst extracted .Japanese N Ps in the S&amp;III(? way, and comhined statistics with a bilingtta.l dictionary tbr MT 1o find out NP (-orrespondences. Although their apl)ro+tches a.t.ta.ined high accuracy for the+ task considered, the most crucial knowledge for MT is tnorc COml~lex correspOlldelices Sllch ~-LS NI'-VP corres\[,Oll(teltces atHI senl.et,'e-hwe\[ eorrespotldences. It seems di\[\[icutt I.o extend these statistical lllethods to ~t I)roa.(ler rmtge of collocations because they are specialized to N l's o1: sillglc&amp;quot; words.</Paragraph>
    <Paragraph position="5"> (Smmlj~t et al., 1996) proposed a generM method to extract a I)roader range of colloca.tions.</Paragraph>
    <Paragraph position="6"> They first extract English collocations using the Xtract systetn (Smadja, 1993), and theu look for French coutlterparts. Their search strategy is an itemtive combina.tion of two elements. This is ha,sed on the intuitive ide~ tim |&amp;quot;if a set of words ('onstitutes a collocation, its subset will Mso be correla.ted&amp;quot;. Although this idea is corre~(:t, the itera|ire combination strategy generates a. mlmber ol + useless expressions. In fa.ct, Xtract. employs a.</Paragraph>
    <Paragraph position="7"> rol)ust l&amp;quot;,nglish pa.rser to lilter out the wrong colloca.tions which form more thaal ha.If lhe candidates. In other hmgua,ges such as Japanese, pa,rser-lmscd prmfi.g cannot be used. Another drawback of their approa, ch is that only the longesl, n-gram is adopl.ed. That is, when 'Ja.lmn-US auto trade talks' is ardol)ted as ;/collocation, ',lapall-IlS' cannot bc recognized as a. collocal,ion though it is i.dependently used very often.</Paragraph>
    <Paragraph position="8"> In thi,~ pN)er, we propose an alt,ernative method based oil word-lewd sorting. Our method com- null prises two steps: (1) extracting useful word chunks (n-grams) by word-level sorting and (2) constrncting bilingual collocations by combining the word-chunks acquired at stage (1). Given sentence-aligned texts in two languages(Haruno and Yamazaki, 1996), the first step detects useful word chunks by sorting and counting all uninterrupted word sequences in sentences. In this phase, we developed a new technique for extracting only useful chunks. The second step of the method evaluates the statistical similarity of the word chunks appearing in the corresponding sentences. Most of the fixed (uninterrupted) collocations are directly extracted from the word chunks. More flexible (interrupted) collocations are acquired level by level by iteratively combining the chunks. The proposed method, which uses effective word-level sorting, not only extracts fixed collocations with high precision, but also avoids the combinatorial explosion involved in searching flexible collocations. In addition, our method is robust and suitable for real-world applications because it only assumes part-of-speech taggers for both languages.</Paragraph>
    <Paragraph position="9"> Even if the part-of-speech taggers make errors in word segmentation, the errors can be recovered in the word chunk extraction stage.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML