File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1015_intro.xml

Size: 5,778 bytes

Last Modified: 2025-10-06 14:00:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1015">
  <Title>You'll Take the High Road and I'll Take the Low Road: Using a Third Language to hnprove Bilingual Word Alignment</Title>
  <Section position="2" start_page="0" end_page="97" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> For about a decade and a half now, researchers in Natural language processing (NLP) and general and applied linguistics have been working with parallel corpora, i.e., in the prototypical case corpora consisting of original texts in some soume language (SL) together with their translations into one oi&amp;quot; more target hmguages (TL). In general linguistics, they are used--in tile same fashion as monolingual corpora--as handy sources of authentic language. In computational linguistics and hmguage engineering, various methods for (semi-)automatic extraction from such corpora of, among others, translation equiwdents, have been explored.</Paragraph>
    <Paragraph position="1"> 2 Why is word alignment more interesting and why is it difficult? Alignment--tile explicit linking of items in the SL and TL texts judged to correspond to each other~is a prerequisite for the extraction of translation cquiwdents fi'om parallel corpora, and tile granularity of tile alignment naturally determines what kind of translation units you can get out of these resources. With sentence aligmnent, you get data which can be used in, e.g., translation memories. If you want to build bi- or multilingual lexica for machine translation systems (or for people), however, you want to be able to align parallel texts on the word (and phrase) level. This is because, in the last two decades, NLP grammars havc become increasingly lexicalized, and granlnlars 1'o1&amp;quot; machine translation--as opposed to translation melnories, or example-based machine translation, neither of which uses a grammar in any interesting sense of tile word--forna no exception in this regard.</Paragraph>
    <Paragraph position="2"> The entries of tile lexicon, which is the major repository of linguistic knowledge in a lexicalized gramnlar, are mainly made up of units oil tile linguistic levels of words and phrases.</Paragraph>
    <Paragraph position="3"> The problem here is that sentence alignment is a fairly well-understood problem, but word alignment is much less so. This means that while language-independent sentence alignment programs typically achieve a recall in the 90 percent range, the same cannot be said about word alignment systems, where nornml recall ligures tend to fall somewhere between 20 and 40 percent, in the language-independent case. Thus, we need methods to increase word alignment recall, preferably without sacrificing  precision.</Paragraph>
    <Paragraph position="4"> There are many conceivable reasons for word alignment being less 'effective' than sentence alignment. Different language structures ensure that words comparatively more seldom stand in a one-to-one relationship between the languages in a parallel text, because, e.g., * SL function words may correspond to TL grammatical structural features, i.e.</Paragraph>
    <Paragraph position="5"> morphology or syntax, or even to nothing at all, if the TL happens not to express the feature in question. At the same time, function words tend to display a high type frequency, both because of high functional load (i.e., they are needed all over the place) and because they tend to be uninflected (i.e. each function word is typically represented by one text word type, while content words tend to appear in several inflectional variants). This of course means that function words will account for a relatively large share of the differences in recall figures between sentence and word alignment; * orthographic conventions may disagree on where word divisions should be written, as when compounds are written as several words in English, but as single words in German or Swedish, the extreme case being that some orthographies get along entirely without word divisions; * word alignment must by necessity (because word orders differ between languages) work with word types rather than with word tokens, while sentence I Alignment recall is here understood as the number of units aligned by the alignment program divided by the total number of correct alignments (established by independent means, normally by human annotation). Precisiolz is the number of correct alignments (again established by independent means) divided by the number o1' units aligned by the alignment program (i.e., the numerator in the recall calculation). We will not in this paper go into a discussion of null alignments (source language units having no correspondence in the target language expression) or partial alignments (part, but not all, of a phrase aligned), as we believe that the results we present here are not dependent on a particular treatment of thcse--admiuedly troublesome--phenomena.</Paragraph>
    <Paragraph position="6"> alignment always works with sentence tokens, 2 i.e., it relies on linear order.</Paragraph>
    <Paragraph position="7"> This means that polysemy (one type in the SL corresponding to several types in the TL), homonymy (several types in the SL corresponding to one type in the TL), and combinations of polysemy and homonymy will disrupt the correspondence even between structurally similar languages; Thus, the circumstance that linear order cannot be used to constrain word alignment --beyond the restriction that putative word alignments must appear in one and the same sentence alignment unit--together with the other factors .just mentioned, conspire to make word alignment a much harder problem than sentence alignment in the language-independent case. 3</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML