File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0802_intro.xml

Size: 3,476 bytes

Last Modified: 2025-10-06 14:03:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0802">
  <Title>Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora</Title>
  <Section position="4" start_page="9" end_page="9" type="intro">
    <SectionTitle>
2 Comparable Corpora
</SectionTitle>
    <Paragraph position="0"> Comparable corpora are collections of texts in different languages regarding similar topics (e.g. a collection of news published by agencies in the same period). More restrictive requirements are expected for parallel corpora (i.e. corpora composed by texts which are mutual translations), while the class of the multilingual corpora (i.e. collection of texts expressed in different languages without any additional requirement) is the more general. Obviously parallel corpora are also comparable, while comparable corpora are also multilingual.</Paragraph>
    <Paragraph position="1"> In a more precise way, let L = fL1,L2,...,Llg be a set of languages, let Ti = fti1,ti2,...,ting be a collection of texts expressed in the language Li 2L, and let ps(tjh,tiz) be a function that returns 1 if tiz is the translation of tjh and 0 otherwise. A multilingual corpus is the collection of texts de ned by T[?] =uniontext i Ti. If the function ps exists for every text tiz 2T[?] and for every language Lj, and is known, then the corpus is parallel and aligned at document level.</Paragraph>
    <Paragraph position="2"> For the purpose of this paper it is enough to assume that two corpora are comparable, i.e. they are composed by documents about the same topics and produced in the same period (e.g. possibly from different news agencies), and it is not known if a function ps exists, even if in principle it could exist and return 1 for a strict subset of document pairs.</Paragraph>
    <Paragraph position="3"> There exist many interesting works about using parallel corpora for multilingual applications (Melamed, 2001), such as Machine Translation, Cross language Information Retrieval (Littman et al., 1998), lexical acquisition, and so on.</Paragraph>
    <Paragraph position="4"> However it is not always easy to nd or build parallel corpora. This is the main reason because the weaker notion of comparable corpora is a matter recent interest in the eld of Computational Linguistics (Gaussier et al., 2004).</Paragraph>
    <Paragraph position="5"> The texts inside comparable corpora, being about the same topics (i.e. about the same semantic domains), should refer to the same concepts by using various expressions in different languages. On the other hand, most of the proper nouns, relevant entities and words that are not yet lexicalized in the language, are expressed by using their original terms.</Paragraph>
    <Paragraph position="6"> As a consequence the same entities will be denoted with the same words in different languages, allowing to automatically detect couples of translation pairs just by looking at the word shape (Koehn and Knight, 2002). Our hypothesis is that comparable corpora contain a large amount of such words, just because texts, referring to the same topics in different languages, will often adopt the same terms to denote the same entities1.</Paragraph>
    <Paragraph position="7"> However, the simple presence of these shared words is not enough to get signi cant results in TC tasks. As we will see, we need to exploit these common words to induce a second-order similarity for the other words in the lexicons.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML