File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1070_intro.xml
Size: 5,572 bytes
Last Modified: 2025-10-06 14:03:35
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1070"> <Title>Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization</Title> <Section position="2" start_page="0" end_page="553" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the worldwide scenario of the Web age, multilinguality is a crucial issue to deal with and to investigate, leading us to reformulate most of the classical Natural Language Processing (NLP) problems into a multilingual setting. For instance the classical monolingual Text Categorization (TC) problem can be reformulated as a Cross Language Text Categorization (CLTC) task, in which the system is trained using labeled examples in a source language (e.g. English), and it classi es documents in a different target language (e.g. Italian).</Paragraph> <Paragraph position="1"> The applicative interest for the CLTC is immediately clear in the globalized Web scenario.</Paragraph> <Paragraph position="2"> For example, in the community based trade (e.g.</Paragraph> <Paragraph position="3"> eBay) it is often necessary to archive texts in different languages by adopting common merceological categories, very often de ned by collections of documents in a source language (e.g. English).</Paragraph> <Paragraph position="4"> Another application along this direction is Cross Lingual Question Answering, in which it would be very useful to lter out the candidate answers according to their topics.</Paragraph> <Paragraph position="5"> In the literature, this task has been proposed quite recently (Bel et al., 2003; Gliozzo and Strapparava, 2005). In those works, authors exploited comparable corpora showing promising results. A more recent work (Rigutini et al., 2005) proposed the use of Machine Translation techniques to approach the same task.</Paragraph> <Paragraph position="6"> Classical approaches for multilingual problems have been conceived by following two main directions: (i) knowledge based approaches, mostly implemented by rule based systems and (ii) empirical approaches, in general relying on statistical learning from parallel corpora. Knowledge based approaches are often affected by low accuracy. Such limitation is mainly due to the problem of tuning large scale multilingual lexical resources (e.g. MultiWordNet, EuroWordNet) for the speci c application task (e.g. discarding irrelevant senses, extending the lexicon with domain speci c terms and their translations). On the other hand, empirical approaches are in general more accurate, because they can be trained from domain speci c collections of parallel text to represent the application needs. There exist many interesting works about using parallel corpora for multilingual applications (Melamed, 2001), such as Machine Translation (Callison-Burch et al., 2004), Cross Lingual Information Retrieval (Littman et al., 1998), and so on.</Paragraph> <Paragraph position="7"> However it is not always easy to nd or build parallel corpora. This is the main reason why the weaker notion of comparable corpora is a matter of recent interest in the eld of Computational Linguistics (Gaussier et al., 2004). In fact, comparable corpora are easier to collect for most languages (e.g. collections of international news agencies), providing a low cost knowledge source for multilingual applications.</Paragraph> <Paragraph position="8"> The main problem of adopting comparable corpora for multilingual knowledge acquisition is that only weaker statistical evidence can be captured.</Paragraph> <Paragraph position="9"> In fact, while parallel corpora provide stronger (text-based) statistical evidence to detect translation pairs by analyzing term co-occurrences in translated documents, comparable corpora provides weaker (term-based) evidence, because text alignments are not available.</Paragraph> <Paragraph position="10"> In this paper we present some solutions to deal with CLTC according to the availability of bilingual resources, and we show that it is possible to deal with the problem even when no such resources are accessible. The core technique relies on the automatic acquisition of Multilingual Domain Models (MDMs) from comparable corpora.</Paragraph> <Paragraph position="11"> This allows us to de ne a kernel function (i.e. a similarity function among documents in different languages) that is then exploited inside a Support Vector Machines classi cation framework. We also investigate this problem exploiting synset-aligned multilingual WordNets and standard bilingual dictionaries (e.g. Collins).</Paragraph> <Paragraph position="12"> Experiments show the effectiveness of our approach, providing a simple and low cost solution for the Cross-Language Text Categorization task. In particular, when bilingual dictionaries/repositories are available, the performance of the categorization gets close to that of monolingual TC.</Paragraph> <Paragraph position="13"> The paper is structured as follows. Section 2 brie y discusses the notion of comparable corpora. Section 3 shows how to perform cross-lingual TC when no bilingual dictionaries are available and it is possible to rely on a comparability assumption. Section 4 present a more elaborated technique to acquire MDMs exploiting bilingual resources, such as MultiWordNet (i.e.</Paragraph> <Paragraph position="14"> a synset-aligned WordNet) and Collins bilingual dictionary. Section 5 evaluates our methodologies and Section 6 concludes the paper suggesting some future developments.</Paragraph> </Section> class="xml-element"></Paper>