File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1144_intro.xml

Size: 3,700 bytes

Last Modified: 2025-10-06 14:03:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1144">
  <Title>arantza.casillas@ehu.es</Title>
  <Section position="3" start_page="0" end_page="1145" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Multilingual Document Clustering (MDC) involves dividing a set of n documents, written in different languages, into a specified number k of clusters, so the documents that are similar to other documents are in the same cluster. Meanwhile a multilingual cluster is composed of documents written in different languages, a monolingual cluster is composed of documents written in one language. null MDC has many applications. The increasing amount of documents written in different languages that are available electronically, leads to develop applications to manage that amount of information for filtering, retrieving and grouping multilingual documents. MDC tools can make easier tasks such as Cross-Lingual Information Retrieval, the training of parameters in statistics based machine translation, or the alignment of parallel and non parallel corpora, among others.</Paragraph>
    <Paragraph position="1"> MDC systems have developed different solutions to group related documents. The strategies employed can be classified in two main groups: the ones which use translation technologies, and the ones that transform the document into a language-independent representation.</Paragraph>
    <Paragraph position="2"> One of the crucial issues regarding the methods based on document or features translation is the correctness of the proper translation. Bilingual resources usually suggest more than one sense for a source word and it is not a trivial task to select the appropriate one. Although word-sense disambiguation methods can be applied, these are not free of errors. On the other hand, methods based on language-independent representation also have limitations. For instance, those based on thesaurus depend on the thesaurus scope. Numbers or dates identification can be appropriate for some types of clustering and documents; however, for other types of documents or clustering it could not be so relevant and even it could be a source of noise.</Paragraph>
    <Paragraph position="3"> In this work we dealt with MDC and we proposed an approach based only on cognate Named Entities (NE) identification. We have tested this approach with a comparable corpus of news written in English and Spanish, obtaining encouraging results. One of the main advantages of this approach is that it does not depend on multilingual resources such as dictionaries, machine translation systems, thesaurus or gazetteers. In addition, no information about the right number of clusters has  to be provided to the algorithm. It only depends on the possibility of identifying cognate named entities between the languages involved in the corpus. It could be particularly appropriate for news corpus, where named entities play an important role.</Paragraph>
    <Paragraph position="4"> In order to compare the results of our approach with other based on features translation, we also dealt with this one, as baseline approach. The system uses EuroWordNet (Vossen, 1998) to translate the features. We tried different features categories and combinations of them in order to determine which ones lead to improve MDC results in this approach.</Paragraph>
    <Paragraph position="5"> In the following section we relate previous work in the field. In Section 3 we present our approach for MDC. Section 4 describes the system we compare our approach with, as well as the experiments and the results. Finally, Section 5 summarizes the conclusions and the future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML