File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/p06-1144_relat.xml
Size: 13,837 bytes
Last Modified: 2025-10-06 14:15:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1144"> <Title>arantza.casillas@ehu.es</Title> <Section position="4" start_page="1145" end_page="1148" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> MDC is normally applied with parallel (Silva et.</Paragraph> <Paragraph position="1"> al., 2004) or comparable corpus (Chen and Lin, 2000), (Rauber et. al., 2001), (Lawrence, 2003), (Steinberger et. al., 2002), (Mathieu et. al, 2004), (Pouliquen et. al., 2004). In the case of the comparable corpora, the documents usually are news articles.</Paragraph> <Paragraph position="2"> Considering the approaches based on translation technology, two different strategies are employed: (1) translate the whole document to an anchor language, and (2) translate some features of the document to an anchor language.</Paragraph> <Paragraph position="3"> With regard to the first approach, some authors use machine translation systems, whereas others translate the document word by word consulting a bilingual dictionary. In (Lawrence, 2003), the author presents several experiments for clustering a Russian-English multilingual corpus; several of these experiments are based on using a machine translation system. Columbia's Newsblaster system (Kirk et al., 2004) clusters news into events, it categorizes events into broad topic and summarizes multiple articles on each event. In the clustering process non-English documents are translated using simple dictionary lookup techniques for translating Japanese and Russian documents, and the Systran translation system for the other languages used in the system.</Paragraph> <Paragraph position="4"> When the solution involves translating only some features, first it is necessary to select these features (usually entities, verbs, nouns) and then translate them with a bilingual dictionary or/and consulting a parallel corpus.</Paragraph> <Paragraph position="5"> In (Mathieu et. al, 2004) before the clustering process, the authors perform a linguistic analysis which extracts lemmas and recognizes named entities (location, organization, person, time expression, numeric expression, product or event); then, the documents are represented by a set of terms (keywords or named entity types). In addition, they use document frequency to select relevant features among the extracted terms. Finally, the solution uses bilingual dictionaries to translate the selected features. In (Rauber et. al., 2001) the authors present a methodology in which documents are parsed to extract features: all the words which appear in n documents except the stopwords. Then, standard machine translation techniques are used to create a monolingual corpus.</Paragraph> <Paragraph position="6"> After the translation process the documents are automatically organized into separate clusters using an un-supervised neural network.</Paragraph> <Paragraph position="7"> Some approaches first carry out an independent clustering in each language, that is a monolingual clustering, and then they find relations among the obtained clusters generating the multilingual clusters. Others solutions start with a multilingual clustering to look for relations between the documents of all the involved languages. This is the case of (Chen and Lin, 2000), where the authors propose an architecture of multilingual news summarizer which includes monolingual and multilingual clustering; the multilingual clustering takes input from the monolingual clusters. The authors select different type of features depending on the clustering: for the monolingual clustering they use only named entities, for the multilingual clustering they extract verbs besides named entities.</Paragraph> <Paragraph position="8"> The strategies that use language-independent representation try to normalize or standardize the document contents in a language-neutral way; for example: (1) by mapping text contents to an independent knowledge representation, or (2) by recognizing language independent text features inside the documents. Both approaches can be employed isolated or combined.</Paragraph> <Paragraph position="9"> The first approach involves the use of existing multilingual linguistic resources, such as thesaurus, to create a text representation consisting of a set of thesaurus items. Normally, in a multilingual thesaurus, elements in different languages are related via language-independent items. So, two documents written in different languages can be considered similar if they have similar representation according to the thesaurus. In some cases, it is necessary to use the thesaurus in combination with a machine learning method for mapping correctly documents onto thesaurus. In (Steinberger et. al., 2002) the authors present an approach to calculate the semantic similarity by representing the document contents in a language independent way, using the descriptor terms of the multilingual thesaurus Eurovoc.</Paragraph> <Paragraph position="10"> The second approach, recognition of language independent text features, involves the recognition of elements such as: dates, numbers, and named entities. In others works, for instance (Silva et. al., 2004), the authors present a method based on Relevant Expressions (RE). The RE are multilingual lexical units of any length automatically extracted from the documents using the LiPXtractor extractor, a language independent statistics-based tool. The RE are used as base features to obtain a reduced set of new features for the multilingual clustering, but the clusters obtained are monolingual.</Paragraph> <Paragraph position="11"> Others works combine recognition of independent text features (numbers, dates, names, cognates) with mapping text contents to a thesaurus.</Paragraph> <Paragraph position="12"> In (Pouliquen et. al., 2004) the cross-lingual news cluster similarity is based on a linear combination of three types of input: (a) cognates, (b) automatically detected references of geographical place names, and (c) the results of a mapping process onto a multilingual classification system which maps documents onto the multilingual thesaurus Eurovoc. In (Steinberger et. al., 2004) it is proposed to extract language-independent text features using gazetteers and regular expressions besides thesaurus and classification systems.</Paragraph> <Paragraph position="13"> None of the revised works use as unique evidence for multilingual clustering the identification of cognate named entities between both sides of the comparable corpora.</Paragraph> <Paragraph position="14"> 3 MDC by Cognate NE Identification We propose an approach for MDC based only on cognate NE identification. The NEs categories that we take into account are: PERSON, ORGANIZATION, LOCATION, and MISCEL-LANY. Other numerical categories such as DATE, TIME or NUMBER are not considered because we think they are less relevant regarding the content of the document. In addition, they can lead to group documents with few content in common.</Paragraph> <Paragraph position="15"> The process has two main phases: (1) cognate NE identification and (2) clustering. Both phases are described in detail in the following sections.</Paragraph> <Section position="1" start_page="1146" end_page="1147" type="sub_section"> <SectionTitle> 3.1 Cognate NE identification </SectionTitle> <Paragraph position="0"> This phase consists of three steps: 1. Detection and classification of the NEs in each side of the corpus.</Paragraph> <Paragraph position="1"> 2. Identification of cognates between the NEs of both sides of the comparable corpus. 3. To work out a statistic of the number of docu- null ments that share cognates of the different NE categories.</Paragraph> <Paragraph position="2"> Regarding the first step, it is carried out in each side of the corpus separately. In our case we used a corpus with morphosyntactical annotations and the NEs identified and classified with the FreeLing tool (Carreras et al., 2004).</Paragraph> <Paragraph position="3"> In order to identify the cognates between NEs 4 steps are carried out: * Obtaining two list of NEs, one for each language. null * Identification of entity mentions in each language. For instance, &quot;Ernesto Zedillo&quot;, &quot;Zedillo&quot;, &quot;Sr. Zedillo&quot; will be considered as the same entity after this step since they refer to the same person. This step is only applied to entities of PERSON category. The identification of NE mentions, as well as cognate NE, is based on the use of the Levenshtein edit-distance function (LD). This measure is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of insertion, deletion and substitution. The result is an integer value that is normalized by the length of the longest string. In addition, constraints regarding the number of words that the NEs are made up, as well as the order of the words are applied.</Paragraph> <Paragraph position="4"> * Identification of cognates between the NEs of both sides of the comparable corpus. It is also based on the LD. In addition, also constraints regarding the number and the order of the words are applied. First, we tried cognate identification only between NEs of the same category (PERSON with PERSON, . . . ) or between any category and MISCELLANY (PERSON with MISCELLANY, . . . ). Next, with the rest of NEs that have not been considered as cognate, a next step is applied without the constraint of being to the same category or MISCELLANY. As result of this step a list of corresponding bilingual cognates is obtained.</Paragraph> <Paragraph position="5"> * The same procedure carried out for obtaining bilingual cognates is used to obtain two more lists of cognates, one per language, between the NEs of the same language.</Paragraph> <Paragraph position="6"> Finally, a statistic of the number of documents that share cognates of the different NE categories is worked out. This information can be used by the algorithm (or the user) to select the NE category used as constraint in the clustering steps 1(a) and 2(b).</Paragraph> </Section> <Section position="2" start_page="1147" end_page="1148" type="sub_section"> <SectionTitle> 3.2 Clustering </SectionTitle> <Paragraph position="0"> The algorithm for clustering multilingual documents based on cognate NEs is of heuristic nature.</Paragraph> <Paragraph position="1"> It consists of 3 main phases: (1) first clusters cre- null ation, (2) addition of remaining documents to existing clusters, and (3) final cluster adjustment. 1. First clusters creation. This phase consists of 2 steps.</Paragraph> <Paragraph position="2"> (a) First, documents in different languages that have more cognates in common than a threshold are grouped into the same cluster. In addition, at least one of the cognates has to be of a specific category (PERSON, LOCATION or ORGA null NIZATION), and the number of mentions has to be similar; a threshold determines the similarity degree. After this step some documents are assigned to clusters while the others are free (with no cluster assigned).</Paragraph> <Paragraph position="3"> (b) Next, it is tried to assign each free document to an existing cluster. This is possible if there is a document in the cluster that has more cognates in common with the free document than a threshold, with no constraints regarding the NE category. If it is not possible, a new cluster is created. This step can also have as result free documents.</Paragraph> <Paragraph position="4"> At this point the number of clusters created is fixed for the next phase.</Paragraph> <Paragraph position="5"> 2. Addition of the rest of the documents to existing clusters. This phase is carried out in 2 steps.</Paragraph> <Paragraph position="6"> (a) A document is added to a cluster that contains a document which has more cognates in common than a threshold.</Paragraph> <Paragraph position="7"> (b) Until now, the cognate NEs have been compared between both sides of the corpus, that is a bilingual comparison. In this step, the NEs of a language are compared with those of the same language. This can be described like a monolingual comparison step. The aim is to group similar documents of the same language if the bilingual comparison steps have not been successful. As in the other cases, a document is added to a cluster with at least a document of the same language which has more cognates in common than a threshold. In addition, at least one of the cognates have to be of a specific category (PERSON, LOCATION or ORGANIZATION).</Paragraph> <Paragraph position="8"> 3. Final cluster adjustment. Finally, if there are still free documents, each one is assigned to the cluster with more cognates in common, without constraints or threshold. Nonetheless, if free documents are left because they do not have any cognates in common with those assigned to the existing clusters, new clusters can be created.</Paragraph> <Paragraph position="9"> Most of the thresholds can be customized in order to permit and make the experiments easier. In addition, the parameters customization allows the adaptation to different type of corpus or content. For example, in steps 1(a) and 2(b) we enforce at least on match in a specific NE category. This parameter can be customized in order to guide the grouping towards some type of NE. In Section 4.5 the exact values we used are described.</Paragraph> <Paragraph position="10"> Our approach is an heuristic method that following an agglomerative approach and in an iterative way, decides the number of clusters and locates each document in a cluster; everything is based in cognate NEs identification. The final number of clusters depends on the threshold values. null</Paragraph> </Section> </Section> class="xml-element"></Paper>