File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1144_metho.xml
Size: 9,645 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1144"> <Title>arantza.casillas@ehu.es</Title> <Section position="5" start_page="1148" end_page="1149" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We wanted not only determine whether our approach was successful for MDC or not, but we also wanted to compare its results with other approach based on feature translation. That is why we try MDC by selecting and translating the features of the documents.</Paragraph> <Paragraph position="1"> In this Section, first the MCD by feature translation is described; next, the corpus, the experiments and the results are presented.</Paragraph> <Section position="1" start_page="1148" end_page="1148" type="sub_section"> <SectionTitle> 4.1 MDC by Feature Translation </SectionTitle> <Paragraph position="0"> In this approach we emphasize the feature selection based on NEs identification and the grammatical category of the words. The selection of features we applied is based on previous work (Casillas et. al, 2004), in which several document representations are tested in order to study which of them lead to better monolingual clustering results.</Paragraph> <Paragraph position="1"> We used this MDC approach as baseline method.</Paragraph> <Paragraph position="2"> The approach we implemented consists of the following steps: 1. Selection of features (NE, noun, verb, adjective, ...) and its context (the whole document or the first paragraph). Normally, the journalist style includes the heart of the news in the first paragraph; taking this into account we have experimented with the whole document and only with the first paragraph.</Paragraph> <Paragraph position="3"> 2. Translation of the features by using EuroWordNet 1.0. We translate English into Spanish. When more than one sense for a single word is provided, we disambiguate by selecting one sense if it appears in the Spanish corpus. Since we work with a comparable corpus, we expect that the correct translation of a word appears in it.</Paragraph> <Paragraph position="4"> 3. In order to generate the document representation we use the TF-IDF function to weight the features.</Paragraph> <Paragraph position="5"> 4. Use of an clustering algorithm. Particularly, we used a partitioning algorithm of the CLUTO (Karypis, 2002) library for clustering. null</Paragraph> </Section> <Section position="2" start_page="1148" end_page="1148" type="sub_section"> <SectionTitle> 4.2 Corpus </SectionTitle> <Paragraph position="0"> A Comparable Corpus is a collection of similar texts in different languages or in different varieties of a language. In this work we compiled a collection of news written in Spanish and English belonging to the same period of time.</Paragraph> <Paragraph position="1"> The news are categorized and come from the news agency EFE compiled by HERMES project (http://nlp.uned.es/hermes/index.html). That collection can be considered like a comparable corpus. We have used three subset of that collection. The first subset, call S1, consists on 65 news, 32 in Spanish and 33 in English; we used it in order to train the threshold values. The second one, S2, is composed of 79 Spanish news and 70 English news, that is 149 news. The third subset, S3, contains 179 news: 93 in Spanish and 86 in English.</Paragraph> <Paragraph position="2"> In order to test the MDC results we carried out a manual clustering with each subset. Three persons read every document and grouped them considering the content of each one. They judged independently and only the identical resultant clusters were selected. The human clustering solution is composed of 12 clusters for subset S1, 26 clusters for subset S2, and 33 clusters for S3. All the clusters are multilingual in the three subsets.</Paragraph> <Paragraph position="3"> In the experimentation process of our approach the first subset, S1, was used to train the parameters and threshold values; with the second one and the third one the best parameters values were applied. null</Paragraph> </Section> <Section position="3" start_page="1148" end_page="1149" type="sub_section"> <SectionTitle> 4.3 Evaluation metric </SectionTitle> <Paragraph position="0"> The quality of the experimentation results are determined by means of an external evaluation measure, the F-measure (van Rijsbergen, 1974). This measure compares the human solution with the system one. The F-measure combines the precision and recall measures:</Paragraph> <Paragraph position="2"> nij is the number of members of cluster human solution i in cluster j, nj is the number of members of cluster j and ni is the number of members of cluster human solution i. For all the clusters:</Paragraph> <Paragraph position="4"> The closer to 1 the F-measure value the better.</Paragraph> </Section> <Section position="4" start_page="1149" end_page="1149" type="sub_section"> <SectionTitle> 4.4 Experiments and Results with MDC by Feature Translation </SectionTitle> <Paragraph position="0"> After trying with features of different grammatical categories and combinations of them, Table 1 and Table 2 only show the best results of the experiments. null The first column of both tables indicates the features used in clustering: NOM (nouns), VER (verbs), ADJ (adjectives), ALL (all the lemmas), NE (named entities), and 1rst PAR (those of the first paragraph of the previous categories). The second column is the F-measure, and the third one indicates the number of multilingual clusters obtained. Note that the number of total clusters of each subset is provided to the clustering algorithm. As can be seen in the tables, the results depend on the features selected.</Paragraph> </Section> <Section position="5" start_page="1149" end_page="1149" type="sub_section"> <SectionTitle> 4.5 Experiments and Results with MDC by Cognate NE </SectionTitle> <Paragraph position="0"> The threshold for the LD in order to determine whether two NEs are cognate or not is 0.2, except for entities of ORGANIZATION and LOCATION categories which is 0.3 when they have more than one word.</Paragraph> <Paragraph position="1"> Regarding the thresholds of the clustering phase (Section 3.2), after training the thresholds with the collection S1 of 65 news articles we have concluded: null * The first step in the clustering phase, 1(a), performs a good first grouping with threshold relatively high; in this case 6 or 7. That is, documents in different languages that have more cognates in common than 6 or 7 are grouped into the same cluster. In addition, at least one of the cognates have to be of an specific category, and the difference between the number of mentions have to be equal or less than 2. Of course, these threshold are applied after checking that there are documents that meet the requirements. If they do not, thresholds are reduced. This first step creates multilingual clusters with high cohesiveness.</Paragraph> <Paragraph position="2"> * Steps 1(b) and 2(a) lead to good results with small threshold values: 1 or 2. They are designed to give priority to the addition of documents to existing clusters. In fact, only step 1(b) can create new clusters.</Paragraph> <Paragraph position="3"> * Step 2(b) tries to group similar documents of the same language when the bilingual comparison steps could not be able to deal with them. This step leads to good results with a threshold value similar to 1(a) step, and with the same NE category.</Paragraph> <Paragraph position="4"> On the other hand, regarding the NE category enforce on match in steps 1(a) and 2(b), we tried with the two NE categories of cognates shared by the most number of documents. Particularly, with S2 and S3 corpus the NE categories of the cognates shared by the most number of documents was LOCATION followed by PERSON. We experimented with both categories.</Paragraph> <Paragraph position="5"> Table 3 and Table 4 show the results of the application of the cognate NE approach to subsets S2 and S3 respectively. The first column of both tables indicates the thresholds for each step of the algorithm. Second and third columns show the results by selecting PERSON category as NE category to be shared by at least a cognate in steps 1(a) and 2(b); whereas fourth and fifth columns are calculated with LOCATION NE category. The results are quite similar but slightly better with LOCATION category, that is the cognate NE category shared by the most number of documents. Although none of the results got the exact number of clusters, it is remarkable that the resulting values are close to the right ones. In fact, no information about the right number of cluster is provided to the algorithm.</Paragraph> <Paragraph position="6"> If we compare the performance of the two approaches (Table 3 with Table 1 and Table 4 with Table 2) our approach obtains better results. With the subset S3 the results of the F-measure of both approaches are more similar than with the subset S2, but the F-measure values of our approach are still slightly better.</Paragraph> <Paragraph position="7"> To sum up, our approach obtains slightly better results that the one based on feature translation with the same corpora. In addition, the number of multilingual clusters is closer to the reference solution. We think that it is remarkable that our approach reaches results that can be comparable with those obtained by means of features translation.</Paragraph> <Paragraph position="8"> We will have to test the algorithm with different corpora (with some monolingual clusters, different languages) in order to confirm its performance.</Paragraph> </Section> </Section> class="xml-element"></Paper>