File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1138_metho.xml
Size: 18,442 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1138"> <Title>Multilingual and cross-lingual news topic tracking</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Clustering of news articles </SectionTitle> <Paragraph position="0"> In this process, larger groups of similar articles are grouped into clusters. Unlike in document classification, clustering is a bottom-up, unsupervised process, because the document classes are not known beforehand.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Building a dendrogram </SectionTitle> <Paragraph position="0"> In the process, we build a hierarchical clustering tree (dendrogram), using an agglomerative algorithm (Jain et al. 1999). In a first step, (1) we calculate the similarity between each document pair in the collection (i.e. one full day of news in one language), applying the cosine formula to the document vector pairs. The vector for each single document consists of its keywords and their log-likelihood values, enhanced with the country profile as described in sections 3.1 and 3.2. (2) When two or more documents have a cosine similarity of 90% or more, we eliminate all but one of them as we assume that they are duplicates or nearduplicates, i.e. they are exact copies or slightly amended versions of the same news wire. (3) We then combine the two most similar documents into a cluster, for which we calculate a new representation by merging the two vectors into one. For the node combining the two documents, we also have an intra-cluster similarity value showing the degree to which the two documents are similar. For the rest of the clustering process, this node will be treated like a single document, with the exception that it will have twice the weight of a single document when being merged with another document or cluster of documents. We iteratively repeat steps (1) and (3) so as to include more and more docu- null ments into the binary dendrogram until all documents are included. The resulting dendrogram will have clusters of articles that are similar, and a list of keywords and their weight for each cluster. The degree of similarity for each cluster is shown by its intra-cluster similarity value.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 Cluster extraction to identify main events </SectionTitle> <Paragraph position="0"> In a next step, we search the dendrogram for the major news clusters of the day, by identifying all sub-clusters of documents that fulfil the following conditions: (a) the intra-cluster similarity (cluster cohesiveness) is above the threshold of 50%; (b) the number X of articles in the cluster is at least 0.6% of the total number of articles of that language per day; (c) the number Y of different feeds is at least half the minimum number of articles per cluster (Y = X/2).</Paragraph> <Paragraph position="1"> The threshold of 50% in (a) was chosen because it guarantees that most related articles are included in the cluster, while unrelated ones are mostly excluded (see section 4.3). The minimum number of articles per cluster in (b) was chosen to limit the number of major news clusters per day. We requested a minimum number of different news feeds (c) so as to be sure that the news items are of general interest and that we are not dealing with some newspaper-specific or local issues.</Paragraph> <Paragraph position="2"> With the current settings, the system produces an average of 9 English major news clusters per day, 11 Italian, 16 German, 20 French and 21 Spanish.</Paragraph> <Paragraph position="3"> The varying numbers indicate that the settings should probably be changed so as to produce a similar number of major news clusters per day in the various languages. Most likely, the minimum number of feeds should have an upper maximum value for languages like English with thousands of news articles per day.</Paragraph> <Paragraph position="4"> For each cluster, we have the following information: number of articles, number of sources (feeds), intra-cluster similarity measure and keywords. Using our group-average approach we also have the centroid of the cluster (i.e. the vector of features that represents the cluster). For each cluster, we compute the article that is most similar to the centroid (short: the centroid article). We use the title of this centroid article as the title for the cluster and we present this article to the users as a first document to read about the contents of the whole cluster.</Paragraph> <Paragraph position="5"> The collection of clusters is mainly presented to the users as a flat and independent list of clusters. However, as we realised that some of the clusters are more related than others (e.g. with the recent interest in Iraq, there are often various clusters covering different aspects of the political situation of the country), we position clusters with an inter-cluster similarity of over 30% closer to each other when presenting them to the users.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 Evaluation of the monolingual clustering </SectionTitle> <Paragraph position="0"> The evaluation of clustering results is rather tricky. According to Joachims (2003), clustering results can be evaluated using a variety of different ways: (a) let the market decide (select the winner); (b) ask end users; (c) measure the 'tightness' or 'purity' of clusters; (d) use human-identified clusters to evaluate system-generated ones. The last solution (d) is out of our reach because it is very resource-consuming; several evaluators would be needed for cross-checking the human judgement. The 'market' (a) and user groups (b) will use and evaluate our system in the near future, but we need to evaluate the system prior to showing it to a large number of customers. We therefore focus on method (c) by letting a person judge how consistently the articles of each cluster treat the same story.</Paragraph> <Paragraph position="1"> We evaluated the major clusters of English news articles (using the 50% intra-cluster similarity threshold) produced for the seven-day period starting 9 March 2004. During this period, 71 clusters containing 1072 news articles were produced. The evaluator was asked to decide, for each cluster and on a four-grade scale, to what extent the clustered articles were related to the centroid article. Comparing the clustered articles to the centroid article was chosen over evaluating the homogeneity of the cluster because it is both easier and closer to the real-life situation of the users: users will enter the cluster via the centroid article and will judge the other articles according to whether or not they contain the information they expect. The evaluation scale distinguishes the following ratings: (0) wrong link, e.g. Madrid football results vs.</Paragraph> <Paragraph position="2"> Madrid elections; this is a hypothetical example as no such link was found.</Paragraph> <Paragraph position="3"> (1) loosely connected story, e.g. Welsh documentary on drinking vs. alcohol policy in Britain; (2) interlinked news stories, e.g. 11/03 Madrid bombing vs. elections of the Spanish Prime Minister Zapatero vs. Spanish decision to pull troops out of Iraq; (3) same news story.</Paragraph> <Paragraph position="4"> In the evaluation, 91.5% of the articles were rated as good (3), 7.7% were rated as interlinked (2) and 0.8% were rated as loosely connected. No wrong links were found. 47 of the 71 clusters only contained good articles (3). Loosely connected articles (1) were distributed evenly. No more than two articles of this rating were found in a single cluster. They never amounted to more than 17% of all articles in a cluster (2 out of 12 articles). An evaluation of the clusters produced on one day's data with 30% and 40% intra-cluster similarity thresholds showed that the performance decreased drastically. In 30%-clusters, we found several wrong links (category 0), while no such wrong links were found in the 50%-clusters. The total number of wrong (0) or loosely connected (1) articles went up from one (in the 50%-cluster for that day) to 37. Furthermore, the worst clusters contained over 50% of such unrelated articles. The 40%-clusters were of a slightly better quality, but they still were clearly less good than the 50%clusters: The percentage of wrong (0) and loosely connected (1) articles only went up from 0.8% (in the 50%-clusters) to 4%, but some of the 40%-clusters still had more bad (category 0 or 1) than good (category 2 or 3) articles. These numbers confirm that our choice of the 50% intra-cluster similarity threshold is most useful.</Paragraph> <Paragraph position="5"> We have not produced a quantitative evaluation of the miss rate of the clustering process (i.e. the number of related articles not included in the cluster, showing the recall). However, a full-text search of the relevant proper names in the rest of the news collection showed that the clustering process missed very few related articles. In any case, from our users' point of view, it is much more important to know the major news stories of a specific day than being able to access all articles on the subject.</Paragraph> <Paragraph position="6"> Statistical evaluation showed no correlation between cluster size and accuracy. However, category (2) results were more frequently found in clusters pertaining to news stories that go on for a long time, such as the US presidential elections.</Paragraph> <Paragraph position="7"> These stories get wide coverage without being 'breaking news', and many of the articles involved are commentaries. Some of the category (2) results were also found in stories around the Madrid bombing and its consequences: some articles discussed the bombing itself on 11 March (number of dead, investigation, mourning); others discussed the fact that, in the 14 March elections, the Spanish people elected the socialists as they felt that former Prime Minister Aznar's politics were partially responsible for this tragedy; yet other articles discussed the post-election consequences such as the decision of the new Socialist government to pull out the Spanish troops from Iraq, etc. Many of the articles touched upon several of these issues. Articles were rated as good (3) if they had at least one core topic in common with the centroid article.</Paragraph> <Paragraph position="8"> 5 Monolingual linking of news over time Establishing automatic links between the major clusters of news published in one language in the last 24 hours and the news published in previous days can help users in their analysis of events. Establishing historical links between related news stories is the third of the TDT tasks (see the introduction in section 1).</Paragraph> <Paragraph position="9"> We track topics by calculating the cosine similarity between all major news clusters of one day with all major news clusters of the previous days, currently up to a maximum distance of seven days.</Paragraph> <Paragraph position="10"> The input for the similarity calculation is the cluster vector produced by the monolingual clustering process (see section 4.2). The output for each pair-wise similarity calculation is a similarity value between 0 and 1. Whether we decide that two clusters are related or not depends on the similarity threshold we set. We found that related clusters over time have an extremely high similarity, often around 90%, which shows that the vocabulary used in news stories over time changes very little. For testing purposes, we set the threshold very low, at 15%, so that we could determine a useful threshold during the evaluation process.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.1 Evaluation of historical linking </SectionTitle> <Paragraph position="0"> We evaluated the historical links for the 136 English clusters of major news produced for the two-week period starting on 9 March 2004, looking at the seven-day window preceding the day for which each major news cluster was identified. The total number of historical links found for this period is 228, i.e. on average 1.68 historical links per major news cluster. However, for 42 of the 136 major news clusters, the system did not find any related news clusters with a similarity of 15% or more.</Paragraph> <Paragraph position="1"> We made a binary distinction between 'closely related articles' (+) and 'unrelated, or not so related articles' (-).The evaluation results at varying cosine similarity thresholds, displayed in Table 1, show that there is no threshold which includes all good clusters and excludes all bad ones. Setting the threshold at 40% would mean that 173 (135+24+ 14) of the 203 good clusters (86%) would be found while three bad ones would also be shown to the user. Setting the threshold at the more inclusive level of 20% would mean that 199 of the 203 good clusters (98%) would be found, but the number of unrelated ones would increase to 17.</Paragraph> <Paragraph position="2"> olds, of the automatically detected links between major news of the day and the major news published in the seven days before. The distinction was binary: Related (+) or Not (so) related (-).</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Cross-lingual linking of news clusters </SectionTitle> <Paragraph position="0"> News analysts and employees in press rooms and public relations departments often want to see how the same news is discussed in different countries. To allow easy access to related news in other languages, we establish cross-lingual links between the clusters of major news stories. As major news in one country sometimes is only minor news in another, we calculate a second, alternative group of news clusters for each language and each day, containing a larger number of smaller clusters. To get this alternative group of clusters, we set the intra-cluster similarity to 25% and require that the news of the cluster come from at least two different news sources. These conditions are much weaker than the requirements described in section 4.2. For each major news cluster (50% intra-cluster similarity) per day and per language, we thus try to find related news in the other languages among any of the smaller clusters produced with the 25% intra-cluster similarity requirement.</Paragraph> <Paragraph position="1"> We use three types of input for the calculation of cross-lingual cluster similarity: (a) the vector of keywords, as described in section 3.1, not enhanced with geographical information, (b) the country score vector, as described in section 3.2, and (c) the vector of Eurovoc descriptors, as described in section 3.3. The impact of the three components is currently set to 20%, 30% and 50% respectively. Using the Eurovoc vector alone would give very high similarity values for, say, news about elections in France and in the United States. By adding the country score, a considerable weight in the cross-lingual similarity calculation is given to the countries that are mentioned in each news cluster. The overlap between the keyword vectors of documents in two different languages will, of course, be extremely little, but it increases with the number of named entities that the documents have in common. According to Gey (2000), 30% of content-bearing words in journalistic text are proper names.</Paragraph> <Paragraph position="2"> The system ignores individual articles, but calculates the similarity between whole clusters of the different languages. The country score and the Eurovoc descriptor vector are thus assigned to the cluster as a whole, treating all articles of each cluster like one big bag of words.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 6.1 Evaluation of cross-lingual cluster links </SectionTitle> <Paragraph position="0"> The evaluation for the cross-lingual linking was carried out on the same corpus as the evaluation of the historical links, i.e. taking the 136 English major news clusters as a starting point. Cross-lingual cluster links were evaluated for two languages, English to French and English to Italian. The evaluation was again binary, i.e. clusters were either judged as being 'closely related' (+) or 'unrelated, or not so related' (-). For 31 English clusters, no French cluster was found. Similarly, for 32 English clusters, no Italian cluster was found. This means that for almost 25% of the English-speaking major news stories (31/136), there was no equivalent news cluster in the other languages.</Paragraph> <Paragraph position="1"> For the remaining English clusters, a total of 131 French and 133 Italian clusters were detected by the system, i.e. on average more than one for each English cluster. However, when several related news clusters were found, only the one with the highest score was considered in the evaluation.</Paragraph> <Paragraph position="2"> Table 2 not only shows that the English-Italian links are less reliable than the English-French ones (the Italian document representation is inferior to the French one because we spent less effort on optimising the Italian keyword assignment), but also that the quality of cross-lingual links is generally lower than the historical links presented in section 5.1. If we set the threshold for identifying related news across languages to 30%, the system catches 74 of the 75 good French clusters (99%) and 67 of the 69 Italian clusters (97%). However, the system then also proposes 13 bad French and 12 bad Italian clusters to the users. Setting the threshold higher would decrease the number of wrong hits. However, we decided to use the threshold of 30% because we consider it important for users to be able to find related news in other languages. Furthermore, unrelated clusters are usually very easy to detect just by looking at the title of the cluster.</Paragraph> <Paragraph position="3"> olds, of the automatically detected cross-lingual links between English major news and French (FR) or Italian (IT) news of the same day. The distinction was binary: Related (+) or Not (so) related (-).</Paragraph> </Section> </Section> class="xml-element"></Paper>