File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1138_intro.xml

Size: 9,041 bytes

Last Modified: 2025-10-06 14:02:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1138">
  <Title>Multilingual and cross-lingual news topic tracking</Title>
  <Section position="3" start_page="2" end_page="2" type="intro">
    <SectionTitle>
2 Related work
</SectionTitle>
    <Paragraph position="0"> Allan et al. (1998) identify new events and then track the topic like in an information filtering task by querying new documents against the profile of the newly detected topic. Topics are represented as a vector of stemmed words and their TF.IDF values, only considering nouns, verbs, adjectives and numbers. In their experiments, using between 10 and 20 features produced optimal results. Schultz (1999) took the alternative approach of clustering texts with a single-linkage unsupervised agglomerative clustering method, using cosine similarity and TF.IDF for term weighting. He concludes that &amp;quot;a successful clustering algorithm must incorporate a representation for a cluster itself as group average clustering does&amp;quot;. We followed Schultz' advice. Unlike Schultz, however, we use the log-likelihood test for term weighting as this measure seems to be better when dealing with varying text sizes (Kilgarriff 1996). We do not consider parts-of-speech, lemmatisation or stemming, as we do not have access to linguistic resources for all the languages we need to work with, but we use an extensive list of stop words.</Paragraph>
    <Paragraph position="1"> Approaches to cross-lingual topic tracking are rather limited. Possible solutions for this task are to either translate documents or words from one language into the other, or to map the documents in both languages onto some multilingual reference system such as a thesaurus. Wactlar (1999) used bilingual dictionaries to translate Serbo-Croatian words and phrases into English and using the translations as a query on the English texts to find similar texts. In TDT-3, only four systems tried to establish links between documents written in different languages. All of them tried to link English and Chinese-Mandarin news articles by using Machine Translation (e.g. Leek et al. 1999). Using a machine translation tool before carrying out the topic tracking resulted in a 50% performance loss, compared to monolingual topic tracking.</Paragraph>
    <Paragraph position="2"> Friburger &amp; Maurel (2002) showed that the identification and usage of proper names, and especially of geographical references, significantly improves document similarity calculation and clustering. Hyland et al. (1999) clustered news and detected topics exploiting the unique combinations of various named entities to link related documents.</Paragraph>
    <Paragraph position="3"> However, according to Friburger &amp; Maurel (2002), the usage of named entities alone is not sufficient.</Paragraph>
    <Paragraph position="4"> Our own approach to cross-lingual topic tracking, presented in section 6, is therefore based on three kinds of information. Two of them exploit the co-occurrence of named entities in related news stories: (a) cognates (i.e. words that are the same across languages, including names) and (b) geographical references. The third component, (c) a process mapping texts onto a multilingual classification scheme, provides an additional, more content-oriented similarity measure. Pouliquen et al.</Paragraph>
    <Paragraph position="5"> (2003) showed that mapping texts onto a multilingual classification system can be very successful for the task of identifying document translations.</Paragraph>
    <Paragraph position="6"> This approach should thus also be an appropriate measure to identify similar documents in other languages, such as news discussing the same topic.</Paragraph>
    <Paragraph position="7"> 3 Feature extraction for document representation null The similarity measure for monolingual news item clustering, discussed in section 4, is a cosine of weighted terms (see 3.1) enriched with information about references to geographical place names (see 3.2). Related news are tracked over time by calculating the cosine of their cluster representations, while setting certain thresholds (section 5). The cross-lingual linking of related clusters, as described in section 6, additionally uses the results of a mapping process onto a multilingual classification scheme (see 3.3).</Paragraph>
    <Paragraph position="8"> The news corpus consists of a daily average of 3350 English news items, 2100 German, 870 Italian, 800 French and 530 Spanish articles, coming from over three hundred different internet sources.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Keyword identification
</SectionTitle>
      <Paragraph position="0"> For monolingual applications, we represent documents by a weighted list of their terms. For the weighting, we use the log-likelihood test, which is said to perform better than the alternatives TF.IDF or chi-square when comparing documents of different sizes (Kilgarriff 1996). The reference corpus was produced with documents of the same type, i.e. news articles. It is planned to update the reference word frequency list daily or weekly so as to take account of the temporary news bias towards specific subjects (e.g. the Iraq war). We set the p-value to 0.01 in order to limit the size of the vector to the most important words. Furthermore, we use a large list of stop words that includes not only function words, but also many other words that are not useful to represent the contents of a document.</Paragraph>
      <Paragraph position="1"> We do not consider part-of-speech information and do not carry out stemming or lemmatisation, in order to increase the speed of the process and to be able to include new languages quickly even if we do not have linguistic resources for them. Clustering results do not seem to suffer from this lack of linguistic normalisation, but when we extend the system to more highly inflected languages, we will have to see whether lemmatisation will be necessary. The result of the keyword identification process is thus a representation of each incoming news article in a vector space.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Geographical Place Name Recognition
</SectionTitle>
      <Paragraph position="0"> For place name recognition, we use a system that has been developed by Pouliquen et al. (2004).</Paragraph>
      <Paragraph position="1"> Compared to other named entity recognition systems, this tool has the advantage that it recognises exonyms (foreign language equivalences, e.g. Venice vs. Venezia) and that it disambiguates between places with the same name (e.g. Paris in France vs.</Paragraph>
      <Paragraph position="2"> the other 13 places called Paris in the world).</Paragraph>
      <Paragraph position="3"> However, instead of using the city and region names as they are mentioned in the article, each place name simply adds to the country score of each article. The idea behind this is that the place names themselves are already contained in the list of keywords. By adding the country score separately, we heighten the impact of the geographical information on the clustering process.</Paragraph>
      <Paragraph position="4"> The country scores are calculated as follows: for each geographical place name identified for a given country, we add one to the country counter.</Paragraph>
      <Paragraph position="5"> We then normalise this value using the log-likelihood value, using the average country counter in a large number of other news articles as a reference base. As with keywords, we plan to update the country counter reference frequency list on a daily or weekly basis. The resulting normalised country score has the same format as the keyword list so that it can simply be added to the document vector space representation.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Mapping documents onto a multilingual
</SectionTitle>
      <Paragraph position="0"> classification scheme For the semantic mapping of news articles, we use an existing system developed by Pouliquen et al. (2003), which maps documents onto a multilingual thesaurus called Eurovoc. Eurovoc is a wide-coverage classification scheme with approximately 6000 hierarchically organised classes. Each of the classes has exactly one translation in the currently 22 languages for which it exists. The system carries out category-ranking classification using Machine Learning methods. In an inductive process, it builds a profile-based classifier by observing the manual classification on a training set of documents with only positive examples. The outcome of the mapping process is a ranked list of the 100 most pertinent Eurovoc classes. Due to the multi-lingual nature of Eurovoc, this representation is independent of the text language so that it is very suitable for cross-lingual document similarity calculation, as was shown by Pouliquen et al. (2003).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML