File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1006_abstr.xml

Size: 3,646 bytes

Last Modified: 2025-10-06 13:42:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1006">
  <Title>NLP and IR Approaches to Monolingual and Multilingual Link Detection</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper considers several important issues for monolingual and multilingual link detection. The experimental results show that nouns, verbs, adjectives and compound nouns are useful to represent news stories; story expansion is helpful; topic segmentation has a little effect; and a translation model is needed to capture the differences between languages.</Paragraph>
    <Paragraph position="1"> Introduction In the digital era, how to assist users to deal with data explosion problem becomes emergent.</Paragraph>
    <Paragraph position="2"> News stories on the Internet contain a large amount of real-time and new information.</Paragraph>
    <Paragraph position="3"> Several attempts were made to extract information from news stories, e.g., multi-lingual multi-document summarization (Chen and Huang, 1999; Chen and Lin, 2000), topic detection and tracking (abbreviated as TDT hereafter, http://www.nist.gov/TDT), and so on. Of these, TDT, which is a long-term project, proposed many diverse applications, e.g., story segmentation (Greiff et al., 2000), topic tracking (Levow et al., 2000; Leek et al., 2002), topic detection (Chen and Ku, 2002) and link detection (Allan et al., 2000).</Paragraph>
    <Paragraph position="4"> This paper will focus on the link detection application. The TDT link detection aims to determine whether two stories discuss the same topic. Each story could discuss one or more than one topic, and the sizes of two stories compared may not be so comparable. For example, one story may contain 100 sentences and the other one may contain only 5 sentences. In addition, the stories may be represented in different languages. These are the main challenges of this task. In this paper, we will discuss and contribute on several issues:  1. How to represent a news story? 2. How to measure the similarity of news stories? 3. How to expand a story vector using historic information? 4. How to identify the subtopics embedded in a news story? 5. How to deal with news stories in  different languages? The multilingual issue was first introduced in 1999 (TDT-3), and the source languages are mainly English and Mandarin. Dictionary-based translation strategy is applied broadly. In addition, some strategies were proposed to improve the translation accuracy. Leek et al., (2002) proposed probabilistic term translation and co-occurrence statistics strategies. The algorithm of co-occurrence statistics tended to favour those translations consistent with the rest of the document. Hui et al., (2001) proposed an enhanced translation approach for improving the translation by using a parallel corpus as an additional resource. Levow et al., (2000) proposed a corpus-based translation preference. English translation candidates were sorted in an order that reflected the dominant usage in the collection. Most of these methods need extra resources, e.g., a parallel corpus. In this paper, we will try to resolve multilingual issues with the lack of extra information.</Paragraph>
    <Paragraph position="5"> Topic segmentation is a technique extensively utilized in information retrieval and automatic document summarization (Hearst et al., 1993; Nakao, 2001). The effects were shown to be valid. This paper will introduce topic  segmentation in link detection. Several experiments will be conducted to investigate its effects.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML