File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0214_intro.xml

Size: 7,638 bytes

Last Modified: 2025-10-06 14:01:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0214">
  <Title>Topic Identification In Natural Language Dialogues Using Neural Networks</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The analysis of the topic of a sentence or a document is an important task for many natural language applications. For example, in interactive dialogue systems that attempt to carry out and answer requests made by customers, the response strategy employed may depend on the topic of the request (Jokinen et al., 2002). In large vocabulary speech recognition knowledge of the topic can, in general, be utilized for adjusting the language model used (see, e.g., (Iyer and Ostendorf, 1999)).</Paragraph>
    <Paragraph position="1"> We describe two approaches to analyzing the topical information, namely the use of topically ordered document maps for analyzing the overall topic of dialogue segments, and identification of topic and focus words in an utterance for sentence-level analysis and identification of topically relevant specific information in short contexts.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Document map as a topically
</SectionTitle>
      <Paragraph position="0"> ordered semantic space The Self-Organizing Map (Kohonen, 1982; Kohonen, 1995) is an unsupervised neural network method suitable for ordering and visualization of complex data sets. It has been shown that very large document collections can be meaningfullyorganized onto maps that are topically ordered: documents with similar content are found near each other on the map (Lin, 1992; Honkela et al., 1996; Lin, 1997; Kohonen et al., 2000).</Paragraph>
      <Paragraph position="1"> The document map can be considered to form an ordered representation of possible topics, i.e., a topical semantic space. Each set of map coordinates specifies a point in the semantic space, and additionally, corresponds to a subset of the corpus, forming a kind of associative topical-semantic memory.</Paragraph>
      <Paragraph position="2"> Document maps have been found useful in text mining and in improving information re-Philadelphia, July 2002, pp. 95-102. Association for Computational Linguistics. Proceedings of the Third SIGdial Workshop on Discourse and Dialogue, trieval (Lagus, 2000). Recent experiments indicate that the document maps ordered using the SOM algorithm can be useful in focusing the language model to the current active vocabulary (Kurimo and Lagus, 2002).</Paragraph>
      <Paragraph position="3"> In this article we examine the usefulness of document maps for analyzing the topics of transcripts of natural spoken dialogues. The topic identification from both individual utterances and longer segments is studied.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Conceptual analysis of individual
</SectionTitle>
      <Paragraph position="0"> utterances Within a single utterance or sentence the speaker may provide several details that specify the request further or provide additional information that specifies something said earlier. Automatic extraction of the relevant words and the concepts they relate to may be useful, e.g., for a system filling out the fields of a database query intended to answer the user's request.</Paragraph>
      <Paragraph position="1"> If a small set of relevant semantic concepts can be defined, and if the sentence structures allowed are strictly limited, the semantic concept identification problem can be solved, at least to some degree, by manually designed rule-based systems (Jokinen et al., 2002).</Paragraph>
      <Paragraph position="2"> However, if the goal is the analysis of free-form dialogues, one cannot count on hearing full sentences. It is therefore important to try to formulate the task as a learning problem into which adaptive, statistical methods can be applied.</Paragraph>
      <Paragraph position="3"> The major challenge in adaptive language modeling is the complexity of the learning problem, caused by large vocabularies and large amount of variation in sentence structures, compared to the amount of learning data available. ForEnglish there alreadyexist various tagged and analyzed corpora. In contrast, for many smaller languages no tagged corpora generally exist. Yet the methods that are developed for English cannot as such be applied for many other languages, such as Finnish.</Paragraph>
      <Paragraph position="4"> In the analysis of natural language dialogues, theories of information structure (Sgall et al., 1986; Halliday, 1985) concern the semantic concepts and their structural properties within an utterance. Such concepts include the attitudes, prior knowledge, beliefs and intentions of the speaker, as well as concepts identifying information that is shared between the speakers. The terms 'topic' and 'focus' may be defined as follows: 'topic' is the general subject of which the user is talking about, and 'focus' refers to the specific additional information that the user now introduces about the topic. An alternative way of describing these terms is that 'topic' constitutes of the old information shared by both dialogue participants and 'focus' contains the new information which is communicated regarding the topic.</Paragraph>
      <Paragraph position="5"> A traditional way of finding the old and new information is the 'question test' (see (Vilkuna, 1989) about using it for Finnish).</Paragraph>
      <Paragraph position="6"> For any declarative sentence, a question is  composedsothatthesentencewouldbeanatural answer to that question. Then the items of the sentence that are repeated in the question belong to the topic and the new items to the focus.</Paragraph>
      <Paragraph position="7"> A usual approach for topic-focus identification is to use parsed data. The sentence, or it's semantic or syntactic-semantic representation, is divided into two segments, usually at the location of the main verb, and the words or semantical concepts in the first segment are regarded as 'topic' words/concepts and those in the second as 'focus' words/concepts. For example in (Meteer and Iyer, 1996), the division point is placed before the first strong verb, or, in the absence of such a verb, behind the last weak verb of the sentence. Similar division is also the starting point for the algorithm for topic-focus identification introduced in (Haji^cov'a et al., 1995). The initial division is then modified according to the verb's position and meaning, the subject's definiteness or indefiniteness and the number, type and order of the other sentence constituents.</Paragraph>
      <Paragraph position="8"> In language modeling for speech recognition improvements in perplexity and word error rate have been observed on English corpora when using language models trained separately for the topic and the focus part of the sentence (Meteer and Iyer, 1996; Ma et al., 1998). Identificationoftheseconceptsislikely to be important also for sentence comprehension and dialogue strategy selection.</Paragraph>
      <Paragraph position="9"> In this article we examine the application of a number of statistical approaches for identification of these concepts. In particular, we apply the notions of topic and focus in information structure (Sgall et al., 1986) to tagging a set of natural dialogues in Finnish. We then try several approaches for learning to identify the occurrences of these concepts from new data based on the statistical properties of the old instances.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML