XML Viewer - w02-0214

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0214_metho.xml
Size: 17,477 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0214">
  <Title>Topic Identification In Natural Language Dialogues Using Neural Networks</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Experiments on recognizing the
</SectionTitle>
    <Paragraph position="0"> dialogue topic of a dialogue turn The ordered document map can be utilized in the analysis of dialogue topics as follows: encode a dialogue turn, i.e., an utterance u (or an utterance combined with its recent history) as a document vector. Locate the bestmatchingmapunit, orseveralsuchunits. Utilize the identities of the best units as a semantic representation of the topic of the u. In effect, this is a latent semantic representation of the topical content of the utterance. Evaluation of such a latent representation directly amounts to asking whether the dialogue manager can benefit from the representation, and must therefore be carried out by the dialogue manager. This direct evaluation has not yet been done.</Paragraph>
    <Paragraph position="1"> Instead, we have utilized the following approach for evaluating the ordering of the maps and the generalization to new, unseen dialogues: An intermediate set of named semantic concepts has been defined in an attempt to approximate what is considered to be interesting for the dialogue manager. The latent semantic representation of the map is then labeled or calibrated to reflect these named concepts. In effect, each dialogue segment is categorized to a prior topical category. The organized map is labeled using part of the data ('training data'), and the remaining part is used to evaluate the map ('test data')1.</Paragraph>
    <Paragraph position="2"> 1Note that even in this case the map is ordered in Furthermore, a statistical model for document classification can be defined on top of the map. The probability model used for</Paragraph>
    <Paragraph position="4"> where Ai is the topic category, S denotes the text transcription of the spoken sentence and XN is the set of N best map vectors used for the classification. We approximate the probability P(XN|S) to be equal for each map vector inXN. We assume thatXN conveys all information about S. The terms P(Ai|XN) are calculated as the relative frequencies of the topics of the document vectors in the training data that were mapped to the nodes that correspond to XN.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Corpus: transcripts of 57 spoken
</SectionTitle>
      <Paragraph position="0"> dialogues The data used in the experiments were Finnish dialogues, recorded from the customer service phone line of Helsinki City Transport. The dialogues, provided by the Interact project (Jokinen et al., 2002), had been transcribed into text by a person listening to the tapes.</Paragraph>
      <Paragraph position="1"> The transcribed data is extremely colloquial. Both the customers and the customer service personnel use a lot of expletive words, such as 'nii' ('so', 'yea') and 'tota' ('hum', 'er', 'like'), often the words appear in reduced or otherwise non-standard forms. The word order does not always follow grammatical rules and quite frequently there is considerable overlap between the dialogue turns. For example, the utterance of speaker A may be interjected by a confirmation from speaker B. This had currently been transcribed as three separate utterances: A1 B A2.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Tagging and segmentation of
</SectionTitle>
      <Paragraph position="0"> dialogues The data set was split into training and test data so that the first 33 dialogues were used for organization and calibration of the map an unsupervised manner, although it is applied for the classification of new instances based on old ones.  and the 24 dialogues collected later for testing. null A small number of broad topic categories were selected so that they comprehensively encompass the subjects of discussion occurring in the data. The categories were 'timetables', 'beginnings', 'tickets', 'endings', and 'out of domain'.</Paragraph>
      <Paragraph position="1"> The dialogues were then manually tagged and segmented, so that each continuous dialogue segment of several utterances that belonged to one general topic category formed a single document. This resulted in a total of 196 segments, 115 and 81 in training and test sets, respectively. Each segment contained data from both the customer and the assistant. null Of particular interest is the analysis of the topics of individual customer utterances. The data was therefore split further into utterances, resulting in 450 and 189 customer utterances in the training and test set, respectively. The relative frequencies of utterances belonging to each topic category for both training and test data are shown in Table 1.</Paragraph>
      <Paragraph position="2"> Each individual utterance was labeled with the topic category of the segment it belonged to.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Creation of the document map
</SectionTitle>
      <Paragraph position="0"> The documents, whether segments or utterances, were encoded as vectors using the methods described in detail in (Kohonen et al., 2000). In short, the encoding was as follows. Stopwords (function words etc.) and words that appeared fewer than 2 times in the training data were removed. The remaining words were weighted using their entropy over document classes. The documents were encoded using the vector space model by Salton (Salton et al., 1975) with word weights. Furthermore, sparse random projection of was applied to reduce the dimensionality of the document vectors from the original 1738 to 500 (for details of the method, see, e.g., (Kohonen et al., 2000)).</Paragraph>
      <Paragraph position="1"> In organizing the map each longer dialogue segment was considered as a document.</Paragraph>
      <Paragraph position="2"> The use of longer segments is likely to make the organization of the map more robust.</Paragraph>
      <Paragraph position="3"> The inclusion of the utterances by the assistant is particularly important given the small amount of data--all information must be utilized. The document vectors were then organized on a SOM of 6x4 = 24 units.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Experiments and results
</SectionTitle>
      <Paragraph position="0"> We carried out three tests where the length of dialogue segments was varied. In each case, different values of N were tried. In the first case, longer dialogue segments in the training data were used to estimate the term P(Ai|XN) whereas recognition accuracy was calculated on customer utterances only. Next, individual customer utterances were used also in estimating the model term. The best recognition accuracy in both cases were obtained using the value N = 3, namely 60.3% for the first case and 65.1% for the second case.</Paragraph>
      <Paragraph position="1"> In the third case we used the longer dialogue segments both for estimating the model and for evaluation, to examine the effect of longer context on the recognition accuracy.</Paragraph>
      <Paragraph position="2"> The recognition accuracy was now 87.7%, i.e., clearly better for the longer dialogue segments than for the utterances.</Paragraph>
      <Paragraph position="3"> It seems that many utterances taken out of context are too short or nondescript to provide reliable cues regarding the topical category. An example of such an utterance is 'Onks sinne mit&amp;quot;a&amp;quot;a muuta?' (lit. 'Is to there anything else?', the intended meaning probably being 'Does any other bus go there?').</Paragraph>
      <Paragraph position="4"> In this case it is the surrounding dialogue (or perhaps the Finnish morpheme corresponding to 'to') that would identify the correct category, namely 'timetables'.</Paragraph>
      <Paragraph position="5"> Moreover, results on comparing a document map to Independent Component Analysis on the same corpus are reported in (Bingham et al., 2002). The slightly higher percentages in that paper are due to evaluating longer segments and to reporting the results on the whole data set instead of a separate test set.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Identification of old and new
</SectionTitle>
    <Paragraph position="0"> information in utterances We define this task as the identification of 'topic words' and 'focus words' from utterances of natural Finnish dialogues. There are thus no restrictions regarding the vocabulary or the grammar. By observing previous, marked instances of these concepts we try to recognize the instances in new dialogues. It should be noted that this task definition differs somewhat from those discussed in Section 1.2 in that we do not construct any conceptual representation of the utterances, nor do we segment them into a 'topic' part and a 'focus' part. This choice is due to utilizing natural utterances in which the sentence borders do not always coincide with the turn-taking of the speakers--a turn may consist of several sentences or a partial one (when interrupted by a comment from the other speaker).</Paragraph>
    <Paragraph position="1"> In other words, we try to identify the central words that communicate the topic and focus in an utterance. We assume that they can appear in any part of the sentence and between them there may be other words that are not relevant to the topic or focus. Whether these central words form a single topic or focus or several such concepts is left open.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Corpus and tagging
</SectionTitle>
      <Paragraph position="0"> The corpus used includes the same data as in section 2 with additional 133 dialogues collected from the same source. Basically each dialogue turn was treated as an utterance, with the exception that long turns were segmented into sentence-like segments, which were then considered to be utterances2. Utterances consisting of only one word were re2Non-textual cues such as silences within turns could not be considered for segmenting because they were not marked in the data.</Paragraph>
      <Paragraph position="1"> moved from the data. The training data contained 11464 words in 1704 utterances. Of the words 17 % were tagged as topic, and 28 % as focus. The test data consisted of 11750 words in 1415 utterances, with 14 % tagged as topic and 25 % as focus.</Paragraph>
      <Paragraph position="2"> In tagging the topic and focus words in the corpus, the following definitions were employed: In interrogative clauses focus consists of those words that form the exact entity that is being asked and all the other words that define the subject are tagged as belonging to the topic. In declarative sentences that function as answers words that form the core of the answer are tagged as 'focus', and other words that merely provide context for the specific answer are tagged as 'topic'. In other declarative sentences 'topics' are words that define the subject matter and 'focus' is applied to words that communicate what is being said about the topic. Regardless, the tagging task was in many cases quite difficult, and the resulting choice of tags often debatable.</Paragraph>
      <Paragraph position="3"> As is charasteristic of spoken language, the data contained a noticeable percentage (35 %) of elliptic utterances, which didn't contain any topic words. Multiple topic constructs, on the other hand, were quite rare: more than one topic concept occurred in only 1 % of the utterances. The pronouns were quite evenly distributed with regard to position in the utterances: 32 % were in medial and 36 % in final position3.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The probabilistic model
</SectionTitle>
      <Paragraph position="0"> The probability of a word belonging to the class topic, focus or other is modeled as</Paragraph>
      <Paragraph position="2"> where W denotes the word, S its position in an utterance, and Ti [?] {topic, focus, other} stands for the class. The model thus assumes that being a topic or a focus word is dependent on the properties of that particular word as well as its position in the utterance. Due 3We interpreted 'medial' to mean the middle third of the sentence, and 'final' to be the last third of the sentence.</Paragraph>
      <Paragraph position="3"> to computational reasons we made the simplifying assumption that these two effects are independent, i.e., P(W,S) = P(W)P(S).</Paragraph>
      <Paragraph position="4"> Maximum likelihood estimates are used for the terms P(Ti|W) for already seen words.</Paragraph>
      <Paragraph position="5"> Moreover, for unseen words we use the average of the models of words seen only rarely (once or twice) in the training data.</Paragraph>
      <Paragraph position="6"> For the term P(Ti|S) that describes the effect of the position of a word we use a softmax model, namely</Paragraph>
      <Paragraph position="8"> where the index j identifies the word and xj is the position of the word j. The functions qi are defined as simple linear functions</Paragraph>
      <Paragraph position="10"> The parameters ai and bi are estimated from the training data. For the class T3 (other), these parameters are set to a constant value of zero.</Paragraph>
      <Paragraph position="11">  When evaluating the rest of the model parameters we use two methods, first Maximum Likelihood estimation and then Bayesian variational analysis.</Paragraph>
      <Paragraph position="12"> In ML estimation the cost function is the log likelihood of the training data D given the model M, i.e,</Paragraph>
      <Paragraph position="14"> The logarithmic term is approximated by a Taylor series of first degree and the parameters can then be solved as usual, by setting the partial derivatives of lnP(D|M) to zero with regard to each parameter. The parameters bi can be solved analytically and the parameters ai are solved using Newton iteration.</Paragraph>
      <Paragraph position="15">  The ML estimation is known to be prone to overlearning the properties of the training data. In contrast, in the Bayesian approach, also the model cost is included in the cost function and can be used to avoid overlearning. For comparison, we thus tried also the Bayesian approach utilizing the software and methodology introduced in (Valpola et al., 2001). The method is based on variational analysis and uses ensemble learning for estimating the model parameters. The methodology and the software allows for the optimization of the model structure with roughly linear computational complexity without the risk of over-fitting the model. However, in these experiments the model structure was not optimized.</Paragraph>
      <Paragraph position="16">  information Furthermore, to study the importance of the position information, we calculated the probabilities using only ML estimates for P(T|W), i.e., disregarding the position of the word.</Paragraph>
      <Paragraph position="17">  As a comparison, we applied the tfxidf weighting scheme, which is commonly used in information retrieval for weighting content words. This method does not benefit from the labeling of the training data. For this reason, it does not differentiate between 'topic' and 'focus' words.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Experiments and results
</SectionTitle>
      <Paragraph position="0"> The following experiment was performed using each described method: For each utterance in the test data, n words were tagged as topic, and likewise for the focus category.</Paragraph>
      <Paragraph position="1"> Further, n was varied from 1 to 8 to produce the results depicted in Figure 1.</Paragraph>
      <Paragraph position="2"> As can be seen, the Bayesian variational analysis and the maximum likelihood estimation produce nearly identical performances. This is perhaps due to the use of very smooth model family, namely first-order polynomials, for taking into account the effect of the position of the word. For this reason, overlearn- null likelihood, Bayes = Bayesian variational analysis, No pos. inf. = without position information, Idf = tfxidf weighting, Random = the average precision with random selection.) ing is not problem even for the ML estimation. However, since the nearly identical results were obtained using two completely different implementations of quite similar methods, this can be considered as a validation experiment on either implementation and optimization method. In total, it seems that the full statistical model designed works rather well especially in focus identification.</Paragraph>
      <Paragraph position="3"> When compared to the full model, disregarding position information altoghether results in inferior performance. The difference is statistically significant (p [?] 0.05) in focus identification for all values of n and in topic identification for small values of n. Moreover, the performance of the tfxidf scheme is clearly inferior in either task. However, it seems that the tfxidf definition of word importance corresponds more closely with the definition of 'focus' than that of 'topic'.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML