File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1088_evalu.xml
Size: 11,356 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1088"> <Title>FLSA: Extending Latent Semantic Analysis with features for dialogue act classification</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> Table 4 reports the results we obtained for each corpus and method (to train and evaluate each method, we used 5-fold cross-validation). We include the baseline, computed as picking the most frequent DA (Doc 1) (Doc 2) (Doc 3) (Doc 4) (Doc 5) (Doc 6) (Doc 7) Dim. 1 1.3076 0.4717 0.1529 1.6668 1.1737 0.1193 0.9101 Dim. 2 1.5991 0.6797 0.0958 -1.3697 -0.4771 0.2844 0.4205 Table 2: The reduced 2-dimensional matrix ^W (Doc 1) (Doc 2) (Doc 3) (Doc 4) (Doc 5) (Doc 6) (Doc 7) do 1 1 0 0 0 0 1 ... ... ... ... ... ... ... ...</Paragraph> <Paragraph position="1"> right 0 0 0 0 0 1 0 <Giver> 1 0 1 1 0 1 0 <Follower> 0 1 0 0 1 0 1 in each corpus;4 the accuracy for LSA; the best accuracy for FLSA, and with what combination of features it was obtained; the best published result, taken from (Ries, 1999) and from (Lager and Zinovjeva, 1999) respectively for CallHome and for MapTask. Finally, for both LSA and FLSA, Table 4 includes, in parenthesis, the dimension k of the reduced semantic space. For each LSA method and corpus, we experimented with values of k between 25 and 350. The values of k that give us the best resuls for each method were thus selected empirically. In all cases, we can see that LSA performs much better than baseline. Moreover, we can see that FLSA further improves performance, dramatically in the case of MapTask. FLSA reduces error rates between 60% and 78%, for all corpora other than DIAG-NLP (all differences in performance between LSA and FLSA are significant, other than for DIAG-NLP). DIAG-NLP may be too small a corpus to train FLSA; or Consult Type may not be effective, but it was the only feature appropriate for FLSA (Sec. 5 discusses how we chose appropriate features). Another extension to LSA we developed, Clustered LSA, did give an improvement in performance for DIAG (79.24%) -- please see (Serafin, 2003).</Paragraph> <Paragraph position="2"> As regards comparable approaches, the performance of FLSA is as good or better. For Spanish CallHome, (Ries, 1999) reports 76.2% accuracy with a hybrid approach that couples Neural Networks and ngram backoff modeling; the former uses prosodic features and POS tags, and interestingly works best with unigram backoff modeling, i.e., without taking into account the DA history - see our discussion of the ineffectiveness of the DA history below. However, (Ries, 1999) does not mention same because in both statement is the most frequent DA. his target classification, and the reported baseline of picking the most frequent DA appears compatible with both CallHome37 and CallHome10.5 Thus, our results with FLSA are slightly worse (- 1.33%) or better (+ 2.68%) than Ries', depending on the target classification. On MapTask, (Lager and Zinovjeva, 1999) achieves 62.1% with Transformation Based Learning using single words, bigrams, word position within the utterance, previous DA, speaker and change of speaker. We achieve much better performance on MapTask with a number of our FLSA models.</Paragraph> <Paragraph position="3"> As regards results on DA classification for other corpora, the best performances obtained are up to 75% for task-oriented dialogues such as Verbmobil (Samuel et al., 1998). (Stolcke et al., 2000) reports an impressive 71% accuracy on transcribed Switchboard dialogues, using a tag set of 42 DAs. These are unrestricted English telephone conversations between two strangers that discuss a general interest topic. The DA classification task appears more difficult for corpora such as Switchboard and CallHome Spanish, that cannot benefit from the regularities imposed on the dialogue by a specific task. (Stolcke et al., 2000) employs a combination of HMM, neural networks and decision trees trained on all available features (words, prosody, sequence of DAs and speaker identity).</Paragraph> <Paragraph position="4"> Table 5 reports a breakdown of the experimental results obtained with FLSA for the three tasks for which it was successful (Table 5 does not include k, which is always 25 for CallHome37 and Call-Home10, and varies between 25 and 75 for Map-Task). For each corpus, under the line we find results that are significantly better than those obtained with LSA. For MapTask, the first 4 results that are better than LSA (from POS to Previous DA) are still pretty low; there is a difference of 19% in performance for FLSA when Previous DA is added and when Game is added.</Paragraph> <Paragraph position="5"> Analysis. A few general conclusions can be drawn from Table 5, as they apply in all three cases. First, using the previous DA does not help, either at all (CallHome37 and CallHome10), or very little (MapTask). Increasing the length of the dialogue history does not improve performance. In other experiments, we increased the length up to n = 4: we found that the higher n, the worse the performance. As we will see in Section 5, introducing any new feature results in a larger and sparser initial matrix, which makes the task harder for FLSA; to be effective, the amount of information provided by the new feature must be sufficient to overcome this handicap. It is clear that, the longer the dialogue history, the sparser the initial matrix becomes, which explains why performance decreases. However, this does not explain why using even only the previous DA does not help. This implies that the previous DA does not provide a lot of information, as in fact is shown numerically in Section 5. This is surprising because the DA history is usually considered an important determinant of the current DA (but (Ries, 1999) observed the same).</Paragraph> <Paragraph position="6"> Second, the notion of Game appears to be really powerful, as it vastly improves performance on two very different corpora such as CallHome and Map-Task.6 We will come back to discussing the usage of Game in a real dialogue system in Section 6.</Paragraph> <Paragraph position="7"> Third, the syntactic features we had access to do not seem to improve performance (they were available only for MapTask). In MapTask SRule indicates the main structure of the utterance, such as Declarative or Wh-question. It is not surprising that SRule did not help, since it is well known that syntactic form is not predictive of DAs, especially those of indirect speech act flavor (Searle, 1975). POS tags don't help LSA either, as has already been observed by (Wiemer-Hastings, 2001; Kanejiya et al., 2003) for other tasks. The likely reason is that it is necessary to add a different 'word' for each distinct pair word-POS, e.g., route becomes split as route-NN and route-VB. This makes the Word-Document matrix much sparser: for MapTask, the number of rows increases from 1,835 for plain LSA to 2,324 for FLSA.</Paragraph> <Paragraph position="8"> These negative results on adding syntactic information to LSA may just reinforce one of the claims of the LSA proponents, that structural information is irrelevant for determining meaning (Landauer and Dumais, 1997). Alternatively, syntactic information may need to be added to LSA in different ways.</Paragraph> <Paragraph position="9"> (Wiemer-Hastings, 2001) discusses applying LSA to each syntactic component of the sentence (subject, verb, rest of sentence), and averaging out those three measures to obtain a final similarity measure. The results are better than with plain LSA. (Kintsch, 2001) proposes an algorithm that successfully differentiates the senses of predicates on the basis on their arguments, in which items of the semantic neighborhood of a predicate that are relevant to an argument are combined with the [LSA] predicate even if a game is identified by its initiating DA. We checked the matching rates for initiating and non initiating DAs with the FLSA model which employs Game + Speaker: they are 78.12% and 71.67% respectively. Hence, even if Game makes initiating moves easier to classify, it is highly beneficial for the classification of non initiating moves as well.</Paragraph> <Paragraph position="10"> 5 How to select features for FLSA An important issue is how to select features for FLSA. One possible answer is to exhaustively train every FLSA model that corresponds to one possible feature combination. The problem is that training LSA models is in general time consuming. For example, training each FLSA model on CallHome37 takes about 35 minutes of CPU time, and on MapTask 17 minutes, on computers with one Pentium 1.7Ghz processor and 1Gb of memory. Thus, it would be better to focus only on the most promising models, especially when the number of features is high, because of the exponential number of combinations. In this work, we trained FLSA on each individual feature. Then, we trained FLSA on each feature combinations that we expected to be effective, either because of the good performances of each individual feature, or because they include features that are deemed predictive of DAs, such as the previous DA(s), even if they did not perform well individually.</Paragraph> <Paragraph position="11"> After we ran our experiments, we performed a post hoc analysis based on the notion of Information Gain (IG) from decision tree learning (Quinlan, 1993). One approach to choosing the next feature to add to the tree at each iteration is to pick the one with the highest IG. Suppose the data set S is classified using n categories v1...vn, each with probability pi. S's entropy H can be seen as an indicator of how uncertain the outcome of the classification is, and is given by:</Paragraph> <Paragraph position="13"> If feature F divides S into k subsets S1...Sk, then IG is the expected reduction in entropy caused by partitioning the data according to the values of F:</Paragraph> <Paragraph position="15"> In our case, we first computed the entropy of the corpora with respect to the classification induced by the DA tags (see Table 6, which also includes the LSA accuracy for convenience). Then, we computed the IG of the features or feature combinations we used in the FLSA experiments.</Paragraph> <Paragraph position="16"> Table 7 reports the IG for most of the features from Table 5; it is ordered by FLSA performance.</Paragraph> <Paragraph position="17"> On the whole, IG appears to be a reasonably accurate predictor of performance. When a feature or feature combination has a high IG, e.g. over 1, there is also a high performance improvement. Occasionally, if the IG is small this does not hold. For example, using the previous DA reduces the entropy by 0.21 for CallHome37, but performance actually decreases. Most likely, the amount of new information introduced is rather low and it is overcome by having a larger and sparser initial matrix, which makes the task harder for FLSA. Also, when performance improves it does not necessarily increase linearly with IG (see e.g. Game + Speaker + Previous DA and Game + Speaker for MapTask). Nevertheless, IG can be effectively used to weed out unpromising features, or to rank feature combinations so that the most promising FLSA models can be trained first.</Paragraph> </Section> class="xml-element"></Paper>