File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3402_evalu.xml
Size: 4,914 bytes
Last Modified: 2025-10-06 13:59:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3402"> <Title>Off-Topic Detection in Conversational Telephone Speech</Title> <Section position="7" start_page="10" end_page="12" type="evalu"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 6.1 Performance of a Learned Classifier </SectionTitle> <Paragraph position="0"> We evaluated the results of our experiments according to three criteria: accuracy, error cost, and plausibility of the annotations produced. In all cases our best results were obtained with the SVM. When evaluated on accuracy, the SVM models were the only ones that exceeded a baseline accuracy of 72.8%, which is the average percentage of On-Topic utterances in our data set. Table 4 displays the numerical results using each of the machine learning algorithms.</Paragraph> <Paragraph position="1"> Figure 1 shows the average accuracy obtained with an SVM classifier using all features described in Section 5.1 except part-of-speech features (for reasons discussed below), and varying the number of words considered. While the best results were obtained at the 100-words level, all classifiers demonstrated significant improvement in accuracy over the baseline. The average standard deviation over the 4 cross-validation runs of the results shown is 6 percentage points.</Paragraph> <Paragraph position="2"> From a practical perspective, accuracy alone is sifier.</Paragraph> <Paragraph position="3"> not an appropriate metric for evaluating our results. If the goal is to eliminate Small Talk regions from conversations, mislabeling On-Topic regions as Small Talk potentially results in the elimination of useful material. Table 5 shows a confusion matrix for an SVM classifier trained on a data set at the 100-word level. We can see that the classifier is conservative, identifying 55% of the Small Talk, but incorrectly labeling On-Topic utterances as Small Talk only 8% of the time.</Paragraph> <Paragraph position="4"> Finally, we analyzed (by hand) the test data annotated by the classifiers. We found that, in general, the SVM classifiers annotated the conversations in a manner similar to the human annotators, transitioning from one label to another relatively infrequently as illustrated in Table 1. This is in contrast to the 1-nearest neighbor classifiers, which tended to annotate in a far more &quot;jumpy&quot; style.</Paragraph> </Section> <Section position="2" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 6.2 Relative Utility of Features </SectionTitle> <Paragraph position="0"> Several of the features we used to describe our training and test examples were selected due to the claims of researchers such as Laver and Cheepen.</Paragraph> <Paragraph position="1"> We were interested in determining the relative contributions of these various linguistically-motivated features to our learned classifiers. Figure 1 and Table 6 report some of our findings. Using proximity to the beginning of the conversation (&quot;line numbers&quot;) as a sole feature, the SVM classifier achieved an accuracy of 75.6%. This clearly verifies the hypothesis that utterances near the beginning of the conversation have different properties than those that follow. On the contrary, when we used only POS tags to train the SVM classifier, it achieved an accuracy that falls exactly at the baseline. Moreover, removing POS features from the SVM classifier improved results (Table 6). This may indicate that detecting off-topic categories will require focusing on the words rather than the grammar of utterances. On tic for the SVM at the 100-words level when features were left out or put in individually.</Paragraph> <Paragraph position="2"> the other hand, part of speech information is implicit in the words (for example, an occurrence of &quot;are&quot; also indicates a present tense verb), so perhaps labeling POS tags does not add any new information. It is also possible that some other detection approach and/or richer syntactic information (such as parse trees) would be beneficial.</Paragraph> <Paragraph position="3"> Finally, the words with the highest feature quality measure (Table 2) clearly refute most of the third linguistic prediction. Helper words like &quot;it&quot;, &quot;there&quot;, and &quot;the&quot; appear roughly evenly in each region type. Moreover, all of the verbs in the top 20 Small Talk list are forms of &quot;to be&quot; (some of them contracted as in &quot;I'm&quot;), while no &quot;to be&quot; words appear in the list for On-Topic. This is further evidence that differentiating off-topic speech depends deeply on the meaning of the words rather than on some more easily extracted feature.</Paragraph> </Section> </Section> class="xml-element"></Paper>