File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1025_evalu.xml
Size: 6,371 bytes
Last Modified: 2025-10-06 13:59:32
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1025"> <Title>Determining Term Subjectivity and Term Orientation for Opinion Mining</Title> <Section position="7" start_page="197" end_page="198" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> We present results obtained from running every combination of (i) the three approaches to classification described in Section 4.3, (ii) the four learners mentioned in the same section, (iii) five different reduction factors for feature selection (0%, 50%, 90%, 95%, 99%), and (iv) the two different training sets (Tr3o and Tr4o) for Objective mentioned in Section 4.2. We discuss each of these four dimensions of the problem individually, for each one reporting results averaged across all the experiments we have run (see Table 1).</Paragraph> <Paragraph position="1"> The first and most important observation is that, with respect to a pure term orientation task, accuracy drops significantly. In fact, the best SO-accuracy and the best PNO-accuracy results obtained across the 120 different experiments are .676 and .660, respectively (these were obtained by using Approach II with the PrTFIDF learner and no feature selection, with Tro = Tr3o for the .676 SO-accuracy result and Tro = Tr4o for the .660 PNO-accuracy result); this contrasts sharply with the accuracy obtained in (Esuli and Sebastiani, 2005) on discriminating Positive from Negative (where the best run obtained .830 accuracy), on the same benchmarks and essentially the same algorithms. This suggests that good performance at orientation detection (as e.g. in (Esuli and Sebastiani, 2005; Hatzivassiloglou and McKeown, 1997; Turney and Littman, 2003)) may not be a guarantee of good performance at subjectivity detection, quite evidently a harder (and, as we have suggested, more realistic) task.</Paragraph> <Paragraph position="2"> This hypothesis is confirmed by an experiment performed by Kim and Hovy (2004) on testing the agreement of two human coders at tagging words with the Positive, Negative, and Objective labels. The authors define two measures of such agreement: strict agreement, equivalent to our PNO-accuracy, and lenient agreement, which measures the accuracy at telling Negative against the rest. For any experiment, strict agreement values are then going to be, by definition, lower or equal than the corresponding lenient ones. Theauthors use two sets of 462 adjectives and 502 verbs, respectively, randomly extracted from the basic English word list of the TOEFL test. The inter-coder agreement results (see Table 2) show a deterioration in agreement (from lenient to strict) of 16.77% for adjectives and 36.42% for verbs. Following this, we evaluated our best experiment according to these measures, and obtained a &quot;strict&quot; accuracy value of .660 and a &quot;lenient&quot; accuracy value of .821, with a relative deterioration of 24.39%, in line with Kim and Hovy's observation10. This confirms that determining subjectivity and orientation is a much harder task than determining orientation alone.</Paragraph> <Paragraph position="3"> The second important observation is that there is very little variance in the results: across all 120 experiments, average SO-accuracy and PNO-accuracy results were .635 (with standard deviation s = .030) and .603 (s = .036), a mere 6.06% and 8.64% deterioration from the best results reported above. This seems to indicate that the levels of performance obtained may be hard to improve upon, especially if working in a similar framework.</Paragraph> <Paragraph position="4"> Let us analyse the individual dimensions of the problem. Concerning the three approaches to classification described in Section 4.3, Approach II outperforms the other two, but by an extremely narrow margin. As for the choice of learners, on average the best performer is NB, but again by a very small margin wrt the others. On average, the 10We observed this trend in all of our experiments.</Paragraph> <Paragraph position="5"> best reduction factor for feature selection turns out to be 50%, but the performance drop we witness in approaching 99% (a dramatic reduction factor) is extremely graceful. As for the choice of TrKo , we note that Tr3o and Tr4o elicit comparable levels of performance, with the former performing best at SO-accuracy and the latter performing best at PNO-accuracy.</Paragraph> <Paragraph position="6"> An interesting observation on the learners we have used is that NB, PrTFIDF and SVMs, unlike Rocchio, generate classifiers that depend on P(ci), the prior probabilities of the classes, which are normally estimated as the proportion of training documents that belong to ci. In many classification applications this is reasonable, as we may assume that the training data are sampled from the samedistribution fromwhichthetestdataaresampled, and that these proportions are thus indicative of the proportions that we are going to encounter in the test data. However, in our application this is not the case, since we do not have a &quot;natural&quot; sample of training terms. What we have is one human-labelled training term for each category in {Positive,Negative,Objective}, and as many machine-labelled terms as we deem reasonable to include, in possibly different numbers for the different categories; and we have no indication whatsoever as to what the &quot;natural&quot; proportions among the three might be. This means that the proportions of Positive, Negative, and Objective terms we decide to include in the training set will strongly bias the classification results if the learner is one of NB, PrTFIDF and SVMs.</Paragraph> <Paragraph position="7"> We may notice this by looking at Table 3, which shows the average proportion of test terms classified as Objective by each learner, depending on whether we have chosen Tro to coincide with Tr3o or Tr4o; note that the former (resp. latter) choice means having roughly as many (resp. roughly five times as many) Objective training terms as there are Positive and Negative ones. Table 3 shows that, the more Objective training terms there are, the more test terms NB, PrTFIDF and (in particular) SVMs will classify as Objective; this is not true for Rocchio, which is basically unaffected by the variation in size of Tro.</Paragraph> </Section> class="xml-element"></Paper>