File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1017_evalu.xml

Size: 4,498 bytes

Last Modified: 2025-10-06 13:59:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1017">
  <Title>Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences</Title>
  <Section position="9" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
8 Results and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Document Classification We trained our Bayes
</SectionTitle>
      <Paragraph position="0"> classifier for documents on 4,000 articles from the WSJ portion of our combined TREC collection, and evaluated on 4,000 other articles also from the WSJ part. Table 2 lists the F-measure scores (the harmonic mean of precision and recall) of our Bayesian classifier for document-level opinion/fact classification. The results show the classifier achieved 97% F-measure, which is comparable or higher than the 93% accuracy reported by (Wiebe et al., 2002), who evaluated their work based on a similar set of WSJ articles. The high classification performance  is also consistent with a high inter-rater agreement (kappa=0.95) for document-level fact/opinion annotation (Wiebe et al., 2002). Note that we trained and evaluated only on WSJ articles for which we can obtain article class metadata, so the classifier may perform less accurately when used for other newswire articles.</Paragraph>
      <Paragraph position="1"> Sentence Classification Table 3 shows the recall and precision of the similarity-based approach, while Table 4 lists the recall and precision of naive Bayes (single and multiple classifiers) for sentence-level opinion/fact classification. In both cases, the results are better when we evaluate against Standard B, containing the sentences for which two humans assign the same label; obviously, it is easier for the automatic system to produce the correct label in these more clear-cut cases.</Paragraph>
      <Paragraph position="2"> Our Naive Bayes classifier has a higher recall and precision (80-90%) for detecting opinions than for facts (around 50%). While words and n-grams had little performance effect for the opinion class, they increased the recall for the fact class around five fold compared to the approach by Wiebe et al. (1999).</Paragraph>
      <Paragraph position="3"> In general, the additional features helped the classifier; the best performance is achieved when words, bigrams, trigrams, part-of-speech, and polarity are included in the feature set. Further, using multiple classifiers to automatically identify an appropriate subset of the data for training slightly increases performance. null Polarity Classification Using the method of Section 5.1, we automatically identified a total of 39,652 (65,773), 3,128 (4,426), 144,238 (195,984), and 22,279 (30,609) positive (negative) adjectives, adverbs, nouns, and verbs, respectively. Extracted positive words include inspirational, truly, luck, and achieve. Negative ones include depraved, disastrously, problem, and depress. Figure 1 plots the  beled positive and negative adjectives as gold standard) of extracted adjectives using 1, 20, and 100 positive and negative adjective pairs as seeds.</Paragraph>
      <Paragraph position="4"> recall and precision of extracted adjectives by using randomly selected seed sets of 1, 20, and 100 pairs of positive and negative adjectives from the list of (Hatzivassiloglou and McKeown, 1997). Both recall and precision increase as the seed set becomes larger. We obtained similar results with the ANEW list of adjectives (Section 7). As an additional experiment, we tested the effect of ignoring sentences with negative particles, obtaining a small increase in precision and recall.</Paragraph>
      <Paragraph position="5"> We subsequently used the automatically extracted polarity score for each word to assign an aggregate  gold standards A and B for different sets of parts-ofspeech. null polarity to opinion sentences. Table 5 lists the accuracy of our sentence-level tagging process. We experimented with different combinations of part-of-speech classes for calculating the aggregate polarity scores, and found that the combined evidence from adjectives, adverbs, and verbs achieves the highest accuracy (90% over a baseline of 48%). As in the case of sentence-level classification between opinion and fact, we also found the performance to be higher on Standard B, for which humans exhibited consistent agreement.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML