XML Viewer - c04-1200

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1200_metho.xml
Size: 8,205 bytes
Last Modified: 2025-10-06 14:08:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1200">
  <Title>Determining the Sentiment of Opinions</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> The first experiment examines the two word sentiment classifier models and the second the three sentence sentiment classifier models.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Word Sentiment Classifier
</SectionTitle>
      <Paragraph position="0"> For test material, we asked three humans to classify data. We started with a basic English word list for foreign students preparing for the TOEFL test and intersected it with an adjective list containing 19748 English adjectives and a verb list of 8011 verbs to obtain common adjectives and verbs. From this we randomly selected 462 adjectives and 502 verbs for human classification. Human1 and human2 each classified 462 adjectives, and human2 and human3 502 verbs.</Paragraph>
      <Paragraph position="1"> The classification task is defined as assigning each word to one of three categories: positive, negative, and neutral.</Paragraph>
      <Paragraph position="2">  strict measure is defined over all three categories, whereas the lenient measure is taken over only two categories, where positive and neutral have been merged, should we choose to focus only on differentiating words of negative sentiment.</Paragraph>
      <Paragraph position="3">  Table 5 shows results, using Equation (2) of Section 2.1.1, compared against a baseline that randomly assigns a sentiment category to each word (averaged over 10 iterations). The system achieves lower agreement than humans but higher than the random process.</Paragraph>
      <Paragraph position="4"> Of the test data, the algorithm classified 93.07% of adjectives and 83.27% of verbs as either positive and negative. The remainder of adjectives and verbs failed to be classified, since they did not overlap with the synonym set of adjectives and verbs.</Paragraph>
      <Paragraph position="5"> In Table 5, the seed list included just a few manually selected seed words (23 positive and 21 negative verbs and 15 and 19 adjectives, repectively). We decided to investigate the effect of more seed words. After collecting the annotated data, we added half of it (231 adjectives and 251 verbs) to the training set, retaining the other half for the test. As Table 6 shows, agreement of both adjectives and verbs with humans improves. Recall is also improved.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Sentence Sentiment Classifier
3.2.1 Data
</SectionTitle>
      <Paragraph position="0"> 100 sentences were selected from the DUC 2001 corpus with the topics &amp;quot;illegal alien&amp;quot;, &amp;quot;term limits&amp;quot;, &amp;quot;gun control&amp;quot;, and &amp;quot;NAFTA&amp;quot;. Two humans annotated the 100 sentences with three categories (positive, negative, and N/A).</Paragraph>
      <Paragraph position="1"> To measure the agreement between humans, we used the Kappa statistic (Siegel and Castellan Jr. 1988). The Kappa value for the annotation task of 100 sentences was 0.91, which is considered to be reliable.</Paragraph>
      <Paragraph position="2">  We experimented on Section 2.2.3's 3 models of sentiment classifiers, using the 4 different window definitions and 4 variations of word-level classifiers (the two word sentiment equations introduced in Section 2.1.1, first with and then without normalization, to compare performance).</Paragraph>
      <Paragraph position="3"> Since Model 0 considers not probabilities of words but only their polarities, the two word-level classifier equations yield the same results. Consequently, Model 0 has 8 combinations and Models 1 and 2 have 16 each.</Paragraph>
      <Paragraph position="4"> To test the identification of opinion Holder, we first ran models with holders that were annotated by humans then ran the same models with the automatic holder finding strategies.</Paragraph>
      <Paragraph position="5"> The results appear in Figures 2 and 3. The models are numbered as follows: m0 through  Correctness of an opinion is determined when the system finds both a correct holder and the appropriate sentiment within the sentence. Since human1 classified 33 sentences positive and 33 negative, random classification gives 33 out of 66 sentences. Similarly, since human2 classified 29 positive and 34 negative, random classification gives 34 out of 63 when the system blindly marks all sentences as negative and 29 out of 63 when it marks all as positive. The system's best model performed at 81% accuracy with the manually provided holder and at 67% accuracy with automatic holder detection.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Problems
3.3.1 Word Sentiment Classification
</SectionTitle>
      <Paragraph position="0"> As mentioned, some words have both strong positive and negative sentiment. For these words, it is difficult to pick one sentiment category without considering context. Second, a unigram model is not sufficient: common words without much sentiment alone can combine to produce reliable sentiment. For example, in &amp;quot;'Term limits really hit at democracy,' says Prof. Fenno&amp;quot;, the common and multi-meaning word &amp;quot;hit&amp;quot; was used to express a negative point of view about term limits. If such combinations occur adjacently, we can use bigrams or trigrams in the seed word list. When they occur at a distance, however, it is more difficult to identify the sentiment correctly, especially if one of the words falls outside the sentiment region.</Paragraph>
      <Paragraph position="1">  Even in a single sentence, a holder might express two different opinions. Our system only detects the closest one.</Paragraph>
      <Paragraph position="2"> Another difficult problem is that the models cannot infer sentiments from facts in a sentence. &amp;quot;She thinks term limits will give women more opportunities in politics&amp;quot; expresses a positive opinion about term limits but the absence of adjective, verb, and noun sentiment-words prevents a classification.</Paragraph>
      <Paragraph position="3"> Although relatively easy task for people, detecting an opinion holder is not simple either. As a result, our system sometimes picks a wrong holder when there are multiple plausible opinion holder candidates present. Employing a parser to delimit opinion regions and more accurately associate them with potential holders should help.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Discussion
</SectionTitle>
      <Paragraph position="0"> Which combination of models is best? The best overall performance is provided by Model 0. Apparently, the mere presence of negative words is more important than sentiment strength. For manually tagged holder and topic, Model 0 has the highest single performance, though Model 1 averages best.</Paragraph>
      <Paragraph position="1"> Which is better, a sentence or a region? With manually identified topic and holder, the region window4 (from Holder to sentence end) performs better than other regions.</Paragraph>
      <Paragraph position="2"> How do scores differ from manual to automatic holder identification? Table 7 compares the average results with automatic holder identification to manually annotated holders in 40 different models.</Paragraph>
      <Paragraph position="3"> Around 7 more sentences (around 11%) were misclassified by the automatic detection method.</Paragraph>
      <Paragraph position="4"> positive negative total  manual and automatic holder detection.</Paragraph>
      <Paragraph position="5"> How does adding the neutral sentiment as a separate category affect the score? It is very confusing even for humans to distinguish between a neutral opinion and non-opinion bearing sentences. In previous research, we built a sentence subjectivity classifier. Unfortunately, in most cases it classifies neutral and weak sentiment sentences as non-opinion bearing sentences.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML