File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1641_evalu.xml

Size: 12,611 bytes

Last Modified: 2025-10-06 13:59:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1641">
  <Title>Sentiment Retrieval using Generative Models</Title>
  <Section position="17" start_page="348" end_page="352" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> trieval models. We summarize this data set as follows. null AF This corpus contains news articles collected from 187 different foreign and U.S. news sources from June 2001 to May 2002. The corpus contains 535 documents, a total of 11,114 sentences.</Paragraph>
    <Paragraph position="1"> AF The majority of the articles are on 10 different topics, which are labeled at document level, but, in addition to these, a number of additional articles were randomly selected from a larger corpus of 270,000 documents.</Paragraph>
    <Paragraph position="2"> AF Each article was manually annotated using an annotation scheme for opinions and other private states at phrase level. We only used the annotations for sentiments that included some attributes such as polarity and strength.</Paragraph>
    <Paragraph position="3"> In this data set, the topic relevance for the 10 topics is known at the document level, but unknown at the sentence level. We assumed that all the sentences in a relevant document could be considered relevant to the topic.4 This data set was annotated with sentiment polarities at the phrase level, but not explicitly annotated at the sentence level. Therefore, we provided sentiment polarities at the sentence level to prepare training data and data for evaluation. We set the sentence-level sentiment polarity equal to the polarity with the highest strength in each sentence.5 null Queries were expressed using the title of one of the 10 topics and specified as positive or negative. Thus, we had 20 types of queries for our experiments. Because the supposed relevance judgments in this setting are imperfect at sentence level, we used bpref (Buckley and Voorhees, 2004), in both the training and testing phases, as it is known to be tolerant of imperfect judgments. Bpref uses binary relevance judgments to define the preference relation (i.e., any relevant document is preferred over any nonrelevant document for a given topic), while other measures, such as mean average precision, depend only on the ranks of the relevant documents.</Paragraph>
    <Paragraph position="4">  peared. We can also set the sentence-level sentiment polarity according to the presence of polarity in each sentence, but we did not consider this setting here.</Paragraph>
    <Section position="1" start_page="349" end_page="350" type="sub_section">
      <SectionTitle>
5.2 Extracting sentiment expressions
5.2.1 Using manual annotation
</SectionTitle>
      <Paragraph position="0"> Because the MPQA corpus was annotated with phrase-level sentiments, we can use these annotations to split a sentence into a topic part DB  and a sentiment part DB D7. The Krovetz stemmer (Krovetz, 1993) was applied to the topic part, the sentiment part and to the query terms6 and, for the retrieval experiments in Sections 5.3 and 5.4, a total of 418 stopwords from a standard stopword list were removed when they appeared.</Paragraph>
      <Paragraph position="1">  In automatic extraction of sentiment expressions in this study, we detected sentiment-bearing words using lists of words with established polarities. At this stage, topic dependence was not considered; however, at the stage of sentiment modeling, the topic dependence can be reflected, as described in Sections 3 and 4.</Paragraph>
      <Paragraph position="2"> We first prepared a list of words indicating sentiments. We used Hatzivassiloglou and McKeown's sentiment word list (Hatzivassiloglou and McKeown, 1997), which consists of 657 positive and 679 negative adjectives, and The General Inquirer (Stone et al., 1966), which contains 1621 positive and 1989 negative words.7 By merging these lists, we obtained 1947 positive and 2348 negative words. After stemming these words in the same manner as in Section 5.2.1, we were left with 1667 positive and 2129 negative words, which we will use hereafter in this paper.</Paragraph>
      <Paragraph position="3"> The sentiment polarities are sometimes sensitive to the structural information, for instance, a negation expression reverses the following sentiment polarity. To handle negation, every sentiment-bearing word was rewritten with a 'NEG' suffix, such as 'good NEG', if an odd number of negation expressions was found within the five preceding words in the sentence. To detect negation expressions, we used a predefined negation expression list. This negation handling is similar to that used in (Das and Chen, 2001; Pang et al., 2002). We extracted sentiment-bearing expressions using the list of words with established po- null The upper and lower tables correspond to positive and negative sentiments, respectively. The topic-independent sentiment relevance models (in the left two columns) correspond to rms, and the topic-dependent models (in the rest of the columns) correspond to rms-base, which is used for slm. larities, considering negation, as described above.</Paragraph>
      <Paragraph position="4"> Note that we used the list of words with sentiments to extract sentiment expressions, but we did not use the predefined sentiments to model sentiment relevance.</Paragraph>
      <Paragraph position="5"> Some expressions are sometimes used to express a certain topic, such as settlements in &amp;quot;Israeli settlements in Gaza and West Bank&amp;quot;; but at other times are used to express a certain sentiment, such as the same word in &amp;quot;All parties signed courtmediated compromise settlements&amp;quot;. Therefore, we will use whole sentences to model topic relevance, while we will use the automatically extracted sentiment expressions to model sentiment relevance, in Sections 5.3 and 5.4.</Paragraph>
    </Section>
    <Section position="2" start_page="350" end_page="351" type="sub_section">
      <SectionTitle>
5.3 Experiments on training-based task
</SectionTitle>
      <Paragraph position="0"> We conducted experiments on the training-based task described in Section 4.1, using either manual annotation as described in Section 5.2.1 or automatic annotation as described in Section 5.2.2.</Paragraph>
      <Paragraph position="1"> Table 1 contrasts sample probabilities from topic-independent sentiment relevance models and those from topic-dependent sentiment relevance models.</Paragraph>
      <Paragraph position="2"> In the left two columns of this table, two sets of sample probabilities using the topic-independent model are presented. One was computed from the manual annotation and the other was computed from the automatic annotation. In the remaining columns, samples using the topic-dependent model are shown according to the three topics: (1) &amp;quot;reaction to President Bush's 2002 State of the Union Address&amp;quot;, (2) &amp;quot;2002 presidential election in Zimbabwe&amp;quot;, and (3) &amp;quot;Israeli settlements in Gaza and West Bank&amp;quot;. A number of positive expressions appeared topic dependent, such as 'promise' (stemmed from 'promising' or not) and 'support' for Topic (1), 'legitimate' and 'congratulate' for Topic (2) and 'justify' and 'secure' for Topic (3); while negative expressions appeared topic-dependent, such as 'critic' (stemmed from 'criticism') and 'eyesore' for Topic (1), 'flaw' and 'condemn' for Topic (2) and 'mistake' and 'secure NEG' (i.e., 'secure' was negated) for Topic (3).</Paragraph>
      <Paragraph position="3"> Some expressions were unexpectedly generated regardless of the types of annotation, e.g., 'palestinian' for Topic (3); however, we found some characteristics in the results using automatic annotation. Some expressions on opinions that did not convey sentiments, such as 'state', frequently appeared regardless of topic. This sort of expression may effectively function as degrading sentences only conveying facts, but may function harmfully by catching sentences conveying opinions without sentiments in the task of sentiment retrieval. Some topic expressions, such as 'settle' (stemmed from 'settlement' or not) for Topic (3), were generated, because such words convey positive sentiments in some other contexts and thus they were contained in the list of sentiment-bearing words that we used for automatic annotation. This will not cause a topic relevance model to drift, because we modeled the topic relevance using whole sentences, as described in Section 5.2.2; however, it may harm the sentiment relevance model to some extent.</Paragraph>
      <Paragraph position="4">  ment over rmtf where D4 BO BCBMBCBH with the two-sided Wilcoxon signed-rank test.</Paragraph>
      <Paragraph position="5"> We performed retrieval experiments in the steps described in Section 4.1. For this purpose, we split the data into three parts: (i) DC% as the training data, (ii) B4BHBC A0 DCB5% as the evaluation data, and (iii) BHBC% as the test data.</Paragraph>
      <Paragraph position="6"> The test results of training-based task using manually annotated data and automatically annotated data are shown in Tables 2 and 3, respectively. The scores were computed according to the bpref evaluation measure (Buckley and Voorhees, 2004), as mentioned in Section 5.1. In addition to the bpref, mean average precision values are presented as 'AvgP' in the tables, for reference.8 In these tables, the top row indicates the percentages of the training data DC. It turned out that in all our experiments the appropriate fraction of training data was 40%. In this setting, our slm model worked 76.1% better than the query likelihood model and 32.6% better than the conventional relevance model, when using manual annotation, and both improvements were statistically significant according to the Wilcoxon signed-rank test.9 When using automatic annotation, the slm model worked 67.2% better than the query likelihood model and 25.9% better than the conventional relevance model, where both improvements were statistically significant. The rmt-base model also worked well with automatic annotation.</Paragraph>
    </Section>
    <Section position="3" start_page="351" end_page="352" type="sub_section">
      <SectionTitle>
5.4 Experiments on seed-based task
</SectionTitle>
      <Paragraph position="0"> For experiments on the seed-based task that was described in Section 4.1, we used three groups of  ment over rmtf where D4 BO BCBMBCBH with the two-sided Wilcoxon signed-rank test.</Paragraph>
      <Paragraph position="1"> seed words: C3BTC5, CCCDCA and C7CABZ. Each group consists of a positive word set D5  icism, fear, rejectCV.</Paragraph>
      <Paragraph position="2"> C3BTC5 and CCCDCA were used in (Kamps and Marx, 2002) and (Turney and Littman, 2003), respectively. We constructed C7CABZ considering sentiment-bearing words that may frequently appear in newspaper articles.</Paragraph>
      <Paragraph position="3"> We experimented with the seed-based task, making use of each of these seed word groups, in the steps described in Section 4.1. For this purpose, we split the data into two parts: (i) 50% as the estimation data and (ii) 50% as the test data. The test results using manually annotated data and automatically annotated data are shown in Tables 4 and 5, respectively, where the scores were computed according to the bpref evaluation measure. Mean average precision values are also presented as 'AvgP' in the tables, for reference. When using the manually annotated approach, our slm model worked well, especially with the seed word group C7CABZ, as shown in Table 4. Using C7CABZ, the slm model worked 61.2% better than the query likelihood model and 15.2% better than the conventional relevance model, where both improvements were statistically significant according to the Wilcoxon signed-rank test. Even  ment over rmtf where D4 BO BCBMBCBH with the two-sided Wilcoxon signed-rank test.</Paragraph>
      <Paragraph position="4"> using the other seed word groups, the slm model worked 49-56% better than the query likelihood model and 6-12% better than the conventional relevance model; however, the latter improvement was not statistically significant. The rmt-slm model also worked well with manual annotation.</Paragraph>
      <Paragraph position="5"> When using automatic annotation, the slm model worked 46-48% better than the query likelihood model and 4-6% better than the conventional relevance model, as shown in Table 5. The improvements over the conventional relevance model were statistically significant only when using CCCDCA or C3BTC5; however, the score when using C7CABZ is almost comparable with the others.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML