File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0404_metho.xml

Size: 25,630 bytes

Last Modified: 2025-10-06 14:08:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0404">
  <Title>Learning Subjective Nouns using Extraction Pattern Bootstrapping</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Subjectivity Data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Annotation Scheme
</SectionTitle>
      <Paragraph position="0"> In 2002, an annotation scheme was developed for a U.S. government-sponsored project with a team of 10 researchers (the annotation instructions and project reports are available on the Web at http://www.cs.pitt.edu/~wiebe/pubs/ardasummer02/).</Paragraph>
      <Paragraph position="1"> Edmonton, May-June 2003 held at HLT-NAACL 2003 , pp. 25-32 Proceeings of the Seventh CoNLL conference The scheme was inspired by work in linguistics and literary theory on subjectivity, which focuses on how opinions, emotions, etc. are expressed linguistically in context (Banfield, 1982). The scheme is more detailed and comprehensive than previous ones. We mention only those aspects of the annotation scheme relevant to this paper.</Paragraph>
      <Paragraph position="2"> The goal of the annotation scheme is to identify and characterize expressions of private states in a sentence. Private state is a general covering term for opinions, evaluations, emotions, and speculations (Quirk et al., 1985). For example, in sentence (1) the writer is expressing a  negative evaluation.</Paragraph>
      <Paragraph position="3"> (1) &amp;quot;The time has come, gentlemen, for Sharon, the assassin, to realize that injustice cannot last long.&amp;quot; Sentence (2) reflects the private state of Western countries. Mugabe's use of &amp;quot;overwhelmingly&amp;quot; also reflects a private state, his positive reaction to and characterization of his victory.</Paragraph>
      <Paragraph position="4"> (2) &amp;quot;Western countries were left frustrated and impotent after Robert Mugabe formally declared that he had overwhelmingly won Zimbabwe's presidential election.&amp;quot; Annotators are also asked to judge the strength of each private state. A private state can have low, medium, high or extreme strength.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Corpus and Agreement Results
</SectionTitle>
      <Paragraph position="0"> Our data consists of English-language versions of foreign news documents from FBIS, the U.S. Foreign Broadcast Information Service. The data is from a variety of publications and countries. The annotated corpus used to train and test our subjectivity classifiers (the experiment corpus) consists of 109 documents with a total of 2197 sentences. We used a separate, annotated tuning corpus of 33 documents with a total of 698 sentences to establish some experimental parameters.1 Each document was annotated by one or both of two annotators, A and T. To allow us to measure interannotator agreement, the annotators independently annotated the same 12 documents with a total of 178 sentences. We began with a strict measure of agreement at the sentence level by first considering whether the annotator marked any private-state expression, of any strength, anywhere in the sentence. If so, the sentence should be subjective.</Paragraph>
      <Paragraph position="1"> Otherwise, it is objective. Table 1 shows the contingency table. The percentage agreement is 88%, and the value  contractors this summer. We are working to resolve copyright issues to make it available to the wider research community.  strength cases removed One would expect that there are clear cases of objective sentences, clear cases of subjective sentences, and borderline sentences in between. The agreement study supports this. In terms of our annotations, we define a sentence as borderline if it has at least one private-state expression identified by at least one annotator, and all strength ratings of private-state expressions are low. Table 2 shows the agreement results when such borderline sentences are removed (19 sentences, or 11% of the agreement test corpus). The percentage agreement increases to 94% and the value increases to 0.87.</Paragraph>
      <Paragraph position="2"> As expected, the majority of disagreement cases involve low-strength subjectivity. The annotators consistently agree about which are the clear cases of subjective sentences. This leads us to define the gold-standard that we use in our experiments. A sentence is subjective if it contains at least one private-state expression of medium or higher strength. The second class, which we call objective, consists of everything else. Thus, sentences with only mild traces of subjectivity are tossed into the objective category, making the system's goal to find the clearly subjective sentences.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Using Extraction Patterns to Learn
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Subjective Nouns
</SectionTitle>
      <Paragraph position="0"> In the last few years, two bootstrapping algorithms have been developed to create semantic dictionaries by exploiting extraction patterns: Meta-Bootstrapping (Riloff and Jones, 1999) and Basilisk (Thelen and Riloff, 2002).</Paragraph>
      <Paragraph position="1"> Extraction patterns were originally developed for information extraction tasks (Cardie, 1997). They represent lexico-syntactic expressions that typically rely on shallow parsing and syntactic role assignment. For example, the pattern &amp;quot;&lt;subject&gt; was hired&amp;quot; would apply to sentences that contain the verb &amp;quot;hired&amp;quot; in the passive voice. The subject would be extracted as the hiree.</Paragraph>
      <Paragraph position="2"> Meta-Bootstrapping and Basilisk were designed to learn words that belong to a semantic category (e.g., &amp;quot;truck&amp;quot; is a VEHICLE and &amp;quot;seashore&amp;quot; is a LOCATION). Both algorithms begin with unannotated texts and seed words that represent a semantic category. A bootstrapping process looks for words that appear in the same extraction patterns as the seeds and hypothesizes that those words belong to the same semantic class. The principle behind this approach is that words of the same semantic class appear in similar pattern contexts. For example, the phrases &amp;quot;lived in&amp;quot; and &amp;quot;traveled to&amp;quot; will co-occur with many noun phrases that represent LOCATIONS.</Paragraph>
      <Paragraph position="3"> In our research, we want to automatically identify words that are subjective. Subjective terms have many different semantic meanings, but we believe that the same contextual principle applies to subjectivity. In this section, we briefly overview these bootstrapping algorithms and explain how we used them to generate lists of subjective nouns.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Meta-Bootstrapping
</SectionTitle>
      <Paragraph position="0"> The Meta-Bootstrapping (&amp;quot;MetaBoot&amp;quot;) process (Riloff and Jones, 1999) begins with a small set of seed words that represent a targeted semantic category (e.g., 10 words that represent LOCATIONS) and an unannotated corpus. First, MetaBoot automatically creates a set of extraction patterns for the corpus by applying and instantiating syntactic templates. This process literally produces thousands of extraction patterns that, collectively, will extract every noun phrase in the corpus. Next, MetaBoot computes a score for each pattern based upon the number of seed words among its extractions. The best pattern is saved and all of its extracted noun phrases are automatically labeled as the targeted semantic category.2 MetaBoot then re-scores the extraction patterns, using the original seed words as well as the newly labeled words, and the process repeats. This procedure is called mutual bootstrapping.</Paragraph>
      <Paragraph position="1"> A second level of bootstrapping (the &amp;quot;meta-&amp;quot; bootstrapping part) makes the algorithm more robust. When the mutual bootstrapping process is finished, all nouns that were put into the semantic dictionary are reevaluated. Each noun is assigned a score based on how many different patterns extracted it. Only the five best nouns are allowed to remain in the dictionary. The other entries are discarded, and the mutual bootstrapping process starts over again using the revised semantic dictionary. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Basilisk
</SectionTitle>
      <Paragraph position="0"> Basilisk (Thelen and Riloff, 2002) is a more recent bootstrapping algorithm that also utilizes extraction patterns to create a semantic dictionary. Similarly, Basilisk begins with an unannotated text corpus and a small set of</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2Our implementation of Meta-Bootstrapping learns individ-
</SectionTitle>
      <Paragraph position="0"> ual nouns (vs. noun phrases) and discards capitalized words.</Paragraph>
      <Paragraph position="1"> seed words for a semantic category. The bootstrapping process involves three steps. (1) Basilisk automatically generates a set of extraction patterns for the corpus and scores each pattern based upon the number of seed words among its extractions. This step is identical to the first step of Meta-Bootstrapping. Basilisk then puts the best patterns into a Pattern Pool. (2) All nouns3 extracted by a pattern in the Pattern Pool are put into a Candidate Word Pool. Basilisk scores each noun based upon the set of patterns that extracted it and their collective association with the seed words. (3) The top 10 nouns are labeled as the targeted semantic class and are added to the dictionary. The bootstrapping process then repeats, using the original seeds and the newly labeled words.</Paragraph>
      <Paragraph position="2"> The main difference between Basilisk and Meta-Bootstrapping is that Basilisk scores each noun based on collective information gathered from all patterns that extracted it. In contrast, Meta-Bootstrapping identifies a single best pattern and assumes that everything it extracted belongs to the same semantic class. The second level of bootstrapping smoothes over some of the problems caused by this assumption. In comparative experiments (Thelen and Riloff, 2002), Basilisk outperformed Meta-Bootstrapping. But since our goal of learning subjective nouns is different from the original intent of the algorithms, we tried them both. We also suspected they might learn different words, in which case using both algorithms could be worthwhile.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> The Meta-Bootstrapping and Basilisk algorithms need seed words and an unannotated text corpus as input.</Paragraph>
      <Paragraph position="1"> Since we did not need annotated texts, we created a much larger training corpus, the bootstrapping corpus, by gathering 950 new texts from the FBIS source mentioned in Section 2.2. To find candidate seed words, we automatically identified 850 nouns that were positively correlated with subjective sentences in another data set. However, it is crucial that the seed words occur frequently in our FBIS texts or the bootstrapping process will not get off the ground. So we searched for each of the 850 nouns in the bootstrapping corpus, sorted them by frequency, and manually selected 20 high-frequency words that we judged to be strongly subjective. Table 3 shows the 20 seed words used for both Meta-Bootstrapping and Basilisk.</Paragraph>
      <Paragraph position="2"> We ran each bootstrapping algorithm for 400 iterations, generating 5 words per iteration. Basilisk generated 2000 nouns and Meta-Bootstrapping generated 1996 nouns.4 Table 4 shows some examples of extraction pat3Technically, each head noun of an extracted noun phrase. 4Meta-Bootstrapping will sometimes produce fewer than 5 words per iteration if it has low confidence in its judgements. cowardice embarrassment hatred outrage  crap fool hell slander delight gloom hypocrisy sigh disdain grievance love twit dismay happiness nonsense virtue</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Extraction Patterns Examples of Extracted Nouns
</SectionTitle>
      <Paragraph position="0"> expressed &lt;dobj&gt; condolences, hope, grief, views, worries, recognition indicative of &lt;np&gt; compromise, desire, thinking inject &lt;dobj&gt; vitality, hatred reaffirmed &lt;dobj&gt; resolve, position, commitment voiced &lt;dobj&gt; outrage, support, skepticism, disagreement, opposition, concerns, gratitude, indignation show of &lt;np&gt; support, strength, goodwill, solidarity, feeling &lt;subject&gt; was shared anxiety, view, niceties, feeling  terns that were discovered to be associated with subjective nouns.</Paragraph>
      <Paragraph position="1"> Meta-Bootstrapping and Basilisk are semi-automatic lexicon generation tools because, although the bootstrapping process is 100% automatic, the resulting lexicons need to be reviewed by a human.5 So we manually reviewed the 3996 words proposed by the algorithms. This process is very fast; it takes only a few seconds to classify each word. The entire review process took approximately 3-4 hours. One author did this labeling; this person did not look at or run tests on the experiment corpus.  We classified the words as StrongSubjective, WeakSubjective, or Objective. Objective terms are not subjective at all (e.g., &amp;quot;chair&amp;quot; or &amp;quot;city&amp;quot;). StrongSubjective terms have strong, unambiguously subjective connotations, such as &amp;quot;bully&amp;quot; or &amp;quot;barbarian&amp;quot;. WeakSubjective was used for three situations: (1) words that have weak subjective connotations, such as &amp;quot;aberration&amp;quot; which implies something out of the ordinary but does not evoke a strong sense of judgement, (2) words that have multiple senses or uses, where one is subjective but the other is not. For example, the word &amp;quot;plague&amp;quot; can refer to a disease (objective) or an onslaught of something negative (subjective), (3) words that are objective by themselves but appear in idiomatic expressions that are subjective. For example, the word &amp;quot;eyebrows&amp;quot; was labeled WeakSubjective because the expression &amp;quot;raised eyebrows&amp;quot; probably occurs more often in our corpus than literal references to &amp;quot;eyebrows&amp;quot;. Table 5 shows examples of learned words that were classified as StrongSubjective or WeakSubjective.</Paragraph>
      <Paragraph position="2"> Once the words had been manually classified, we could go back and measure the effectiveness of the algorithms. The graph in Figure 1 tracks their accuracy as the bootstrapping progressed. The X-axis shows the number of words generated so far. The Y-axis shows the percentage of those words that were manually classified as subjective. As is typical of bootstrapping algorithms, accuracy was high during the initial iterations but tapered off as the bootstrapping continued. After 20 words, both algorithms were 95% accurate. After 100 words Basilisk was 75% accurate and MetaBoot was 81% accurate. After 1000 words, accuracy dropped to about 28% for MetaBoot, but Basilisk was still performing reasonably well at 53%. Although 53% accuracy is not high for a fully automatic process, Basilisk depends on a human to review the words so 53% accuracy means that the human is accepting every other word, on average. Thus, the reviewer's time was still being spent productively even after 1000 words had been hypothesized.</Paragraph>
      <Paragraph position="3"> Table 6 shows the size of the final lexicons created by the bootstrapping algorithms. The first two columns show the number of subjective terms learned by Basilisk and Meta-Bootstrapping. Basilisk was more prolific, generating 825 subjective terms compared to 522 for Meta-Bootstrapping. The third column shows the intersection between their word lists. There was substantial overlap, but both algorithms produced many words that the other did not. The last column shows the results of merging their lists. In total, the bootstrapping algorithms produced 1052 subjective nouns.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Creating Subjectivity Classifiers
</SectionTitle>
    <Paragraph position="0"> To evaluate the subjective nouns, we trained a Naive Bayes classifier using the nouns as features. We also incorporated previously established subjectivity clues, and added some new discourse features. In this section, we describe all the feature sets and present performance results for subjectivity classifiers trained on different combinations of these features. The threshold values and feature representations used in this section are the ones that produced the best results on our separate tuning corpus.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Subjective Noun Features
</SectionTitle>
      <Paragraph position="0"> We defined four features to represent the sets of subjective nouns produced by the bootstrapping algorithms.</Paragraph>
      <Paragraph position="1"> BA-Strong: the set of StrongSubjective nouns generated by Basilisk BA-Weak: the set of WeakSubjective nouns generated by Basilisk MB-Strong: the set of StrongSubjective nouns generated by Meta-Bootstrapping MB-Weak: the set of WeakSubjective nouns generated by Meta-Bootstrapping For each set, we created a three-valued feature based on the presence of 0, 1, or 2 words from that set. We used the nouns as feature sets, rather than define a separate feature for each word, so the classifier could generalize over the set to minimize sparse data problems. We will refer to these as the SubjNoun features.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Previously Established Features
</SectionTitle>
      <Paragraph position="0"> Wiebe, Bruce, &amp; O'Hara (1999) developed a machine learning system to classify subjective sentences. We experimented with the features that they used, both to compare their results to ours and to see if we could benefit from their features. We will refer to these as the WBO features.</Paragraph>
      <Paragraph position="1"> WBO includes a set of stems positively correlated with the subjective training examples (subjStems) and a set of stems positively correlated with the objective training examples (objStems). We defined a three-valued feature for the presence of 0, 1, or 2 members of subjStems in a sentence, and likewise for objStems. For our experiments, subjStems includes stems that appear 7 times in the training set, and for which the precision is 1.25 times the baseline word precision for that training set.</Paragraph>
      <Paragraph position="2"> objStems contains the stems that appear 7 times and for which at least 50% of their occurrences in the training set are in objective sentences. WBO also includes a binary feature for each of the following: the presence in the sentence of a pronoun, an adjective, a cardinal number, a modal other than will, and an adverb other than not.</Paragraph>
      <Paragraph position="3"> We also added manually-developed features found by other researchers. We created 14 feature sets representing some classes from (Levin, 1993; Ballmer and Brennenstuhl, 1981), some Framenet lemmas with frame element experiencer (Baker et al., 1998), adjectives manually annotated for polarity (Hatzivassiloglou and McKeown, 1997), and some subjectivity clues listed in (Wiebe, 1990). We represented each set as a three-valued feature based on the presence of 0, 1, or 2 members of the set.</Paragraph>
      <Paragraph position="4"> We will refer to these as the manual features.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Discourse Features
</SectionTitle>
      <Paragraph position="0"> We created discourse features to capture the density of clues in the text surrounding a sentence. First, we computed the average number of subjective clues and objective clues per sentence, normalized by sentence length.</Paragraph>
      <Paragraph position="1"> The subjective clues, subjClues, are all sets for which 3-valued features were defined above (except objStems).</Paragraph>
      <Paragraph position="2"> The objective clues consist only of objStems. For sentence S, let ClueRatesubj(S) = jsubjClues in SjjSj and ClueRateobj(S) = jobjStems in SjjSj . Then we define AvgClueRatesubj to be the average of ClueRate(S) over all sentences S and similarly for AvgClueRateobj.</Paragraph>
      <Paragraph position="3"> Next, we characterize the number of subjective and objective clues in the previous and next sentences as: higher-than-expected (high), lower-than-expected (low), or expected (medium). The value for ClueRatesubj(S) is high if ClueRatesubj(S) AvgClueRatesubj 1:3;</Paragraph>
      <Paragraph position="5"> erwise it is medium. The values for ClueRateobj(S) are defined similarly.</Paragraph>
      <Paragraph position="6"> Using these definitions we created four features: ClueRatesubj for the previous and following sentences, and ClueRateobj for the previous and following sentences. We also defined a feature for sentence length. Let AvgSentLen be the average sentence length. SentLen(S) is high if length(S) AvgSentLen 1:3; low if length(S) AvgSentLen=1:3; and medium otherwise. null</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Classification Results
</SectionTitle>
      <Paragraph position="0"> We conducted experiments to evaluate the performance of the feature sets, both individually and in various combinations. Unless otherwise noted, all experiments involved training a Naive Bayes classifier using a particular set of features. We evaluated each classifier using 25fold cross validation on the experiment corpus and used paired t-tests to measure significance at the 95% confidence level. As our evaluation metrics, we computed accuracy (Acc) as the percentage of the system's classifications that match the gold-standard, and precision (Prec) and recall (Rec) with respect to subjective sentences.</Paragraph>
      <Paragraph position="1">  represents the common baseline of assigning every sentence to the most frequent class. The Most-Frequent baseline achieves 59% accuracy because 59% of the sentences in the gold-standard are subjective. Row (2) is a Naive Bayes classifier that uses the WBO features, which performed well in prior research on sentence-level subjectivity classification (Wiebe et al., 1999). Row (1) shows a Naive Bayes classifier that uses unigram bag-of-words features, with one binary feature for the absence or presence in the sentence of each word that appeared during training. Pang et al. (2002) reported that a similar experiment produced their best results on a related classification task. The difference in accuracy between Rows (1) and (2) is not statistically significant (Bag-of-Word's higher precision is balanced by WBO's higher recall).</Paragraph>
      <Paragraph position="2"> Next, we trained a Naive Bayes classifier using only the SubjNoun features. This classifier achieved good precision (77%) but only moderate recall (64%). Upon further inspection, we discovered that the subjective nouns are good subjectivity indicators when they appear, but not every subjective sentence contains one of them.</Paragraph>
      <Paragraph position="3"> And, relatively few sentences contain more than one, making it difficult to recognize contextual effects (i.e., multiple clues in a region). We concluded that the appropriate way to benefit from the subjective nouns is to use them in tandem with other subjectivity clues.</Paragraph>
      <Paragraph position="4">  trained with different combinations of features. The accuracy differences between all pairs of experiments in Table 8 are statistically significant. Row (3) uses only the WBO features (also shown in Table 7 as a baseline).</Paragraph>
      <Paragraph position="5"> Row (2) uses the WBO features as well as the SubjNoun features. There is a synergy between these feature sets: using both types of features achieves better performance than either one alone. The difference is mainly precision, presumably because the classifier found more and better combinations of features. In Row (1), we also added the manual and discourse features. The discourse features explicitly identify contexts in which multiple clues are found. This classifier produced even better performance, achieving 81.3% precision with 77.4% recall. The 76.1% accuracy result is significantly higher than the accuracy results for all of the other classifiers (in both Table 8 and Table 7).</Paragraph>
      <Paragraph position="6"> Finally, higher precision classification can be obtained by simply classifying a sentence as subjective if it contains any of the StrongSubjective nouns. On our data, this method produces 87% precision with 26% recall. This approach could support applications for which precision is paramount.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML