File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/j04-3002_abstr.xml
Size: 7,765 bytes
Last Modified: 2025-10-06 13:43:23
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-3002"> <Title>at Asheville</Title> <Section position="2" start_page="0" end_page="279" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Subjectivity in natural language refers to aspects of language used to express opinions, evaluations, and speculations (Banfield 1982; Wiebe 1994). Many natural language processing (NLP) applications could benefit from being able to distinguish subjective language from language used to objectively present factual information.</Paragraph> <Paragraph position="1"> Current extraction and retrieval technology focuses almost exclusively on the sub-ject matter of documents. However, additional aspects of a document influence its relevance, including evidential status and attitude (Kessler, Nunberg, Sch &quot;utze 1997).</Paragraph> <Paragraph position="2"> Information extraction systems should be able to distinguish between factual information (which should be extracted) and nonfactual information (which should be Computational Linguistics Volume 30, Number 3 discarded or labeled as uncertain). Question-answering systems should distinguish between factual and speculative answers. Multi-perspective question answering aims to present multiple answers to the user based upon speculation or opinions derived from different sources (Carbonell 1979; Wiebe et al. 2003). Multidocument summarization systems should summarize different opinions and perspectives. Automatic subjectivity analysis would also be useful to perform flame recognition (Spertus 1997; Kaufer 2000), e-mail classification (Aone, Ramos-Santacruze, and Niehaus 2000), intellectual attribution in text (Teufel and Moens 2000), recognition of speaker role in radio broadcasts (Barzialy et al. 2000), review mining (Terveen et al. 1997), review classification (Turney 2002; Pang, Lee, and Vaithyanathan 2002), style in generation (Hovy 1987), and clustering documents by ideological point of view (Sack 1995). In general, nearly any information-seeking system could benefit from knowledge of how opinionated a text is and whether or not the writer purports to objectively present factual material.</Paragraph> <Paragraph position="3"> To perform automatic subjectivity analysis, good clues must be found. A huge variety of words and phrases have subjective usages, and while some manually developed resources exist, such as dictionaries of affective language (General-Inquirer 2000; Heise 2000) and subjective features in general-purpose lexicons (e.g., the attitude adverb features in Comlex [Macleod, Grishman, and Meyers 1998]), there is no comprehensive dictionary of subjective language. In addition, many expressions with subjective usages have objective usages as well, so a dictionary alone would not suffice. An NLP system must disambiguate these expressions in context.</Paragraph> <Paragraph position="4"> The goal of our work is learning subjective language from corpora. In this article, we generate and test subjectivity clues and contextual features and use the knowledge we gain to recognize subjective sentences and opinionated documents.</Paragraph> <Paragraph position="5"> Two kinds of data are available to us: a relatively small amount of data manually annotated at the expression level (i.e., labels on individual words and phrases) of Wall Street Journal and newsgroup data and a large amount of data with existing document-level annotations from the Wall Street Journal (opinion pieces, such as editorials and reviews, versus nonopinion pieces). Both are used as training data to identify clues of subjectivity. In addition, we cross-validate the results between the two types of annotation: The clues learned from the expression-level data are evaluated against the document-level annotations, and those learned using the document-level annotations are evaluated against the expression-level annotations.</Paragraph> <Paragraph position="6"> There were a number of motivations behind our decision to use document-level annotations, in addition to our manual annotations, to identify and evaluate clues of subjectivity. The document-level annotations were not produced according to our annotation scheme and were not produced for the purpose of training and evaluating an NLP system. Thus, they are an external influence from outside the laboratory. In addition, there are a great number of these data, enabling us to evaluate the results on a larger scale, using multiple large test sets. This and cross-training between the two types of annotations allows us to assess consistency in performance of the various identification procedures. Good performance in cross-validation experiments between different types of annotations is evidence that the results are not brittle.</Paragraph> <Paragraph position="7"> We focus on three types of subjectivity clues. The first are hapax legomena, the set of words that appear just once in the corpus. We refer to them here as unique words.</Paragraph> <Paragraph position="8"> The set of all unique words is a feature with high frequency and significantly higher precision than baseline (Section 3.2).</Paragraph> <Paragraph position="9"> The second are collocations (Section 3.3). We demonstrate a straightforward method for automatically identifying collocational clues of subjectivity in texts. The method is first used to identify fixed n-grams, such as of the century and get out of here. Interest- null Wiebe, Wilson, Bruce, Bell, and Martin Learning Subjective Language ingly, many include noncontent words that are typically on stop lists of NLP systems (e.g., of, the, get, out, here in the above examples). The method is then used to identify an unusual form of collocation: One or more positions in the collocation may be filled by any word (of an appropriate part of speech) that is unique in the test data.</Paragraph> <Paragraph position="10"> The third type of subjectivity clue we examine here are adjective and verb features identified using the results of a method for clustering words according to distributional similarity (Lin 1998) (Section 3.4). We hypothesized that two words may be distributionally similar because they are both potentially subjective (e.g., tragic, sad, and poignant are identified from bizarre). In addition, we use distributional similarity to improve estimates of unseen events: A word is selected or discarded based on the precision of it together with its n most similar neighbors.</Paragraph> <Paragraph position="11"> We show that the various subjectivity clues perform better and worse on the same data sets, exhibiting an important consistency in performance (Section 4.2).</Paragraph> <Paragraph position="12"> In addition to learning and evaluating clues associated with subjectivity, we address disambiguating them in context, that is, identifying instances of clues that are subjective in context (Sections 4.3 and 4.4). We find that the density of clues in the surrounding context is an important influence. Using two types of annotations serves us well here, too. It enables us to use manual judgments to identify parameters for disambiguating instances of automatically identified clues. High-density clues are high precision in both the expression-level and document-level data. In addition, we give the results of a new annotation study showing that most high-density clues are in subjective text spans (Section 4.5). Finally, we use the clues together to perform document-level classification, to further demonstrate the utility of the acquired knowledge (Section 4.6).</Paragraph> <Paragraph position="13"> At the end of the article, we discuss related work (Section 5) and conclusions (Section 6).</Paragraph> </Section> class="xml-element"></Paper>