File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1034_intro.xml
Size: 3,336 bytes
Last Modified: 2025-10-06 14:02:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1034"> <Title>The Sentimental Factor: Improving Review Classification via Human-Provided Information</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text documents are available in ever-increasing numbers, making automated techniques for information extraction increasingly useful. Traditionally, most research effort has been directed towards &quot;objective&quot; information, such as classification according to topic; however, interest is growing in producing information about the opinions that a document contains; for instance, Morinaga et al. (2002). In March, 2004, the American Association for Artificial Intelligence held a symposium in this area, entitled &quot;Exploring Affect and Attitude in Text.&quot; One task in opinion extraction is to label a review document d according to its prevailing sentiment s 2 f 1; 1g (unfavorable or favorable). Several previous papers have addressed this problem by building models that rely exclusively upon labeled documents, e.g. Pang et al. (2002), Dave et al. (2003). By learning models from labeled data, one can apply familiar, powerful techniques directly; however, in practice it may be difficult to obtain enough labeled reviews to learn model parameters accurately.</Paragraph> <Paragraph position="1"> A contrasting approach (Turney, 2002) relies only upon documents whose labels are unknown. This makes it possible to use a large underlying corpus in this case, the entire Internet as seen through the AltaVista search engine. As a result, estimates for model parameters are subject to a relatively small amount of random variation. The corresponding drawback to such an approach is that its predictions are not validated on actual documents.</Paragraph> <Paragraph position="2"> In machine learning, it has often been effective to use labeled and unlabeled examples in tandem, e.g. Nigam et al. (2000). Turney's model introduces the further consideration of incorporating human-provided knowledge about language. In this paper we build models that utilize all three sources: labeled documents, unlabeled documents, and human-provided information.</Paragraph> <Paragraph position="3"> The basic concept behind Turney's model is quite simple. The &quot;sentiment orientation&quot; (Hatzivassiloglou and McKeown, 1997) of a pair of words is taken to be known. These words serve as &quot;anchors&quot; for positive and negative sentiment. Words that co-occur more frequently with one anchor than the other are themselves taken to be predictive of sentiment. As a result, information about a pair of words is generalized to many words, and then to documents.</Paragraph> <Paragraph position="4"> In the following section, we relate this model with Naive Bayes classification, showing that Turney's classifier is a &quot;pseudo-supervised&quot; approach: it effectively generates a new corpus of labeled documents, upon which it fits a Naive Bayes classifier. This insight allows the procedure to be represented as a probability model that is linear on the logistic scale, which in turn suggests generalizations that are developed in subsequent sections.</Paragraph> </Section> class="xml-element"></Paper>