File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3808_intro.xml

Size: 7,515 bytes

Last Modified: 2025-10-06 14:04:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3808">
  <Title>Seeing stars when there aren't many stars: Graph-based semi-supervised learning for sentiment categorization</Title>
  <Section position="3" start_page="0" end_page="46" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Sentiment analysis of text documents has received considerable attention recently (Shanahan et al., 2005; Turney, 2002; Dave et al., 2003; Hu and Liu, 2004; Chaovalit and Zhou, 2005). Unlike traditional text categorization based on topics, sentiment analysis attempts to identify the subjective sentiment expressed (or implied) in documents, such as consumer product or movie reviews. In particular Pang and Lee proposed the rating-inference problem (2005). Rating inference is harder than binary positive / negative opinion classification. The goal is to infer a numerical rating from reviews, for example the number of &amp;quot;stars&amp;quot; that a critic gave to a movie. Pang and Lee showed that supervised machine learning techniques (classification and regression) work well for rating inference with large amounts of training data.</Paragraph>
    <Paragraph position="1"> However, review documents often do not come with numerical ratings. We call such documents unlabeled data. Standard supervised machine learning algorithms cannot learn from unlabeled data. Assigning labels can be a slow and expensive process because manual inspection and domain expertise are needed. Often only a small portion of the documents can be labeled within resource constraints, so most documents remain unlabeled. Supervised learning algorithms trained on small labeled sets suffer in performance. Can one use the unlabeled reviews to improve rating-inference? Pang and Lee (2005) suggested that doing so should be useful.</Paragraph>
    <Paragraph position="2"> We demonstrate that the answer is 'Yes.' Our approach is graph-based semi-supervised learning.</Paragraph>
    <Paragraph position="3"> Semi-supervised learning is an active research area in machine learning. It builds better classifiers or regressors using both labeled and unlabeled data, under appropriate assumptions (Zhu, 2005; Seeger, 2001). This paper contains three contributions: * We present a novel adaptation of graph-based semi-supervised learning (Zhu et al., 2003)  to the sentiment analysis domain, extending past supervised learning work by Pang and Lee (2005); * We design a special graph which encodes our assumptions for rating-inference problems (section 2), and present the associated optimization problem in section 3; * We show the benefit of semi-supervised learning for rating inference with extensive experimental results in section 4.</Paragraph>
    <Paragraph position="4"> 2 A Graph for Sentiment Categorization The semi-supervised rating-inference problem is formalized as follows. There are n review documents x1 ...xn, each represented by some standard feature representation (e.g., word-presence vectors). Without loss of generality, let the first l [?] n documents be labeled with ratings y1 ...yl [?] C. The remaining documents are unlabeled. In our experiments, the unlabeled documents are also the test documents, a setting known as transduction. The set of numerical ratings are C = {c1,...,cC}, with c1 &lt; ... &lt; cC [?] R. For example, a one-star to four-star movie rating system has C = {0,1,2,3}.</Paragraph>
    <Paragraph position="5"> We seek a function f : x mapsto- R that gives a continuous rating f(x) to a document x. Classification is done by mapping f(x) to the nearest discrete rating in C. Note this is ordinal classification, which differs from standard multi-class classification in that C is endowed with an order. In the following we use 'review' and 'document,' 'rating' and 'label' interchangeably. null We make two assumptions: 1. We are given a similarity measure wij [?] 0 between documents xi and xj. wij should be computable from features, so that we can measure similarities between any documents, including unlabeled ones. A large wij implies that the two documents tend to express the same sentiment (i.e., rating). We experiment with positive-sentence percentage (PSP) based similarity which is proposed in (Pang and Lee, 2005), and mutual-information modulated word-vector cosine similarity. Details can be found in section 4.</Paragraph>
    <Paragraph position="6"> 2. Optionally, we are given numerical rating predictions ^yl+1,..., ^yn on the unlabeled documents from a separate learner, for instance epsilon1-insensitive support vector regression (Joachims, 1999; Smola and Sch&amp;quot;olkopf, 2004) used by (Pang and Lee, 2005). This acts as an extra knowledge source for our semi-supervised learning framework to improve upon. We note our framework is general and works without the separate learner, too. (For this to work in practice, a reliable similarity measure is required.) We now describe our graph for the semi-supervised rating-inference problem. We do this piece by piece with reference to Figure 1. Our undirected graph G = (V,E) has 2n nodes V , and weighted edges E among some of the nodes.</Paragraph>
    <Paragraph position="7"> * Each document is a node in the graph (open circles, e.g., xi and xj). The true ratings of these nodes f(x) are unobserved. This is true even for the labeled documents because we allow for noisy labels. Our goal is to infer f(x) for the unlabeled documents.</Paragraph>
    <Paragraph position="8"> * Each labeled document (e.g., xj) is connected to an observed node (dark circle) whose value is the given rating yj. The observed node is a 'dongle' (Zhu et al., 2003) since it connects only to xj. As we point out later, this serves to pull f(xj) towards yj. The edge weight between a labeled document and its dongle is a large number M. M represents the influence of yj: if M - [?] then f(xj) = yj becomes a hard constraint.</Paragraph>
    <Paragraph position="9"> * Similarly each unlabeled document (e.g., xi) is also connected to an observed dongle node ^yi, whose value is the prediction of the separate learner. Therefore we also require that f(xi) is close to ^yi. This is a way to incorporate multiple learners in general. We set the weight between an unlabeled node and its dongle arbitrarily to 1 (the weights are scale-invariant otherwise). As noted earlier, the separate learner is optional: we can remove it and still carry out graph-based semi-supervised learning.</Paragraph>
    <Paragraph position="10">  * Each unlabeled document xi is connected to kNNL(i), its k nearest labeled documents.</Paragraph>
    <Paragraph position="11"> Distance is measured by the given similarity measure w. We want f(xi) to be consistent with its similar labeled documents. The weight between xi and xj [?] kNNL(i) is a*wij.</Paragraph>
    <Paragraph position="12"> * Each unlabeled document is also connected to kprimeNNU(i), its kprime nearest unlabeled documents (excluding itself). The weight between xi and xj [?] kprimeNNU(i) is b * wij. We also want f(xi) to be consistent with its similar unlabeled neighbors. We allow potentially different numbers of neighbors (k and kprime), and different weight coefficients (a and b). These parameters  are set by cross validation in experiments.</Paragraph>
    <Paragraph position="13"> The last two kinds of edges are the key to semi-supervised learning: They connect unobserved nodes and force ratings to be smooth throughout the graph, as we discuss in the next section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML