File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1023_evalu.xml

Size: 7,269 bytes

Last Modified: 2025-10-06 14:00:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1023">
  <Title>Predicting the Semantic Orientation of Adjectives</Title>
  <Section position="8" start_page="177" end_page="178" type="evalu">
    <SectionTitle>
8 Results and Evaluation
</SectionTitle>
    <Paragraph position="0"> Since graph connectivity affects performance, we devised a method of selecting test sets that makes this dependence explicit. Note that the graph density is largely a function of corpus size, and thus can be increased by adding more data. Nevertheless, we report results on sparser test sets to show how our algorithm scales up.</Paragraph>
    <Paragraph position="1"> We separated our sets of adjectives A (containing 1,336 adjectives) and conjunction- and morphology-based links L (containing 2,838 links) into training and testing groups by selecting, for several values of the parameter a, the maximal subset of A, An, which includes an adjective z if and only if there exist at least a links from L between x and other elements of An. This operation in turn defines a subset of L, L~, which includes all links between members of An. We train our log-linear model on L - La (excluding links between morphologically related adjectives), compute predictions and dissimilarities for the links in L~, and use these to classify and label the adjectives in An. c~ must be at least 2, since we need to leave some links for training.</Paragraph>
    <Paragraph position="2"> Table 3 shows the results of these experiments for a = 2 to 5. Our method produced the correct classification between 78% of the time on the sparsest test set up to more than 92% of the time when a higher number of links was present. Moreover, in all cases, the ratio of the two group frequencies correctly identified the positive subgroup. These results are extremely significant statistically (P-value less than 10 -16 ) when compared with the baseline method of randomly assigning orientations to adjectives, or the baseline method of always predicting the most frequent (for types) category (50.82% of the adjectives in our collection are classified as negative). Figure 2 shows some of the adjectives in set A4 and their classifications. null  Classified as positive: bold decisive disturbing generous good honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty Classified as negative: ambiguous cautious cynical evasive harmful hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful  tives from set A4. Correctly matched adjectives are shown in bold.</Paragraph>
    <Section position="1" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
9 Graph Connectivity and
Performance
</SectionTitle>
      <Paragraph position="0"> A strong point of our method is that decisions on individual words are aggregated to provide decisions on how to group words into a class and whether to label the class as positive or negative. Thus, the overall result can be much more accurate than the individual indicators. To verify this, we ran a series of simulation experiments. Each experiment measures how our algorithm performs for a given level of precision P for identifying links and a given average number of links k for each word. The goal is to show that even when P is low, given enough data (i.e., high k), we can achieve high performance for the grouping.</Paragraph>
      <Paragraph position="1"> As we noted earlier, the corpus data is eventually represented in our system as a graph, with the nodes corresponding to adjectives and the links to predictions about whether the two connected adjectives have the same or different orientation. Thus the parameter P in the simulation experiments measures how well we are able to predict each link independently of the others, and the parameter k measures the number of distinct adjectives each adjective appears with in conjunctions. P therefore directly represents the precision of the link classification algorithm, while k indirectly represents the corpus size. To measure the effect of P and k (which are reflected in the graph topology), we need to carry out a series of experiments where we systematically vary their values. For example, as k (or the amount of data) increases for a given level of precision P for individual links, we want to measure how this affects overall accuracy of the resulting groups of nodes.</Paragraph>
      <Paragraph position="2"> Thus, we need to construct a series of data sets, or graphs, which represent different scenarios corresponding to a given combination of values of P and k. To do this, we construct a random graph by randomly assigning 50 nodes to the two possible orientations. Because we don't have frequency and morphology information on these abstract nodes, we cannot predict whether two nodes are of the same or different orientation. Rather, we randomly assign links between nodes so that, on average, each node participates in k links and 100 x P% of all links connect nodes of the same orientation. Then we consider these links as identified by the link prediction algorithm as connecting two nodes with the same orientation (so that 100 x P% of these predictions will be correct). This is equivalent to the baseline link classification method, and provides a lower bound on the performance of the algorithm actually used in our system (Section 5).</Paragraph>
      <Paragraph position="3"> Because of the lack of actual measurements such as frequency on these abstract nodes, we also decouple the partitioning and labeling components of our system and score the partition found under the best matching conditions for the actual labels. Thus the simulation measures only how well the system separates positive from negative adjectives, not how well it determines which is which. However, in all the experiments performed on real corpus data (Section 8), the system correctly found the labels of the groups; any misclassifications came from misplacing an adjective in the wrong group. The whole procedure of constructing the random graph and finding and scoring the groups is repeated 200 times for any given combination of P and k, and the results are averaged, thus avoiding accidentally evaluating our system on a graph that is not truly representative of graphs with the given P and k.</Paragraph>
      <Paragraph position="4"> We observe (Figure 3) that even for relatively low t9, our ability to correctly classify the nodes approaches very high levels with a modest number of links. For P = 0.8, we need only about ? links per adjective for classification performance over 90% and only 12 links per adjective for performance over 99%. s The difference between low and high values of P is in the rate at which increasing data increases overall precision. These results are somewhat more optimistic than those obtained with real data (Section 8), a difference which is probably due to the uniform distributional assumptions in the simulation.</Paragraph>
      <Paragraph position="5"> Nevertheless, we expect the trends to be similar to the ones shown in Figure 3 and the results of Table 3 on real data support this expectation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML