File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0625_evalu.xml

Size: 13,772 bytes

Last Modified: 2025-10-06 14:00:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0625">
  <Title>Normalized? Yes Yes Yes No Yes No Yes Yes Yes Yes Yes Yes</Title>
  <Section position="6" start_page="206" end_page="208" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="206" end_page="206" type="sub_section">
      <SectionTitle>
5.1 The Evaluation Corpus
</SectionTitle>
      <Paragraph position="0"> For evaluation, we use a set of articles already classified into topical subsets which we obtained from the Reuters part of the 1997 pilot Topic Detection and Tracking (TDT) corpus. The TDT corpus, developed by NIST and DARPA, is a collection of 16,000 news articles from Reuters and CNN where many of the articles and transcripts have been manually grouped into 25 categories each of which corresponds to a single event (see http://morph.ldc.</Paragraph>
      <Paragraph position="1"> uperm, edu/Cat alog/LDC98T25, html). Using the Reuters part of the corpus, we selected five of the larger categories and extracted all articles assigned to them from severM randomly chosen days, for a total of 30 articles.</Paragraph>
      <Paragraph position="2"> Since paragraphs in news stories tend to be short--typically one or two sentences--in this study we use paragraphs as our small text units, although sentences would also be a possibility.</Paragraph>
      <Paragraph position="3"> In total, we have 264 text units and 10,345 comparisons between units. As comparisons are made between all pairs of paragraphs from the same topic, the total number of comparisons is equal to</Paragraph>
      <Paragraph position="5"> where Ni is the number of paragraphs in all selected articles from topical category i.</Paragraph>
      <Paragraph position="6"> Training of our machine learning component was done by three-fold cross-validation, randomly splitting the 10,345 pairs of paragraphs into three (almost) equally-sized subsets. In each of the three runs, two of these subsets were used for training and one for testing.</Paragraph>
      <Paragraph position="7"> To create a reference standard, the entire collection of 10,345 paragraph pairs was marked for similarity by two reviewers who were given our definition and detailed instructions. Each re-viewer independently marked each pair of paragraphs as similar or not similar. Subsequently, the two reviewers jointly examined eases where there was disagreement, discussed reasons, and reconciled the differences.</Paragraph>
    </Section>
    <Section position="2" start_page="206" end_page="207" type="sub_section">
      <SectionTitle>
5.2 Experimental Validation of the
Similarity Definition
</SectionTitle>
      <Paragraph position="0"> In order to independently validate our definition of similarity, we performed two additional experiments. In the first, we asked three additional judges to determine similarity for a random sample of 40 paragraph pairs. High agreement between judges would indicate that our definition of similarity reflects an objective reality and can be mapped unambiguously to an operational procedure for marking text units as similar or not. At the same time, it would also validate the judgments between text units that we use for our experiments (see Section 5.1).</Paragraph>
      <Paragraph position="1"> In this task, judges were given the opportunity to provide reasons for claiming similarity or dissimilarity, and comments on the task were logged for future analysis. The three additional  judges agreed with the manually marked and standardized corpus on 97.6% of the comparisons. null Unfortunately, approximately 97% (depending on the specific experiment) of the comparisons in both our model and the subsequent validation experiment receive the value &amp;quot;not similar&amp;quot;. This large percentage is due to our fine-grained notion of similarity, and is parallel to what happens in randomly sampled IR collections, since in that case most documents will not be relevant to any given query. Nevertheless, we can account for the high probability of inter-reviewer agreement expected by chance, 0.97.0.97+ (1-0.97)-(1-0.97) -- 0.9418, by referring to the kappa statistic \[Cohen 1960; Carletta 1996\]. The kappa statistic is defined as</Paragraph>
      <Paragraph position="3"> where PA is the probability that two reviewers agree in practice, and P0 is the probability that they would agree solely by chance. In our case,</Paragraph>
      <Paragraph position="5"> indicating that the observed agreement by the reviewers is indeed significant. 2 If P0 is estimated from the particular sample used in this experiment rather than from our entire corpus, it would be only 0.9, producing a value of 0.76 for K.</Paragraph>
      <Paragraph position="6"> In addition to this validation experiment that used randomly sampled pairs of paragraphs (and reflected the disproportionate rate of occurrence of dissimilar pairs), we performed a balanced experiment by randomly selecting 50 of the dissimilar pairs and 50 of the similar pairs, in a manner that guaranteed generation of an independent sample. 3 Pairs in this sub-set were rated for similarity by two additional independent reviewers, who agreed on their decisions 91% of the time, versus 50% expected by chance; in this case, K --- 0.82. Thus, we feel confident in the reliability of our annotation  were randomly selected for inclusion in the sample but a pair (A, B) was immediately rejected if there were paragraphs X1,...,X,~ for n &gt; 0 such that all pairs (A, X1), (X1, X2), * * *, (Xn, B) had already been included in the sample.</Paragraph>
      <Paragraph position="7"> process, and can use the annotated corpus to assess the performance of our similarity measure and compare it to measures proposed earlier in the information retrieval literature.</Paragraph>
    </Section>
    <Section position="3" start_page="207" end_page="207" type="sub_section">
      <SectionTitle>
5.3 Performance Comparisons
</SectionTitle>
      <Paragraph position="0"> We compare the performance of our system to three other methods. First, we use standard TF*IDF, a method that with various alterations, remains at the core of many information retrieval and text matching systems \[Salton and Buckley 1988; Salton 1989\]. We compute the total frequency (TF) of words in each text unit.</Paragraph>
      <Paragraph position="1"> We also compute the number of units each word appears in in our training set (DF, or document frequency). Then each text unit is represented as a vector of TF*IDF scores calculated as Total number of units TF (word/) * log DF(wordi) Similarity between text units is measured by the cosine of the angle between the corresponding two vectors (i.e., the normalized inner product of the two vectors). A further cutoff point is selected to convert similarities to hard decisions of &amp;quot;similar&amp;quot; or &amp;quot;not similar&amp;quot;; different cutoffs result in different tradeoffs between recall and precision.</Paragraph>
      <Paragraph position="2"> Second, we compare our method against a standard, widely available information retrieval system developed at Cornell University, SMART \[Buckley 1985\]. 4 SMART utilizes a modified TF*IDF measure (ATC) plus stemming and a fairly sizable stopword list.</Paragraph>
      <Paragraph position="3"> Third, we use as a baseline method the default selection of the most frequent category, i.e., &amp;quot;not similar&amp;quot;. While this last method cannot be effectively used to identify similar paragraphs, it offers a baseline for the overall accuracy of any more sophisticated technique for this task.</Paragraph>
    </Section>
    <Section position="4" start_page="207" end_page="208" type="sub_section">
      <SectionTitle>
5.4 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Our system was able to recover 36.6% of the similar paragraphs with 60.5% precision, as shown in Table 1. In comparison, the unmodified TF*IDF approach obtained only 32.6% precision when recall is 39.1%, i.e., close to our system's recall; and only 20.8% recall at precision of 62.2%, comparable to our classifier's aWe used version 11.0 of SMART, released in July 1992.</Paragraph>
      <Paragraph position="1">  ilarity metrics. For comparison purposes, we list the average recall, precision, and accuracy obtained by TF*IDF and SMART at the two points in the precision-recall curve identified for each method in the text (i.e., the point where the method's precision is most similar to ours, and the point where its recall is most similar to ours).</Paragraph>
      <Paragraph position="2"> precision. SMART (in its default configuration) offered only a small improvement over the base TF*IDF implementation, and significantly underperformed our method, obtaining 34.1% precision at recall of 36.7%, and 21.5% recall at 62.4% precision. The default method of always marking a pair as dissimilar obtains of course 0% recall and undefined precision. Figure 5 illustrates the difference between our systern and straight TF*IDF at different points of the precision-recall spectrum.</Paragraph>
      <Paragraph position="3"> When overall accuracy (total percentage of correct answers over both categories of similar and non-similar pairs) is considered, the numbers are much closer together: 98.8% for our approach; 96.6% and 97.8% for TF*IDF on the two P-R points mentioned for that method above; 96.5% and 97.6% for SMART, again at the two P-R points mentioned for SMART earlier; and 97.5% for the default baseline. 5 Nevertheless, since the challenge of identifying sparsely occurring similar small text units is our goal, the accuracy measure and the base-line technique of classifying everything as not similar are included only for reference but do 5Statistical tests of significance cannot be performed for comparing these values, since paragraphs appear in multiple comparisons and consequently the comparisons are not independent.</Paragraph>
      <Paragraph position="4">  method using RIPPER (solid line with squares) versus TF*IDF (dotted line with triangles).</Paragraph>
      <Paragraph position="5"> not reflect our task.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="208" end_page="209" type="evalu">
    <SectionTitle>
6 Analysis and Discussion of Feature
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="208" end_page="209" type="sub_section">
      <SectionTitle>
Performance
</SectionTitle>
      <Paragraph position="0"> We computed statistics on how much each feature helps in identifying similarity, summarized in Table 2. Primitive features are named according to the type of the feature (e.g., Verb for the feature that counts the number of matching verbs according to exact matches). Composite feature names indicate the restrictions applied to primitives. For example, the composite feature Distance &lt; ~ restricts a pair of matching primitives to occur within a relative distance of four words. If the composite feature also restricts the types of the primitives in the pair, the name of the restricting primitive feature is added to the composite feature name. For example the feature named Verb Distance &lt; 5 requires one member of the pair to be a verb and the relative distance between the primitives to be at most five.</Paragraph>
      <Paragraph position="1"> The second column in Table 2 shows whether the feature value has been normalized according to its overall rarity 6, while the third column indicates the actual threshold used in decisions assuming that only this feature is used for classification. The fourth column shows the applicability of that feature, that is, the percentage of  multiple times for the same feature and normalization option, highlighting the effect of different decision thresholds.</Paragraph>
      <Paragraph position="2"> paragraph pairs for which this feature would apply (i.e., have a value over the specified threshold). Finally, the fifth and sixth columns show the recall and precision on identifying similar paragraphs for each independent feature. Note that some features have low applicability over the entire corpus, but target the hard-to-find similar pairs, resulting in significant gains in recall and precision.</Paragraph>
      <Paragraph position="3"> Table 2 presents a selected subset of primitive and composite features in order to demonstrate our results. For example, it was not surprising to observe that the most effective primitive features in determining similarity are Any word, Simplex NPi and Noun while other primitives such as Verb were not as effective independently.</Paragraph>
      <Paragraph position="4"> This is to be expected since nouns name objects, entities, and concepts, and frequently exhibit more sense constancy. In contrast, verbs are functions and tend to shift senses in a more fluid fashion depending on context. Furthermore, our technique does not label phrasal verbs (e.g. look up, look out, look over, look for, etc.), which are a major source of verbal ambiguity in English.</Paragraph>
      <Paragraph position="5"> Whereas primitive features viewed independently might not have a directly visible effect on identifying similarity, when used in composite features they lead to some novel results. The most pronounced case of this is for Verb, which, in the composite feature Verb Distance _&lt; 5, can help identify similarity effectively, as seen in Table 2. This composite feature approximates verb-argument and verb-collocation relations, which are strong indicators of similarity. At the same time, the more restrictive a feature is, the fewer occurrences of that feature appear in the training set. This suggests that we could consider adding additional features suggested by current results in order to further refine and improve our similarity identification algorithm.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML