File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/j02-1002_intro.xml
Size: 18,120 bytes
Last Modified: 2025-10-06 14:01:22
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-1002"> <Title>A Critique and Improvement of an Evaluation Metric for Text Segmentation</Title> <Section position="4" start_page="25" end_page="32" type="intro"> <SectionTitle> 3. A Solution </SectionTitle> <Paragraph position="0"> It turns out that a simple change to the error metric algorithm remedies most of the problems described above, while retaining the desirable characteristic of penalizing near misses less than pure false positives and pure false negatives. The amended metric, which we call WindowDiff, works as follows: for each position of the probe, simply compare the number of reference segmentation boundaries that fall in this</Paragraph> <Paragraph position="2"> where b(i, j) represents the number of boundaries between positions i and j in the text and N represents the number of sentences in the text.</Paragraph> <Paragraph position="3"> This approach clearly eliminates the asymmetry between the false positive and false negative penalties seen in the P k metric. It also catches false positives and false negatives within segments of length less than k.</Paragraph> <Paragraph position="4"> To understand the behavior of WindowDiff with respect to the other problems, consider again the examples in Figure 5. This metric penalizes Algorithm A-4 (which contains both a false positive and a false negative) the most, assigning it a penalty of about 2k. Algorithms A-0, A-1, and A-2 receive the same penalty (about k), and Algorithm A-3 receives the smallest penalty (2e, where e is the offset from the actual boundary, presumed to be much smaller than k). Thus, although it makes the mistake of penalizing Algorithm A-1 as much as Algorithms A-0 and A-2, it correctly recognizes that the error made by Algorithm A-3 is a near miss and assigns it a smaller penalty than Algorithm A-1 or any of the others. We argue that this kind of error is less detrimental than the errors made by P k . WindowDiff successfully distinguishes Pevzner and Hearst An Evaluation Metric for Text Segmentation the near-miss error as a separate kind of error and penalizes it a different amount, something that P k is unable to do.</Paragraph> <Paragraph position="5"> We explored a weighted version of WindowDiff, in which the penalty is weighted by the differencejr</Paragraph> <Paragraph position="7"> j. However, the results of the simulations were nearly identical with those of the nonweighted version of this metric, so we do not consider the weighted version further.</Paragraph> <Paragraph position="8"> 4. Validation via Simulations This section describes a set of simulations that verify the theoretical analysis of the</Paragraph> <Paragraph position="10"> metric presented above. It also reports the results of simulating two alternatives, including the proposed solution just described.</Paragraph> <Paragraph position="11"> For the simulation runs described below, three metrics were implemented:</Paragraph> <Paragraph position="13"> metric modified to double the false positive penalty (henceforth</Paragraph> <Paragraph position="15"> our proposed alternative, WindowDiff (henceforth WD), which counts the number of segment boundaries between the two ends of the probe and assigns a penalty if this number is different for the experimental and reference segmentations.</Paragraph> <Paragraph position="16"> In these studies, a single trial consists of generating a reference segmentation of 1,000 segments with some distribution, generating different experimental segmentations of a specific type 100 times, computing the metric based on the comparison of the reference and experimental segmentations, and averaging the 100 results. For example, we might generate a reference segmentation R, then generate 100 experimental segmentations that have false negatives with probability 0.5, and then compute the average of their P k penalties. We carried out 10 such trials for each experiment and averaged the average penalties over these trials.</Paragraph> <Section position="1" start_page="26" end_page="29" type="sub_section"> <SectionTitle> 4.1 Variation in the Segment Sizes </SectionTitle> <Paragraph position="0"> The first set of tests was designed to test the metric's performance on texts with different segment size distributions (Problem 3). We generated four sets of reference segmentations with segment size uniformly distributed between two numbers. Note that the units of segmentation are deliberately left unspecified. So a segment of size 25 can refer to 25 words, clauses, or sentences--whichever is applicable to the task under consideration. Also note that the same tests were run using larger segment sizes than those reported here, with the results remaining nearly identical.</Paragraph> <Paragraph position="1"> For these tests, the mean segment size was held constant at 25 for each set of reference segments, in order to produce distributions of segment size with the same means but different variances. The four ranges of segment sizes were (20, 30), (15, 35), (10, 40), and (5, 45). The results of these tests are shown in Table 1. The tests used the following types of experimental segmentations: FN: segmentation with false negative probability 0.5 at each boundary; FP: segmentation with false positive probability 0.5 in each segment, with the probability uniformly distributed within each segment; and FNP: segmentation with false positive probability 0.5 (uniformly distributed), and false negative probability 0.5.</Paragraph> <Paragraph position="2"> The results indicate that variation in segment size does make a difference, but not a very big one. (As we will show, the differences are similar when we use a smaller probability of false negative/positive occurrence.) The P k value for the (20, 30) range with FN segmentation is on average 0.245, and it decreases to 0.223 for the (5, 45) range. Similarly, the FP segmentation decreases from 0.128 for the (20, 30) range to 0.107 for the (5, 45) range, and the FNP segmentation decreases from 0.317 for the (20, 30) range to 0.268 for the (5, 45) range. Thus, variation in segment size has an effect</Paragraph> <Paragraph position="4"> , as predicted.</Paragraph> <Paragraph position="5"> Note that for false negatives, the P k value for the (20, 30) range is not much different than for the (15, 35) range. This is expected since there are no segments of size less than k (12.5) in these conditions. For the (10, 40) range, the P k value is slightly smaller; and for the (5, 45) range, it is smaller still. These results are to be expected, since more segments in these ranges will be of length less than k. For the FP segmentations, on the other hand, the decrease in P k value is more pronounced, falling from 0.128 to 0.107 as the segment size range changes from (20, 30) to (5, 45). This is also consistent with our earlier analysis of the behavior of the metric on false positives as segment size decreases. Notice that the difference in P k values between (15, 35) and (10, 40) is slightly larger than the other two differences. This happens because for segment sizes < k, the false positive penalty disappears completely. The results for the FNP segmentation are consistent with what one would expect of a mix of the FN and FP segmentations.</Paragraph> <Paragraph position="6"> Several other observations can be made from Table 1. We can begin to make some judgments about how the metric performs on algorithms prone to different kinds of errors. First, P k penalizes false negatives about twice as much as false positives, as Pevzner and Hearst An Evaluation Metric for Text Segmentation predicted by our analysis. The experimental segmentations in Table 1a contain on average 500 false negatives, while the ones in Table 1b contain on average 500 false positives, but the penalty for the Table 1b segmentations is consistently about half that for those in Table 1a. Thus, algorithms prone to false positives are penalized less harshly than those prone to false negatives.</Paragraph> <Paragraph position="7"> The table also shows the performance of the two other metrics. P</Paragraph> <Paragraph position="9"> simply doubles the false positive penalty, while WD counts and compares the number of boundaries between the two ends of the probe, as described earlier. Both P</Paragraph> <Paragraph position="11"> and WD appear to solve the problem of underpenalizing false positives, but WD has the added benefit of being more stable across variations in segment size distribution. Thus, WD essentially solves Problems 1, 2, and 3.</Paragraph> <Paragraph position="12"> Table 1c shows that for the FNP segmentation (in which both false positives and false negatives occur), there is a disparity between the performances of P</Paragraph> <Paragraph position="14"> is harsher in this situation. From the above discussion, we know that WD is more lenient in situations where a false negative and a false positive occur near each other (where &quot;near&quot; means within a distance of</Paragraph> <Paragraph position="16"> is more lenient for pure false positives that occur close to boundaries. Thus, it is not immediately clear why P</Paragraph> <Paragraph position="18"> is harsher in this situation, but a more detailed look provides the answer.</Paragraph> <Paragraph position="19"> Let us begin the analysis by trying to explain why P k scores for the FNP segmentation make sense. The FNP segmentation places both false negatives and false positives with probability 0.5. Since we are working with reference segmentations of 1,000 segments, this means 500 missed boundaries and 500 incorrect boundaries. Since the probabilities are uniformly distributed across all segments and all boundaries, on ure out what the average penalty is for a Type C error. Modeling the behavior of the metric, a Type C error occurrence in which a false positive and a false negative are some distance e < k from each other incurs a penalty of 2e, where e is assigned for the false positive and another e is assigned for the false negative. This may range from 0 to 2k, and since error distribution is uniform, the penalty is k on average--the same as for a regular false negative. To translate this into actual values, we assume the metric is linear with respect to the number of errors (a reasonable assumption, supported by our experiments). Thus, if P k outputs a penalty of p for 500 false negatives, it would have a penalty of p for 250 false negatives.</Paragraph> <Paragraph position="20"> Let a be the penalty for 500 Type A errors, b the penalty for 500 Type B errors, and c the penalty for 500 Type C errors; then the penalty for the FNP segmentation is p = penalized false negatives twice as much as false positives on average). We can thus substitute either b or 2a for c. We choose to substitute 2a, because P</Paragraph> <Paragraph position="22"> strongly affected by segment size variation for Type A and Type C errors, but not for Computational Linguistics Volume 28, Number 1 Type B errors. Thus, replacing c with 2a is more accurate. Performing the substitution, we have p = 3 . We have a and b from the FP and FN data, respectively, so we can compute p. The results, arranged by segment size variation, are as follows: null</Paragraph> <Paragraph position="24"> did, but they are still very close. One reason why these estimates are a little less accurate is that for WD, Type C errors are more affected by variation in segment size than either Type A or Type B errors. This is clear from the fact that the decrease is greater in the actual data than in the estimate.</Paragraph> <Paragraph position="25"> Table 2 shows data similar to those of Table 1, but using two different probability values for error occurrence: 0:05 and 0:25. These results have the same tendencies as those shown above for p = 0:5.</Paragraph> </Section> <Section position="2" start_page="29" end_page="32" type="sub_section"> <SectionTitle> 4.2 Variation in the Error Distributions </SectionTitle> <Paragraph position="0"> The second set of tests was designed to assess the performance of the metrics on algorithms prone to different kinds of errors. This would determine whether the metrics are consistent in applying penalties, or whether they favor certain kinds of errors over others. For these trials, we generated the reference segmentation using a uniform distribution of segment sizes in the (15, 35) range. We picked this range because it has reasonably high segment size variation, but segment size does not dip below k. For the of 100 measurements each, shown by segment size distribution range. (a) False negatives were placed with probability 0.05 at each boundary; (b) false positives were placed with probability 0.05, uniformly distributed within each segment; and (c) both false negatives and false positives were placed with probability 0.05. (d) False negatives were placed with probability 0.25 at each boundary; (e) false positives were placed with probability 0.25, uniformly distributed within each segment; and (f) both false negatives and false positives were placed with probability 0.25.</Paragraph> <Paragraph position="1"> range (15, 35) and with error probabilities of 0.5. The average penalties computed by the three metrics are shown for seven different error distributions.</Paragraph> <Paragraph position="2"> k to segment size variations.</Paragraph> <Paragraph position="3"> The tests analyzed below were performed using the high error occurrence probabilities of 0.5, but similar results were obtained using probabilities of 0.25 and 0.05 as well. The following error distributions were used: FN: false negatives, probability p = 0:5; FP1: false positives uniformly distributed in each segment, probability p = 0:5; FP2: false positives normally distributed around each boundary with standard deviation equal to the segment size, probability p = 0:5; FP3: false positives uniformly distributed throughout the document, occurring at each point with probability p = number of segments length 2 (this corresponds to a 0.5 probability value for each individual segment); FNP1: FN and FP1 combined; FNP2: FN and FP2 combined; FNP3: FN and FP3 combined.</Paragraph> <Paragraph position="4"> The results are shown in Table 3. P k penalizes FP2 less than FP1 and FP3, and FNP2 less than FNP1 and FNP3. This result is as expected. FP2 and FNP2 have false positives normally distributed around each boundary, which means that more of the false positives are close to the boundaries and thus are penalized less. If we made the standard deviation smaller, we would expect this difference to be even more apparent.</Paragraph> <Paragraph position="6"> penalized FP2 and FNP2 the least in their respective categories, and FP1 and FNP1 the most, with FP3 and FNP3 falling in between. These results are as expected, for the same reasons as for P k . The difference in the penalty for FP1 and FP3 (and FNP1 vs. FNP3)--for both P</Paragraph> <Paragraph position="8"> --is interesting. In FP/FNP1, false positive probability is uniformly distributed throughout each segment, whereas in FP/FNP3, false positive probability is uniformly distributed throughout the entire document. Thus, the FP/FNP3 segmentations are more likely to have boundaries that are very close to each other, since they are not segment dependent, while FP/FNP1 smaller penalties for FP/FNP3, since groups of false positives close together (to be more exact, within k sentences of each other) would be underpenalized. This difference is also present in the P k results, but is about half for obvious reasons.</Paragraph> <Paragraph position="9"> WD penalized FP1 the most and FP3 the least among the FP segmentations. Among the FNP segmentations, FNP1 was penalized the most and FNP2 the least. To see why, we examine the results for the FP segmentations. WD penalizes pure false positives the same amount regardless of how close they are to a boundary; the only way false positives are underpenalized is if they occur in bunches. As mentioned earlier, this is most likely to happen in FP3. It is least likely to happen in FP1, since in FP1 there is a maximum of one false positive per segment, and this false positive is not necessarily close to a boundary. In FP2, false positives are also limited to one per segment, but they are also more likely to be close to boundaries. This increases the likelihood that 2 false positives will be within k sentences of each other and thus makes WD give a slightly lower score to the FP2 segmentation than to the FP1 segmentation.</Paragraph> <Paragraph position="10"> Now let us look at the FNP segmentations. FNP3 is penalized less than FNP1 for the same reason described above, and FNP2 is penalized even less than FNP3. The closer a Type C error is to the boundary, the lower the penalty. FNP2 has more errors distributed near the boundaries than the others: thus, the FNP2 segmentation is penalized less than either FNP1 or FNP3.</Paragraph> <Paragraph position="11"> The same tests were run for different error occurrence probabilities (p = 0:05 and p = 0:25), achieving results similar to those for p = 0:5 just described. There is a slight difference for the case of p = 0:05 because the error probability is too small for some of the trends to manifest themselves. In particular, the differences in the way WD treats the different segmentations disappear when the error probability is this small.</Paragraph> </Section> <Section position="3" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 4.3 Variation in the Error Types </SectionTitle> <Paragraph position="0"> We also performed a small set of tests to verify the theoretical finding that P</Paragraph> <Paragraph position="2"> overpenalize near-miss errors as compared with pure false positives, and that WD does the opposite, overpenalizing the pure false positives. Space limitations prevent detailed reporting of these results, but the simulations did indeed verify these expectations.</Paragraph> </Section> </Section> class="xml-element"></Paper>