File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/p94-1010_evalu.xml

Size: 7,523 bytes

Last Modified: 2025-10-06 14:00:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1010">
  <Title>REFERENCES</Title>
  <Section position="8" start_page="69" end_page="71" type="evalu">
    <SectionTitle>
EVALUATION
</SectionTitle>
    <Paragraph position="0"> In this section we present a partial evaluation of the current system in three parts. The first is an evaluation of the system's ability to mimic humans at the task of segmenting text into word-sized units; the second evaluates the proper name identification; the third measures the performance on morphological analysis. To date we have not done a separate evaluation of foreign name recognition.</Paragraph>
    <Paragraph position="1"> Evaluation of the Segmentation as a Whole: Previous reports on Chinese segmentation have invariably 4The current model is too simplistic in several respects. For instance, the common 'suffixes', -nia (e.g., Virginia) and -sia are normally transliterated as ~=~ ni2-ya3 and ~\]~ ~n~ xil-ya3, respectively. The interdependence between \]:~ or ~, and ~r~ is not captured by our model, but this could easily be remedied.</Paragraph>
    <Paragraph position="2">  cited performance either in terms of a single percentcorrect score, or else a single precision-recall pair. The problem with these styles of evaluation is that, as we shall demonstrate, even human judges do not agree perfectly on how to segment a given text. Thus, rather than give a single evaluative score, we prefer to compare the performance of our method with the judgments of several human subjects. To this end, we picked 100 sentences at random containing 4372 total hanzi from a test corpus. We asked six native speakers -- three from Taiwan (T1-T3), and three from the Mainland (M1-M3) -- to segment the corpus. Since we could not bias the subjects towards a particular segmentation and did not presume linguistic sophistication on their part, the instructions were simple: subjects were to mark all places they might plausibly pause if they were reading the text aloud. An examination of the subjects' bracketings confirmed that these instructions were satisfactory in yielding plausible word-sized units.</Paragraph>
    <Paragraph position="3"> Various segmentation approaches were then compared with human performance:  1. A greedy algorithm, GR: proceed through the sentence, taking the longest match with a dictionary entry at each point.</Paragraph>
    <Paragraph position="4"> 2. An 'anti-greedy' algorithm, AG: instead of the longest match, take the shortest match at each point.</Paragraph>
    <Paragraph position="5"> 3. The method being described -- henceforth ST.</Paragraph>
    <Paragraph position="6"> Two measures that can be used to compare judgments are: 1. Precision. For each pair of judges consider one  judge as the standard, computing the precision of the other's judgments relative to this standard.</Paragraph>
    <Paragraph position="7"> 2. Recall. For each pair of judges, consider one judge as the standard, computing the recall of the other's judgments relative to this standard.</Paragraph>
    <Paragraph position="8"> Obviously, for judges J1 and J2, taking ,/1 as standard and computing the precision and recall for J2 yields the same results as taking J2 as the standard, and computing for Jr, respectively, the recall and precision. We therefore used the arithmetic mean of each interjudge precision-recall pair as a single measure of interjudge similarity. Table 2 shows these similarity measures. The average agreement among the human judges is .76, and the average agreement between ST and the humans is .75, or about 99% of the inter-human agreement. (GR is .73 or 96%.) One can better visualize the precision-recall similarity matrix by producing from that matrix a distance matrix, computing a multidimensional scaling on that distance matrix, and plotting the first two most significant dimensions. The result of this is shown in Figure 4. In addition to the automatic methods, AG, GR and ST, just discussed, we also added to the plot the values for the current algorithm using only dictionary entries (i.e., no productively derived words, or names). This is to allow for fair comparison between the statistical method, and GR, which is also purely dictionary-based. As can be seen, GR and this 'pared-down' statistical method perform quite similarly, though the statistical method is still slightly better. AG clearly performs much less like humans than these methods, whereas the full statistical algorithm, including morphological derivatives and names, performs most closely to humans among the automatic methods. It can be also seen clearly in this plot, two of the Taiwan speakers cluster very closely together, and the third Taiwan speaker is also close in the most significant dimension (the z axis). Two of the Mainlanders also cluster close together but, interestingly, not particularly close to the Taiwan speakers; the third Mainlander is much more similar to the Taiwan speakers.</Paragraph>
    <Paragraph position="9"> Personal Name Identification: To evaluate personal name identification, we randomly selected 186 sentences containing 12,000 hanzi from our test corpus, and segmented the text automatically, tagging personal names; note that for names there is always a single un-ambiguous answer, unlike the more general question of which segmentation is correct. The performance was 80.99% recall and 61.83% precision. Interestingly, Chang et al. reported 80.67% recall and 91.87% precision on an 11,000 word corpus: seemingly, our system finds as many names as their system, but with four times as many false hits. However, we have reason to doubt Chang et al.'s performance claims. Without using the same test corpus, direct comparison is obviously difficult; fortunately Chang et al. included a list of about 60 example sentence fragments that exemplified various categories of performance for their system. The performance of our system on those sentences appeared rather better than theirs. Now, on a set of 11 sentence fragments where they reported 100% recall and precision for name identification, we had 80% precision and 73% recall. However, they listed two sets, one consisting of 28 fragments and the other of 22 fragments in which they had 0% precision and recall.</Paragraph>
    <Paragraph position="10"> On the first of these our system had 86% precision and 64% recall; on the second it had 19% precision and 33% recall. Note that it is in precision that our over-all performance would appear to be poorer than that of Chang et al., yet based on their published examples, our  system appears to be doing better precisionwise. Thus we have some confidence that our own performance is at least as good that of(Chang et al., 1992). s Evaluation of Morphological Analysis: In Table 3 we present results from small test corpora for some productive affixes; as with names, the segmentation of morphologically derived words is generally either right or wrong. The first four affixes are so-called resultative affixes: they denote some property of the resultant state of an verb, as in ~,,:;~ ~&amp;quot; wang4-bu4-1iao3 (forget-notattain) 'cannot forget'. The last affix is the nominal plural. Note that ~ in ~,:~: \]&amp;quot; is normally pronounced as leO, but when part of a resultative it is liao3. In the table are the (typical) classes of words to which the affix attaches, the number found in the test corpus by the method, the number correct (with a precision measure), and the number missed (with a recall measure).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML