XML Viewer - w95-0109

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/95/w95-0109_evalu.xml
Size: 14,022 bytes
Last Modified: 2025-10-06 14:00:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0109">
  <Title>Automatic Construction of a Chinese Electronic Dictionary</Title>
  <Section position="5" start_page="114" end_page="118" type="evalu">
    <SectionTitle>
9. Performance Evaluation
</SectionTitle>
    <Paragraph position="0"> To get an estimation of the system performance automatically, the extracted dictionary is compared against a manually constructed standard dictionary. This is required because the extracted dictionary is large, and human verification will be both subjective and time-consuming. The performance will be evaluated in terms of the word precision rate and recall rate for the VTW and the TCC modules. The word precision rate is the number of n-grams common to the extracted word list and the standard dictionary divided by the number of n-grams in the extracted word list; on the contrary, the recall is the number of common n-grams divided by the number of n-grams in the standard dictionary. The VTT module will be estimated in terms of several weighted tag precision and recall rate measures.</Paragraph>
    <Paragraph position="1"> The standard Word Dictionary to be compared with the extracted word list is acquired by merging the word lists of two electronically available dictionaries \[CKIP 90, BDC 93\] and the words included in the seed corpus. It also excludes all n-grams which never appear in the 9767-sentence seed corpus and the untagged text corpus, because such n-grams will never be the input to the dictionary construction system.</Paragraph>
    <Paragraph position="2"> The merged dictionary, excluding entries that appear less frequently than the frequency lower bound (5), contains 17,005 bigram words, 2,524 trigram words and 1,612 4-gram words.</Paragraph>
    <Paragraph position="3"> The standard Word-Tag Dictionary to be compared with the extracted POSes is constructed from the BDC English-Chinese electronic dictionary \[BDC 93\]. The derived Word-Tag Dictionary contains 87,551 entries, including 35,722 bigram words, 19,858 trigram words, and 24,092 4-gram words. The tagset used in this dictionary Contains 62 tags (including two punctuation tags). Note that there are only 42 tags in the smaller seed corpus of 1000 sentences, and the whole seed corpus of 9676 sentences contains only 47 POS tags (including one punctuation tag). Therefore, such missing tags will introduce some tag extracting errors in the training processes.</Paragraph>
    <Paragraph position="4"> Since the Word Dictionary and Word-Tag Dictionary, which are used for comparison with the extracted dictionary, are constructed independently of the corpus from which the lexicon entries are  extracted, the reported performances could be greatly underestimated. For instance, an n-gram which is identified as a lexicon entry by the system but excluded from the Word Dictionary may not necessarily be a wrong word entry if it is judged by an expert lexicographer. In the ideal case, the Word Dictionary and Word-Tag Dictionary should be constructed by an expert lexicographer based on the corpus for a fair comparison. Unfortunately, we are unable to afford the man power for such an evaluation on the large corpus. Therefore, special attention should be taken when interpreting the performances reported in the following sections.</Paragraph>
    <Section position="1" start_page="115" end_page="116" type="sub_section">
      <SectionTitle>
9.1 Performance for the Basic (VTW+VTT) Topology
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the performances in different stages for the Basic Model (columns 1-4) and the Postfiltering Model (columns 1-6) by using the small (1000-sentence) seed corpus. (Columns 1-4 are shared because the Postfiltering is applied immediately after the Basic Model.) The numerators in the parentheses are the numbers of correctly identified n-grams; for precision, the denominators are the numbers of n-grams in the extracted word lists; and for recall, they stand for the numbers of n-grams in the standard dictionary.</Paragraph>
      <Paragraph position="1"> The third column simply shows the initial precision and recall for the n-grams which are more frequent than a frequency lower bound LB; such word candidates are the base for evaluating the effects of the VTW and TCC modules. The Viterbi training process for extracting the word list goes through 4 iterations. With the small seed corpus, it is observed that the precision for bigram words is improved from the initial precision of 17.07% to 38.21%, corresponding to an increase of 21.14%, and the recall is dropped from 100% to 89.87%, a decrease of 10.13%. This shows that the Viterbi training procedure does provide a significant improvement in precision while maintaining a reasonable recall.</Paragraph>
      <Paragraph position="2"> Note that, the precision for the initial (frequency-filterred) word candidates with respect to the dictionary is an indicator to the difficulty of the task. It indicates how much percentage of word candidates are recognized as words by the standard dictionary. From the table, the initial word candidates in the large corpus only include 3 to 4 % of the real word candidates which are recognized as words by a human constructed dictionary. Furthermore, there are only 317 trigram words and 40 4-gram words in the training seed corpus. As a result, it is difficult to spot such candidates from the large candidate list with a reasonable precision and recall. Hence, it is not surprising that the performance for the 3-grams and 4-grams is poor.</Paragraph>
      <Paragraph position="3"> For these reasons, we will make no further comments on the 3-gram and 4-gram performances which are trained and observed under a very difficult training environment. A few comments will be given on the section for error analysis though.</Paragraph>
    </Section>
    <Section position="2" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
9.2 Performance for the Postfiltering (VTW+TCC+VTT) Topology
</SectionTitle>
      <Paragraph position="0"> The performance of the PostFiltering model is shown in columns 1-6 of Table 1. The two VTW modules in Figure 4 are identical, and each VTW module goes through 4 training iterations. With the small seed corpus, the bigram performance is improved from 38.21% to 56.80% with a decrease of recall from 89.87% to 76.98% after the post-filter is installed. The global system achieves a precision rate of 56.88% at the recall rate of 77.37%.</Paragraph>
      <Paragraph position="1"> It is observed that, by using the large corpus (which is about ten folds in size), the precisions are only slightly increased (by about 2%). Therefore, the corpus size may not be a critical issue in this task. A better extraction model might be more likely to improve the system further.</Paragraph>
    </Section>
    <Section position="3" start_page="116" end_page="117" type="sub_section">
      <SectionTitle>
9.3 Error Analysis for Word Identification Models
</SectionTitle>
      <Paragraph position="0"> The 3-gram and 4-gram precision rates are quite poor in the above tests. An inspection of the entries which are not recognized as words shows that some of the entries which should be considered words are not registered in the standard general dictionary. This means that the system does find some new words that were never seen by the standard dictionary, and thus are considered wrong. Examples of such n-grams are: Some of the above examples are frequently encountered domain-specific terms in politics, economics, etc., which would be considered new words to a general dictionary. Others include frequently encountered proper names (company names, city names) or productive lexicon entries. Although such terms may not be considered in constructing a general dictionary, it is useful to include such daily used high frequency terms in an electronic dictionary for practical processing purposes. Therefore, the precision performance, estimated by comparing it with a general dictionary, is usually underestimated.</Paragraph>
      <Paragraph position="1"> Excluding such n-grams, the other incorrectly extracted n-grams have some special patterns which suggest that the extraction models might be refined by extracting or filtering out n-grams according to the substring patterns they have. In particular, a 3- gram (or 4-gram) may have the following relationships with  its substrings: 1. compositional: the n-gram can be decomposed into legal words (e.g., ~ ~ ~ ~ (&amp;quot;this afternoon&amp;quot;) - ~l~ (&amp;quot;announce ... today&amp;quot;), ~)k.~(&amp;quot;intervene the election&amp;quot;)). . collocational: parts of the n-gram are legal words, the other parts are highly flexible (e.g.,&amp;quot;do not + VERBS&amp;quot; : ~j~.~ - ~::~1~ &amp;quot; ~~ ; &amp;quot;many + NOUNS&amp;quot; : ~~ - ;~I~i~ ~,-~/55 ; &amp;quot;not + ADJECTIVES&amp;quot; : qq~_3~z~ - ~\]J~l~ &amp;quot; ~2;t~).</Paragraph>
      <Paragraph position="2"> 3. idiomatic: none of the substrings are legal words, all single characters are highly flexible (e.g.,  --~z~= (&amp;quot;cannot be enumerated one-by-one&amp;quot;)).</Paragraph>
      <Paragraph position="3"> All the above patterns are related to the internal structure of the n-grams; our features and models, however, are more closely related to the intrinsic properties of the n-gram itself or the contextual information with the other n-grams. This explains why some highly associated n-grams, which are not word units, are extracted as words by the system. It also suggests that we could filter out some inappropriate candidates which contain frequently encountered substrings and whose other parts show high entropy (or  similar measures.) A few simple filtering rules based on such observation show that the precision could be increased more effectively by refining the models in this way than increasing the seed corpus size. A more extensive survey is being studied.</Paragraph>
    </Section>
    <Section position="4" start_page="117" end_page="117" type="sub_section">
      <SectionTitle>
9.4 Tagging Accuracy: Weighted Tagging Recall and Precision
</SectionTitle>
      <Paragraph position="0"> Because a word may be tagged differently under different context, a word identified by the VTW or TCC module may have more than one tag. For the tagging accuracy, we use several measures to estimate the performance. Firstly, the number of word-tag pairs common to the extracted word-tag list and the Word-Tag Dictionary divided by the number of pairs in the extracted list is defined as the raw precision rate; the raw recall rate is defined similarly as the number of common word-tag pairs divided by the number of word-tag pairs in the Word-Tag Dictionary. With this measure, if a word in the extracted list has M tags, then all the M word-tag pairs for the word are evaluated independently of the other pairs.</Paragraph>
      <Paragraph position="1"> Because the annotated tags for a word is usually considered as a whole when constructing a dictionary entry, it may be desirable to define a per-word precision and per-word recall to measure how good the tags for a word is annotated, and then properly associate a weight to each word to evaluate the performance for the whole system.</Paragraph>
      <Paragraph position="2"> The per-word precision for a word is defined as the number of tags commonly annotated in the dictionary entry and the extracted word-tag list for the word divided by the number of tags in the extracted word-tag entry for the word. On the contrary, the number of common tags divided by the number of tags in the corresponding dictionary entry is defined as the per-word recall for the word. For instance, if a word is tagged with the parts of speech\[n, v, a\] by the system, and it has the parts of speech \[n, adv\] in the standard dictionary, then the per-word recall will be 1/2 for this word and the per-word precision will be 1/3.</Paragraph>
      <Paragraph position="3"> Based on the per-word precision and recall, we define the average precision (resp. recall) of the system as the sum of per-word precisions (resp. recalls) divided by the number of words in the word list.</Paragraph>
      <Paragraph position="4"> Alternatively, we could take the frequencies of the n-grams into account so that more frequently used words are given a heavier weight on its per-word precision and recall. Such weighted precision (or recall) is defined as the sum of product of the per-word precision (or recall) and the word probability taken over each word.</Paragraph>
    </Section>
    <Section position="5" start_page="117" end_page="118" type="sub_section">
      <SectionTitle>
9.5 Part-of-Speech Extraction Performance
</SectionTitle>
      <Paragraph position="0"> To evaluate the performance of the Viterbi Part-of-Speech Tagging Module on the POS extraction task, the words in the segmented and POS tagged text corpus are compared against the Word-Tag Dictionary mentioned in a previous section.</Paragraph>
      <Paragraph position="1"> Since not all extracted words have a corresponding entry in the Word-Tag Dictionary, we only evaluate the performance of the POS extraction module over common entries in both the extracted dictionary and the standard dictionary. The sizes of the common entries for the various models are around 8 to 9 thousands entries. On the average, each dictionary entry contains about 1.4 parts of speech, and each entry annotated by the Viterbi training module has about 1.7 parts of speech.</Paragraph>
      <Paragraph position="2"> Tables 2 shows the raw precision (Praw), average precision (Pavrg), weighted precision (Pwavg), and their corresponding recall rates. (The left-hand side performance is acquired with a seed of 1000 sentences, and the right hand side with 9676 sentences.) It seems that the performance is not significantly different between the two different models. This may imply that the segmented text corpus passed from the various models do not have significant difference. Furthermore, unlike in the word identification stage, the increase</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML