File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2045_evalu.xml

Size: 5,906 bytes

Last Modified: 2025-10-06 13:59:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2045">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Collaborative Framework for Collecting Thai Unknown Words from the Web</Title>
  <Section position="7" start_page="349" end_page="350" type="evalu">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In this section, we evaluate the performance of our proposed framework. The corpus used in the experiments is composed of 8,137 newspaper articles collected from a top-selling Thai newspaper's Web site (Thairath, 2003) during 2003. The corpus contains a total of 78,529 unknown words of which 14,943 are unique. This corpus was focused on unknown words which are transliterated from foreign languages, e.g., English, Spanish, Japanese and Chinese. We use the publicly available Thai dictionary LEXiTRON, which contains approximately 30,000 words, in our framework (Lexitron, 2006).</Paragraph>
    <Paragraph position="1"> We first analyze the unknown-word set to observe its characteristics. Figure 3 shows the plot of unknown-word frequency distribution. Not surprisingly, the frequency of unknown-word usage follows a Zipf-like distribution. This means there areagroup ofunknown words whichareused very often, while some unknown words are used only a few times over a time period. Based on the frequency statistics of unknown words, only about 3%(2,375 words outof 78,529) occur onlyonce in thecorpus. Therefore, thisfindingsupports theuse of statistical pattern-matching algorithm described in previous section.</Paragraph>
    <Section position="1" start_page="349" end_page="350" type="sub_section">
      <SectionTitle>
5.1 Evaluation of Unknown-Word Detection
Approaches
</SectionTitle>
      <Paragraph position="0"> As discussed in Section 4, multiple unknown segments could be merged to form a representative unknown segment. The merging will help reduce the complexity in the unknown-word boundary identification as fewer segments will be checked for the same set of unknown words.</Paragraph>
      <Paragraph position="1"> The following variations of merging approach are compared.</Paragraph>
      <Paragraph position="2">  * N-character Merging (N-char): Allow the maximum of N characters per segment.</Paragraph>
      <Paragraph position="3"> * Merging all segments (all): No limit on number of characters per segment.</Paragraph>
      <Paragraph position="4">  We measure the performance of unknown-word detection task by using two metrics. The first is the detection rate (or recall) which is equal to the number of detected unknown words divided bythe total number of previously tagged unknown words in the corpus. The second is the averaged detected positions per word. The second metric directly represents the overhead or the complexity to the unknown-word boundary identification process. This is because all detected positions from a single unknown word must be checked by the process. The comparison results are shown in Figure 4. As expected, the approach none gives the maximum detection rate of 96.6%, while the approach all yields the lowest detection rate. Another interesting observation is that the approach 2-char yields comparable detection rate to the ap- null proach none, however, its averaged detected positions per word is about three times lower. Thereforetoreduce thecomplexity during the unknown-word boundary identification process, one might want to consider using the merging approach of</Paragraph>
    </Section>
    <Section position="2" start_page="350" end_page="350" type="sub_section">
      <SectionTitle>
5.2 Evaluation of Unknown-Word Boundary
Identification
</SectionTitle>
      <Paragraph position="0"> The unknown-word boundary identification is based on string pattern-matching algorithm. The following variations of string pattern-matching technique are compared.</Paragraph>
      <Paragraph position="1">  phological analysis (freq-morph): Similar the the approach freq but with additional morphological analysis to guarantee that the word boundaries are grammatically correct. The comparison among all variations of string pattern-matching approaches areperformed across all unknown-segment merging approach. The results are shown in Figure 5. The performance metric is the word-boundary identification accuracy which is equal to the number of unknown words correctly extracted divided by the total number of tested unknown segments. It can be observed that the selection of different merging approaches doesnotreallyeffecttheaccuracy oftheunknownword boundary identification process. But since the approach none generates approximately 6 positions per unknown segment on average, it would be more efficient to perform a merging approach which could reduce the number of positions down by at least 3 times.</Paragraph>
      <Paragraph position="2"> The plot also shows the comparison among three approaches of string pattern-matching. Figure 6 summarizes the accuracy results of each string pattern-matching approach by taking the average on alldifferent merging approaches. Theapproach long performed poorly with the averaged accuracy of 8.68%. This is not surprising because selection of the longest matching pattern does not mean that its boundary will be identified correctly. The approaches freq and freq-morph yield similar accuracy of about 36%. The freq-morph improves the performance of the approach freq by less than 1%. The little improvement is due to the fact that the matching strings are mostly grammatically correct. However, the error is caused by the matching collocations of the unknown-word context. If an unknown word occurs together adjacent to another word very frequently, they will likely be extracted by the algorithm. Our solution to this problem is by providing the users with a user-friendly interface so unknown-word candidates could be easily filtered and corrected.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML