File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/p06-2045_concl.xml

Size: 1,866 bytes

Last Modified: 2025-10-06 13:55:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2045">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Collaborative Framework for Collecting Thai Unknown Words from the Web</Title>
  <Section position="8" start_page="350" end_page="351" type="concl">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> We proposed a framework for collecting Thai unknown words from the Web. Our framework  mation agent is to collect and extract textual data from Web pages of given URLs. The unknown-word analyzer involves two processes: unknown-word detection and unknown-word boundary identification. Due to the non-segmenting characteristic of Thai written language, the unknown-word detection is based on a word-segmentation algorithm with a morphological analysis. To take advantage of large available text resource from the Web, the unknown-word boundary identification is based on the statistical pattern-matching algorithm. null We evaluate our proposed framework on a collection of Web Pages obtained from a Thai newspaper's Web site. The evaluation is divided to test each of the two processes underlying the framework. For the unknown-word detection, the detection rate isfound tobe ashigh as 96%. Inaddition, by merging a few characters into a segment, the number of required unknown-word extraction is reduced byatleast 3times, whilethedetection rate is relatively maintained. For the unknown-word boundary identification, considering the highest frequent occurrence of string pattern is found to be the most effective approach. The identification accuracy was found to beas high as approximately 36%. The relatively low accuracy is not the major concern since the unknown-word candidates are to be verified and corrected by users before they are actually added to the dictionary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML