File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/p06-2045_concl.xml
Size: 1,866 bytes
Last Modified: 2025-10-06 13:55:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2045"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Collaborative Framework for Collecting Thai Unknown Words from the Web</Title> <Section position="8" start_page="350" end_page="351" type="concl"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> We proposed a framework for collecting Thai unknown words from the Web. Our framework mation agent is to collect and extract textual data from Web pages of given URLs. The unknown-word analyzer involves two processes: unknown-word detection and unknown-word boundary identification. Due to the non-segmenting characteristic of Thai written language, the unknown-word detection is based on a word-segmentation algorithm with a morphological analysis. To take advantage of large available text resource from the Web, the unknown-word boundary identification is based on the statistical pattern-matching algorithm. null We evaluate our proposed framework on a collection of Web Pages obtained from a Thai newspaper's Web site. The evaluation is divided to test each of the two processes underlying the framework. For the unknown-word detection, the detection rate isfound tobe ashigh as 96%. Inaddition, by merging a few characters into a segment, the number of required unknown-word extraction is reduced byatleast 3times, whilethedetection rate is relatively maintained. For the unknown-word boundary identification, considering the highest frequent occurrence of string pattern is found to be the most effective approach. The identification accuracy was found to beas high as approximately 36%. The relatively low accuracy is not the major concern since the unknown-word candidates are to be verified and corrected by users before they are actually added to the dictionary.</Paragraph> </Section> class="xml-element"></Paper>