File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2045_intro.xml
Size: 3,963 bytes
Last Modified: 2025-10-06 14:03:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2045"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Collaborative Framework for Collecting Thai Unknown Words from the Web</Title> <Section position="4" start_page="346" end_page="346" type="intro"> <SectionTitle> 2 Previous Works </SectionTitle> <Paragraph position="0"> The research and study in unknown-word problem have been extensively done over the past decades. Unknown words are viewed as problematic source in the NLP systems. Techniques in identifying and extracting unknown words are somewhat language-dependent. However, these techniques could be classified into two major categories, one for segmenting languages and another for non-segmenting languages. Segmenting languages, such as latin-based languages, use delimiting characters to separate written words.</Paragraph> <Paragraph position="1"> Therefore, once the unknown words are detected, their boundaries could be identified relatively easily when compared to those for non-segmenting languages.</Paragraph> <Paragraph position="2"> Some examples of techniques involving segmenting languages are listed as follows.</Paragraph> <Paragraph position="3"> Toole (2000) used multiple decision trees to identify names and misspellings in English texts. Features used in constructing the decision trees are, for example, POS (Part-Of-Speech), word length, edit distance and character sequence frequency. Similarly, a decision-tree approach was used to solve the POS disambiguation and unknown word guessing in (Orphanos and Christodoulakis, 1999). The research in the unknown-word problem for segmenting languages is also closely related to the extraction of named entities. The difference of these techniques to those in non-segmenting languages is that the approach needs to parse the written text in word-level as opposed to character-level.</Paragraph> <Paragraph position="4"> The research in unknown-word problem for non-segmenting languages is highly active for Chinese and Japanese. Many approaches have been proposed and experimented with. Asahara and Matsumoto (2004) proposed a technique of SVM-based chunking to identify unknown words from Japanese texts. Their approach used a statistical morphological analyzer to segment texts into segments. The SVM was trained by using POS tags to identify the unknown-word boundary. Chen and Ma (2002) proposed a practical unknown word extraction system by considering both morphological and statistical rule sets for word segmentation. Chang and Su (1997) proposed an unsupervised iterative method for extracting unknown lexicons from Chinese text corpus. Theiridea is toinclude the potential unknown words to the augmented dictionary in order to improve the word segmentation process. Their proposed approach also includes both contextual constraints and the joint character association metric to filter the unlikely unknown words. Other approaches to identify unknown words include statistical or corpus-based (Chen and Bai, 1998), and the use of heuristic knowledge (Nie et al. , 1995) and contextual information (Khoo and Loh, 2002).</Paragraph> <Paragraph position="5"> Some extensions to unknown-word identification have been done. An example include the determination of POS for unknown words (Nakagawa et al. , 2001).</Paragraph> <Paragraph position="6"> The research in unknown words for Thai language has not been widely done as in other languages. Kawtrakul et al. (1997) used the combination of a statistical model and a set of context sensitive rules to detect unknown words. Our frameworkhasadifferent goal from previous works. We consider unknown-word problem as collaborative task among a group of interested users. As more textual content is provided to the system, new unknown words could be extracted with more accuracy. Thus, our framework can be viewed as collaborative and statistical or corpus-based.</Paragraph> </Section> class="xml-element"></Paper>