File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2045_metho.xml

Size: 9,648 bytes

Last Modified: 2025-10-06 14:10:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2045">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Collaborative Framework for Collecting Thai Unknown Words from the Web</Title>
  <Section position="5" start_page="346" end_page="347" type="metho">
    <SectionTitle>
3 Unknown-Word Problem in Word
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="346" end_page="347" type="sub_section">
      <SectionTitle>
Segmentation Algorithms
</SectionTitle>
      <Paragraph position="0"> Similar to Chinese, Japanese and Korea, Thai language belongs to the class of non-segmenting languages in which words are written continuously without using any explicit delimiting character.</Paragraph>
      <Paragraph position="1"> To handle non-segmenting languages, the first required step istoperform word segmentation. Most word segmentation algorithms use a lexicon or dictionary to parse texts at the character-level. A typical word segmentation algorithm yields three types of results: known words, ambiguous segments, and unknown segments. Known words are existing words in the lexicon. Ambiguous segments are caused by the overlapping of twoknown words. Unknown segments are the combination of characters which are not defined in the lexicon.</Paragraph>
      <Paragraph position="2"> In this paper, we are interested in extracting the unknown words with high precision and recall results. Three types of unknown words are hidden, explicit and mixed (Kawtrakul et al. , 1997). Hidden unknown words are composed by different words existing in the lexicon. To illustrate the idea, let us consider an unknown word ABCD where A, B, C, and D represents individual characters. Suppose that AB and CD both ex- null ist in a dictionary, then ABCD is considered as a hidden unknown word. The explicit unknown words are newly created words by using different characters. Let us again consider an unknown word ABCD. Suppose that there is no substring of ABCD (i.e., AB, BC, CD, ABC, BCD) exists in thedictionary, then ABCDisconsidered asexplicit unknown words. The mixed unknown words are composed of both existing words in a dictionary and non-existing substrings. From the example of unknown string ABCD, if there is at least one sub-string of ABCD (i.e., AB, BC, CD, ABC, BCD) exists in the dictionary, then ABCD is considered as a mixed unknown word.</Paragraph>
      <Paragraph position="3"> It can be immediately seen that the detection of the hidden unknown words are not trivial since the parser would mistakenly assume that all the fragments of the words are valid, i.e., previously defined in the dictionary. In this paper, we limit ourself to the extraction of the explicit and mixed unknown words. This type of unknown words usually represent the transliteration of foreign words. Detection of these unknown words could be accomplished mainly by using a word-segmentation algorithm withamorphological analysis. Byusing a dictionary-based word-segmentation algorithm, locations of words which are not previously defined in the lexicon could be easily detected.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="347" end_page="349" type="metho">
    <SectionTitle>
4 The Proposed Framework
</SectionTitle>
    <Paragraph position="0"> The overall framework is shown in Figure 1.</Paragraph>
    <Paragraph position="1"> Two major components are information agent and unknown-word analyzer. The details of each component are given as follows.</Paragraph>
    <Paragraph position="2"> * Information agent: This module is composed of a Web crawler and an HTMLparser.</Paragraph>
    <Paragraph position="3"> Itisresponsible for collecting HTMLsources from the given URLs and extracting the textual data from the pages. Our framework is designed to support multi-user and collaborative environment. The advantage of this design approach is that unknown words could be collected and verified more efficiently.</Paragraph>
    <Paragraph position="4"> More importantly, it allows users to select the Web pages which suit their interests.</Paragraph>
    <Paragraph position="5"> * Unknown-word analyzer: This module is composed ofmany components foranalyzing and extracting unknown words. Word segmentation module receives text strings from the information agent and segments them into a list of words. N-gram generation module is responsible for generating hidden unknown-word candidates. Morphological analysis module is used to form initial explicit unknown-word segments. String pattern matching unit performs unknown-word boundary identification task. It takes the intermediate unknown segments and identifies their boundaries by analyzing string matching patterns The results are processed unknown-word candidates which are presented to linguists for final post-processing and verification. New unknown words are combined with the dictionary to iteratively improve the performance of the word segmentation module. Details of each component are given in the following subsections.</Paragraph>
    <Section position="1" start_page="347" end_page="348" type="sub_section">
      <SectionTitle>
4.1 Unknown-Word Detection
</SectionTitle>
      <Paragraph position="0"> As previously mentioned in Section 3, applying a word-segmentation algorithm on a text string yields three different segmented outputs: known, ambiguous, and unknown segments. Since our goal is to simply detect the unknown segments without solving or analyzing other related issues in word segmentation, using the longest-matching word segmentation algorithm previously proposed by Poowarawan (1986) is sufficient. An example to illustrate the word-segmentation process is given as follows.</Paragraph>
      <Paragraph position="1"> Let the following string denotes a text string written in Thai language: {a1a2...aib1b2...bjc1c2...ck}. Suppose that {a1a2...ai} and {c1c2...ck} are known words from the dictionary, and {b1b2...bj} be an unknown word. For the explicit unknown-word case, applying the word-segmentation algorithm would yield the following segments: {a1a2...ai}{b1}{b2}...{bj}{c1c2...ck}. It can be observed that the detected unknown positions for a single unknown word are individual characters in the unknown word itself. Based on the initial statistical analysis of a Thai lexicon, it was found that the averaged number of characters in a word is equal to 7. This characteristic is quite different from other non-segmenting languages such as Chinese and Japanese in which a word could be a character or a combination of only a few characters. Therefore, to reduce the complexity in unknown-word boundary identification task, the unknown segments could be merged to form multiple-character segments. For exam-</Paragraph>
      <Paragraph position="3"> ple, a merging of two characters per segment would give the following unknown segments: {b1b2}{b3b4}...{bj[?]1bj}. In the following experiment section, the merging of two to fivecharacters per segment including the merging of all unknown segments without limitation will be compared.</Paragraph>
      <Paragraph position="4"> Morphological analysis is applied to guarantee grammatically correct word boundaries. Simple morphological rules are used in the framework. The rule set is based on two types of characters, front-dependent characters and reardependent characters. Front-dependent characters are characters which must be merged to the segment leading them. Rear-dependent characters are characters which must be merged to the segment following them. In Thai written language, these dependent characters are some vowels and tonal characters which have specific grammatical constraints. Applying morphological analysis will help making the unknown segments more reliable.</Paragraph>
    </Section>
    <Section position="2" start_page="348" end_page="349" type="sub_section">
      <SectionTitle>
4.2 Unknown-Word Boundary Identification
</SectionTitle>
      <Paragraph position="0"> Once the unknown segments are detected, they are stored into a hashtable along with their contextual information. Our unknown-word boundary identification approach is based on a string pattern-matching algorithm previously proposed by Boyer and Moore (1977). Consider the unknown-word boundary identification as a string pattern-matching problem, there are two possible strategies: considering the longest matching pattern and considering the most frequent matching pattern as the unknown-word candidates. Both strategies could beexplained moreformally asfollows. null Given a set of N text strings, {S1S2...SN}, where Si, is a series of leni characters denoted by {ci,1ci,2...ci,leni} and each is marked with an unknown-segment position, posi, where 1[?]posi[?]leni. Given a new string, Sj, with an unknown-segment position, posj, the longest pattern-matching strategy iterates through each string, S1 to SN and records the longest string pattern which occur in both Sj and the other string in the set. On the other hand, the most frequent pattern-matching strategy iterates through each string, S1 to SN, but records the matching pattern which occur most frequently.</Paragraph>
      <Paragraph position="1"> The results from the unknown-word boundary identification are unknown-word candidates. These candidates are presented to the users for verification. Our framework is implemented via a Web-browser interface which provides a user-friendly environment. Figure 2 shows a screen snapshot of our system. Each unknown word is listed within a text field box which allows a user to edit and correct its boundary. The contexts could be used as some editing guidelines and are also stored into the database.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML