File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0131_metho.xml

Size: 7,138 bytes

Last Modified: 2025-10-06 14:10:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0131">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics POC-NLW Template for Chinese Word Segmentation</Title>
  <Section position="4" start_page="0" end_page="177" type="metho">
    <SectionTitle>
2 The Basic Word Segmentation Stage
</SectionTitle>
    <Paragraph position="0"> In the first stage, the basic word segmentation is accomplished. The key issue in this stage is the ambiguity problem, which is mainly caused by the fact that a Chinese character can occur in different word internal positions in different words (Xue, 2003). A lot of machine learning techniques have been applied to resolve this problem, the n-gram language model is one of the most popular ones among them (Fu and Luke, 2003; Li et al., 2005). As such, we also employed n-gram model in this stage.</Paragraph>
    <Paragraph position="1"> When a sentence is inputted, it is first segmented into a sequence of individual characters (e.g. ASCII strings, basic Chinese characters, punitions, numerals and so on), marked as C 1,n .</Paragraph>
    <Paragraph position="2"> According to the system's dictionary, several word sequences W 1,m will be constructed as candidates. The function of the n-gram model is to find out the best word sequence W</Paragraph>
    <Paragraph position="4"> The Maximum Likelihood method was used to estimate the word n-gram probabilities used in our model, and the linear interpolation method (Jelinek and Mercer, 1980) was applied to smooth these estimated probabilities.</Paragraph>
  </Section>
  <Section position="5" start_page="177" end_page="178" type="metho">
    <SectionTitle>
3 The OOV Word Identification Stage
</SectionTitle>
    <Paragraph position="0"> The n-gram method is based on the exiting grams in the model, so it is good at judging the connecting relationship among known words, but does not have the ability to deal with unknown words in substance. Therefore, another OOV word identification model is required.</Paragraph>
    <Paragraph position="1"> OOV words are regarded as words that do not exist in a system's machine-readable dictionary, and a more detailed definition can be found in (Wu and Jiang, 2000). In general, Chinese word can be created through compounding or abbreviating of most of existing characters and words. Thus, the key to solve the OOV word identification lies on whether the new word creation mechanisms in Chinese language can be extracted. Therefore, a POC-NLW language tagging template is introduced to explore such information on the character-level within words.</Paragraph>
    <Section position="1" start_page="177" end_page="177" type="sub_section">
      <SectionTitle>
3.1 The POC-NLW Template
</SectionTitle>
      <Paragraph position="0"> Many character-level based works have been done for the Chinese word segmentation, including the LMR tagging methods (Xue, 2003; Nakagawa. 2004), the IWP mechanism (Wu and Jiang, 2000). Based on these previous works, this POC-NLW template was derived. Assume that the length of a word is the number of component characters in it, the template is consist of two component: L max and a Wl-Pn tag set. L max to denote the maximum length of a word expressed by the template; a Wl-Pn tag denotes that this tag is assigned to a character at the n-th position within a l-length word, . Apparently, the size of this tag set is</Paragraph>
      <Paragraph position="2"> For example, the Chinese word &amp;quot;Ren Min &amp;quot; is tagged as: Ren W2P1, Min W2P2 and &amp;quot;Zhong Guo Ren &amp;quot; is tagged as: Zhong W3P1, Guo W3P2, Ren W3P3 In the example, two words are tagged by the template respectively, and the Chinese character &amp;quot;Ren &amp;quot; has been assigned two different tags. In a sense, the Chinese word creation mechanisms could be extracted through statistics of the tags for each character on a certain large corpus. On the other hand, while a character sequence in a sentence is tagged by this template, the word boundaries are obvious. Meanwhile, the word segmentation is implemented.</Paragraph>
      <Paragraph position="3"> In addition, in this template, known words and unknown words are both regarded as sequences of individual characters. Thus, the basic word segmentation process, the disambiguation process and the OOV word identification process can be accomplished in a unified process. Thereby, this model can also be used alone to implement the word segmentation task. This characteristic will make the word segmentation system much more efficient.</Paragraph>
    </Section>
    <Section position="2" start_page="177" end_page="178" type="sub_section">
      <SectionTitle>
3.2 The HMM Tagger
</SectionTitle>
      <Paragraph position="0"> Form the description of POC-NLW template, it can be found that the word segmentation could be implemented as POC-NLW tagging, which is similar to the so-called part-of-speech (POS) tagging problem. In POS tagging, Hidden Markov Model (HMM) was applied as one of the most significant methods, as described in detail in (Brants, 2000). The HMM method can achieve high accuracy in tagging with low processing costs, so it was adopted in our model.</Paragraph>
      <Paragraph position="1"> According to the definition of POC-NLW template, the state set of HMM corresponds to the Wl-Pn tag set, and the symbol set is composed of all characters. However, the initial state probability matrix and the state transition probability matrix are not composed of all of the tags in the state set. To express more clearly, we define two subset of the state set: * Begin Tag Set (BTS): this set is consisted of tag which can occur in the begging position in a word. Apparently, these tags must have the Wl-P1 form.</Paragraph>
      <Paragraph position="2"> * End Tag Set (ETS): correspond to BTS, tags in this set should occur in the end position, and their form should be like Wl-Pl.</Paragraph>
      <Paragraph position="3"> Apparently, the size of BTS is L max as well as of ETS. Thus, the initial state probability matrix corresponds to BTS instead of the whole state set. On the other hand, because of the word internal continuity, if the current tag Wl-Pn is not in ETS, than the next tag must be Wl-P(n+1). In other words, the case in which the transition probability is need is that when the current tag is in ETS and the next tag belongs to BTS. So, the state transition matrix in our model corresponds to  The probabilities used in HMM were defined similarly to those in POS tagging, and were estimated using the Maximum Likelihood method from the training corpus.</Paragraph>
      <Paragraph position="4"> In the two-stage strategy, the output word sequence of the first stage is transferred into the second stage. The items in the sequence, including individual characters and words, which do not have a bigram or trigram relationship with the surrounding items, are picked out with its surrounding items to compose several sequences of items. These item sequences are processed by the HMM tagger to form new item sequences. At last, these processed items sequences are combined into the whole word sequence as the final output.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML