File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1723_metho.xml

Size: 4,513 bytes

Last Modified: 2025-10-06 14:08:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1723">
  <Title>A two-stage statistical word segmentation system for Chinese</Title>
  <Section position="3" start_page="0" end_page="21" type="metho">
    <SectionTitle>
2 The first stage: Segmentation of known
words
</SectionTitle>
    <Paragraph position="0"> In a sense, known word segmentation is a process of disambiguation. In our system, we use word bigram language models and Viterbi algorithm (1967) to resolve word boundary ambiguities in known word segmentation.</Paragraph>
    <Paragraph position="1"> For a particular input Chinese character string</Paragraph>
    <Paragraph position="3"> = , there is usually more than one possible segmentation</Paragraph>
    <Paragraph position="5"> = according to given system dictionary. Word bigram segmentation aims to find the most appropriate segmentation</Paragraph>
    <Paragraph position="7"> will occur given previous word</Paragraph>
    <Paragraph position="9"> To avoid the problem of data sparseness in MLE, here we apply the linear interpolation technique (Jelinek and Mercer, 1980) to smooth the estimated word bigram probabilities.</Paragraph>
  </Section>
  <Section position="4" start_page="21" end_page="21" type="metho">
    <SectionTitle>
3 The second stage: Unknown word
</SectionTitle>
    <Paragraph position="0"> identification The second stage mainly concerns unknown words segmentation that remains unresolved in first stage. This section describes a hybrid algorithm for unknown word identification, which can incorporate word juncture model, word-formation patterns and contextual information. To avoid the complicated normalization of the probabilities of different dimensions, the simple superposition principle is also used in merging these models.</Paragraph>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
3.1 Word juncture model
</SectionTitle>
      <Paragraph position="0"> Word juncture model score an unknown word by assigning word juncture type. Obviously, most unknown words appear as a string of known words after segmentation in first stage. Therefore, unknown word identification can be viewed as a process of re-assigning correct word juncture type to each known word pair in input. Given a known word string</Paragraph>
      <Paragraph position="2"> = , between each word pair  is a word juncture. In general, there are two types of junctures in unknown word identification, namely word boundary (denoted by</Paragraph>
      <Paragraph position="4"> wwt denote the type of a word juncture  wwtP , the more likely the two words are merged together into one new word.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
3.2 Word-formation patterns
</SectionTitle>
      <Paragraph position="0"> Word-formation pattern model scores an unknown word according to the probability of how each internal known word contributes to its formation. In general, a known word w may take one of the following four patterns while forming a word: (1) w itself is a word. (2) w is the beginning of an unknown word. (3) w is at the middle of an unknown word. (4) w appears at the end of an unknown word. For convenience, we use S , B , M and E to denote the four patterns respectively.</Paragraph>
      <Paragraph position="1"> Let )(wpttn denote a particular pattern of w in an</Paragraph>
      <Paragraph position="3"> Theoretically speaking, a known word can take any pattern while forming an unknown word. But it is not even in probability for different known words and different patterns. For example, the word Xing (xing4, nature) is more likely to act as the suffix of words. According to our investigation on the training corpus, the characterXing appears at the end of a multiword in more than 93% of cases.</Paragraph>
    </Section>
    <Section position="3" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
3.3 Hybrid algorithm for unknown word
</SectionTitle>
      <Paragraph position="0"> identification Current algorithm for unknown word identification consists of three major components: (1) an unknown word extractor firstly extracts a fragment of known words</Paragraph>
      <Paragraph position="2"> that that may have unknown words based on the related word-formation power and word juncture probability and its left and right contextual word</Paragraph>
      <Paragraph position="4"> w from the output of the first stage. (2) A candidate word constructor then generates a lattice of all possible</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML