File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3033_metho.xml

Size: 4,849 bytes

Last Modified: 2025-10-06 14:09:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3033">
  <Title>Towards a Hybrid Model for Chinese Word Segmentation</Title>
  <Section position="3" start_page="189" end_page="190" type="metho">
    <SectionTitle>
2 System Description
</SectionTitle>
    <Paragraph position="0"> The overall architecture of the segmenter is described in Figure 1. An input sentence is first segmented into a character sequence, with a space inserted after each character. The segmented character sequence is then processed by the tagging component, where it is initially tagged by an HMM tagger, and then by a TBL tagger. Finally, the tagged character sequence is transformed into a word-segmented sentence by the merging component.</Paragraph>
    <Section position="1" start_page="189" end_page="189" type="sub_section">
      <SectionTitle>
2.1 The Tagging Component
</SectionTitle>
      <Paragraph position="0"> The tagset used by the tagging component consists of the following four tags: L, M, R, and W, each of which indicates that the character is in a word-initial, word-middle, or word-final position or is a monosyllabic word respectively. The transformation-based error-driven learning algorithm is adopted as the backbone of the tagging component over other promising machine learning algorithms because, as Brill (1995) argued, it captures linguistic knowledge in a more explicit and direct fashion without compromising performance. This algorithm requires a gold standard, some initial tagging of the training corpus, and a set of rule templates. It then learns a set of rules that are ranked in terms of the number of tagging error reductions they can achieve.</Paragraph>
      <Paragraph position="1"> A number of different initial tagging schemes can be used, e.g., tagging each character as a monosyllabic word or with its most probable POC tag. We used a simple first-order HMM tagger to produce an initial tagging. Specifically,</Paragraph>
      <Paragraph position="3"> where ti denotes the ith tag in the tag sequence and wi denotes the ith character in the character sequence. The transition probabilities and lexical probabilities are estimated from the training data. The lexical probability for an unknown character, i.e., a character that is not found in the training data, is by default uniformly distributed among the four POC tags defined in the tagset.</Paragraph>
      <Paragraph position="4"> The Viterbi algorithm (Rabiner, 1989) is used to tag new texts.</Paragraph>
      <Paragraph position="5"> The transformation-based tagger was implemented using fnTBL (Florian and Ngai, 2001).</Paragraph>
      <Paragraph position="6"> The rule templates used are the same as the contextual rule templates Brill (1995) defined for the POS tagging task. These templates basically transform the current tag into some other tag based on the current character/tag and the character/tag one to three positions before/after the  current character. An example rule template is given below: (1) Change tag a to tag b if the preceding character is tagged z.</Paragraph>
      <Paragraph position="7">  The training process is iterative. At each iteration, the algorithm picks the instantiation of a rule template that achieves the greatest number of tagging error reductions. This rule is applied to the text, and the learning process repeats until no more rules reduce errors beyond a pre-defined threshold. The learned rules can then be applied to new texts that are tagged by the initial HMM tagger.</Paragraph>
    </Section>
    <Section position="2" start_page="189" end_page="190" type="sub_section">
      <SectionTitle>
2.2 The Merging Component
</SectionTitle>
      <Paragraph position="0"> The merging component transforms a POC-tagged character sequence into a word-segmented sentence. In general, the characters in a sequence are concatenated, and a space is inserted after each character tagged R (word-final position) or W (monosyllabic word).</Paragraph>
      <Paragraph position="1">  In addition, two sets of heuristics are used in this process. One set (H1) is used to handle non-Chinese characters and numeric type compounds, e.g., numbers, time expressions, etc. A few patterns of non-Chinese characters and numeric type compounds are generalized from the training data. If the merging algorithm detects such a pattern in the character sequence, it groups the characters that are part of the pattern accordingly. null The second set of heuristics (H2) is used to handle words that three or more characters long. Our hypothesis is that long words tend to have less fluidity than shorter words and their behavior is more predictable (Lu, 2005). We extracted a wordlist from the training data. Based on our hypothesis, if the merging algorithm detects that a group of characters form a long word found in the wordlist, it groups these characters into one word.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML