File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3031_metho.xml

Size: 9,159 bytes

Last Modified: 2025-10-06 14:09:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3031">
  <Title>Two-Phase LMR-RC Tagging for Chinese Word Segmentation</Title>
  <Section position="3" start_page="0" end_page="184" type="metho">
    <SectionTitle>
2 Our Proposed Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="183" type="sub_section">
      <SectionTitle>
2.1 Chinese Word Segmentation as Tagging
</SectionTitle>
      <Paragraph position="0"> One of the difficulties in Chinese word segmentation is that, Chinese characters can appear in different positions within a word (Xue and Shen, 2003), and LMR Tagging was proposed to solve the problem. The basic idea of LMR Tagging is toassigntoeachcharacter, basedonitscontextual  information,atagwhichrepresentsitsrelativeposition within the word. Note that the original tag set used by (Xue and Shen, 2003) is simplified and improved by (Ng and Low, 2004) . We shall then adopt and illustrate the simplified case here.</Paragraph>
      <Paragraph position="1"> The tags and their meanings are summarized in Table 1. Tag L, M, and R correspond to the character at the beginning, in the middle, and at the end of the word respectively. Tag S means the character is a &amp;quot;single-character&amp;quot; word. Figure 1 illustrates a Chinese sentence segmented by spaces, and the corresponding tagging results.</Paragraph>
      <Paragraph position="2"> After transforming the Chinese segmentation problem to the tagging problem, various solutions can be applied. Maximum Entropy model (MaxEnt) (Berger, S. A. Della Pietra, and  proposed in the original work to solve the LMR Tagging problem. In order to make MaxEnt success in LMR Tagging, feature templates used in capturing useful contextual information must be carefully designed. Furthermore, it is unavoidable that invalid tag sequences will occur if we just assign the tag with the highest probability. In the next subsection, we describe the feature templates and measures used to correct the tagging.</Paragraph>
      <Paragraph position="3">  ular Tagging, in which similar procedures as in the original LMR Tagging are performed. The difference in this phase as compared to the original one is that, we use extra feature templates to capture characteristics of Chinese word segmentation. The second phase, C-phase, is called Correctional Tagging, in which the sentences are re-tagged by incorporating the regular tagging results. We hope that tagging errors can be corrected under this way. The models used in both phases are trained using MaxEnt model.</Paragraph>
    </Section>
    <Section position="2" start_page="183" end_page="184" type="sub_section">
      <SectionTitle>
Regular Tagging Phase
</SectionTitle>
      <Paragraph position="0"> In this phase, each character is tagged similar to the original approach. In our scheme, given the contextual information (x) of current character, the tag (y[?]) with highest probability will be assigned:</Paragraph>
      <Paragraph position="2"> p(y|x).</Paragraph>
      <Paragraph position="3"> The features describing the characteristics of Chinese segmentation problem are instantiations of the feature templates listed in Table 2. Note that feature templates only describe the forms of features, but not the actual features. So the number of features used is much larger than the number of templates.</Paragraph>
      <Paragraph position="4">  Additional feature templates as compared to (Xue and Shen, 2003) and (Ng and Low, 2004) are template 5 and 6. Template 5 is used to handle documents with ASCII characters. For template 6, as it is quite common that word boundary occurs in between two characters with different types, this template is used to capture such characteristics. null Correctional Tagging Phase In this phase, the sequence of characters is re-tagged by using the additional information of tagging results after R-phase. The tagging procedure is similar to the previous phase, except extra features (listed in Table 3) are used to assist the tagging. null  models used in R- and C-phase: (1) Separated Mode, and (2) Integrated Mode. Separated Mode means the models used in two phases are separated. Model for R-phase is called R-model, and model for C-phase is called C-model. Integrated Mode means only one model, I-model is used in both phases.</Paragraph>
      <Paragraph position="5"> The training methods are illustrated now. First of all, training data are divided into three parts,  (1) Regular Training, (2) Correctional Training, and (3) Evaluation. Our method first trains using  observations extracted from Part 1 (observation is simply the pair (context,tag) of each character). The created model is used to process Part 2. After that, observationsextractedfromPart2(whichinclude previous tagging results) are used to create the final model. The performance is then evaluated by processing Part 3.</Paragraph>
      <Paragraph position="6"> Let O be the set of observations, with subscripts R or C indicating the sources of them. Let TrainModel : O - P, where P is the set of models, be the &amp;quot;model generating&amp;quot; function. The two proposed training methods can be illustrated as follow:  The advantage of Separated Mode is that, it is easy to aggregate different sets of training data. It also provides a mean to handle large training data under limited resources, as we can divide the training data into several parts, and then use the similar idea to train each part. The drawback of this mode is that, it may lose the features' characteristics captured from Part 1 of training data, and Integrated Mode is proposed to address the problem, in which all the features' characteristics in both Part 1 and Part 2 are used to train the model.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="184" end_page="185" type="metho">
    <SectionTitle>
3 Experimental Results and Discussion
</SectionTitle>
    <Paragraph position="0"> We conducted closed track experiments on the Hong Kong City University (CityU) corpus in The Second International Chinese Word Segmentation Bakeoff to evaluate the proposed training and tagging methods. The training data were split into three portions. Part 1: 60% of the data is trained for R-phase; Part 2: 30% for C-phase training; and Part 3: the remaining 10% for evaluation. The evaluation part was further divided into six parts to simulate actual size of test document. The MaxEnt classifier was implemented using Java opennlp maximum entropy package from (Baldridge, Morton, and Bierner, 2004), and training was done with feature cutoff of 2 and 160 iterations. The experiments were run on an Intel Pentium4 3.0GHz machine with 3.0GB memory.</Paragraph>
    <Paragraph position="1"> To evaluate our proposed scheme, we carried outfourexperimentsforeachevaluationdata. For Experiment 1, data were processed with R-phase only. For Experiment 2, data were processed with both R- and C-phase, using Separated Mode as training method. For Experiment 3, data were processed similar to Experiment 2, except Integrated Mode was used. Finally for Experiment 4, data were processed similar to Experiment 1, with both Part 1 and Part 2 data were used for Rmodeltraining. ThepurposeofExperiment4isto determine whether the proposed scheme can perform better than just the single Regular Tagging under the same amount of training data. Table 4 summarizes the experimental results measured in F-measure (the harmonic mean of precision and recall).</Paragraph>
    <Paragraph position="2"> From the results, we obtain the following observations. null  1. Both Integrated and Separated Training modes  bothExp3andExp4. Thereasonisthatthe C-model cannot capture enough features' characteristicsusedforbasictagging. Webelievethat by adjusting the proportion of Part 1 and Part 2 of training data, performance can be increased. 4. Under limited computational resources, in which constructing single-model using all available data (as in Exp 3 and Exp 4) is not possible, Separated Mode shows its advantage in constructing and aggregating multi-models by dividing the training data into different portions. null The official BakeOff2005 results are summarized in Table 5. We have submitted multiple results for CityU, MSR and PKU corpora by applying different tagging methods described in the paper.</Paragraph>
  </Section>
  <Section position="5" start_page="185" end_page="185" type="metho">
    <SectionTitle>
4 Conclusion
</SectionTitle>
    <Paragraph position="0"> We present a Two-Phase LMR-RC Tagging scheme to perform Chinese word segmentation.</Paragraph>
    <Paragraph position="1"> Correctional Tagging phase is introduced in addition to the original LMR Tagging technique, in which the Chinese sentences are re-tagged using extra information of first round tagging results.</Paragraph>
    <Paragraph position="2"> Two training methods, Separated Mode and Integrated Mode, are introduced to suit our scheme.</Paragraph>
    <Paragraph position="3"> Experimental results show that Integrated Mode achieve the highest accuracy in terms of Fmeasure, where Separated Mode shows its advantages in constructing and aggregating multi-models under limited resources.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML