File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0137_metho.xml

Size: 12,361 bytes

Last Modified: 2025-10-06 14:10:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0137">
  <Title>Chinese Word Segmentation based on an Approach of Maximum Entropy Modeling</Title>
  <Section position="5" start_page="0" end_page="202" type="metho">
    <SectionTitle>
2 System Overview
</SectionTitle>
    <Paragraph position="0"> Using Maximum Entropy approach for Chinese Word Segmentation is not a fresh idea, some previous works (Xue and Shen, 2003; Low, Ng and Guo, 2005) have got good performance in this field. But what we consider in the process of Segmentation is another way. We treat the input text which need to be segmented as a sequence of the Chinese characters, The segment process is, in fact, to find where we should split the character sequence. Thepointistogetthesegmentprobability between 2 Chinese characters, which is different from dealing with the character itself.</Paragraph>
    <Paragraph position="1"> In this section, training and segmentation process of the system is described to show how our system works.</Paragraph>
    <Section position="1" start_page="0" end_page="201" type="sub_section">
      <SectionTitle>
2.1 Pre-Process of Training
</SectionTitle>
      <Paragraph position="0"> For the first step we find the Minimal Segment Unit (MSU for short) of a text fragment in the training corpus. A MSU is a character or a string which is the minimal unit in a text fragment that cannot be segmented any more. According to the corpus, all of the MSUs can be divided into 5 type classes: &amp;quot;C&amp;quot; - Chinese Character (such as a47a92a48 and a47a208a48), &amp;quot;AB&amp;quot; - alphabetic string (such as &amp;quot;SIGHAN&amp;quot;), &amp;quot;EN&amp;quot; - digit string (such as &amp;quot;1234567&amp;quot;), &amp;quot;CN&amp;quot; - Chinese number string (such as a47a152a122a19a155a48) and &amp;quot;P&amp;quot; - punctuation (a47a167a48,a47a34a48,a47a182a48, etc). Besides the classes above, we define a tag &amp;quot;NL&amp;quot; as a special MSU, which refers to the beginning or ending of a text fragment. So, any MSU u can be described as: u[?]C[?]AB[?]EN[?]CN[?]P[?]{NL}. In order to check the capability of the pure Maximum Entropy model, in closed tracks, we didn't have any type of classes, the MSU here is every character of the text fragment, u[?]Cprime[?]{NL}. For instance, a47a183a130a235a92a10SIGHAN2006a169a99a140a109a34a48 is segmented into these MSUs: a47a183/a130/a235/a92/ a10/S/I/G/H/A/N/2/0/0/6/a169/a99/a140/a109/a34a48.</Paragraph>
      <Paragraph position="1"> Once we get all the MSUs of a text fragment, we can get the value of the Nexus Coefficient (NC for short) of any 2 adjacent MSUs according to the training corpus. The set of NC value can be  described as: NC [?] {0,1}, where 0 means those 2 MSUs are segmented and 1 means they are not segmented (Roughly, we appoint r = 0 if either one of the 2 adjacent MSUs is NL). For example, the NC value of these 2 MSUsa47a92a48anda47a208a48 in the text fragment a47a92a208a48 is 0 since these 2 characters is segmented according to the training corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
2.2 Training
</SectionTitle>
      <Paragraph position="0"> Since the segmentation is to obtain NC value of any 2 adjacent MSUs (here we call the interspace of the 2 adjacent MSUs a check point, illustrated  In these features above, U+n (U[?]n) refers to the following (previous) n MSU of the check point with the information of relative position (Intuitively, We consider the same MSU has different effect on the NC value of the check point when its relative position is different to check point). And U[?]1U+1 is the 2 adjacent MSUs of the check point. r[?]2r[?]1 is the NC value of the previous 2 check points. Similarly, the (d) and (epsilon1) features represent the MSUs with their adjacent r. For instance, in the sentencea183a180a152a135a165a73a60, we can extract these features for the check point  since a152a135 is segmented into 2 characters, but in MSRA corpus,a152a135is treated as a word)</Paragraph>
      <Paragraph position="2"> After the extraction of the features, we use the ZhangLe's Maximum Entropy Toolkit1 to train the model with a feature cutoff of 1. In order to get the best number of iteration, 9/10 of the training data is used to train the model, and the other 1/10 portion of the training data is used to evaluate the model. Figure 1 and 2 show the results of the evaluation on MSRA and UPUC corpus.</Paragraph>
      <Paragraph position="3"> From the figures we can see the best iteration number range from 555 to 575 for MSRA corpus, and 360 to 375 for UPUC corpus. So we decide the iteration for 560 rounds for MSRA tracks and 365 rounds for UPUC tracks, respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="201" end_page="202" type="sub_section">
      <SectionTitle>
2.3 Segmentation
</SectionTitle>
      <Paragraph position="0"> As we mentioned in the beginning of this section, the segmentation is the process to obtain the value  of every NC in a text fragment. This process is similar to the training process. Firstly, We scan the text fragment from start to end to get all of the MSUs. Then we can extract all of the features from the text fragment and decide which check point we should tag as r = 0 by this equation:</Paragraph>
      <Paragraph position="2"> where K is the number of features, Z is the normalization constant used to ensure that a probability distribution results, and c represents the context of the check point. aj is the weight for feature fj, here {a1a2 ...aK} is generated by the training data. We then compute P(r = 0|c) and</Paragraph>
      <Paragraph position="4"> After one check point is treated with value of r, the system shifts backward to the next check point until all of the check point in the whole text fragment are treated. And by calculating:</Paragraph>
      <Paragraph position="6"> toget anr sequencewhich canmaximizeP. From thisprocesswecanseethatthesequenceis,infact, a second-order Markov Model. Thus it is easily to think about more tags prior to the check point (as an nth-order Markov Model) to get more accuracy, but in this paper we only use the previous 2 tags from the check point.</Paragraph>
    </Section>
    <Section position="4" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
2.4 Identification of New words
</SectionTitle>
      <Paragraph position="0"> We perform the new word(s) identification as a post-process by check the word formation power (WFP) of characters. The WFP of a character is defined as: WFP(c) = Nwc/Nc, where Nwc is the number of times that the character c appears in a word of at least 2 characters in the training corpus, Nc is the number of times the character c occursinthetrainingcorpus. Afteratextfragment is segmented by our system, we extract all consecutive single characters. If at least 2 consecutive characters have the WFP larger than our threshold of 0.88, we polymerize them together as a word.</Paragraph>
      <Paragraph position="1"> For example,a47a178a214a151a48is a new word which is segmented as a47a178/a214/a151a48 by our system, WFP of these 3 characters is 0.9517,0.9818 and 1.0 respectively, then they are polymerized as one word.</Paragraph>
      <Paragraph position="2"> Besides the WFP, during the experiments, we find that the Maximum Entropy model can polymerizesomeMSUsasanewword(Wecallitpoly- null merization characteristic of the model), such asa170 a186a13a196 in the training corpus, we can extract a170 a186a13as the previous context feature of the check point aftera13, in another stringa83a44a13a142, we can extract the backward contexta142of the check point aftera13with r = 1. Then in the test, a new word a170a186a13a142 is recognized by the model since a170 a186a13 and a142 are polymerized if a13a142 appears together a large number of times in the training corpus. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="202" end_page="203" type="metho">
    <SectionTitle>
3 Performance analysis
</SectionTitle>
    <Paragraph position="0"> Here Table 1 illustrates the results of all 4 tracks we participate. The first column is the track name, and the 2nd column presents the Recall (R), the 3rd column the Precision (P), the 4th column is F-measure (F). The Roov refers to the recall of the out-of-vocabulary words and the Riv refers to the recall of the words in training corpus.</Paragraph>
    <Section position="1" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
3.1 Closed tracks
</SectionTitle>
      <Paragraph position="0"> For all of the closed tracks, we perform the segmentation as we mentioned in the section above, without any class defined. Every MSU we extract from the training data is a character, which may be a Chinese character, an English letter or a single digit. We extract the features based on this kind of MSUs to generate the models. The results show these models are not precise.</Paragraph>
      <Paragraph position="1"> For the UPUC closed track, the official released training data is rather small. Then the capability of the model is limited, this is the most reasonable negative effect on our F-measure 0.895.</Paragraph>
    </Section>
    <Section position="2" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
3.2 Open tracks
</SectionTitle>
      <Paragraph position="0"> The primary change between open tracks and closed tracks is that we have classified 5 classes (&amp;quot;C&amp;quot;,&amp;quot;AB&amp;quot;,&amp;quot;EN&amp;quot;,&amp;quot;CN&amp;quot; and &amp;quot;P&amp;quot;) to MSUs in order to improve the accuracy of the model. The classification really works and affects the performance of the system in a great deal. As this text fragment 1998a99 can be recognized as (EN)(C), which can also presents 1644a99, thus 1644a99can  be easily recognized though there is no 1664a99in the training data.</Paragraph>
      <Paragraph position="1"> The training corpus we used in UPUC open track is the same as in UPUC closed track. With those 5 classes, it is easily seen that the F-measure increased by 2.2% in the open tracks.</Paragraph>
      <Paragraph position="2"> For the MSRA open track, we adjust the class &amp;quot;P&amp;quot; by removing the punctuation &amp;quot;a33&amp;quot; from the class, because in the MSRA corpus, &amp;quot;a33&amp;quot; can be a part of a organization name, such as &amp;quot;a33&amp;quot; in a47a165a2a108a208a33a218a178a134a117a208a148a10a172a48. Besides, we add the Microsoft Research training data of SIGHAN bakeoff 2005 as extended training corpus. The larger training data cooperate with the classification method, the F-measure of the open track increased to 0.942 as comparison with 0.926 of closed track.</Paragraph>
    </Section>
    <Section position="3" start_page="203" end_page="203" type="sub_section">
      <SectionTitle>
3.3 Discussion of the tracks
</SectionTitle>
      <Paragraph position="0"> Through the tracks, we tested the performance by using the pure Maximum Entropy model in closed tracks and run with the improved model with classified MSUs in open tracks. It is shown that the pure model without any additional methods can hardly make us satisfied, for the open tracks, the model with classes are just acceptable in segmentation. null In both closed and open tracks, we use the same new word identification process, and with the polymerization characteristic of the model, we find the Roov is better than we expected.</Paragraph>
      <Paragraph position="1">  Ontheotherhand,inoursystem,thereisnodictionary used as we described in the sections above, the Riv of each track shows that affects the system performance.</Paragraph>
      <Paragraph position="2"> Another factor affects our system in the UPUC tracks is the wrongly written characters. Consider that our system is based on the sequence of characters, this kind of mistake is fatal. For example, in the sentencea166a130a195a35a117a129a140a79a27a60a27a123a146, wherea123a153is written asa123a146. The model cannot recognize it since a123a146 didn't occur in the training corpus. In the step of new word identification, the WFPs of the 2 characters a123a167a146 are 0.8917 and 0.8310, thus they are wrongly segmented into 2 single characters while they are treated as a word in the gold standard corpus. Therefore, we believe the results can increase if there are no such mistakes in the test data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML