File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1105_metho.xml

Size: 15,231 bytes

Last Modified: 2025-10-06 14:09:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1105">
  <Title>An Enhanced Model for Chinese Word Segmentation and Part-of-speech Tagging</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Model
</SectionTitle>
    <Paragraph position="0"> The first step to establish the model is to make a formal description for its input and output. Here, a Chinese word segmentation and POS tagging system is viewed as with input, nCCC ,...,, 21 where Ci is the i'th Chinese character of the input sentence, and with output pairs, ( nm [?] )</Paragraph>
    <Paragraph position="2"> where Li is the word length of the i'th word in the segmented word sequence, Ti is the word tag, and each (Li, Ti) pair is corresponding to a segmented and tagged word, and [?]</Paragraph>
    <Paragraph position="4"> It is easily seen that the distinction between this model and other models is that this one introduces word length. In fact, word length really works, and affects the performance of the system in a great deal, of which our later experiments will approve.</Paragraph>
    <Paragraph position="5"> The motivation to introduce word length into our model is initially from the classical Chinese poems. When we read these poems, we may spontaneously obey some laws in where to have a pause. For example, in most cases, a 7-characterlined Jueju(A kind of poem format) is read as **/**/***. And the pauses in a sentence are much related to the length of words or chunks. Even in modern Chinese, word length also plays a part.</Paragraph>
    <Paragraph position="6"> Sometimes we prefer to use disyllabic words rather than single one, though both are correct in grammar. For example, in our daily lives, we always say &amp;quot; /n /v /n&amp;quot; or &amp;quot; /n /v /n&amp;quot;, but seldom hear &amp;quot; /n /v /n&amp;quot;, where &amp;quot; &amp;quot;, &amp;quot; &amp;quot; and &amp;quot; &amp;quot; have the same meaning. So, it is reasonable to assume that the occurrence of the word length will obey some unwritten laws when human writes or speaks.</Paragraph>
    <Paragraph position="7"> Introducing the word length into the word segmentation and POS tagging model may be in accord with the needs for processing Chinese.</Paragraph>
    <Paragraph position="8"> Another main characteristic of the model is that it is an integrated model, because there is only one hop through the input sentence to the output word-tag sequence.</Paragraph>
    <Paragraph position="9"> The following text will introduce how the model works. We will also inherit n-gram assumption in our model.</Paragraph>
    <Paragraph position="10"> Our destination is to find a sequence of (Li, Ti) pairs that maximizes the probability,  So, we have, )),(|(*)|()),(|( TLWPWCPTLCP = ...2.3 Because W is the segmentation of C , )|( WCP is always 1, and by another assumption that the occurrence of every word is independent to each other, then</Paragraph>
    <Paragraph position="12"> where )),(|( iii TLWP means the conditional probability of Wi under Li and Ti. For example, P(&amp;quot; &amp;quot; |2, v) is the conditional probability of &amp;quot; &amp;quot; under a 2-charactered verb which may be computed as (the number of &amp;quot; &amp;quot; appearing as a verb) / (the number of all 2-charactered verbs).</Paragraph>
    <Paragraph position="13"> With 2.3 and 2.4, )),(|( TLCP is ready.</Paragraph>
    <Paragraph position="14"> Then consider ),( TLP , which is easy to retrieve when we apply n-gram assumption.</Paragraph>
    <Paragraph position="15"> Suppose n is 2, which means that (Li, Ti) only depends on (Li-1, Ti-1).</Paragraph>
    <Paragraph position="17"> Here )),(|),(( 11 [?][?] iiii TLTLP means the probability of a Tag Ti with Length Li appearing next to Tag Ti-1 with Length Li-1, which may be computed as (the number of (Li-1, Ti-1)(Li, Ti) appearing in corpus) / (the number of (Li-1, Ti-1) appearing in corpus). So, ),( TLP is also ready.</Paragraph>
    <Paragraph position="18"> Combining formula 2.1, 2.2, 2.3, 2.4 and 2.5, we</Paragraph>
    <Paragraph position="20"> Now, the enhanced model is complete with 2.6.</Paragraph>
    <Paragraph position="21"> When establishing the model, we have made several assumptions.</Paragraph>
    <Paragraph position="22">  1. the dependency assumption between tag-length pairs, words and characters like the Bayers network of figure 2.1 2. Word and word are independent.</Paragraph>
    <Paragraph position="23"> 3. n-gram assumption on (T,L) pairs.  The validation of these assumptions is still somewhat in doubt, but the computational complexity of the model is decreased. All the resources required to achieve this model are also listed, i.e., a word list with</Paragraph>
    <Paragraph position="25"> LWP , and an n-gram transition network with probability ),...,|(</Paragraph>
    <Paragraph position="27"> The algorithm to implement this model is also rather simple , and using Dynamic Programming, we could finish the algorithm in O(cn), where n is the length of input sentence, and c is a constant related to the maximum ambiguity in a position.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Discussion
</SectionTitle>
    <Paragraph position="0"> Though the model itself is not difficult to implement as we have presented in last section, there are still some problems that we will be probably encountered with in practice. The first one is the data sparseness when we do the statistics. Another is how to further integrate Chinese Named Entity Recognition into the new, word-lengthintroduced model.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Data Sparseness
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> word length is introduced, the need for larger corpus is greatly increased. Suppose we are using a tri-gram assumption on length-tag pairs, the number of tags is 28 as that of our system, and the max word length is 6, then the number of patterns we should count is,</Paragraph>
      <Paragraph position="4"> To retrieve a reasonable statistical result, the scale of the corpus should at least be several times larger than that value. It is common that we don't have such a large corpus, and meet the problem so called Data Sparseness.</Paragraph>
      <Paragraph position="5"> One way to deal with the problem is to find a good smoothing, and another is to make further independent assumption between word length and</Paragraph>
      <Paragraph position="7"> Now, the patterns to count are just as many as those of a traditional n-gram assumption that only assumes the dependency among tags.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Named Entity Recognition Integration
</SectionTitle>
      <Paragraph position="0"> Named Entity Recognition is one of the most important parts of word segmentation and POS tagging systems, for the words in word list are limited while the language seems infinite. There are always new words appearing in human language, among which human names, place names and organization names are most common and most valuble to recognize. The performance of Named Entity Recognition will have a deep impact on the performance of a whole word segmentation and POS tagging system. The research on Named Entity Recognition has appeared for many years.</Paragraph>
      <Paragraph position="1"> No matter whether the current performance of Named Entity Recognition is ideal or not, we will not discuss it here, and instead, we will just show how to integrate the existing Name Entity Recognition methods into the new model.</Paragraph>
      <Paragraph position="2"> During the integration, more attention should be paid to the structural and probabilistic consistency. For structural consistency, the original system structure does not need modifying when a new method of Named Entity Recognition is applied.</Paragraph>
      <Paragraph position="3"> For probabilistic consistency, the probabilit ies outputted by the Named Entity Recognition should be compatible with the probabilit ies of the words in the original word list.</Paragraph>
      <Paragraph position="4"> Here, we will take the Human Name Recognition as an example to show how to do the integration.</Paragraph>
      <Paragraph position="5"> [Zheng Jiahen, et al. 2000] has presented a probabilistic method for Chinese Human Name Recognition, which is easy to understand and suitable to be borrowed as a demonstration.</Paragraph>
      <Paragraph position="6"> That paper defined the probability for a Chinese Human Name as: )(*)()|( kEiFiknsP = ............................3.2 )(*)(*)()|( kEjMiFijknpP = .............3.3 Where each one of &amp;quot;i&amp;quot;, &amp;quot;j&amp;quot;, &amp;quot;k&amp;quot; represents a single Chinese characters, &amp;quot;ik &amp;quot;, &amp;quot;ijk&amp;quot; are the strings which may be a human name, &amp;quot;ns&amp;quot; means a single name when &amp;quot;j&amp;quot; is empty, &amp;quot;np&amp;quot; means plural name when &amp;quot;j&amp;quot; is not empty, F(i) is the probability of &amp;quot;i&amp;quot; being a family name, M(j) means the probability of &amp;quot;j&amp;quot; being the middle character of a human name, E(k) means the probability of &amp;quot;k&amp;quot; being the tailing character of a human name, P(ns  |ik ) is the probability of &amp;quot;ik &amp;quot; being a single name, and P(np | ijk ) is the probability of &amp;quot;ijk &amp;quot; being a plural name. F(i), M(j), and E(k) are easily retrieved from corpus, so P(ns  |ik ) and P(np  |ijk) can be known.</Paragraph>
      <Paragraph position="7"> However, P(ns  |ik ) and P(np  |ijk ) do not satisfy the requirements of the word length introduced model. The model needs probabilit ies like )),(|( tlwP , where w is a word, t is a word tag, and l is the word length. Therefore, P(ns  |ik ) needs to be modified into P(ik  |nh, 2), for ik is always a 2-charactered word, and likewise, P(np  |ijk ) needs to be modified into P(ijk  |nh, 3), where &amp;quot;nh&amp;quot; is the word tag for human name in our system.</Paragraph>
      <Paragraph position="8"> P(ns  |ik ) is equivalent to P(nh, 2  |ik ) and P(np | ijk ) is equivalent to P(nh, 3  |ijk). P(ns  |ik ) can be converted into P(ik  |nh, 2) through following way,  where &amp;quot;i&amp;quot;, &amp;quot;k&amp;quot; have the same meaning with those in 3.2 and 3.3. and nh is the tag for human name.</Paragraph>
      <Paragraph position="9"> In this formula, &amp;quot;i&amp;quot; and &amp;quot;k&amp;quot; are assumed to be independent. P(nh, 2), P(i), P(k) are easy to retrieve, which represent the probability of a 2-charactered human name, the probability of character &amp;quot;i&amp;quot; and the probability of character &amp;quot;k&amp;quot;. P(nh, 2  |ik ) is computed from 3.2. Thus, the conversion of P(nk  |nh, 2) to P(nh, 2  |ik ) is done. In the same way, P(np  |ijk) can be converted  Finally, the Human Name Recognition Module is integrated into the whole system. The input string C1, C2, ..., Cn first goes through the Human Name Recognition module, and the module outputs a temporary word list, which consists of a column of words that are probably human names and a column of probabilities corresponding to the words, which can be computed by 3.4 and 3.5. The whole system then merges the temporary word list and the original word list into a new word list, and applies the new word list in segmenting and tagging C1, C2, ..., Cn.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Conclusion &amp; Experiments
</SectionTitle>
    <Paragraph position="0"> This paper has presented an enhanced probabilistic model of Chinese Lexical Analysis, which introduces word length as one of the features and achieves the integration of word segmentation, Named Entity Recognition and POS tagging.</Paragraph>
    <Paragraph position="1"> At last, we will briefly give the results of our experiments. In the previous experiments, we have compared many simple probabilistic models for Chinese word segmentation and POS tagging, and found that the system using maximum word frequency as segmentation strategy and forward tri-gram Markov model as POS tagging model (MWF + FTMM) reaches the best performance.</Paragraph>
    <Paragraph position="2"> Our comparisons will be done between the MWF+FTMM and the enhance model with tri-gram assumption. The training corpus is 40MB annotated Chinese text from People's Daily. The testing data is about 1MB in size and is from  with named entity considered NOTES: MWF: Maximum Word Frequency, a very simple strategy in word segmentation disambiguation, which chooses the word sequence with max probability as its result.</Paragraph>
    <Paragraph position="3"> FTMM: Forward Tri-gram Markov Model, a popular model in POS tagging.</Paragraph>
    <Paragraph position="4"> MWF+FTMM: A strategy, which chooses the output that makes a balance between the MWF and FTMM as its result.</Paragraph>
    <Paragraph position="5"> WSA (by word): Word Segmentation Accuracy, measured by recall, i.e. the number of correct segments divided by the number of segments in corpus.</Paragraph>
    <Paragraph position="6"> (In a problem like word segmentation, the result of precision measurement is commonly around that of recall measurement.) PTA (by word): POS Tagging Accuracy based on correct segmentation, the number of words that are correctly segmented and tagged divided by the number of words that are correctly segmented.</Paragraph>
    <Paragraph position="7"> Total (by word): total accuracy of the system, measured by recall, i.e. the number of words that are correctly segmented and tagged divided by the number of words in corpus, or simply WSA * PTA.</Paragraph>
    <Paragraph position="8"> WSA (by sentence): the number of correctly segmented sentences divided by the number of sentences in corpus. A correctly segmented sentence is a sentence whose words are all correctly segmented.</Paragraph>
    <Paragraph position="9"> PTA (by sentence): the number of correctly tagged sentences divided by the number of correctly segmented sentences in corpus. A correctly tagged sentence is a sentence whose words are all correctly segmented and tagged.</Paragraph>
    <Paragraph position="10"> Total (by sentence): WSA * PTA.</Paragraph>
    <Paragraph position="11"> Named entity considered or not: When named entity is not considered, all the unknown words in corpus are deleted before evaluation.</Paragraph>
    <Paragraph position="12"> Otherwise, nothing is done on the corpus.</Paragraph>
    <Paragraph position="13"> According to the results above (Table 4.1, Table 4.2, Table 4.3, Table 4.4), the new enhanced model does better than the MWF + FTMM in every field. Introducing the word length into a Chinese word segmentation and POS taggin g system seems effective.</Paragraph>
    <Paragraph position="14"> This paper just focuses on the pure probabilistic model for word segmetation and POS tagging. It can be predicted that, with more disambiguation strategies, such as some rule based approaches, being implemented into the new model to achieve a multi-engine system, the performance will be further improved.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML