XML Viewer - w04-3236

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3236_metho.xml
Size: 21,268 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3236">
  <Title>Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?</Title>
  <Section position="4" start_page="21" end_page="22" type="metho">
    <SectionTitle>
:CW
</SectionTitle>
    <Paragraph position="0"> This feature captures the word context in which the current character is found. For example, the character &amp;quot;She &amp;quot; within the word &amp;quot;Xin Hua She &amp;quot; will have the feature  A punctuation symbol is usually a good indication of a word boundary. This feature checks whether the current character is a punctuation symbol (such as &amp;quot;. &amp;quot;, &amp;quot;-&amp;quot;, &amp;quot;, &amp;quot;).</Paragraph>
    <Paragraph position="2"> feature is especially helpful in predicting the word segmentation of dates and numbers, whose exact characters may not have been seen in the training text. Four type classes are defined: numbers represent class 1, dates (&amp;quot;Ri &amp;quot;, &amp;quot;Yue &amp;quot;, &amp;quot;Nian &amp;quot;, the Chinese character for &amp;quot;day&amp;quot;, &amp;quot;month&amp;quot;, &amp;quot;year&amp;quot;, respectively) represent class 2, English letters represent class 3, and other characters represent class 4. For example, when considering the character &amp;quot;Nian &amp;quot; in the character sequence &amp;quot;Jiu 0Nian Dai R&amp;quot;, the feature )()(</Paragraph>
  </Section>
  <Section position="5" start_page="22" end_page="22" type="metho">
    <SectionTitle>
CTCTK
</SectionTitle>
    <Paragraph position="0"> [?] =11243 will be set to 1 ( &amp;quot;Jiu &amp;quot; and &amp;quot;0 &amp;quot; are the Chinese characters for &amp;quot;9&amp;quot; and &amp;quot;0&amp;quot; respectively).</Paragraph>
    <Section position="1" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
2.3 Testing
</SectionTitle>
      <Paragraph position="0"> During testing, the probability of a boundary tag  is determined by using the maximum entropy classifier to compute the probability that a boundary tag t</Paragraph>
      <Paragraph position="2"> . If we were to just assign each character the boundary tag with the highest probability, it is possible that the classifier produces a sequence of invalid tags (e.g., &amp;quot;m&amp;quot; followed by &amp;quot;s&amp;quot;). To eliminate such possibilities, we implemented a dynamic programming algorithm which considers only valid boundary tag sequences given an input character sequence. At each character position i, the algorithm considers each last word candidate ending at position i and consisting of K characters in length (K = 1, ..., 20 in our experiments). To determine the boundary tag assignment to the last word W with K characters, the first character of W is assigned boundary tag &amp;quot;b&amp;quot;, the last character of W is assigned tag &amp;quot;e&amp;quot;, and the intervening characters are assigned tag &amp;quot;m&amp;quot;. (If W is a single-character word, then the single character is assigned &amp;quot;s&amp;quot;.) In this way, the dynamic programming algorithm only considers valid tag sequences, and we are also able to make use of the  CW feature during testing.</Paragraph>
      <Paragraph position="3"> After word segmentation is done by the maximum entropy classifier, a post-processing step is applied to correct inconsistently segmented words made up of 3 or more characters. A word W is defined to be inconsistently segmented if the concatenation of 2 to 6 consecutive words elsewhere in the segmented output document matches W. In the post-processing step, the segmentation of the characters of these consecutive words is changed so that they are segmented as a single word. To illustrate, if the concatenation of 2 consecutive words &amp;quot;Ba Sai Luo Na &amp;quot; in the segmented output document matches another word &amp;quot;Ba Sai Luo Na &amp;quot;, then &amp;quot;Ba Sai Luo Na &amp;quot; will be re-segmented as</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
&amp;quot;Ba Sai Luo Na &amp;quot;.
2.4 Word Segmenter Experimental Results
</SectionTitle>
      <Paragraph position="0"> To evaluate the accuracy of our word segmenter, we carried out 10-fold cross validation (CV) on the 250K-word Penn Chinese Treebank (CTB) (Xia et al., 2000) version 3.0. The Java opennlp maximum entropy package from sourceforge  was used in our implementation, and training was done with a feature cutoff of 2 and 100 iterations.</Paragraph>
      <Paragraph position="1"> The accuracy of word segmentation is measured by recall (R), precision (P), and F-measure ( )/(2 PRRP + ). Recall is the proportion of correctly segmented words in the gold-standard segmentation, and precision is the proportion of correctly segmented words in word segmenter's output.</Paragraph>
      <Paragraph position="2"> Figure 1 gives the word segmentation F-measure of our word segmenter based on 10-fold CV on the 250K-word CTB. Our word segmenter achieves an average F-measure of 95.1%. This accuracy compares favorably with  http://maxent.sourceforge.net (Luo, 2003), which reported 94.6% word segmentation F-measure using his full parser without additional lexical features, and about 94.9%  word segmentation F-measure using only word boundaries information, no POS tags or constituent labels, but with lexical features derived from a 58K-entry word list.</Paragraph>
      <Paragraph position="3"> The average training time taken to train on 90% of the 250K-word CTB was 12 minutes, while testing on 10% of CTB took about 1 minute. The running times reported in this paper were all obtained on an Intel Xeon 2.4GHz computer with 2GB RAM.</Paragraph>
      <Paragraph position="4">  measure for our word segmenter As further evaluation, we tested our word segmenter on all the 4 test corpora (CTB, Academia Sinica (AS), Hong Kong CityU (HK), and Peking University (PK)) of the closed track of the 2003 ACL-SIGHAN-sponsored First</Paragraph>
    </Section>
    <Section position="3" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
International Chinese Word Segmentation
</SectionTitle>
      <Paragraph position="0"> Bakeoff (Sproat and Emerson, 2003). For each of the 4 corpora, we trained our word segmenter on only the official released training data of that corpus. Training was conducted with feature cutoff of 2 and 100 iterations (these parameters were obtained by cross validation on the training set), except for the AS corpus where we used cutoff 3 since the AS training corpus was too big to train with cutoff 2.</Paragraph>
      <Paragraph position="1"> Figure 2 shows our word segmenter's F-measure (based on the official word segmentation scorer of 2003 SIGHAN bakeoff) compared to those reported by all the 2003 SIGHAN participants in the four closed tracks  exceptionally high out-of-vocabulary (OOV) rate of the test data (18.1%), our word segmenter's F-measure ranked in the third position. (Note that the top participant of CTB c (Zhang et al., 2003) used additional named entity knowledge/data in their word segmenter).</Paragraph>
      <Paragraph position="2">  task, we used as additional training data the AS training corpus provided by SIGHAN, after converting the AS training corpus to GB encoding. We found that with this additional AS training data added to the original  Last ranked participant of SIGHAN CTB (closed) with F-measure 73.2% is not shown in Figure 2 due to space constraint.</Paragraph>
      <Paragraph position="3"> official released CTB training data of SIGHAN, our word segmenter achieved an F-measure of 92.2%, higher than the best reported F-measure in the CTB open task. With sufficient training data, our word segmenter can perform very well. In our evaluation, we also found that the additional features we introduced in Section 2.2 and the post-processing step consistently improved average word segmentation F-measure, when evaluated on the 4 SIGHAN test corpora in the closed track. The additional features improved F-measure by an average of about 0.4%, and the post-processing step added on top of the use of all features further improved F-measure by 0.3% (i.e., for a cumulative total of 0.7% increase in F-measure).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="22" end_page="79" type="metho">
    <SectionTitle>
3 One-at-a-Time, Word-Based POS Tagger
</SectionTitle>
    <Paragraph position="0"> Now that we have successfully built a state-of-the-art Chinese word segmenter, we are ready to explore issues of processing architecture and feature representation for Chinese POS tagging.</Paragraph>
    <Paragraph position="1"> An English POS tagger based on maximum entropy modeling was built by (Ratnaparkhi, 1996). As a first attempt, we investigated whether simply porting the method used by (Ratnaparkhi, 1996) for English POS tagging would work equally well for Chinese. Applying it in the context of Chinese POS tagging, Ratnaparkhi's method assumes that words are pre-segmented, and it assigns POS tags on a word-by-word basis, making use of word features in the surrounding context. This gives rise to a one-at-a-time, word-based POS tagger.</Paragraph>
    <Paragraph position="2"> Note that in a one-at-a-time approach, the word-segmented input sentence given to the POS tagger may contain word segmentation errors, which can lower the POS tagging accuracy.</Paragraph>
    <Section position="1" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
3.1 Features
</SectionTitle>
      <Paragraph position="0"> The following feature templates were chosen.</Paragraph>
      <Paragraph position="1"> W refers to a word while POS refers to the POS tag assigned. The feature )W(Pu</Paragraph>
      <Paragraph position="3"> characters in the current word are punctuation characters. Feature (e) encodes the class of characters that constitute the surrounding words (similar to feature (f) of the word segmenter in Section 2.1). Four type classes are defined: a word is of class 1 if it is a number; class 2 if the word is made up of only numeric characters followed by &amp;quot;Ri &amp;quot;, &amp;quot;Yue &amp;quot;,or &amp;quot;Nian &amp;quot;; class 3 when the word is made up of only English characters and optionally punctuation characters; class 4 otherwise.</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
3.2 Testing
</SectionTitle>
      <Paragraph position="0"> The testing procedure is similar to the beam search algorithm of (Ratnaparkhi, 1996), which tags each word one by one and maintains, as it sees a new word, the N most probable POS tag sequence candidates up to that point in the sentence. For our experiment, we have chosen N to be 3.</Paragraph>
    </Section>
    <Section position="3" start_page="22" end_page="79" type="sub_section">
      <SectionTitle>
3.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> The 250K-word CTB corpus, tagged with 32 different POS tags (such as &amp;quot;NR&amp;quot;, &amp;quot;PU&amp;quot;, etc) was employed in our evaluation of POS taggers in this study. We ran 10-fold CV on the CTB corpus, using our word segmenter's output for each of the 10 runs as the input sentences to the POS tagger. POS tagging accuracy is simply calculated as (number of characters assigned correct POS tag) / (total number of characters).</Paragraph>
      <Paragraph position="1">  time, word-based POS tagger The POS tagging accuracy is plotted in Figure 3. The average POS tagging accuracy achieved for the 10 experiments was only 84.1%, far lower than the 96% achievable by English POS taggers on the English Penn Treebank tag set. The average training time was 25 minutes, while testing took about 20 seconds. As an experiment, we also conducted POS tagging using only the features (a), (f), and (g) in Section 3.1, similar to (Ratnaparkhi, 1996), and we obtained an average POS tagging accuracy of 83.1% for that set of features.</Paragraph>
      <Paragraph position="2"> The features that worked well for English POS tagging did not seem to apply to Chinese in the maximum entropy framework. Language differences between Chinese and English have no doubt made the direct porting of an English POS tagging method to Chinese ineffective.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="79" end_page="89" type="metho">
    <SectionTitle>
4 One-at-a-Time, Character-Based POS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="79" end_page="79" type="sub_section">
      <SectionTitle>
Tagger
</SectionTitle>
      <Paragraph position="0"> Since one-at-a-time, word-based POS tagging did not yield good accuracy, we proceeded to investigate other combinations of processing architecture and feature representation. We observed that character features were successfully used to build our word segmenter and that of (Xue and Shen, 2003). Similarly, character features were used to build a maximum entropy Chinese parser by (Luo, 2003), where his parser could perform word segmentation, POS tagging, and parsing in an integrated, unified approach. We hypothesized that assigning POS tags on a character-by-character basis, making use of character features in the surrounding context may yield good accuracy. So we next investigate such a one-at-a-time, character-based POS tagger.</Paragraph>
    </Section>
    <Section position="2" start_page="79" end_page="79" type="sub_section">
      <SectionTitle>
4.1 Features
</SectionTitle>
      <Paragraph position="0"> The features that were used for our word segmenter ((a) [?] (f)) in Section 2.1 were yet again applied, with two additional features (g) and (h) to aid POS tag prediction.</Paragraph>
      <Paragraph position="1">  This feature refers to the POS tag of the previous character before the current word. For example, in the character  =P_PN is set to 1 (assuming &amp;quot;Dui &amp;quot; was tagged as P and &amp;quot;Ci &amp;quot; was tagged as PN).</Paragraph>
    </Section>
    <Section position="3" start_page="79" end_page="79" type="sub_section">
      <SectionTitle>
4.2 Testing
</SectionTitle>
      <Paragraph position="0"> The testing algorithm is similar to that described in Section 3.2, except that the probability of a word being assigned a POS tag t is estimated by the product of the probability of its individual characters being assigned the same POS tag t.</Paragraph>
      <Paragraph position="1"> For example, when estimating the probability of &amp;quot;Xin Hua She &amp;quot; being tagged NR, we find the product of the probability of &amp;quot;Xin &amp;quot; being tagged NR, &amp;quot;Hua &amp;quot; being tagged NR, and &amp;quot;She &amp;quot; being tagged NR.</Paragraph>
      <Paragraph position="2"> That is, we enforce the constraint that all characters within a segmented word in the pre-segmented input sentence must have the same POS tag.</Paragraph>
    </Section>
    <Section position="4" start_page="79" end_page="89" type="sub_section">
      <SectionTitle>
4.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> 10-fold CV for CTB is repeated for this POS tagger. Figure 4 shows the detailed POS tagging accuracy. With a one-at-a-time, character-based POS tagger, the average POS tagging accuracy improved to 91.7%, 7.6% higher than that achieved by the one-at-a-time, word-based POS tagger. The average training timing was 55 minutes, while testing took about 50 seconds.</Paragraph>
      <Paragraph position="1">  a-time approaches, the character-based approach was found to be significantly better than the word-based approach, at the level of significance 0.01.</Paragraph>
      <Paragraph position="2"> Assuming a one-at-a-time processing architecture, Chinese POS tagging using a character-based approach gives higher accuracy compared to a word-based approach.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="89" end_page="89" type="metho">
    <SectionTitle>
5 All-at-Once, Character-Based POS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
Tagger and Segmenter
</SectionTitle>
      <Paragraph position="0"> Encouraged by the success of character features, we next explored whether a change in processing architecture, from one-at-a-time to all-at-once, while still retaining the use of character features, could give further improvement to POS tagging accuracy. In this approach, both word segmentation and POS tagging will be performed in a combined, single step simultaneously. Each character is assigned both a boundary tag and a POS tag, for example &amp;quot;b_NN&amp;quot; (i.e., the first character in a word with POS tag NN). Thus, given 4 possible boundary tags and 32 unique POS tags present in the training corpus, each character can potentially be assigned one of (4x32) classes.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="89" end_page="89" type="metho">
    <SectionTitle>
5.1 Features
</SectionTitle>
    <Paragraph position="0"> The features we used are identical to those employed in the character-based POS tagger described in section 4.1, except that features (g) and (h) are replaced with those listed below. In the following templates, B refers to the boundary tag assigned. For example, given the character sequence &amp;quot;Dui Ci Yi Jian &amp;quot;, when considering the character &amp;quot;Jian &amp;quot;, template (g) results in the feature  Note that this approach is essentially that used by (Luo, 2003), since his parser performs both word segmentation and POS tagging (as well as parsing) in one unified approach. The features we used are similar to his tag features, except that we did not use features with three consecutive characters, since we found that the use of these features did not improve accuracy. We also added additional features (d) [?] (f).</Paragraph>
    <Section position="1" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
5.2 Testing
</SectionTitle>
      <Paragraph position="0"> Beam search algorithm is used with N = 3 during the testing phase.</Paragraph>
    </Section>
    <Section position="2" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
5.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> 10-fold CV on CTB was carried out again, using unsegmented test sentences as input to the program.</Paragraph>
      <Paragraph position="1"> Figure 5 shows the word segmentation Fmeasure, while Figure 6 shows the POS tagging accuracy achieved by this approach. With an allat-once, character-based approach, an average word segmentation F-measure of 95.2% and an average POS tagging accuracy of 91.9% was achieved. The average training timing was 3 hours, while testing took about 20 minutes.</Paragraph>
      <Paragraph position="2"> There is a slight improvement in word segmentation and POS tagging accuracy using this approach, compared to the one-at-a-time, character-based approach. When a paired t-test was carried out at the level of significance 0.01, the all-at-once approach was found to be significantly better than the one-at-a-time approach for POS tagging accuracy, although the difference was insignificant for word segmentation.</Paragraph>
      <Paragraph position="3">  using an all-at-once approach However, the time required for training and testing is increased significantly for the all-at-once approach. When efficiency is a major consideration, or if high quality hand-segmented text is available, the one-at-a-time, character-based approach could indeed be a worthwhile compromise, performing only slightly worse than the all-at-once approach. Table 1 summarizes the methods investigated in this paper. Total testing time includes both word segmentation and POS tagging on 10% of CTB data. Note that an all-atonce, word-based approach is not applicable as word segmentation requires character features to determine the word boundaries.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="89" end_page="89" type="metho">
    <SectionTitle>
6 Discussions
</SectionTitle>
    <Paragraph position="0"> Word-based or character-based? The findings that a character-based approach is better than a word-based approach for Chinese POS tagging is not too surprising. Unlike in English where each English letter by itself does not possess any meaning, many Chinese characters have well defined meanings. For example, the single Chinese character &amp;quot;Zhi &amp;quot; means &amp;quot;know&amp;quot;. And when a character appears as part of a word, the word derives part of its meaning from the component characters. For example, &amp;quot;Zhi Shi &amp;quot; means &amp;quot;knowledge&amp;quot;, &amp;quot;Wu Zhi &amp;quot; means &amp;quot;ignorant&amp;quot;, &amp;quot;Zhi Ming &amp;quot; means &amp;quot;well-known&amp;quot;, etc. In addition, since the out-of-vocabulary (OOV) rate for Chinese words is much higher than the OOV rate for Chinese characters, in the presence of an unknown word, using the component characters in the word to help predict the correct POS tag is a good heuristic.</Paragraph>
    <Paragraph position="1"> One-at-a-time or all-at-once? The all-at-once approach, which considers all aspects of available information in an integrated, unified framework, can make better informed decisions, but incurs a higher computational cost.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML