File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3236_intro.xml
Size: 2,802 bytes
Last Modified: 2025-10-06 14:02:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3236"> <Title>Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?</Title> <Section position="3" start_page="0" end_page="21" type="intro"> <SectionTitle> 2 Word Segmentation </SectionTitle> <Paragraph position="0"> As a first step in our investigation, we built a Chinese word segmenter capable of performing word segmentation without using POS tag information. Since errors in word segmentation will propagate to the subsequent POS tagging phase in the one-at-a-time approach, in order for our study to give relevant findings, it is important that the word segmenter we use gives state-of-the-art accuracy.</Paragraph> <Paragraph position="1"> The word segmenter we built is similar to the maximum entropy word segmenter of (Xue and Shen, 2003). Our word segmenter uses a maximum entropy framework and is trained on manually segmented sentences. It classifies each Chinese character given the features derived from its surrounding context. Each character can be assigned one of 4 possible boundary tags: &quot;b&quot; for a character that begins a word and is followed by another character, &quot;m&quot; for a character that occurs in the middle of a word, &quot;e&quot; for a character that ends a word, and &quot;s&quot; for a character that occurs as a single-character word.</Paragraph> <Section position="1" start_page="0" end_page="21" type="sub_section"> <SectionTitle> 2.1 Word Segmenter Features </SectionTitle> <Paragraph position="0"> Besides implementing a subset of the features described in (Xue and Shen, 2003), we also came up with three additional types of features ((d) [?] (f) below) which improved the accuracy of word segmentation. The default feature, boundary tag feature of the previous character, and boundary tag feature of the character two before the current character used in (Xue and Shen, 2003) were dropped from our word segmenter, as they did not improve word segmentation accuracy in our experiments.</Paragraph> <Paragraph position="1"> In the following feature templates used in our word segmenter, C refers to a Chinese character while W refers to a Chinese word. Templates (a) [?] (c) refer to a context of five characters (the current character and two characters to its left and right). 0C denotes the current character, nC</Paragraph> <Paragraph position="3"> ) denotes the character n positions to the right (left) of the current character.</Paragraph> <Paragraph position="4"> CC =Ji Zhe to be set to 1.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 2.2 Our Additional Features </SectionTitle> <Paragraph position="0"/> </Section> </Section> class="xml-element"></Paper>