File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3023_intro.xml

Size: 2,230 bytes

Last Modified: 2025-10-06 14:02:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3023">
  <Title>Perceptron Learning for Chinese Word Segmentation</Title>
  <Section position="3" start_page="0" end_page="154" type="intro">
    <SectionTitle>
2 Character Based Chinese Word
Segmentation
</SectionTitle>
    <Paragraph position="0"> We adopted the character based methodology for Chinese word segmentation, in which every character in a sentence was checked one by one to see if it was a word on its own or it was beginning, middle, orendcharacter ofamulti-character word. In contrast, another commonly used strategy, thewordbased methodology segmentsaChinese sentence into the words in a pre-defined word list possibly with probability information about each word, according to some maximum probability criteria ( see e.g. Chen (2003)). The performance of word based segmentation is dependent upon the quality of word list used, while the character based method does not need any word list - it segments a sentence only based on the characters in the sentence.</Paragraph>
    <Paragraph position="1"> Using character based methodology, we transform the word segmentation problem into four binary classification problems, corresponding to single-character word, the beginning, middle and end character of multi-character word, respectively. For each of the four classes a classifier was learnt from training setusing theone vs. allothers paradigm, in which every character in the training data belonging to the class considered was regarded as positive example and all other characters were negative examples.</Paragraph>
    <Paragraph position="2"> Afterlearning, weapplied thefourclassifiers to each character in test text and assigned the character the class which classifier had the maximal output among the four. This kind of strategy has been widely used in the applications of machine learning to named entity recognition and has also  been used in Chinese word segmentation (Xue andShen, 2003). Finallyaworddelimiter (often a blank space, depending on particular corpus) was added to the right of one character ifit was not the last character of a sentence and it was predicted as end character of word or as a single character word.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML