File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1731_intro.xml

Size: 2,927 bytes

Last Modified: 2025-10-06 14:02:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1731">
  <Title>Chunking-based Chinese Word Tokenization</Title>
  <Section position="3" start_page="111" end_page="211" type="intro">
    <SectionTitle>
2 HMM-based Chunking
2.1 HMM
* The second term is the summation of log
</SectionTitle>
    <Paragraph position="0"> probabilities of all the individual tags.</Paragraph>
    <Paragraph position="1"> Given an input sequence , the goal of Chunking is to find a stochastic optimal tag sequence that maximizes (Zhou and</Paragraph>
    <Paragraph position="3"> = * The third term corresponds to the &amp;quot;lexical&amp;quot; component (dictionary) of the tagger. We will not discuss either the first or the second term further in this paper because ngram modeling has been well studied in the literature. We will focus on the third term .</Paragraph>
    <Paragraph position="5"/>
    <Section position="1" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
2.2 Chinese Word Tokenization
</SectionTitle>
      <Paragraph position="0"> Given the previous HMM, for Chinese word tokenization, we have (Zhou and Su 2002):</Paragraph>
      <Paragraph position="2"> sequence; is the word formation pattern sequence and is the word formation pattern of . Here consists of:</Paragraph>
      <Paragraph position="4"> o The percentage of w occurring as a whole word (round to 10%) i o The percentage of w occurring at the beginning of other words (round to 10%) i o The percentage of occurring at the end of other words (round to 10%)</Paragraph>
      <Paragraph position="6"> o The occurring frequence feature, which is set to max(log(Frequence), 9 ).</Paragraph>
      <Paragraph position="7"> * tag : Here, a word is regarded as a chunk (called &amp;quot;Word-Chunk&amp;quot;) and the tags are used to bracket and differentiate various types of Word-chunks. Chinese word tokenization can be regarded as a bracketing process while differentiation of different word types can help the bracketing process. For convenience, here the tag used in Chinese word tokenization is called &amp;quot;Word-chunk tag&amp;quot;. The Word-chunk tag is structural and consists of three parts:</Paragraph>
      <Paragraph position="9"> o Boundary category (B): it is a set of four values: 0,1,2,3, where 0 means that current word is a whole entity and 1/2/3 means that current word is at the beginning/in the middle/at the end of a word.</Paragraph>
      <Paragraph position="10"> o Word category (W): used to denote the class of the word. In our system, word is classified into two types: pure Chinese word type and mixed word type (for example, including English characters/Chinese digits/Chinesenumbers).</Paragraph>
      <Paragraph position="11"> o Word Formation Pattern(P): Because of the limited number of boundary and word categories, the word formation pattern is added into the structural chunk tag to represent more accurate models.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML