File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1731_metho.xml

Size: 4,796 bytes

Last Modified: 2025-10-06 14:08:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1731">
  <Title>Chunking-based Chinese Word Tokenization</Title>
  <Section position="4" start_page="211" end_page="211" type="metho">
    <SectionTitle>
3 Context-dependent Lexicons
</SectionTitle>
    <Paragraph position="0"> The major problem with Chunking-based Chinese word tokenization is how to effectively approximate . This can be done by adding lexical entries with more contextual information into the lexicon Ph. In the following, we will discuss five context-dependent lexicons which consider different contextual information.</Paragraph>
    <Paragraph position="2"/>
    <Section position="1" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
3.1 Context of current word formation pattern
</SectionTitle>
      <Paragraph position="0"> and current word Here, we assume:</Paragraph>
      <Paragraph position="2"> [?] + and is a word formation pattern and word pair existing in the</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="2" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
3.2 Context of previous word formation pattern
</SectionTitle>
      <Paragraph position="0"> and current word formation pattern Here, we assume :</Paragraph>
      <Paragraph position="2"> [?] and is a pair of previous word formation pattern and current word formation pattern existing in the</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="3" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
3.3 Context of previous word formation pattern,
</SectionTitle>
      <Paragraph position="0"> previous word and current word formation pattern Here, we assume :</Paragraph>
      <Paragraph position="2"> where is a triple pattern existing in the training corpus.</Paragraph>
      <Paragraph position="3"> iii pwp</Paragraph>
    </Section>
    <Section position="4" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
3.4 Context of previous word formation pattern,
</SectionTitle>
      <Paragraph position="0"> current word formation pattern and current word Here, we assume :</Paragraph>
    </Section>
    <Section position="5" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
3.5 Context of previous word formation pattern,
</SectionTitle>
      <Paragraph position="0"> previous word, current word formation pattern and current word Here, the context of previous word formation pattern, previous word, current word formation pattern and current word is used as a lexical entry to determine the current structural chunk tag and Ph=  where is a pattern existing in the training corpus. Due to memory limitation, only lexical entries which occurs at least 3 times are kept.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="211" end_page="211" type="metho">
    <SectionTitle>
4 Error-Driven Learning
</SectionTitle>
    <Paragraph position="0"> In order to reduce the size of lexicon effectively, an error-driven learning approach is adopted to examine the effectiveness of lexical entries and make it possible to further improve the chunking accuracy by merging all the above context-dependent lexicons in a single lexicon.</Paragraph>
    <Paragraph position="1"> For a new lexical entry e , the effectiveness is measured by the reduction in error which results from adding the lexical entry to the lexicon : . Here, is the chunking error number of the lexical entry e for the old lexicon</Paragraph>
    <Paragraph position="3"> is the list of new lexical entries added to the old lexicon ). If , we define the lexical entry e as positive for</Paragraph>
  </Section>
  <Section position="6" start_page="211" end_page="211" type="metho">
    <SectionTitle>
5 Implementation
</SectionTitle>
    <Paragraph position="0"> In training process, only the words occurs at least 5 times are kept in the training corpus and in the word table while those less-freqently occurred words are separated into short words (most of such short words are single-character words) to simulate the chunking. That is, those less-frequently words are regarded as chunked from several short words.</Paragraph>
    <Paragraph position="1"> In word tokenization process, the Chunking-based Chinese word tokenization can be implemented as follows: 1) Given an input sentence, a lattice of word and word formation pattern pair is generated by skimming the sentence from left-to-right, looking up the word table to determine all the possible words, and determining the word formation pattern for each possible word.</Paragraph>
    <Paragraph position="2"> 2) Viterbi algorithm is applied to decode the lattice to find the most possible tag sequence.</Paragraph>
    <Paragraph position="3"> 3) In this way, the given sentence is chunked into words with word category information discarded.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML