File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1731_metho.xml
Size: 4,796 bytes
Last Modified: 2025-10-06 14:08:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1731"> <Title>Chunking-based Chinese Word Tokenization</Title> <Section position="4" start_page="211" end_page="211" type="metho"> <SectionTitle> 3 Context-dependent Lexicons </SectionTitle> <Paragraph position="0"> The major problem with Chunking-based Chinese word tokenization is how to effectively approximate . This can be done by adding lexical entries with more contextual information into the lexicon Ph. In the following, we will discuss five context-dependent lexicons which consider different contextual information.</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 3.1 Context of current word formation pattern </SectionTitle> <Paragraph position="0"> and current word Here, we assume:</Paragraph> <Paragraph position="2"> [?] + and is a word formation pattern and word pair existing in the</Paragraph> <Paragraph position="4"/> </Section> <Section position="2" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 3.2 Context of previous word formation pattern </SectionTitle> <Paragraph position="0"> and current word formation pattern Here, we assume :</Paragraph> <Paragraph position="2"> [?] and is a pair of previous word formation pattern and current word formation pattern existing in the</Paragraph> <Paragraph position="4"/> </Section> <Section position="3" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 3.3 Context of previous word formation pattern, </SectionTitle> <Paragraph position="0"> previous word and current word formation pattern Here, we assume :</Paragraph> <Paragraph position="2"> where is a triple pattern existing in the training corpus.</Paragraph> <Paragraph position="3"> iii pwp</Paragraph> </Section> <Section position="4" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 3.4 Context of previous word formation pattern, </SectionTitle> <Paragraph position="0"> current word formation pattern and current word Here, we assume :</Paragraph> </Section> <Section position="5" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 3.5 Context of previous word formation pattern, </SectionTitle> <Paragraph position="0"> previous word, current word formation pattern and current word Here, the context of previous word formation pattern, previous word, current word formation pattern and current word is used as a lexical entry to determine the current structural chunk tag and Ph= where is a pattern existing in the training corpus. Due to memory limitation, only lexical entries which occurs at least 3 times are kept.</Paragraph> </Section> </Section> <Section position="5" start_page="211" end_page="211" type="metho"> <SectionTitle> 4 Error-Driven Learning </SectionTitle> <Paragraph position="0"> In order to reduce the size of lexicon effectively, an error-driven learning approach is adopted to examine the effectiveness of lexical entries and make it possible to further improve the chunking accuracy by merging all the above context-dependent lexicons in a single lexicon.</Paragraph> <Paragraph position="1"> For a new lexical entry e , the effectiveness is measured by the reduction in error which results from adding the lexical entry to the lexicon : . Here, is the chunking error number of the lexical entry e for the old lexicon</Paragraph> <Paragraph position="3"> is the list of new lexical entries added to the old lexicon ). If , we define the lexical entry e as positive for</Paragraph> </Section> <Section position="6" start_page="211" end_page="211" type="metho"> <SectionTitle> 5 Implementation </SectionTitle> <Paragraph position="0"> In training process, only the words occurs at least 5 times are kept in the training corpus and in the word table while those less-freqently occurred words are separated into short words (most of such short words are single-character words) to simulate the chunking. That is, those less-frequently words are regarded as chunked from several short words.</Paragraph> <Paragraph position="1"> In word tokenization process, the Chunking-based Chinese word tokenization can be implemented as follows: 1) Given an input sentence, a lattice of word and word formation pattern pair is generated by skimming the sentence from left-to-right, looking up the word table to determine all the possible words, and determining the word formation pattern for each possible word.</Paragraph> <Paragraph position="2"> 2) Viterbi algorithm is applied to decode the lattice to find the most possible tag sequence.</Paragraph> <Paragraph position="3"> 3) In this way, the given sentence is chunked into words with word category information discarded.</Paragraph> </Section> class="xml-element"></Paper>