File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1711_metho.xml
Size: 7,550 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1711"> <Title>A Chinese Efficient Analyser Integrating Word Segmentation, Part-Of-Speech Tagging, Partial Parsing and Full Parsing</Title> <Section position="5" start_page="3" end_page="111" type="metho"> <SectionTitle> 3 HMM-based Tagger </SectionTitle> <Paragraph position="0"> The Chinese efficient analyser is based on the HMM-based tagger described in Zhou et al 2000a.</Paragraph> <Paragraph position="1"> Given a token sequence G , the goal of tagging is to find a stochastic optimal tag sequence that maximizes</Paragraph> <Paragraph position="3"/> <Paragraph position="5"/> <Paragraph position="7"> )|(log )(log)(log)|(log Both the first and second items correspond to the language model component of the tagger. We will not discuss these two items further in this paper since they are well studied in ngram modeling. This paper will focus on the third item , which is the main difference between our tagger and other HMM-based taggers. Ideally, it can be estimated by using the forward-backward algorithm (Rabiner 1989) recursively for the first-order (Rabiner 1989) or second-order HMMs (Watson et al 1992). To simplify the complexity, several context dependent approximations on it will be attempted in this paper instead, as detailed in sections 3 and 4.</Paragraph> <Paragraph position="9"> All of this modelling would be for naught were it not for the existence of an efficient algorithm for finding the optimal state sequence, thereby &quot;decoding&quot; the original sequence of tags. The stochastic optimal tag sequence can be found by maximizing the previous equation over all the possible tag sequences. This is implemented via the well-known Viterbi algorithm (Viterbi 1967) by using dynamic programming and an appropriate merging of multiple theories when they converge on a particular state. Since we are interested in recovering the tag state sequence, we pursue 16 theories at every given step of the algorithm.</Paragraph> </Section> <Section position="6" start_page="111" end_page="111" type="metho"> <SectionTitle> 4 Word Segmentation and POS Tagging </SectionTitle> <Paragraph position="0"> Traditionally, in Chinese Language Processing, word segmentation and POS tagging are implemented sequentially. That is, the input Chinese sentence is segmented into words first and then the segmented result (in the form of word lattice or N-best word sequences) is passed to POS tagging component. However, this processing strategy has following disadvantages: * The word lexicons used in word segmentation and POS tagging may be different. This difference is difficult to overcome and largely drops the system accuracy although different optimal algorithms may be applied to word segmentation and POS tagging.</Paragraph> <Paragraph position="1"> * With speed in consideration, the two-stage processing strategy is not efficient.</Paragraph> <Paragraph position="2"> Therefore, we apply the strategy of integrating word segmentation and POS tagging in a single stage. This can be implemented as follows: 1) Given an input sentence, a 3tuple (special chunk, POS and word) lattice is generated by skimming the sentence from left-to-right, and looking up the word and POS lexicon to determine all the possible words and get POS tag probability distribution for each possible word.</Paragraph> <Paragraph position="3"> 2) Viterbi algorithm is applied to decode the 3tuple lattice to find the most possible POS tag sequence.</Paragraph> <Paragraph position="4"> 3) In this way, the given sentence is segmented into words with POS tags.</Paragraph> <Paragraph position="5"> The rationale behind the above algorithm is the ability of HMM in parallel segmentation and classification (Rabiner 1989).</Paragraph> <Paragraph position="6"> In order to overcome the coarse n-gram models raised by the limited number of orignial POS tags used in current Chinese POS tag bank (corpus), a word clustering algorithm (Bai et al 1998) is applied to classify words into classes first and then the N (e.g. N=500) most frequently occurred word class and POS pairs are added to the original POS tag set to achieve more accurate models. For example, ADJ(<Xu Duo >) represents a special POS tag ADJ which pairs with the word class <Xu Duo >. Here, <Xu Duo > is a word class label. For convenience and clarity, we use the most frequently occurred word in a word class as the label to represent the word class.</Paragraph> </Section> <Section position="7" start_page="111" end_page="111" type="metho"> <SectionTitle> 5 Partial Parsing and Full Parsing </SectionTitle> <Paragraph position="0"> As discussed in section 2, obviously partial parsing can have different levels and full parsing can be achieved by cascading several levels of partial parsing (e.g. 3 levels of cascaded partial parsing can achieve full parsing for the example as shown in Figure 1).</Paragraph> <Paragraph position="1"> In this paper, a certain level (e.g. l -th level) of partial parsing is implemented via a chunking model, built on the HMM-based tagger as described in section 2, with ( -th level 3tuple sequence as input. That is, for the l -th level partial parsing, the chunking model has the ( -th level 3tuple sequence (Here, 3tuple ) as input. In the meantime, chunk tag t used in the chunking model is structural and consists of following three parts:</Paragraph> <Paragraph position="3"> * Boundary Category B: It is a set of four values 0, 1, 2, 3, where &quot;0&quot; means that the current 3tuple is a whole chunk, &quot;1&quot; means that the current 3tuple is at the beginning of a chunk, &quot;2&quot; means that the current 3tuple is in the middle of a chunk and &quot;3&quot; means that the current stuple is at the end of a chunk.</Paragraph> <Paragraph position="4"> * Chunk Category C: It is used to denote the output chunk category of the chunking model, which includes normal chunks and the special chunk (&quot;.&quot;). The reason to include the special chunk is that some of POS 3tuple in the input sequence may not be chunked in the current chunking stage.</Paragraph> <Paragraph position="5"> * POS Category POS: Because of the limited number in boundary category and output chunk category, the POS category is added into the structural tag to represent more accurate models.</Paragraph> <Paragraph position="6"> Therefore, can be represented by , where b is the boundary type of , is the output chunk type of t and is the POS type of t . Obviously, there exist some constraints between t and on the boundary categories and output chunk categories, as briefed in table 1, where &quot;valid&quot;/&quot;invalid&quot; means the chunk tag sequence t is valid/invalid while &quot;validon&quot; means is valid on the condition</Paragraph> <Paragraph position="8"> -level partial parsing, the input 3tuple sequence is the 2 nd -level 3tuple sequence NP(NN, Guo Jia ) VP(VB, Cun Zai ) and the output tag sequence 1_S_NN 3_S_VB, from where derived is the 3 rd -level 3tuple sequence S(VB, Cun Zai ). In this way, a fully parsed tree is reached. * In the cascaded chunking procedure, necessary information is stored for back-tracing. Partially/fully parsed trees can be constructed by tracing from the final 3tuple sequence back to 0 th -level 3tuple sequence. Different levels of partial parsing can be achieved according to the need of the application.</Paragraph> </Section> class="xml-element"></Paper>