File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0129_metho.xml
Size: 13,152 bytes
Last Modified: 2025-10-06 14:10:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0129"> <Title>Character Language Models for Chinese Word Segmentation and Named Entity Recognition</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Named Entity Recognition </SectionTitle> <Paragraph position="0"> Named entities consist of proper noun mentions of persons (PER), locations (LOC), and organizations (ORG). Two training corpora were provided. Each line consists of a single character, a single space character, and then a tag. The tags were in the standard BIO (begin/in/out) encoding. B-PER tags the first character in a person entity, I-PER tags subsequent characters in a person, and0characters not part of entities. We segmented the data into sentences by taking Unicode character 0x3002, which is rendered as a baseline-aligned small circle, as marking end of sentence (EOS). As judged by our own sentence numbers (see Figures 1 and 2), this missed around 20% of the sentence boundaries in the City U NE corpus and 5% of the boundaries in the Microsoft NE corpus. Test data is in the same format as the word segmentation task.</Paragraph> </Section> <Section position="5" start_page="0" end_page="171" type="metho"> <SectionTitle> 3 LingPipe </SectionTitle> <Paragraph position="0"> LingPipe is a Java-based natural language processing toolkit distributed with source code by Alias-i (2006). For this bakeoff, we used two LingPipe packages, com.aliasi.spell for Chinese word segmentation and com.aliasi.chunk for named-entity extraction. Both of these depend on the character language modeling package com.aliasi.lm, and the chunker also depends on the hidden Markov model package com.alias.hmm. The experiments reported in this paper were carried out in May 2006 using (a prerelease version of) LingPipe 2.3.0.</Paragraph> <Section position="1" start_page="0" end_page="169" type="sub_section"> <SectionTitle> 3.1 LingPipe's Character Language Models </SectionTitle> <Paragraph position="0"> LingPipe provides n-gram based character language models with a generalized form of Witten-Bell smoothing, which performed better than other approaches to smoothing in extensive English trials (Carpenter 2005). Language models provide a probability distribution P(s) defined for strings s [?] S[?] over a fixed alphabet of characters S. We begin with Markovian language models normalized as random processes. This means the sum of the probabilities for strings of a fixed length is 1.0.</Paragraph> <Paragraph position="1"> The chain rule factors P(sc) = P(s) * P(c|s) for a character c and string s. The n-gram Markovian assumption restricts the context to the previous n[?]1 characters, taking P(cn|sc1 ***cn[?]1) = P(cn|c1 ***cn[?]1).</Paragraph> <Paragraph position="2"> The maximum likelihood estimator for n-grams is ^PML(c|s) = count(sc)/extCount(s), where count(s) is the number of times the sequence s was observed in the training data and extCount(s) is the number of single-character extensions of s observed: extCount(s) =summationtextc count(sc).</Paragraph> <Paragraph position="3"> Witten-Bell smoothing uses linear interpolation to form a mixture model of all orders of maximum likelihood estimates down to the uniform estimate PU(c) = 1/|S|. The interpolation ratio l(ds) ranges between 0 and 1 depending on the context:</Paragraph> <Paragraph position="5"> Generalized Witten-Bell smoothing defines the interpolation ratio with a hyperparameter th:</Paragraph> <Paragraph position="7"> We take numExts(s) = |{c|count(sc) > 0} |to be the number of different symbols observed following s in the training data. The original Witten-Bell estimator set the hyperparameter th = 1. LingPipe's default sets th equal to the n-gram order.</Paragraph> </Section> <Section position="2" start_page="169" end_page="169" type="sub_section"> <SectionTitle> 3.2 Noisy Channel Spelling Correction </SectionTitle> <Paragraph position="0"> LingPipe performs spelling correction with a noisy-channel model. A noisy-channel model consists of a source model Ps(u) defining the probability of message u, coupled with a channel model Pc(s|u) defining the likelihood of a signal s given a message u. In LingPipe, the source model Ps is a character language model. The channel model Pc is a (probabilistically normalized) weighted edit distance (with transposition). LingPipe's decoder finds the most likely message u to have produced a signal s: argmaxuP(u|s) = argmaxuP(u) * P(s|u).</Paragraph> <Paragraph position="1"> For spelling correction, the channel Pc(s|u) is a model of what is likely to be typed given an intended message. Uniform models work fairly well and ones tuned to brainos and typos work even better. The source model is typically estimated from a corpus of ordinary text.</Paragraph> <Paragraph position="2"> For Chinese word segmentation, the source model is trained over the corpus with spaces inserted. The noisy channel deterministically eliminates spaces so that Pc(s|u) = 1.0 if s is identical to u with all of the spaces removed, and 0.0 otherwise. This channel is easily implemented as a weighted edit distance where deletion of a single space is 100% likely (log probability edit &quot;cost&quot; is zero) and matching a character is 100% likely, with any other operation being 0% likely (infinite cost). This makes any segmentation equally likely according to the channel model, reducing decoding to finding the highest likelihood hypothesis consisting of the test string with spaces inserted. This approach reduces to the cross-entropy/compression-based approach of (Teahan et al. 2000). Experiments showed that skewing these space-insertion/matching probabilities reduces decoding accuracy.</Paragraph> </Section> <Section position="3" start_page="169" end_page="171" type="sub_section"> <SectionTitle> 3.3 LingPipe's Named Entity Recognition </SectionTitle> <Paragraph position="0"> LingPipe 2.1 introduced a hidden Markov model interface with several decoders: first-best (Viterbi), n-best (Viterbi forward, A* backward with exact Viterbi estimates), and confidence-based (forward-backward).</Paragraph> <Paragraph position="1"> LingPipe 2.2 introduced a chunking implementation that codes a chunking problem as an HMM tagging problem using a refinement of the standard BIO coding. The refinement both introduces context and greatly simplifies confidence estimation over the approach using standard BIO coding in (Culotta and McCallum 2004). The tags are B-T for the first character in a multi-character entity of type T, M-T for a middle character in a multi-character entity,E-T for the end character in a multi-character entity, and W-T for a single character entity. The out tags are similarly contextualized, with additional information on the start/end tags to model their context. Specifically, the tags used are B-O-T for a character not in an entity following an entity of type T, I-O for any middle character not in an entity, and E-O-T for a character not in an entity but preceding a character in an entity of type T, and finally, W-O-T for a character that is a single character between two entities, the following entity being of type T. Finally, the first tag is conditioned on the begin-of-sentence tag (BOS) and after the last tag, the end-of-sentence tag (EOS) is generated. Thus the probabilities normalize to model string/tag joint probabilities. null In the HMM implementation considered here, transitions between states (tags) in the HMM are modeled by a maximum likelihood estimate over the training data. Tag emissions are generated by bounded character language models. Rather than the process estimate P(X), we use P(X#|#), where # is a distinguished boundary character not in the training or test character sets. We also train with boundaries. For Chinese at the character level, this bounding is irrelevant as all tokens are length 1, so probabilities are already normalized and there is no contextual position to take account of within a token. In the more usual wordtokenized case, it normalizes probabilities over all strings and accounts for the special status of prefixes and suffixes (e.g. capitalization, inflection). Consider the chunking consisting of the string John J. Smith lives in Seattle. with John J. Smith a person mention and Seattle a location mention. In the coded HMM model, the joint estimate is:</Paragraph> <Paragraph position="3"> LingPipe 2.3 introduced an n-best chunking implementation that adapts an underlying n-best chunker via rescoring. In rescoring, each of these outputs is scored on its own and the new best output is returned. The rescoring model is a longer-distance generative model that produces alternating out/entity tags for all characters. The joint probability of the specified chunking is:</Paragraph> <Paragraph position="5"> where each estimator is a character language model, and where the cT are distinct characters not in the training/test sets that encode begin-of-sentence (BOS), end-of-sentence (EOS), and type (e.g. PER, LOC, ORG). In words, we generate an alternating sequence of OUT and type estimates, starting and ending with an OUT estimate. We begin by conditioning on the begin-of-sentence tag. Because the first character is in an entity, we do not generate any text, but rather generate a character indicating that we are done generating the OUT characters and ready to switch to generating person characters. We then generate the phrase John J. Smith in the person model; note that type estimates always begin and end with the cOUT character, essentially making them bounded models. After generating the name and the character to end the entity, we revert to generating more out characters, starting from a person and ending with a location. Note that we are generating the phrase lives in including the preceding and following space. All such spaces are generated in the OUT models for English; there are no spaces in the Chinese input. Next, we generate the location phrase the same way as the person phrase. Next, we generate the final period in the OUT model and then the end-of-sentence symbol. Note that the OUT category's language model shoulders the brunt of the burden of estimating contextual effects. It conditions on the preceding type, so that the likelihood of lives in is conditioned on following a person entity. Furthermore, the choice to begin an entity of type location is based on the fact that it follows lives in. This includes begin-of-sentence and end-of-sentence effects, so the model is sensitive to initial capitalization in the out model as a distribution of character sequences likely to follow BOS. Similarly, the text, in this case a single period. The resulting model defines a (properly normalized) joint probability distribution over chunkings.</Paragraph> </Section> </Section> <Section position="6" start_page="171" end_page="171" type="metho"> <SectionTitle> 4 Held-out Parameter Tuning </SectionTitle> <Paragraph position="0"> We ran preliminary tests on MUC 6 English and City University of Hong Kong data for Chinese and found baseline performance around 72% and rescored performance around 82%. The underlying model was designed to have good recall in generating hypotheses. Over 99% of the MUC test sentences had their correct analysis in a 1024-best list generated by the underlying model. Nevertheless, setting the number of hypotheses beyond 64 did not improve results in either English or Chinese, so we reported runs with n-best set to 64.</Paragraph> <Paragraph position="1"> We believe this is because the two language-model based approaches make highly correlated ranking decisions based on character n-grams.</Paragraph> <Paragraph position="2"> Held-out scores peaked with 5-grams for Chinese; 3-grams and 4-grams were not much worse and longer n-grams performed nearly identically.</Paragraph> <Paragraph position="3"> We used 7500 as the number of distinct characters, though this parameter is not at all sensitive to within an order of magnitude. We used LingPipe's default of setting the interpolation parameter equal to the n-gram length; for the final evaluation th = 5.0. Higher interpolation ratios favor precision over recall, lower ratios favor recall. Values within an order of magnitude performed with</Paragraph> </Section> <Section position="7" start_page="171" end_page="171" type="metho"> <SectionTitle> 1% F-measure and 2% precision/recall. 5 Bakeoff Time and Effort </SectionTitle> <Paragraph position="0"> The total time spent on this SIGHAN bakeoff was about 2 hours for the word segmentation task and 10 hours for the named-entity task (not including writing this paper). We started from a working word segmentation system for the last SIGHAN.</Paragraph> <Paragraph position="1"> Most of the time was spent munging entity data, with the rest devoted to held out analysis. The final code was roughly one page per task, with only a dozen or so LingPipe-specific lines. The final run, including unpacking, training and testing, took 45 minutes on a 512MB home PC; most of the time was named-entity decoding.</Paragraph> </Section> class="xml-element"></Paper>