File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1718_metho.xml

Size: 13,434 bytes

Last Modified: 2025-10-06 14:08:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1718">
  <Title>Single Character Chinese Named Entity Recognition</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 SCNE Recognition and Related Work
</SectionTitle>
    <Paragraph position="0"> We consider three types of SCNE in this paper:  single character location name (SCL), person name (SCP), and organization name (SCO). Below are examples: 1. SCL: &amp;quot;a0&amp;quot;and &amp;quot;a1&amp;quot; in &amp;quot;a0a1a4a5&amp;quot; 2. SCP: &amp;quot;a0&amp;quot; (zhou1, Zhou) in &amp;quot;a0a1a2&amp;quot; (zhou1-zong3-li3,Premier Zhou), 3. SCO: &amp;quot;a3&amp;quot; (guo2, Kuomingtang Party)</Paragraph>
    <Paragraph position="2"> Cooperation between Kuomingtang Part and Communist Party) SCNE is very common in written Chinese text. As shown in Table 1, SCNE accounts for 8.17% of all NE tokens on the 10MB corpus. Especially, 14.65% of location names are SCLs. However, due to the lack of research, SCNE is a major source of errors in NER. In our experiments described below, we focus on SCL and SCP, while SCO is not considered because of its small number in the data.</Paragraph>
    <Paragraph position="3">  To our knowledge, most NER systems do not report SCNE recognition results separately. Some systems (e.g. Liu, 2001) even do not include SCNE in recognition task. SCNE recognition is achieved using the same technologies as for NER, which can be roughly classified into rule-based methods and statistical-based methods, while most of state-of-the-art systems use hybrid approaches.</Paragraph>
    <Paragraph position="4"> Wang (1999) and Chen (1998) used linguistic rules to detect NE with the help of the statistics from dictionary. Ji(2001), Zheng (2000), Shen(1995) and Sun(1994) used statistics from dictionaries and large corpus to generate PN or LN candidates, and used linguistic rules to filter the result, and Yu (1998) used language model to filter. Liu (2001) applied statistical and linguistic knowledge alternatively in a seven-step procedure. Unfortunately, most of these results are incomparable due to the different test sets used, except the results of Chen (1998) and Yu (1998). They took part in Multilingual Entity Task (MET-2) on Chinese, held together with MUC-7.</Paragraph>
    <Paragraph position="5"> Between them, Yu (1998)'s results are slightly better. However, these two comparable systems did not report their results on SCNE separately. To evaluate our results, we compare with three state-of-the-art system we have. These systems include: MSWS, PBWS and LCWS. The former two are developed by Microsoft(r) and the last one comes from by Beijing Language University. null</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 SCNE Recognition Using an Improved
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Source-Channel Model
3.1 Improved Source-Channel Model1
</SectionTitle>
      <Paragraph position="0"> We first conduct SCNE recognition within a framework of improved source-channel models, which is applied to Chinese word segmentation.</Paragraph>
      <Paragraph position="1"> We define Chinese words as one of the following four types: (1) entries in a lexicon, (2) morphologically derived words, (3) named entity (NE),  and (4) factoid. Examples are 1. lexicon word: a7a8 (peng2-you3, friend).</Paragraph>
      <Paragraph position="2"> 2. morph-derived word: a9a9a10a10 (gao1-gao1xing4-xing4 , happily) 3. named entity: a11a12a13a14(wei1-ruan3-gong1si1, Microsoft Corporation) 4. factoid2: a15a16a17a18 (yi1-yue4-jiu3-ri4, Jan 9th)  Chinese NER is achieved within the framework. To make our later discussion on SCNE clear, we introduce the model briefly.</Paragraph>
      <Paragraph position="3"> We are given Chinese sentence S, which is a character string. For all possible word segmentations W, we will choose the one which achieves the highest conditional probability W* = argmax w P(W|S). According to Bayes' law and dropping the constant denominator, we acquire the following equation:  age, money, number (NUM), measure, e-mail, phone number, and WWW.</Paragraph>
      <Paragraph position="5"> Following our Chinese word definition, we define word class C as follows: (1) each lexicon word is defined as a class; (2) each morphologically derived word is defined as a class; (3) each type of named entities is defined as a class, e.g.</Paragraph>
      <Paragraph position="6"> all person names belong to a class PN, and (4) each type of factoids is defined as a class, e.g. all time expressions belong to a class TIME. We therefore convert the word segmentation W into a word class sequence C. Eq. 1 can then be rewritten as:</Paragraph>
      <Paragraph position="8"> Eq. 2 is the basic form of the source-channel models for Chinese word segmentation. The models assume that a Chinese sentence S is generated as follows: First, a person chooses a sequence of concepts (i.e., word classes C) to output, according to the probability distribution P(C); then the person attempts to express each concept by choosing a sequence of characters, according to the probability distribution P(S|C).</Paragraph>
      <Paragraph position="9"> We use different types of channel models for different types of Chinese words. This brings several advantages. First, different linguistic constraints can be easily added to corresponding channel models (see Figure 1). These constraints can be dynamic linguistic knowledge acquired through statistics or intuitive rules compiled by linguists. Second, this framework is data-driven, which makes it easy to adapt to other languages.</Paragraph>
      <Paragraph position="10"> We have three channel models for PN, LN and ON respectively. (see Figure 1) However, although Eq. 2 suggests that channel model probability and source model probability can be combined through simple multiplication, in practice some weighting is desirable. There are two reasons. First, some channel models are poorly estimated, owing to the sub-optimal assumptions we make for simplicity and the insufficiency of the training corpus. Combining the channel model probability with poorly estimated source model probabilities according to Eq. 2 would give the context model too little weight. Second, as seen in Figure 1, the channel models of different word classes are constructed in different ways (e.g. name entity models are n-gram models trained on corpora, and factoid models are compiled using linguistic knowledge). Therefore, the quantities of channel model probabilities are likely to have vastly different dynamic ranges among different word classes. One way to balance these probability quantities is to add several channel model weight CW, each for one word class, to adjust the channel model probability P(S|C) to P(S|C)CW. In our experiments, these weights are determined empirically on a development set.</Paragraph>
      <Paragraph position="11"> Given the source-channel models, the procedure of word segmentation involves two steps: first, given an input string S, all word candidates are generated (and stored in a lattice). Each candidate is tagged with its class and the probability P(S'|C), where S' is any substring of S. Second, Viterbi search is used to select (from the lattice) the most probable word segmentation (i.e. word class sequence C*) according to Eq. 2.</Paragraph>
      <Paragraph position="12"> Word class Channel model Linguistic Constraints</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Improved Model for SCNE Recognition
</SectionTitle>
      <Paragraph position="0"> Although our results show that the source-channel models achieve the state-of-the-art word segmentation performance, they cannot handle SCNE very well. Error analysis shows that 11.6% person name errors come from SCP, and 47.7% location names come from SCL. There are two reasons accounting for it: First, SCNE is generated in a different way from that of multi-character NE. Second, the context of SCNE is different from other NE. For example, SCNE usually appears one after another such as &amp;quot;a0a1 a2a3&amp;quot;. But this is not the case for multi-character NE.</Paragraph>
      <Paragraph position="1"> To solve the first problem, we add two new channel models to Figure 1, that is, define each type of SCNE (i.e. SCL and SCP) as a individual class (i.e. NE_SCL and NE_SCP) with its channel probability P(Sj |NE_SCL), and P(Sj |NE_SCP). P(Sj |NE_SCL) is calculated by Eq. 3.</Paragraph>
      <Paragraph position="3"> Here, Sj is a character in SCL list which is extracted from training corpus. |SCL(Sj) |is the number of tokens Sj , which are labeled as SCL in training corpus. n is the size of SCL list, which includes 177 SCL. Similarly, P(Sj |NE_SCP) is calculated by Eq. 4, and the SCP list includes 151 SCP.</Paragraph>
      <Paragraph position="5"> We also use two CW to balance their channel probabilities with other NE's.</Paragraph>
      <Paragraph position="6"> To solve the second problem, we trained a new source model P(C) on the re-annotated training corpus, where all SCNE are tagged by SCL or SCP. For example, &amp;quot;a4&amp;quot; in &amp;quot;a4a5a6&amp;quot;is tagged as SCP instead of PN, and &amp;quot;a0&amp;quot; in &amp;quot;a0a1&amp;quot; is tagged as SCL in stead of LN.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Character-based Classifiers
</SectionTitle>
    <Paragraph position="0"> In this section, SCNE recognition is formulated as a binary classification problem. Our motivations are two folds. First, most NER systems do not use source-channel model, so our method described in the previous section cannot be applied. However, if we define SCNE as a binary classification problem, it would be possible to build a separate recognizer which can be used together with any NER systems. Second, we are interested in comparing the performance of source-channel models with that of other methods. null For each Chinese character, a classifier is built to estimate how likely an occurrence of this Chinese character in a text is a SCNE. Some examples of these Chinese character as well as their probabilities of being a SCNE is shown in</Paragraph>
    <Paragraph position="2"> We can see that the probabilities of being a SCNE of many characters are very small. Thus, SCNE recognition is an 'unbalanced' classification problem. That is, in most cases, it is safer to assume that a character is not a SCNE.</Paragraph>
    <Paragraph position="3"> We construct two classifiers respectively based on two statistical models: maximum entropy model (ME) and vector space model (VSM). Local context characters (i.e. left or right characters within a window) are used as features.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Maximum Entropy
</SectionTitle>
      <Paragraph position="0"> ME provides a good framework to integrate various features from different knowledge sources. Each feature is typically represented as a binary constraint f. All features are then combined using a log-linear model shown in Eq. 5.</Paragraph>
      <Paragraph position="2"> where a0 i is a weight of the feature fi , and Z(x) is a normalization factor.</Paragraph>
      <Paragraph position="3"> Weights (a0 ) are estimated using the maximum entropy principle: to satisfy constraints on observed data and assume a uniform distribution (with the maximum entropy) on unseen data. The training algorithm we used is the improved iterative scaling (IIS) described in (Berger et al, 1996)3. The context features include six characters: three on the left of the SCNE, and three on the right. Given the context features, the ME classifier would estimate the probability of the candidate being a SCNE. In our example, we treat candidates with the probability larger than 0.5 as SCNEs. To get the precision-recall curve, we can vary the probability threshold from 0.1 to 0.9.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Vector Space Model
</SectionTitle>
      <Paragraph position="0"> VSM is another model we used to detect SCNE.</Paragraph>
      <Paragraph position="1"> Similar to ME, we use six surrounding characters as the features, as shown in Figure 2.</Paragraph>
      <Paragraph position="2"> Figure 2. Context window In this approach, we apply the standard tf-idf weighting technique with one minor adaptation: the same character appearing in different positions within the context window is considered as different terms. For example, character Cj appearing at position i, ia1{-3,-2,1,1,2,3}, is regarded as term Cji,. Term weighting of Cji is acquired with Eq.6.</Paragraph>
      <Paragraph position="4"> With this adaptation, we can apply an additional weighting coefficient PWi to different position, so as to reflect the importance of different positions.</Paragraph>
      <Paragraph position="5"> PWi is determined in a heuristic way as shown in Table.3 with the underlying principle that the closer the context character is to the SCNE candidate, the larger PWi is.</Paragraph>
      <Paragraph position="6">  A precision/recall curve can be obtained by multiplying a factor to one of the two cosine distances we get, before comparing them.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML