XML Viewer - w06-0110

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0110_metho.xml
Size: 13,836 bytes
Last Modified: 2025-10-06 14:10:35
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0110">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Hybrid Models for Chinese Named Entity Recognition</Title>
  <Section position="5" start_page="111" end_page="111" type="metho">
    <SectionTitle>
1) Chinese Person Names Chunk Tags
</SectionTitle>
    <Paragraph position="0"> We use the Inside/Outside representation for proper chunks: I Current token is inside of a chunk. O Current token is outside of any chunk. B Current token is the beginning of a chunk. A chunk is considered as a Chinese person name in this case. Every character in the training set is given a tag classification of B, I or O, that is, },,{ OIBy</Paragraph>
    <Paragraph position="2"> [?] . Here, the multi-class decision method pairwise is selected.</Paragraph>
    <Paragraph position="3"> 2) Features Extraction for Chinese Person</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
Names
</SectionTitle>
      <Paragraph position="0"> Since Chinese person names are identified from the segmented texts, the mistakes of word segmentation can result in error identification of person names. So we must break words into characters and extract features for every character. Table 1 summarizes types of features and their values.</Paragraph>
      <Paragraph position="1">  The POS tag from the output of lexical analysis is subcategorized to include the position of the character in the word. The list of POS tags is shown in Table 2.</Paragraph>
      <Paragraph position="2">  If the character is a surname, the value is assigned to Y, otherwise assigned to N.</Paragraph>
      <Paragraph position="3"> The &amp;quot;character&amp;quot; is surface form of the character in the word.</Paragraph>
      <Paragraph position="4"> We extract all person names in January 1998 of the People's Daily to set up person names table and calculate the frequency of every charac- null ter (F) of person names table in the training corpus. The frequency of F is defined as ,)( F of number total the names person of character a as F of number the</Paragraph>
      <Paragraph position="6"> if P(F) is greater than the given threshold, the value is assigned to Y, otherwise assigned to N.</Paragraph>
      <Paragraph position="7"> We also use previous BIO-tags as features.</Paragraph>
      <Paragraph position="8"> Whether a character is inside a person name or not, it depends on the context of the character.</Paragraph>
      <Paragraph position="9"> Therefore, we use contextual information of two previous and two successive characters of the current character as features.</Paragraph>
      <Paragraph position="10"> Figure 1 shows an example of features extraction for the i-th character. When training, the features of the character &amp;quot;Min&amp;quot; contains all the features surrounded in the frames. If the same sentence is used as testing, the same features are</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="111" end_page="111" type="metho">
    <SectionTitle>
3) Choosing Kernel Functions
</SectionTitle>
    <Paragraph position="0"> Here, we choose polynomial kernel functions: to build an optimal separating hyperplane.</Paragraph>
    <Paragraph position="2"/>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
2.3 Recognition of Chinese Location Names
</SectionTitle>
      <Paragraph position="0"> Based on SVM The identification process of location names is the same as that of person names except for the features extraction. Table 3 summarizes types of features and their values of location names extraction. null  The location names characteristic table is set up in advance, and it includes the characters or words expressing the characteristics of location names such as &amp;quot;sheng (province)&amp;quot;, &amp;quot;shi (city)&amp;quot;, &amp;quot;xian (county)&amp;quot;etc. If the character is in the loca-tion names characteristic table, the value is assigned to Y, otherwise assigned to N.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="111" end_page="111" type="metho">
    <SectionTitle>
3 Statistical Models
</SectionTitle>
    <Paragraph position="0"> Many statistical models for NER have been presented (Zhang et al., 1992; Huang et al., 2003 etc). In this section, we proposed our statistical models for Chinese person names recognition and Chinese location names recognition.</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.1 Chinese Person Names
</SectionTitle>
      <Paragraph position="0"> We define a function to evaluate the person name candidate PN. The evaluated function Total-Probability(PN) is composed of two parts: the lexical probability LP(PN) and contextual probability CP(PN) based on POS tags.</Paragraph>
      <Paragraph position="2"> where PN is the evaluated person name and a is the balance cofficient.</Paragraph>
      <Paragraph position="3"> 1) lexical probability LP(PN) We establish the surname table (SurName) and the first name table (FirstName) from the students of year 1999 in a university (containing  where , is the number of L as the single or multiple surname of person names in the SurName.</Paragraph>
      <Paragraph position="4">  Chinese person names have characteristic contexual POS tags in real Chinese texts, for example, in the phrase &amp;quot;dui Zhangshuai shuo (say to Zhangshuai)&amp;quot;, the POS tag before the person name &amp;quot;Zhangshuai&amp;quot; is prepnoun and verb occurs after the person name. We define the bigram contextual probability CP(PN) of the person name PN as the following equation:</Paragraph>
      <Paragraph position="6"> where lpos is the POS tag of the character before PN (called POS forward), rpos is the POS tag of the character after PN (called POS backward), and is the number of PN as a pereson name whose POS forward is lpos and POS backward is rpos in training corpus. is the total number of the contexual POS tags of every person name in the whole training corpus.</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.2 Chinese Location Names
</SectionTitle>
      <Paragraph position="0"> We also define a function to evaluate the location name candidate LN. The evaluated function TotalProbability(LN) is composed of two parts: the lexical probability LP (LN) and contextual probability CP (LN) based on POS tags.</Paragraph>
      <Paragraph position="2"> where LN is the evaluated location name anda is the balance cofficient.</Paragraph>
      <Paragraph position="3">  is the middle characters of the evaluated location name LN, S is the last character of the evaluated location name LN. The probability of the first character of the evaluated location name is defined as )(</Paragraph>
      <Paragraph position="5"> as the first character of location names in the Chinese Location Names Record.  in the Chinese Location Names Record.</Paragraph>
      <Paragraph position="6"> The probability of the middle character of the evaluated location name is defined as )(</Paragraph>
      <Paragraph position="8"> as the i-th middle character of loca-tion names in the Chinese Location Names Re-</Paragraph>
      <Paragraph position="10"> in the Chinese Location Names Record.</Paragraph>
      <Paragraph position="11"> The probability of the last character of the evaluated location name is defined as )(SP</Paragraph>
      <Paragraph position="13"> where , is the number of S as the last character of location names in the Chinese Location Names Record.  where Len(LN) is the length of the evaluated location name LN.</Paragraph>
      <Paragraph position="14"> 2) contextual probability based on POS tags CP (LN) Location names also have characteristic contexual POS tags in real Chinese texts, for example, in the phrase &amp;quot;zai Chongqing shi junxing (to be held in Chongqing)&amp;quot;, the POS tag before the location name &amp;quot;Chongqing&amp;quot;is prepnoun and verb occurs after the location name. We define the bigram contextual probability CP(LN) of the location name LN similar to that of the person name PN in equation (9), where PN is replaced with LN.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="111" end_page="111" type="metho">
    <SectionTitle>
4 Recognition of Chinese Named Entity
</SectionTitle>
    <Paragraph position="0"> Using Hybrid Model Analyzing the classification results (obtained by sole SVMs described in section 2) between B and I, B and O, I and O respectively, we find that the error is mainly caused by the second classification. The samples which attribute to B class are misclassified to O class, which leads to B class vote's diminishing and the corresponding named entities are lost. Therefore the Recall is lower. In the meantime, the number of the mis-classified samples whose function distances to the hyperplane of SVM in feature space are less than 1 can reach over 83% of the number of total misclassified samples. That means the misclassi- null fication of a classifier is occurred in the region of two overlapping classes. Considering this fact, we can expect to improve SVM using the following hybrid model.</Paragraph>
    <Paragraph position="1"> The hybrid model includes the following procedure: 1) compute the distance from the test sample to the hyperplane of SVM in feature space.</Paragraph>
    <Paragraph position="2"> 2) compare the distance with given threshold.</Paragraph>
    <Paragraph position="3"> The algorithm of hybrid model can be described as follows: Suppose T is the testing set,</Paragraph>
  </Section>
  <Section position="9" start_page="111" end_page="111" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> Our experimental results are all based on the</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
corpus of Peking University.
5.1 Extracting Chinese Person Names
</SectionTitle>
      <Paragraph position="0"> We use 180 thousand characters corpus of year 1998 from the People's Daily as the training corpus and extract other sentences (containing 1526 Chinese person names) as testing corpus to conduct an open test experiment. The results are obtained as follows based on different models.</Paragraph>
      <Paragraph position="1"> 1) Based on Sole SVM An experiment is carried out to recognize Chinese person names based on sole SVM by the method as described in Section 2. The Recall, Precision and F-measure using different number of degree of polynomial kernel function are given in Table 4. The best result is obtained when d=2.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="111" end_page="111" type="metho">
    <SectionTitle>
2) Using Hybrid Model
</SectionTitle>
    <Paragraph position="0"> As mentioned in section 4, the test samples which attribute to B class are misclassified to O class and therefore the Recall for person names extraction from sole SVM is lower. So we only deal with the test samples (B class and O class) whose function distances to the hyperplane of SVM in feature space (i.e. g(x)) is between 0 and e . We move class-boundary learned by SVM towards the O class, that is, the O class samples are considered as B class in that area. 93.64% of the Chinese person names in testing corpus are recalled when e =0.9 (Here, e also represents how much the boundary is moved). However, a number of non-person names are also identified as person names wrongly and the Precision is decreased correspondingly. Table 5 shows the Recall and Precision of person names extraction with different e .</Paragraph>
    <Paragraph position="1">  We use the evaluated function TotalProbability(PN) as described in section 3 to filter the wrongly recalled person names using SVM. We tunea in equation (5) to obtain the best results. The results based on the hybrid model with different a are listed in Table 6 (when d=2). We can observe that the result is best when a=0.4.</Paragraph>
    <Paragraph position="2"> Table 7 shows the results based on the hybrid model with different e when =0.4. We can observe that the Recall rises and the Precision drops on the whole when a e increases. The synthetic index F-measures are improved when e is between 0.1 and 0.8 compared with sole SVM.</Paragraph>
    <Paragraph position="3"> The best result is obtained when e =0.3. The Recall and the F-measure increases 3.27% and 1.77% respectively.</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
5.2 Extracting Chinese Location Names
</SectionTitle>
      <Paragraph position="0"> We use 1.5M characters corpus of year 1998 from the People's Daily as the training corpus and extract sentences of year 2000 from the People's Daily (containing 2919 Chinese location names) as testing corpus to conduct an open test experiment. The results are obtained as follows based on different models.</Paragraph>
      <Paragraph position="1"> 1) Based on Sole SVM The Recall, Precision and F-measure using different number of degree of polynomial kernel function are given in Table 8. The best result is obtained when d=2.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="111" end_page="111" type="metho">
    <SectionTitle>
2) Using Hybrid Model
</SectionTitle>
    <Paragraph position="0"> The results for Chinese location names extraction based on the hybrid model are listed in Table 9 (when d=2; a =0.2 in equation (10)). We can observe that the Recall rises and the Precision drops on the whole when e increases. The synthetic index F-measures are improved when e is between 0.1 and 0.7 compared with sole SVM. The best result is obtained when e =0.3.</Paragraph>
    <Paragraph position="1"> The Recall increases 3.55%, the Precision decreases 1.05% and the F-measure increases 1.37%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML