XML Viewer - w06-0140

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0140_metho.xml
Size: 8,350 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0140">
  <Title>Chinese Named Entity Recognition with a Multi-Phase Model</Title>
  <Section position="4" start_page="0" end_page="214" type="metho">
    <SectionTitle>
2 Chinese NER with multi-level models
2.1 Recognition Process
</SectionTitle>
    <Paragraph position="0"> The input to the recognition algorithm is Chinese character sequence that is not segmented and the output is recognized entity names. The process of recognition of Chinese NER is illustrated in figure 1. First, we segment the text with a character-level CRF model. After basic segmentation, a small number of named entities in the text, such as &amp;quot;Shan Xi Dui &amp;quot;, &amp;quot;Xin Hua She &amp;quot;,&amp;quot;Fu Jian Sheng &amp;quot; and so on, which are segmented as a single word. These simple single-word entities will be labeled with some rules in the last phase. However, a great number of named entities in the text, such as &amp;quot;Zhong Guo Lu Se Zhao Ming Gong Cheng Ban Gong Shi &amp;quot;, &amp;quot;Xi Bo Po Ji Nian Guan &amp;quot;, are not yet segmented as a single word. Then, different from (Andrew et al. 2003), we apply three trained CRFs models with carefully designed and selected features to label person names, location names and organization names in the segmentation results, respectively. At last phase, we apply some rules to tag some names not recognized by CRFs models, and adjust part of the organization names recognized by CRFs models.</Paragraph>
    <Section position="1" start_page="213" end_page="213" type="sub_section">
      <SectionTitle>
2.2 Word segmentation
</SectionTitle>
      <Paragraph position="0"> We implemented the basic segmentation component with linear chain structure CRFs. CRFs are undirected graphical models that encode a conditional probability distribution using a given set of features. In the special case in which the designated output nodes of the graphical model are linked by edges in a linear chain, CRFs make a first-order Markov independence assumption among output nodes, and thus correspond to finite state machines (FSMs). CRFs define the conditional probability of a state sequence given an input sequence as</Paragraph>
      <Paragraph position="2"> is an arbitrary feature function over its arguments, and l k is a learned weight for each feature function. Based on CRFs model, we cast the segmentation problem as a sequence tagging problem. Different from (Peng et al., 2004), we represent the positions of a hanzi (Chinese character) with four different tags: B for a hanzi that starts a word, I for a hanzi that continues the word, F for a hanzi that ends the word, S for a hanzi that occurs as a single-character word. The basic segmentation is a process of labeling each hanzi with a tag given the features derived from its surrounding context. The features used in our experiment can be broken into two categories: character features and word features. The character features are instantiations of the following templates, similar to those described in (Ng and Jin, 2004), C refers to  a Chinese hanzi.</Paragraph>
      <Paragraph position="3"> (a) Cn (n = [?]2,[?]1,0,1,2 ) (b) CnCn+1( n = [?]2,[?]1,0,1) (c) C[?]1C1 (d) Pu(C0 )  In addition to the character features, we came up with another type word context feature which was found very useful in our experiments. The feature captures the relationship between the hanzi and the word which contains the hanzi. For a two-hanzi word, for example, the first hanzi &amp;quot;Lian &amp;quot; within the word &amp;quot;Lian Xu &amp;quot; will have the feature WC0=TWO_F set to 1, the second hanzi &amp;quot;Xu &amp;quot; within the same word &amp;quot;Lian Xu &amp;quot; will have the feature WC0=TWO_L set to 1. For the threehanzi word, for example, the first hanzi &amp;quot;Shu &amp;quot; within a word &amp;quot;Shu Zhuang Jing &amp;quot; will have the feature WC0=TRI_F set to 1, the second hanzi &amp;quot;Zhuang &amp;quot; within the same word &amp;quot;Shu Zhuang Jing &amp;quot; will have the feature WC0=TRI_M set to 1, and the last hanzi &amp;quot;Jing &amp;quot; within the same word &amp;quot;Shu Zhuang Jing &amp;quot; will have the feature WC0=TRI_L set to 1. Similarly, the feature can be extended to a four-hanzi word.</Paragraph>
    </Section>
    <Section position="2" start_page="213" end_page="214" type="sub_section">
      <SectionTitle>
2.3 Named entity tagging with CRFs
</SectionTitle>
      <Paragraph position="0"> After basic segmentation, we use three word-level CRFs models to label person names, loca-tion names and organization names, respectively.</Paragraph>
      <Paragraph position="1"> The important factor in applying CRFs model to name entity recognition is how to select the proper features set. Most of entity names do not have any common structural characteristics except for containing some feature words, such as &amp;quot;Gong Si &amp;quot;, &amp;quot;Xue Xiao &amp;quot;, &amp;quot;Xiang &amp;quot; , &amp;quot;Zhen &amp;quot; and so on. In addition, for person names, most names include a common surname, e.g. &amp;quot;Zhang &amp;quot;, &amp;quot;Wang &amp;quot;. But as a proper noun, the occurrence of an entity name has the specific context. In this section, we only present our approach to organization name recognition. For example, the context information of organization name mainly includes the boundary words and some title words (e.g. Ju Chang , Dong Shi Chang ). By analyzing a large amount of entity name corpora, we find that the indicative intensity of different boundary words vary greatly. So we divide the left and right boundary words into two classes according to the indicative intensity. Accordingly we construct the four boundary words lexicons. To solve the problem of the selection and classification of boundary words, we make use of mutual Information I(x, y). If there is a genuine association between x and y, then I(x, y) &gt;&gt;0. If there is no interesting relationship be- null tween x and y, then I(x, y)[?]0. If x and y are in complementary distribution, then I(x, y) &lt;&lt; 0.</Paragraph>
      <Paragraph position="2"> By using mutual information, we compute the association between boundary word and the type of organization name, then select and classify the boundary words. Some example boundary words for organization names are listed in table 1.</Paragraph>
      <Paragraph position="3">  Based on the consideration given in preceding section, we constructed a set of atomic feature patterns, listed in table 2. Additionally, we defined a set of conjunctive feature patterns, which could form effective feature conjunctions to express complicated contextual information.</Paragraph>
    </Section>
    <Section position="3" start_page="214" end_page="214" type="sub_section">
      <SectionTitle>
2.4 Processing with rules
</SectionTitle>
      <Paragraph position="0"> There exists some single-word named entities that aren't tagged by CRFs models. We recognize these single-word named entities with some rules. We first construct two known location names and organization names dictionaries and two feature words lists for location names and organization names. In closed track, we collect known location names and organization names only from training corpus. The recognition process is described below. For each word in the text, we first check whether it is a known location or organization names according to the known loca-tion names and organization names dictionaries.</Paragraph>
      <Paragraph position="1"> If it isn't a known name, then we further check whether it is a known word. If it is not a known word also, we next check whether the word ends with a feature word of location or organization names. If it is, we label it as a location or organization name.</Paragraph>
      <Paragraph position="2"> In addition, we introduce some rules to adjust organization names recognized by CRF model based on the labeling specification of MRSA corpus. For example, the string &amp;quot;Yang Cheng Xian Li Ge Ta Xiang Wei Sheng Yuan &amp;quot; is recognized as an organization name, but the string should be divided into two names: a location name (&amp;quot;Yang Cheng Xian &amp;quot;) and a organization name (&amp;quot;Li Ge Ta Xiang Wei Sheng Yuan &amp;quot;), according to label specification, so we add some rules to adjust it.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML