File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1708_metho.xml

Size: 21,589 bytes

Last Modified: 2025-10-06 14:08:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1708">
  <Title>CHINERS: A Chinese Named Entity Recognition System for the Sports Domain</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Repairing the Errors for Word Segmen-
</SectionTitle>
    <Paragraph position="0"> tation and POS Tagging For the purpose of ensuring good quality in segmenting word and tagging POS, we compared different existing Chinese word segmentation and POS tagging systems and introduced one of them (Liu, 2000) as the first component in our system. Unfortunately, we found there still are considerable errors of word segmentation and POS tagging when we use this system to process our texts on sports domain. For example:  In the above examples, A, N, N5, and W represent an adjective, a general noun, a Chinese LN, and a punctuation respectively. According to the domain knowledge, the word &amp;quot;a44a2a45 &amp;quot; is a city name as a constituent of TN, which should not be segmented; while the word &amp;quot;a26a50a46a51a47 &amp;quot; is an attack strategy of the football match, it should not be tagged as a Chinese LN. Obviously, these errors will have an unfavorable effect on the consequent recognition for NEs.</Paragraph>
    <Paragraph position="1"> In order to improve the quality for word segmentation and POS tagging, there may be two ways to achieve such goal:  * Develop a novel general Chinese word segmentation and POS tagging system, which will have higher performance than the current systems of the same kind or * Utilize a baseline system with good quality and further improve its performance on a specific domain, so that it can be suitable to real-world application.</Paragraph>
    <Paragraph position="2"> We have chosen the second way in our investigation. First, the research of word segmentation and POS tagging is a secondary task for us in the project. In order to ensure the overall quality of the system, we have to enhance basic quality. Second, it is more effective to improve the quality for word segmentation and POS tagging on a specific domain. null The transformation based error-driven machine learning approach (Brill, 1995) is adopted to repair word segmentation and POS tagging errors, because it is suitable for fixing Chinese word segmentation and POS tagging errors as well as producing effective repairing rules automatically. Following (Hockenmaier and Brew, 1998; Palmer, 1997) we divide the error repairing operations of word segmentation into three types, that is, concat, split and slide. In addition, we add context-sensitive or context-free constraints in the rules to repair the errors of word segmentation and POS tagging. It is important that the context constraints can help us distinguish different sentence environments. The error repairing rules for word segmentation and POS tagging are defined as follows:</Paragraph>
    <Paragraph position="4"> Using these rules, we can move the word segmentation position newly and replace an error tag with a correct tag. e.g. a32 |N|a11 |N|a0 |V|a1 |N (The national football team arrived in Shanghai). The word &amp;quot;a32 a11 &amp;quot; (the national football team) is an abbreviated TN that should not be segmented; while the word &amp;quot;a1 &amp;quot; (Hu) is an abbreviated Chinese LN for Shanghai. We can use the following two rules to repair such errors: rectify_segmentation_error ( concat, a32 |N|a11 |N, 1, J, _|_, a0 |V) rectify_tag_error (a1 |N, J, a0 |V, _|_ ) Here, the digit 1 means the operation number of concat. J is a POS tag for the abbreviated word.</Paragraph>
    <Paragraph position="5"> After the errors are repaired, the correct result is a32</Paragraph>
    <Paragraph position="7"> In the training algorithm (Yao et al., 2002), the error positions are determined by comparing between manually error-repaired text and automatically processed text from the baseline system.</Paragraph>
    <Paragraph position="8"> Simultaneously, the error environments are recorded. Based on such information, the candidate transformation rules are generated and the final error repairing rules are selected depending on their error repairing number in the training set. In order to use these rules with priority, the rules in the final rule library are sorted.</Paragraph>
    <Paragraph position="9"> Considering the requirements of context constraints for different rules, we manually divide the rule context constraints into three types: whole POS context constraint, preceding POS context constraint and without context constraint. Hence, each error repairing rule can be used in accordance with either common or individual cases of errors.</Paragraph>
    <Paragraph position="10"> In the testing algorithm (Yao et al., 2002), the usage of error repairing rules with context constraints is prior to those without context constraints, the employment of error repairing rules for word segmentation has priority over those for POS tagging.</Paragraph>
    <Paragraph position="11"> Thus, it ensures that the rules can repair more errors. At the same time, it prevents new errors occur during repairing existing errors.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Named Entity Recognition
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 An Automatically Constructed FSC
</SectionTitle>
      <Paragraph position="0"> After error repairing, the text with repaired word segmentations and POS tags is used as the input text for NE recognition.</Paragraph>
      <Paragraph position="1"> We make use of Finite-State Cascades (FSC) (Abney, 1996) as a shallow parser for NE recognition in our system. An FSC is automatically constructed by the NE recognition rule sets and consists of three recognition levels. Each level has a NE recognizer, that is, TN, CT and PI recognizer (Other three NEs, namely, PN, DT and LN are immediately recognized after error repairing).</Paragraph>
      <Paragraph position="2"> In order to build a flexible and reliable FSC, we propose the following construction algorithm to automatically construct FSC by the recognition rule sets.</Paragraph>
      <Paragraph position="3"> The NE recognition rule is defined as follows:</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
tic Constraintn
</SectionTitle>
      <Paragraph position="0"> The NE recognition rule is composed of POS rule and its corresponding semantic constraints.</Paragraph>
      <Paragraph position="1"> The rule sets include 19 POS tags and 29 semantic constraint tags.</Paragraph>
      <Paragraph position="2"> Four adjacent matrices, POS matrix, POS index matrix, semantic index matrix and semantic constraint matrix, are used in this algorithm as data structure. The POS matrix is used for the corresponding POS tags between two states. The POS index matrix provides the position of indexes related with POS tags between two states in the semantic index matrix. The semantic index matrix indicates the position of semantic constraints for each POS tag in semantic constraint matrix. The semantic constraint matrix saves the semantic constraint information for each POS tag in the POS matrix. We store the information for both the multi-POS tags between two states and the POS tags that have multi-semantic constraints in these matrices. As an example, Figure 2 shows that the following CT recognition rule set is used to build a deterministic finite automaton, that is, CT recognizer, using the above adjacent matrices. In the figure of the automaton, the semantic constraints in the rule set is omitted.</Paragraph>
      <Paragraph position="3">  In the POS tags, B, N1, and N7 represent a discrimination word, a proper noun and a transliterated noun separately. In the semantic constraint tags, Rank and Range mean the competition rank and range, such as super, woman etc.</Paragraph>
      <Paragraph position="4">  nite automaton using four adjacent matrices The construction algorithm is summarized as follows: * Input a NE recognition rule set and initialize four adjacent matrices.</Paragraph>
      <Paragraph position="5"> * Get the first POS tag of a POS rule, start from the initial state of the NE recognizer, add its corresponding edge into the POS</Paragraph>
      <Paragraph position="7"> condition (see below explanation). At the same time, add its corresponding semantic constraints into the semantic constraint adjacent matrix by the POS and semantic index adjacent matrices.</Paragraph>
      <Paragraph position="8"> * If a tag's edge is successfully added, but it doesn't arrive in the final state, temporarily, push its POS tag, semantic constraints and related states (edge information) into a stack. If the next tag's edge isn't successfully added, pop the last tag's edge information from the stack. If the added edge arrives in the final state, pop all tag's edge information of the POS rule and add them into the POS and the semantic constraints adjacent matrix.</Paragraph>
      <Paragraph position="9"> * If all existing states in the NE recognizer are tried, but the current edge can not be added, add a new state to the NE recognizer. In the following adding edge procedure, share the existing edge with tag's edge to be added as much as possible.</Paragraph>
      <Paragraph position="10"> * If all the POS rules are processed, the construction for certain NE recognition level of FSC is completed.</Paragraph>
      <Paragraph position="11"> It is important that the correct construction condition in the procedure of adding POS tag's edge must be met. For example, whether its corresponding semantic constraints conflict with the existing edge's semantic constraints between two states or the in-degree of starting state and the out-degree of arriving state must be less than or equal to 1, etc in the NE recognizer. Otherwise give up adding this tag's edge. Figure 3 is a part of the constructed recognition level of FSC for CT.</Paragraph>
      <Paragraph position="12"> The construction algorithm is a rule-driven algorithm. It only depends on the format of rules. Therefore, it is easy to change the size of POS and semantic constraint tags or easy to add, modify or delete the rules. Additionally, the algorithm can be applied to build all the recognition levels, it is also easy to expand the NE recognizers in FSC for new NEs.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Recognition Procedure
</SectionTitle>
      <Paragraph position="0"> When FSC has been constructed, we use it to recognize TN, CT and PI. First of all, input the text to be processed and different resources, such as country name, club name, product name library and HowNet knowledge lexicon (Dong and Dong, 2000). Then attach keyword, potential personal identity, digital or alphabetic string, and other semantic constraint tags to the corresponding constituents in a sentence. Thirdly, match the words one by one with a NE recognizer. Start from the initial state, match the POS tag's edge in the POS adjacent matrix, then match its corresponding semantic constraint tags in the semantic constraint adjacent matrix through two index adjacent matrices. If it is successful, push the related information into the stack. Otherwise find another edge that can be matched. Until arriving in the final state, pop the recognized named entity from the stack.</Paragraph>
      <Paragraph position="1"> Fourthly, if the current word is not successfully matched and the stack is not empty, pop the information of the last word and go on matching this word with other edges. If some words are successfully matched, the following words will be matched until all of the words in the sentence are tried to match. Finally, if there is still a sentence which is not processed in the text, continue. Otherwise finish the NE recognition procedure.</Paragraph>
      <Paragraph position="2"> The matching algorithm guarantees that any NE match is a match with maximum length. Because the finite automaton in FSC is deterministic and has semantic constraints, it can process ambiguous linguistic cases. Therefore, it has reliability and accuracy during the NE recognition procedure.</Paragraph>
      <Paragraph position="3">  level of FSC for CT The following is an example to give the NE recognition procedure with FSC. L1 to L3 represent three NE recognition levels of FSC, namely, TN,</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Some Special Named Entities
</SectionTitle>
      <Paragraph position="0"> Sometimes there are TNs or CTs without keyword in sentences. For instance, TNs without keyword:  We organize domain verb lexicon and collect domain verbs, such as a40 (win), a65a67a66 (lose), a23 (vs.), a68a49a69 (attack), a70a14a71 (guard), a72a73a63 (take on), a74a58a75 (disintegrate), and their corresponding valence constraints. For instance, the valence constituents for the verb &amp;quot;a40 &amp;quot; in our domain are defined as follows:  In the basic format, Essiv is a subject that represents a non-spontaneous action and state in an event. Object is a direct object that deals with an non-spontaneous action. In the extended format, Link indicates a type, identity or role of the subject. In general, it begins with the word &amp;quot;a76a78a77 ...&amp;quot; (As ...). Accompaniment expresses an indirect object that is accompanied or excluded. It often begins with the word &amp;quot;a79a12a80 ...&amp;quot; (Except ...). Step 2 : Keep the equity of domain verbs and analyze the constituents of TN candidates According to the valence constraints, we examine whether the constituents in both sides of domain verbs are identical with the valence basic format or extended format, e.g. in Ex. 1 the team name1 should be balanced with the team name2 in the light of the valence basic format of domain verb a40 (win1). Besides, the candidate of team name2 is checked, its constituent is a city name (Dalian) that can be as a constituent of team name.</Paragraph>
      <Paragraph position="1"> Step 3 : Utilize context clue of TN candidates We find whether there is a TN that is equal to current TN candidate with the keyword through the context of the TN candidate in the text, in order to enhance the correctness of TN recognition. As an example, in Ex. 2, depending on Step 2, a team name can occur on both sides of the domain verb a48a81a40 (win victory). A country name can be a constituent of team name. At the same time, the context of two TN candidates will be examined.</Paragraph>
      <Paragraph position="2"> Finally, if there is such context clue, the candidates are determined. Otherwise, continue to recognize the candidates by the next step.</Paragraph>
      <Paragraph position="3"> Step 4: Distinguish team name from location name Because a LN can be a constituent of a TN, we should distinguish TNs without keywords from LNs. With the help of other constituents (e.g.</Paragraph>
      <Paragraph position="4"> nouns, propositions etc.) in a sentence, the differences of both NEs can be distinguished to a certain extent. In Ex. 3 the noun a63 (fight) is an analogy for the match in sports domain. Therefore, here &amp;quot;a60 &amp;quot; (Shanghai) and &amp;quot;a43 &amp;quot; (Dalian) represent two TNs separately. But it is still difficult to further improve the precision of TN recognition. (see the third experimental result.)</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System Implementation and Experi-
mental Results
</SectionTitle>
    <Paragraph position="0"> CHINERS has been implemented with Java 2 (ver.1.4.0) under Windows 2000. The user interface displays the result of word segmentation and POS tagging from the baseline system, the error repairing result, the recognized result for six types of NEs and the statistic report for error repairing and NE recognition. The recognized text can be entered from a disk or directly downloaded from WWW (http://www.jfdaily.com/). HowNet knowledge lexicon is used to provide English and concept explanation for Chinese words in the recognized results. Except the error repairing rule library (for the most part) and HowNet knowledge lexicon, other resources have been manually built.</Paragraph>
    <Paragraph position="1"> To evaluate this system, we have completed three different experiments. The first one is only for the performance of error repairing component.</Paragraph>
    <Paragraph position="2"> The second one is about comparison for NE recognition performance with or without error repairing. The third one is to test the recognition performance for TNs and CTs without keyword. The training set consists of 94 texts including 3473 sentences (roughly 37077 characters) from Jiefang Daily in 2001. The texts come from the football sports news. After machine learning, we obtain 4304 transformation rules. Among them 2491 rules are for word segmentation error repairing, 1813 rules are for POS tagging error repairing. There are 1730 rules as concat rules, 554 rules as split rules and 207 rules as slide rules in the word segmentation error repairing rules. Subsequently, we distinguish above rules into context-sensitive or context-free category manually. In the error repairing rules for word segmentation, 790, 315 and 77 rules are as concat, split and slide context-sensitive rules respectively; while 940, 239 and 130 rules are as concat, split and slide context-free rules separately. In the error repairing rules for POS tagging, 1052 rules are context-sensitive rules and 761 are context-free rules. The testing set is a separate set, which contains 20 texts including 658 sentences (roughly 8340 characters). The texts in the testing set have been randomly chosen from Jiefang Daily in May 2002. The texts also come from football sports news.</Paragraph>
    <Paragraph position="3"> Table 1 and 2 show the first experimental results for the performance in different cases. These results indicate that the average F-measure of word segmentation has increased by 5.11%; while one of POS tagging has even increased by 12.54%.</Paragraph>
    <Paragraph position="4"> In addition, using same testing set, we give the second and third experimental results in Figure 4, 5 and Table 3. In Figure 4 and 5, the performance of  six types of NEs has manifestly been improved.</Paragraph>
    <Paragraph position="5"> The total average recall is increased from 58% to 83%, and the total average precision has also increased from 65% to 85%. In Table 3, the average recall for TN without keyword has exceeded the average recall of TN in Figure 4; the average recall and precision of CT without keyword have also exceeded the average recall and precision of CT in Figure 4 and 5. But the average precision of TN only reaches 66%. We analyze the error reasons for the recognition of TN without keyword. Among 19 errors there are 17 errors from the wrong recognition for LN and 2 errors from imperfect recognition for TN. That is to say, the Step 4 of the recognition strategy in section 3.3 should be further improved.</Paragraph>
    <Paragraph position="6"> In short, the experimental results have shown that the performance of whole system has been significantly improved after error repairing for  without keyword word segmentation and POS tagging as well as the recognition for special NEs. It also proves that the system architecture is reasonable and effective.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Conclusions and Future Work
</SectionTitle>
    <Paragraph position="0"> During the research for Chinese NE recognition, we noted that the errors from word segmentation and POS tagging have adversely affected the performance of NE recognition to a certain extent. We utilize transformation based error-driven machine learning to perform error repairing for word segmentation and POS tagging simultaneously. In the error repairing procedure, we add context-sensitive or context-free constraints in the rules. Thus, the introduction of further errors during error repairing can be avoided. In order to recognize NEs flexibly, reliably and accurately, we design and implement a FSC as a shallow parser for the NE recognition, which can be automatically constructed on basis of the recognition rule sets. In accordance with special NEs, additionally, we hit on the corresponding solutions for the recognition correctness.</Paragraph>
    <Paragraph position="1"> The experimental results have shown that the performance of word segmentation and POS tagging has been improved, leading to an improved performance for NE recognition in our system.</Paragraph>
    <Paragraph position="2"> Such a hybrid approach used in our system synthesizes the advantages of knowledge engineering and machine learning.</Paragraph>
    <Paragraph position="3"> For future work we will focus on relation extraction. On one hand, we will build an ontology including sports objects, movements and properties as a knowledge base to support corpus annotation.</Paragraph>
    <Paragraph position="4"> On the other hand, we utilize machine learning to automatically build relation pattern library for relation recognition.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Acknowledgement
</SectionTitle>
    <Paragraph position="0"> This work is a part of the COLLATE project under contract no. 01INA01B, which is supported by the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML