File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1041_metho.xml
Size: 14,629 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1041"> <Title>Japanese Named Entity Recognition based on a Simple Rule Generator and Decision Tree Learning</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> -> ORGANIZATION </SectionTitle> <Paragraph position="0"> However, this rule is not very good. For instance, OO-SAKA-WAN (= Osaka Bay) follows this pattern, but it is a location's name. GIN-KOU and WAN strongly imply ORGANIZATION andLOCATION, respectively. Thus, the last word of an NE is often a head that is more useful than other words for the classification. Therefore, we register the last word into a suffix dictionary for each non-numerical NE class (i.e., ORGANIZA-TION, PERSON, LOCATION, and ARTIFACT) in order to accept only reliable candidates. If the last word appears in two or more different NE, we call it a reliable NE suffix. We register only reliable ones.</Paragraph> <Paragraph position="1"> NE candidatesdocument recog. rule 1 recog. rule 2 recog. rule n In the above examples, the last words were common nouns. However, the last word can also be a proper noun. For instance, we will get the following rule from <ORGANIZATION>OO-</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> SAKA-TO-YO-TA</ORGANIZATION> (= Os- </SectionTitle> <Paragraph position="0"> aka Toyota) because Japanese POS taggers know that TO-YO-TA is an organization name (a kind of proper noun).</Paragraph> <Paragraph position="1"> *:*:location-name, *:*:org-name -> ORGANIZATION,0,0 Since Yokohama Honda and Kyoto Sony also follow this pattern, the second element *:*:org-name should not be restricted to the words in the training data. Therefore, we do not restrict proper nouns by a suffix dictionary, and we do not restrict numbers either.</Paragraph> <Paragraph position="2"> In addition, the first or last word of an NE may contain an NE boundary as we described before (SHI</LOCATION>NAI). In this case, we can get OO-SAKA-SHI by removing no character of the first word OO-SAKA and one character of the last word SHI-NAI. Accordingly, this modification can be represented by two integers: 0,1. Furthermore, one-word NEs are different from other NEs in the following respects.</Paragraph> <Paragraph position="3"> a0 The word is usually a proper noun, an unknown word, or a number; otherwise, it is an exceptional case.</Paragraph> <Paragraph position="4"> a0 The character type of a one-word NE gives a useful hint for its classification. For instance, all-uppercasewords (e.g., IOC) are often classified as ORGANIZATION.</Paragraph> <Paragraph position="5"> Since unknown words are often proper nouns, we assume they are tagged as misc-proper-noun. If the training data contains <ORGANIZATION>I-O-C</ORGANIZATION> and I-O-C (= IOC) is an unknown word, we will get I-O-C:alluppercase:misc-proper-noun. null By considering these facts, we modify the above rule generation. That is, we replace every word in an NE and its character type by '*' to get the left-hand side of the corresponding recognition rule except the following cases.</Paragraph> <Paragraph position="6"> A word that contains an NE boundary If the first or last word of the NE contains an NE boundary (e.g, SHI</LOCATION>NAI), the word is not replaced by '*'. The number of characters to be deleted is also recorded in the right-hand side of the recognition rule. One-word NE The following exceptions are applied to one-word NEs. If the word is a proper noun or a number, its character type is not replaced by '*'. Otherwise, the word is not replaced by '*'.</Paragraph> <Paragraph position="7"> The last word of a longer NE The following exceptions are applied to the last word of a non-numerical NE that is composed of two or more words when the word is neither a proper noun nor a number. If the last word is a reliable NE suffix (i.e., it appears in two or more different NEs in the class), its information (i.e., the last word, its character type, and its POS tag) is registered into a suffix dictionary for the NE class. The last word of the recognition rule must be an element of the suffix dictionary. Unreliable NE suffixes are not replaced by '*'. Suffixes of numerical NEs (i.e., DATE, TIME, MONEY, PERCENT) are not replaced, either.</Paragraph> <Paragraph position="8"> Now, we obtain the following recognition rules from the above examples.</Paragraph> <Paragraph position="9"> *:all-uppercase:misc-proper-noun -> ORGANIZATION,0,0.</Paragraph> <Paragraph position="10"> *:*:location-name, SHI-NAI:*:common-noun -> LOCATION,0,1.</Paragraph> <Paragraph position="11"> *:*:location-name, *:*:common-noun -> ORGANIZATION,0,0.</Paragraph> <Paragraph position="12"> The first rule extracts CNN as an organization. The second rule extracts YOKO-HAMA-SHI (= Yokohama City) from YOKO-HAMA-SHI-NAI (= in Yokohama City). The third rule extracts YOKO-HAMA-GIN-KOU (= Yokohama Bank) as an organization. Note that, in this rule, the second element (*:*:common-noun) is constrained by the suffix dictionary for ORGANIZATION because it is neither a proper noun nor a number. Hence, the rule does not match YOKO-HAMA-WAN (= Yokohama Bay). If the suffix dictionary also happens to have KOU-KOU:all-kanji: commmon-noun (= senior high school), the rule also matches YOKO-HAMA-KOU-KOU (= Yokohama Senior High School).</Paragraph> <Paragraph position="13"> IREX introduced <ARTIFACT> for product names, prizes, pacts, books, and fine arts, among other nouns. Titles of books and fine arts are often long and have atypical word patterns. However, they are often delimited by a pair of symbols that correspond to quotation marks in English. Some atypical organization names are also delimited by these symbols. In order to extract such a long NE, we concatenate all words within a pair of such symbols into one word. We employ the first and last word of the quoted words as extra features. In addition, we do not regard the quotation symbols as adjacent words because they are constant and lack semantic meaning.</Paragraph> <Paragraph position="14"> When a large amount of training data is given, thousands of recognition rules are generated. For efficiency, we compile these recognition rules by using a hash table that converts a hash key into a list of relevant rules that have to be examined. We make this hash table as follows. If the left-hand side of a rule contains only one element, the element is used as a hash key and its rule identifier is appended to the corresponding rule list. If the left-hand side contains two or more elements, the first two elements are concatenated and used as a hash key and its rule identifier is appended to the corresponding rule list. After this compilation, we can efficiently apply all of the rules to a new document. By taking the first two elements into consideration, we can reduce the number of rules that need to be examined.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Refinement of recognition rules </SectionTitle> <Paragraph position="0"> Some recognition rules are not reliable. For instance, we get the following rule when a person's name is incorrectly tagged as a location's name</Paragraph> <Paragraph position="2"> Therefore, we have to consider a way to refine the recognition rules.</Paragraph> <Paragraph position="3"> By applying each recognition rule to the untagged training data, we can obtain NE candidates for the rule. By comparing the candidates with the given answer for the training data, we can classify them into positive examples and negative examples for the recognition rule. Consequently, we can apply decision tree learning to classify these examples correctly. We represent each example by a list of features: words in the NEs, a1 preceding words, a2 succeeding words, their character types, and their POS tags. If we consider one preceding word and two succeeding words, the feature list for a two-word named entity (a3a5a4a7a6a8a3a10a9 ) will be a3a10a11 a9 , a12a13a11 a9 , a14a15a11 a9 , a3 a4 , a12 a4 , a14 a4 , a3 a9 , a12 a9 , a14 a9 , a3a17a16 , a12a18a16 , a14 a16 , a3a17a19 , a12a18a19 , a14a20a19 , a14a22a21 , where a3 a11 a9 is the preceding word and a3 a16 and a3a23a19 are the succeeding words.</Paragraph> <Paragraph position="4"> a12a8a24 is a3a25a24 's character type and a14a26a24 is a3a27a24 's POS tag. a14a22a21 is a boolean value that indicates whether it is a positive example. If a feature value appears less than three times in the examples, it is replaced by a dummy constant. We also replace numbers by dummy constants because most numerical NEs follow typical patterns, and their specific values are often useless for NE recognition.</Paragraph> <Paragraph position="5"> Here, we discuss handling short NEs. For example, NO-O-BE-RU-SHOU-SEN-KOU-I-IN-KAI (= the Nobel Prize Selection Committee) is an organization's name that contains a person's name NO-O-BE-RU (= Nobel) and an artifact name NO-O-BE-RU-SHOU (= Nobel Prize), but <PERSON>NO-O-BE-RU</PER-</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> SON> and <ARTIFACT>NO-O-BE-RU-SHOU </SectionTitle> <Paragraph position="0"> </ARTIFACT> are incorrect in this case. If the training data contain NO-O-BE-RU as both positive and negative examples of a person's name, the decision tree learner will be confused. They are rejected because there is a longer named entity and overlapping tags are not allowed. We do not have to change our knowledge that Nobel is a person's name. Therefore, we remove such negative examples caused by longer NEs. Consequently, the decision tree may fail to reject <PERSON> NO-O-BE-RU</PERSON>, but it will disappear in the final output because we use a longest match method for arbitration.</Paragraph> <Paragraph position="1"> For readability, we translate each decision tree into a set of production rules by c4.5rules (Quinlan, 1993). Throughout this paper, we call them dt-rules (Fig. 1) in order to distinguish them from recognition rules. Thus, each recognition rule is enhanced by a set of dt-rules. The dt-rules removes unlikely candidates.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Arbitration of candidates </SectionTitle> <Paragraph position="0"> Once the refined rules are generated, we can apply them to a new document. This obtains a large number of NE candidates (Fig. 1). Since overlapping tags are not allowed, we use a kind of left-to-right longest match method. First, we compare their starting points and select the earliest ones.</Paragraph> <Paragraph position="1"> If two or more candidates start at the same point, their ending points are compared and the longest candidate is selected. Therefore, the candidates overlapping the selected candidate are removed from the candidate set. This procedure is repeated until the candidate set becomes empty.</Paragraph> <Paragraph position="2"> The rank of a candidate starting at the a28 -th word boundary and ending at the a29 -th word boundary can be represented by a pair a30a31a28a32a6a13a33a34a29a36a35 . The beginning of a sentence is the zeroth word boundary, and the first word ends at the first word boundary, etc. Then, the selected candidate should have the minimum rank according to the lexicographical ordering of a30a31a28a37a6a38a33a34a29a36a35 . When a candidate starts or ends within a word (e.g., SHI-NAI), we assume that the entire word is a member of the candidate for the definition of a30a31a28a32a6a13a33a34a29a36a35 . According to this ordering, two candidates can have the same rank. One of them might assert that a certain word is an organization's name and another candidate might assert that it is a person's name. In order to apply the most frequently used rule, we extend this ordering by a30a31a28a32a6a13a33a34a29a36a6a38a33a40a39a42a41a13a35 , where a39a43a41 is the number of positive examples for the rulea44 .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Maximum entropy system </SectionTitle> <Paragraph position="0"> In order to compare our method with the ME approach, we also implement an ME system based on Ristad's toolkit (1997). Borthwick's (1999) and Uchimoto's (2000) ME systems are quite similar but differ in details. They regarded Japanese NE recognition as a classification problem of a word. The first word of a per-son name is classified as PERSON-BEGIN. The last word is classified as PERSON-END. Other words in the person's name (if any) are classified as PERSON-MIDDLE. If the person's name is composed of only one word, it is classified as PERSON-SINGLE. Similar labels are given to all other classes such as LOCATION. Non-NE words are classified as OTHER. Thus, every word is classified into 33 classes, i.e., a45 ORGANIZATION, PERSON, LOCATION, ARTIFACT, DATE, TIME, We use the following features for each word in the training data: the word itself, a1 preceding words, a2 succeeding words, their character types, and their POS tags. By following Uchimoto, we disregard words that appear fewer than five times and other features that appear fewer than three times.</Paragraph> <Paragraph position="1"> Then, the ME-based classifier gives a probability for each class to each word in a new sentence. Finally, the Viterbi algorithm (see textbooks, e.g., (Allen, 1995)) enhanced with consistency checking (e.g., PERSON-END should follow PERSON-BEGIN or PERSON-MIDDLE) determines the best combination for the entire sentence.</Paragraph> <Paragraph position="2"> We generate the word boundary rewriting rules as follows. First, the NE boundaries inside a word are assumed to be at the nearest word boundary outside the named entity. Hence, SHI</LOCATION>NAI is rewritten as SHI-NAI</LOCATION>. Accordingly, SHI-NAI is classified as LOCATION-END. The original NE boundary is recorded for the pair SHI-NAI/</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> LOCATION-END, If SHI-NAI/LOCATION-END </SectionTitle> <Paragraph position="0"> is found in the output of the Viterbi algorithm, it is rewritten as SHI</LOCATION>NAI. Since rewriting rules from rare cases can be harmful, we employ a rewriting rule only when the rule correctly works for more than 50% of the word/class pairs in the training data.</Paragraph> </Section> class="xml-element"></Paper>