File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1120_metho.xml
Size: 18,576 bytes
Last Modified: 2025-10-06 14:15:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1120"> <Title>A Decision Tree Method for Finding and Classifying Names in Japanese Texts</Title> <Section position="5" start_page="171" end_page="172" type="metho"> <SectionTitle> 3 Algorithm </SectionTitle> <Paragraph position="0"> In this section, the algorithm of the system will be presented. There are two phases, one for creating the decision tree from training data (training phase) and the other for generating the tagged output based on the decision tree (testing phase). We use a Japanese morphological analyzer, JUMAN (JUMAN, 1997) and a program package for decision trees, C4.5 (Quinlan, 1993).</Paragraph> <Paragraph position="1"> We use three kinds of feature sets in the decision tree: * Part-of-speech tagged by JUMAN We define the set of our categoriefi based on its major category and minor category.</Paragraph> <Paragraph position="2"> * Character type information Character type, like Kanji, Hiragana, Katakana, alphabet, number or symbol, etc.</Paragraph> <Paragraph position="3"> and some combinations of these.</Paragraph> </Section> <Section position="6" start_page="172" end_page="173" type="metho"> <SectionTitle> * Special Dictionaries </SectionTitle> <Paragraph position="0"> Lists of entities created based on JUMAN dictionary entries, lists found on the Web or based on human knowledge. Table 2 shows the number of entities in each dictionary.</Paragraph> <Paragraph position="1"> Organization name has two types of dictionary; one for proper names and tile other for general nouns. An example of the latter case is &quot;Executive Staff&quot;, mentioned before. name- name name- null Creating the special dictionaries is not very easy, but it is not very laborious work. The initial dictionary was built in about a week. In tile course of the system development, in particular while creating the training corpus, we added some entities to the dictionaries.</Paragraph> <Paragraph position="2"> The decision tree gives an output for the beginning and tile ending position of each token. It is one of the 4 possible combinations of opening, continuation and closing for each named entity type, or having no named entity, shown in Table 3. When we have 8 named entity types, there are 33 kinds of output. For example, if an or- null ganization name covers three words, h. B and C~ / / and the next word D has no named entity, then we will have the following data:</Paragraph> <Paragraph position="4"> Note that there is no overlapping or embedding of named entities. An example of real data is shown in Appendix A.</Paragraph> <Paragraph position="5"> There could be a problem, in the testing phase, if we just use the deterministic decision created by the tree. Because the decisions are made locally, the system could make an inconsistent sequence of decisions overall. For example, one token could be tagged as the opening of an organization, while the next token might be tagged as the closing of person name. We can think of several strategies to solve this problem (for example, the method adopted by (Bennett et al.</Paragraph> <Paragraph position="6"> 1997) will be described in a later section), but we used a probabilistic method.</Paragraph> <Paragraph position="7"> There will usually be more than one tag in the leaf of a decision tree. At a leaf we don't just record the most probable tag; rather, we keep the probabilities of tile all possible tags for that leaf. In this way we can salvage cases where ~ tag is part of the most probable globally-consistent tagging of the text, even though it is not the most probable tag for this token, and so would be discarded if we made a deterministic decision at each token. Note that. we did not apply smoothing technique, which might be able to avoid the data sparseness problem. More about the probabilistic method will be explained in the next section. null</Paragraph> <Section position="1" start_page="172" end_page="173" type="sub_section"> <SectionTitle> Training Phase </SectionTitle> <Paragraph position="0"> First, the training sentences are segmented and part-of-speech tagged by JUMAN. Then each token is analyzed by its character type and is matched against entries in the special dictionaries. One token can match entries in several dictionaries. For example, &quot;Matsushita&quot; could match the organization, person anfflocation dictionaries. null Using the training data, a decision tree is built. It learns about the opening and closing of named entities based on the three kinds of information of the previous, current and following tokens.</Paragraph> <Paragraph position="1"> The three types of information are tile part-ofspeech, character type and special dictionary information described above.</Paragraph> </Section> <Section position="2" start_page="173" end_page="173" type="sub_section"> <SectionTitle> Testing Phase </SectionTitle> <Paragraph position="0"> In the testing phase, the first three steps, token segmentation and part-of-speech tagging by JUMAN, analysis of character type, and special dictionary look-up, are identical to that in the training phase. Then, in order to find the probabilities of opening and closing a named entity for each token, the properties of the previous, current and following tokens are examined against the decision tree. Appendix 13 shows two example paths in the decision tree. For each token, the probabilities of 'none' and the four combinations of answer pairs for each named entity type are assigned. For instance, if we have 7 named entity types, then 29 probabilities are generated.</Paragraph> <Paragraph position="1"> Once the probabilities for all the tokens in a sentence are assigned, the remaining task is to discover the most probable consistent path through the sentence. Here, a consistent path means that for example, a path can't have org-0P-CN and date-0P-CL in a row, but call have loc-0P-CN and loc-CN-CL. The output is generated from the consistent sequence with the highest probability for each sentence. The Viterbi algorithm is used in the search; this can be run in time linear in the length of the input.</Paragraph> </Section> </Section> <Section position="7" start_page="173" end_page="173" type="metho"> <SectionTitle> 4 Example </SectionTitle> <Paragraph position="0"> Appendix A shows an example sentence along with three types of information, part-of-speech.</Paragraph> <Paragraph position="1"> character type and special dictionary information, and information of opening and closing of named entities. Appendix 13 shows two example paths in the decision tree. For the purpose of demonstration, we used the seventh and eighth token of the example sentence in Appendix A.</Paragraph> <Paragraph position="2"> Each line corresponds to a question asked by the tree nodes along the path. The last line shows the probabilities of named entity information which have none-zero probability. This instance demonstrates how the probability method works. As we can see, the probability of none for the seventh token (Isuraeru = Israel) is higher than that for the opening of organization (0.67 to 0.33), but in the eighth token (Keisatsu = Police), the probability of closing organization is much higher than none (0.86 to 0.14). The combined probabilities of the two consistent pw:hs are calculated. One of these paths makes the two tokens an organization entity while along the other path, neither token is part of a named entity. The probabilities are higher in the first case (0.28) than that in the latter case (0.09), So the two tokens are tagged as an organization entity.</Paragraph> </Section> <Section position="8" start_page="173" end_page="175" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> In this section, the experiments will be described. We chose two domains for the experiments. One is the vehicle accident report domain. Newspaper articles in the domain report accidents of vehicles, like car, train or airplane.</Paragraph> <Paragraph position="1"> The other is the executive succession domain, articles in this domain report succession events of executives, like president, vice president or CEO. We have 103 training articles in the accident domain, which contain 2.368 NE's and 11 evaluation articles which were hidden from the developer, In the evaluation articles, there are 258 NE items (58 organization, 30 person, 100 location, 47 date, 21 time and 2 money expressions). Also, we have 70 training articles, which contain 2,406 NE's and 17 evaluation articles in the succession domain. In the evaluation articles, there are 566 NE items (113 organization, 114 person, 67 location. 183 position. 77 date. 1 time. 9 money and 2 percent expressions).</Paragraph> <Section position="1" start_page="173" end_page="173" type="sub_section"> <SectionTitle> 5.1 Accident Report Domain </SectionTitle> <Paragraph position="0"> First. we will report on the experiment on the accident domain. Basically, this is the initial target domain of the system.</Paragraph> <Paragraph position="1"> The result is shown in Table 4. The F-scores based on recall and precision are shown. 'Recall' is the percentage of the correct answers among the answers in the key provided by human. 'Precision' is the percentage of the correct answers among the answers proposed by the system. 'F-score' is a measurement combining the two figures. See (Tipster2, 1996) for more &quot;detail&quot; definition of F-score, recall and precision. They are compared with the results produced by JUMAN's part-of-speech information and the average scores in MET1, reported in (Tipster2, 1996). The result from JUMAN is created based on JUMAN version 3.3's output alone 1. When I Latest version may have better performance than the results reported here. Also remember that the definitions it identifies a sequence of locations, persons or.</Paragraph> <Paragraph position="2"> other proper nouns, then we tag the sequence with location, person or organization, respectively. The MET1 evaluation was conducted on completely different texts and on a. different domain, so it is not directly comparable, but since the task definitions are almost the same, we believe it gives a rough point of comparison. Note that for the MET1 evaluation, there were about 300 training articles compared to our 100 training articles. Also, they did not report the scores by each individual participant.</Paragraph> <Paragraph position="3"> We believe these results are quite good and indicate the capability of our system. In terms of execution time, the training phase takes about 5 minutes, of which JUMAN and the decision tree creation take most of the time. It takes less than a minute to create the named entity output, and again JUMAN takes the bulk of the time.</Paragraph> </Section> <Section position="2" start_page="173" end_page="175" type="sub_section"> <SectionTitle> 5.2 Issue of Training Size </SectionTitle> <Paragraph position="0"> It is quite nice that we can get this level of performance with only about 100 training articles. It is interesting to investigate how much training data is needed to achieve a good performance.</Paragraph> <Paragraph position="1"> We created 8 small training sets of different size, and ran the system using these training data.</Paragraph> <Paragraph position="2"> Note that we used the same dictionaries for all the experiments, which were generated by several means including the items in the entire training data. Table 5 shows the results. The size of the training set is indicated by the number of articles and the number of NE in the training data.</Paragraph> <Paragraph position="3"> It is amazing that the performance is not greatly degraded even with 9 articles. Also, even with are different.</Paragraph> <Paragraph position="4"> only one article, our system can achieve 68 Fscore. Actually, the three sets of 1-article training data were created from each article in the 3-article training data, and we can see that the performance using tlle 3-article training data. is mainly derived from the high performance single article. So, we believe that once you have a good coverage dictionaries and some amount of standard patterns in the training data, the system can achieve fairly good performance. We observed that tile article which gives high performance contains a good variety of many named entities.</Paragraph> <Paragraph position="5"> zation, location dictionary, etc. We believe that these dictionaries can be relatively domain independent.</Paragraph> <Paragraph position="6"> 2. Modify the program Assign a new ID number for the position entity in the decision tree program and modify the input/output routine accordingly. This also took less than an hour.</Paragraph> <Paragraph position="7"> In less than two hours for the system modification, and about a day's work for the preparation of the training data, the new system becomes runnable, Table 6 shows the result of the experiment. The result is quite satisfactory. However, In general, one of the advantages of automatic learning systems is their portability. In this sub-section, we will report an experiment of moving tile system to a new domain, the executive succession domain. Also, in order to see the portability of the system, we add a new kind of named entity. In this domain, executive positions appear very often and it is an important entity type for understanding those articles. So, we add a new entity class, 'position'. When porting the system, only the following two changes are required. null 1. Add a new dictionary Create a new dictionary for positions. In practice, many of them were listed in the person prefix in the previous-experiment.</Paragraph> <Paragraph position="8"> So we separate them and add several position names which appeared in or could be inferred from the training data. This took less than an hour. Note that we did not change any other dictionaries, i.e. organi- null it &quot;is not as good as the result in the previous domain, in particular, for organization and location. Observing the output, we noticed domain idiosyncrasies which we had not thought of before. For example, in the new domain, there are many Chinese company names, which have the suffix &quot;Yuugenkoushi'. This is never used for Japanese company names and we don't have the suffix in our organization suffix dictionary. Another interesting example is a Chinese character &quot;Shou&quot;. In Japanese, the character is used as a suffix of official organizations, like &quot;Monbu-Shou&quot; (Department of Education), but in Chinese it is used as a suffix of location names, like &quot;Kanton-Shou&quot; (Canton District). In the accident domain, we did not encounter such Chinese location names, so we just had the token in the organization suffix dictionary. This led to many errors in location names in the new domain. Also, we find many unfamiliar foreign location names and company names. We believe these make the result relatively worse.</Paragraph> </Section> <Section position="3" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 5.4 Domain Dependency </SectionTitle> <Paragraph position="0"> As we have training and evaluation data on two different domains, it is interesting to observe the domain dependency of the system. Namely, we will see how the performance differs if we use the knowledge (decision tree) created from a different domain. We conducted two new experiments, tagging named entities for texts in 1:he succession domain based on the decision tree created for the accident domain, and vice versa.</Paragraph> <Paragraph position="1"> Table 7 shows the comparison of these resuits. The performance in the accident domain decreased from 85 to 71 using the decision tree of the other domain. Also, the performance decreased from 82 to 59 in the succession domain.</Paragraph> <Paragraph position="2"> The result demonstrates the domain dependency of the method used, at least for the two domains. Obviously, making a general comment based on these small experiments is dangerous, but it suggests that we should consider the domain dependency when we port the system to a new domain.</Paragraph> </Section> </Section> <Section position="9" start_page="175" end_page="175" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> There have been several efforts to apply machine learning techniques to the same task (Cowie, 1995) (Bikel et al, 1997) (Gallippi, 1996) (Bennett et al, 1997) (Borthwick et al, 1997). In this section, we will discuss a system which is one of the most advanced and which closely resembles our own (Bennett et al, 1997). A good review of most of the other systems can be found in their paper.</Paragraph> <Paragraph position="1"> Their system uses the decision tree algorithm and almost the same features. However, there are significant differences between the systems. The main difference is that they have-more than one decision tree, each of which decides if a particular named entity starts/ends at the current token. In contrast, our system has only one decision tree which produces probabilities of information about the named entity. In this regard, we are similar to (Bikel et al, 1997), which also uses a probabilistic method in their HMM based system. This is a crucial difference which also has important consequences. Because the system of (Bennett et al, 1997) makes multiple decisions at each token, they could assign multiple, possibly inconsistent tags. They solved the problem by introducing two somewhat idiosyncratic methods. One of them is the distance score, which is used to find an opening and closing pair for each named entity mainly based on distance information. The other is a tag priority scheme, which chooses a named entity among different types of overlapping candidates based on the priority order of named entities. These methods require parameters which must be adjusted when they are applied to a new domain. In contrast, our system does not require such methods, as the multiple possibilities are resolved bv the probabilistic method. This is a strong advantage, because we don't need manual adjustments.</Paragraph> <Paragraph position="2"> The result they reported is not comparable to our result, because the text and definition are different. But the total F-score of our system is similar to theirs, even though the size of our training data is much smaller.</Paragraph> </Section> class="xml-element"></Paper>