File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0140_evalu.xml
Size: 3,809 bytes
Last Modified: 2025-10-06 13:59:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0140"> <Title>Chinese Named Entity Recognition with a Multi-Phase Model</Title> <Section position="5" start_page="214" end_page="215" type="evalu"> <SectionTitle> 3 Experimental results </SectionTitle> <Paragraph position="0"> We participated in the three GB tracks in the third international Chinese language processing bakeoff: NER msra-closed, NER msra-open and WS msra-open. In the closed track, we constructed all dictionaries only with the words appearing in the training corpus. In the closed track, we didn't use the same feature characters lists for location names and organization names as in the open tracks and we collected the feature characters from the training data in the closed track. We constructed feature characters lists for location names and organization names by the following approach. First, we extract all suffix string for all location names and organization names in the training data and count the occurrence of these suffix strings in all location names and organization names. Second, we check every suffix string to judge whether it is a known word. If a suffix string is not a known word, we discard it. Finally, in the remaining suffix words, we select the frequently used suffix words as the feature characters whose counts are greater than the threshold. We set different thresholds for single-character feature words and multi-character feature words.</Paragraph> <Paragraph position="1"> Similar approaches were taken to the collection of common Chinese surnames in the closed track.</Paragraph> <Paragraph position="2"> While making training data for segmentation model, we adopted different tagging methods for organization names in the closed track and in the open track. In the closed track, we regard every organization name, such as &quot;Nei Meng Gu Ren Min Chu Ban She &quot;, as a single word. But, in the open track, we segment a long organization name into several words. For example, the organization name &quot;Nei Meng Gu Ren Min Chu Ban She &quot; would be divided into three words: &quot;Nei Meng Gu &quot;, &quot;Ren Min &quot; and &quot;Chu Ban She &quot;. The different tagging methods at segmentation phase would bring different effect to organization names recognition. The size of training data used in the open tracks is same as the closed tracks.</Paragraph> <Paragraph position="3"> We have not employed any additional training data in the open tracks. Table 3 shows the performance of our systems for NER in the bakeoff.</Paragraph> <Paragraph position="4"> For the separate word segmentation task(WS), the above NER task is performed first. Then we added several additional processing steps on the result of named entity recognition. As we all know, disambiguation problem is one of the key issue in Chinese words segmentation. In this task, some ambiguities were resolved through a rule-set which was automatically constructed based on error driven learning theory. The preconstructed rule-set stored many pseudoambiguity strings and gave their correct segmentations. After analyzing the result of our NER based on CRFs model, we noticed that it presents a high recall on out-of-vocabulary. But at the same time, some characters and words were wrongly combined as new words which caused the losing of the precision of OOV and the recall of IV. To this phenomenon, we adopted an unconditional rule, that if a word, except recognized name entity, was detected as a new word and its length was more than 6 (Chinese Characters), and it should be segmented as several in-vocabulary words based on the combination of FMM and BMM methods. Table 4 shows the result of our systems for word segmentation in the bakeoff.</Paragraph> </Section> class="xml-element"></Paper>