File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1054_metho.xml
Size: 8,867 bytes
Last Modified: 2025-10-06 14:14:27
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1054"> <Title>NameTag TM Japanese and Spanish Systems as Used for MET</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 MET System Description </SectionTitle> <Paragraph position="0"> For MET, we used NameTag in its Japanese and Spanish configurations. NameTag is an automated text indexing system that recognizes and classifies names and other key phrases such as time and numeric expressions. It is an enhanced offspring, implemented in C++, of the preprocessing module of SRA's multilingual natural language processing system \[1\]. NameTag combines dynamic pattern recognition with static lexical look-up to achieve high recall and precision at high speed.</Paragraph> <Paragraph position="1"> The NameTag engine is designed for multilingual capabilities. The same engine is used for different languages using language-specific &quot;plug-ins&quot; such as tokenizers, patterns, lexical data, alias generators, morphological analyzers, and segmenters. Table 1 compares plug-ins used for different languages/tasks.</Paragraph> <Paragraph position="2"> NameTag has several unique features beside being able to handle multiple languages. First, it can generate and link aliases of names automatically using language-specific alias generators. For example, the Spanish system can recognize &quot;Pefia&quot; as an alias for &quot;Josd Francisco Pefia Gdmez&quot; by generating a paternal name alone. The Japanese system currently performs a limited organization alias recognition such as *I'd llke to thank Mila Ramos-Santacruz and Misa Miyaclfi, who assisted me with manual tagging of development texts and lexical development for Spanish and Japanese respectively, and Kevin Hausman, who is the principal developer of the C++ NameTag engine.</Paragraph> <Paragraph position="3"> &quot;DAIWA&quot; as an alias for &quot;DAIWABANK.&quot; Second, SRA's Japanese segmenter, which uses a large lexicon, morphological analysis, and heuristic-based rules to segment a sentence into words, is fully integrated into the NameTag engine. Thus, the system can utilize the results of name recognition in the subsequent segmentation process and increase segmentation accuracy.</Paragraph> <Paragraph position="4"> Third, NameTag can take advantage of SGML markers to improve performance. For MET, headlines (marked by <SLUG> or <HL>) were processed after the main body of text (marked by <TXT>).</Paragraph> <Paragraph position="5"> This was advantageous for Japanese because any entity recognized in the main body was utilized in segmentation of headlines. It also increased precision for Spanish because only lexicon and alias lookups were applied to all-upper-case headlines, avoiding spurious or erroneous names generated by patterns.</Paragraph> <Paragraph position="6"> Finally, NameTag provides a GUI-based multilingual development environment which facilitates rapid development of patterns.</Paragraph> </Section> <Section position="3" start_page="0" end_page="463" type="metho"> <SectionTitle> 3 MET Results and Analysis </SectionTitle> <Paragraph position="0"> The MET final blind tests were conducted using 100 Kyodo articles for Japanese and 100 AFP articles for Spanish. These articles were retrieved using the key-word &quot;press conference.&quot; Thus, they encompassed various subject domains, including business, politics, sports, and arts, unlike the Wall Street Journal arti- null cles used for MUC-6. In addition, the MET guidelines had additional requirements not found in MUC6, such as tagging of relative dates and tagging organizations as locations when they are used as facilities. Despite the differences in types of articles and the additional requirements, the NameTag Japanese and Spanish systems achieved high performance in both recall and precision. The Japanese system is slower (15MB/h on a SPARC 20) than its Spanish and English counterparts (93 MB/h and 80 MB/h respectively) because of the segmentation overhead. Both the Japanese and Spanish systems still have room for higher recall because of shorter development time but also partly because of difficult language-specific issues to be solved, which we will discuss below.</Paragraph> <Section position="1" start_page="463" end_page="463" type="sub_section"> <SectionTitle> 3.1 Japanese-specific Issues </SectionTitle> <Paragraph position="0"> The MET evaluation has revealed several Japanesespecific challenges which must be solved in order for the system to achieve even higher performance.</Paragraph> <Paragraph position="1"> First, we have encountered what we call chicken-and-egg problems. Good name recognition requires good segmentation, as name recognition patterns rely on properties of words segmented by the segmenter such as part-of-speech and other linguistic attributes.</Paragraph> <Paragraph position="2"> However, good segmentation, in turn, relies on good name recognition, as names are usually not in the lexicons and thus tend to cause segmentation errors.</Paragraph> <Paragraph position="3"> As discussed in Section 2, NameTag can utilize the results of name recognition in subsequent segmentation to partially solve this problem. Additionally, it is essential that the segmenter be more robust and accurate in order to improve performance on name recognition and other Japanese text processing tasks even further.</Paragraph> <Paragraph position="4"> Another chicken-and-egg problem was encountered in constructions where a person name and an organization name appear next to each other in a sentence and there is no delimiter between the two (e.g. &quot;ABCDEFG&quot; where ABC is a person name and DEFG an organization name with no space or other punctuation in-between.) Here, recognizing the per-son name requires recognizing the adjacent organization name first while recognizing the organization name requires recognizing the person name first. In these cases, the system often mistags the whole string as a person and misses the organization name.</Paragraph> <Paragraph position="5"> The second big challenge is dealing with Japanese aliases, which are more complex than English aliases.</Paragraph> <Paragraph position="6"> The NameTag Japanese system currently generates aliases like &quot;SILICONGRAPHICS&quot; for &quot;SILICON-GRAPHICSCORP.&quot; by stripping off certain corporate designators at the end of names. But it does not currently generate an alias which is a character subsequence of its full name like &quot;NIKKOU&quot; for &quot;N__IHONKOUKUU.&quot; Since aliases are, by definition, already recognized as names in a given article, they often appear in contexts where patterns do not apply. In these cases, not generating aliases results in missing names (i.e., loss in recall).</Paragraph> </Section> <Section position="2" start_page="463" end_page="463" type="sub_section"> <SectionTitle> 3.2 Spanish-specific Issues </SectionTitle> <Paragraph position="0"> In addition to the general differences between MET and the MUC-6 NE task described earlier, there were a few Spanish-specific issues which had to be tackled for MET.</Paragraph> <Paragraph position="1"> In the MET Spanish articles, the capitalization convention was rather unpredictable (e.g., &quot;Oficina de lucha contra la droga,&quot; &quot;puerto cubano de Mariel&quot;). Thus, capitalization clue was not as relevant in Spanish MET texts as English WSJ texts in recognizing proper names. Consequently, the Spanish system needed to perform deeper analysis of the texts to achieve comparable results.</Paragraph> <Paragraph position="2"> The presence of &quot;de&quot; in Spanish person names has made person name recognition more difficult, as &quot;de&quot; is also a preposition, and sometimes caused a Spanish version of chicken-and-egg problem. For example, the system thought &quot;Valle Rivas&quot; in &quot;Olijela del Valle Rivas&quot; was a location as &quot;valle&quot; also means &quot;valley.&quot; On the other hand, it tagged &quot;Roverto Marquevich de San Isidro&quot; as a full person name though &quot;de&quot; is a preposition and &quot;San Isidro&quot; is a location name.</Paragraph> </Section> </Section> <Section position="4" start_page="463" end_page="463" type="metho"> <SectionTitle> 4 Summary </SectionTitle> <Paragraph position="0"> The MET evaluation has proved that NameTag can be ported to other languages with a level of performance similar to English, despite various language-specific challenges. We plan to port it to other unsegmented languages such as Chinese and Thai in addition to other European languages.</Paragraph> </Section> class="xml-element"></Paper>