File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/c92-4199_intro.xml
Size: 3,060 bytes
Last Modified: 2025-10-06 14:05:18
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4199"> <Title>Recognizing Unregistered Names for Mandarin Word Identification</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word Identification (WI, also known as Segmentation) has been an important and active issue ill Chinese Natural Language Processing. Various approaches are proposed for this problem \[1\], such as MM (Maximum Matclfing) method \[8\], RMM (Reverse Directional Maximum Matching) metlmd, OM (Optimum Matching) method, statistical approaches \[5\], and unification approaches \[12\]. lIowever, there are still a number of problems to conquer towards a satisfactory WI system. Among them are a clear definition of Chinese words, an objective evaluation suite with appropriate corpora, and the processing of unknown words (such as personal names, place names, and organization names).</Paragraph> <Paragraph position="1"> In this paper, we will deal with the problem of unknown words, especially personal names, althougii the proposed approach can be easily extended to cover place nantes and organization nantes. According to Chang, et al. \[2\], proper nouns (which compose a major part of unknown words) account for more than fifty percent of errors made by a typical system. Thus, successful processing of proper nouns is essential for a satisfactory WI system.</Paragraph> <Paragraph position="2"> Almost all WI systems use a lexicon to guide the segmentation process. In fixed domains such as a classical novel or technical texts, we can put all possible words in the lexicon and avoid the unknown-word problem. However, in a dynamic domain such as newspapers, it is impossible to enumerate all possible words in advance. For example, some personal names, such as suspects or victims , often appear in only one day's news. Thus, recognition of these personal names and other unknown words is very important. null Chang, et al. \[2\] (at National Tsing-Hua University, ttsinchu, Taiwan) proposed a Multiple-Corpus approach to solve the problem. They consider the WI problem as a constraint satisfaction problem (CSP) and use a number of corpora to train their statistic-based system. The probabilities of each Chinese character as a surnanm, the first character and the second character in a first name are computed based on the training. Using these statistics, two-character and three.character personal names are proposed to compete with the words in the lexicon. Then, a dynamic programming technique is used to decide the most probable solution to the CSP. They reported a 90 percent average correct rate of surname-name identification. To the best of our knowledge, this is the only group that has proposed a solution to the problem.</Paragraph> <Paragraph position="3"> Chang's approach is completely statistic-based and easy-to-implenmnt. However, we argue that syntactic and semantic information must be considered in a successfid WI system.</Paragraph> </Section> class="xml-element"></Paper>