File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4199_metho.xml
Size: 17,599 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4199"> <Title>Recognizing Unregistered Names for Mandarin Word Identification</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Sublanguage Approach </SectionTitle> <Paragraph position="0"> The concept of sublanguages (i.e., languages in restricted domains) has been considered very important in natural language processing \[6, 7\]. A sublanguage usually has its own special syntax, semantics, and style, which are more restricted comparing with the language as a whole. In this paper, we will show how the study of a sublanguage can help identifying names and forming them in a dynamic, adaptive way.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Observation </SectionTitle> <Paragraph position="0"> From the United News, one of the most popular daily newspapers in Taiwan, we have acquired a newspaper corpus of more than one million characters.</Paragraph> <Paragraph position="1"> This corpus has been used for building our lexicon, computing statistics, and testing our WI systems for spell-checking, preprocessing for speech synthesis, Ac'lXS DE COLING-92. NAN'r~, 23-28 AOtrr 1992 1 2 3 9 FROC. OF COLING-92. NANTES. AtrG. 23-28, 1992 and phoneme-to-word conversion.</Paragraph> <Paragraph position="2"> After studying the segmentation output of the newspaper corpus, we observed that (1) unknown words are mostly personal names (translation names or otherwise), place names, and organization names in addition to those words that should have been built in the lexicon (a similar conclusion was obtained by Chang's papers); aud (2) when a personal name appears the first time, it is usually accompanied with a title (such as taibel shizhang ~:~b~:~ Taipei mayor} or a role noun (such as jizhe \]~ ~ reporter, houxianren It~j~,~. candidate).</Paragraph> <Paragraph position="3"> From these observations, we propose the following mechanisms to help identifying unknown words in the WI process: (1) title-driven name recognition and (2) adaptive dynamic word tbrmation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Title-driven Name Recognition </SectionTitle> <Paragraph position="0"> As we mentioned above, it is not plausible to put all proper names in the lexicon for a dynamic domain such as news articles. Since a new personal name usually appears with a title or a role noun, we can use the clue to design a set of word formation rules in our parsing-based WI system \[11\] (s~ the next section). Part of the set of rules in augmented CFG format are : <name> ~-- <title> <last> <first> { Build <last> <first> as a name ) <nurse> ~- <last> <first> <title> { Build <last> <first> as a name ) <title> +- <word> { Test if <word> is'a title } <last> ~ <word> { Test if <word> is a surnanae ) <first> *-- <word> { &quot;lest if <word> is 1- or 2-char } <first> ~- <word> <word> { Test if both <word> are 1-char } A Chinese name usually consists of two to four characters: one- or two-character surname and oneor two-character first name. Furthermore, surnames are among a limited set. Thus, in rule 4, the augmented part is just a membership test. We can store the surname information as a feature in the \[exical entries. Similarly, we have title and r~ole features in the lexicon for rule 3. Note that in the current design, translation names of foreigners and husband surname prefixing of married women can not be correctly identified. However, this approach works for eomanon persoual names that occupy a major part of unknown words.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Adaptive Dynamic Word Forma- </SectionTitle> <Paragraph position="0"> tion After a new personal name is recognized through the set of rules described above, the system will dynamically build a lexical entry for it. Thus, if the name appears in later sentences in the news article, it can be correctly identified.</Paragraph> <Paragraph position="1"> In Figure 1 is an example for adaptive dynamic word formation. In the article, there are four Chinese names: ni2 shu2 yah2 ~ (4 instances), ye4 yingl hao2 ~'I~ (1 instance), eai4 jial tlng2 ~ (4 instances), and wu2 xun2 long2 ~:~ (1 iustance). In first instances, all four names come with a title: lao3shil ~ (teacher), ji4zhe3 \]~ (reporter), er2tong2 ~ (child), and jian3cha2guanl ~i~'E&quot; (prosecutor). Since the names are built in the lexicon dynamically, the other instances of the names can be identified with higher scores than names without title. In other words, the names with title are built with much more confidence.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Names without Title </SectionTitle> <Paragraph position="0"> In addition to the names with title or role, the other personal names are proposed through a surname-driven rule. In other words, when the WI system meets a surname word, a personal name proposing rule is invoked although its preference score would be much lower than regular words and names with title.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Place Names and Organization Names </SectionTitle> <Paragraph position="0"> The proposed mechanism can be extended to cover place names and organization names. Just llke personal names appear with title, place names can be identified through the unit such as xian ~ (county), shi i~i (city), jie ff~ (street), lu t~ (road), etc. Similarly, organization names can be identified by the type such as gongsi /C~ (company), bu n\[~ (department or ministry), ke ~ (section), and so on. This part has not yet implemented in our system.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The System </SectionTitle> <Paragraph position="0"> Since July 1986, we have been involved in developing a series of Chinese-related NLP systems,'including an English-Chinese MT system, a Japanese-Chinese MT, a Chinese Word Knowledge Base, a Chinese Parser, and a Chinese Spell-Checker. tIere, we will only briefly describe the Chinese WI system as a frontend for the Chinese Parser. For more details, the reader is referred to Wang, et al. \[11\].</Paragraph> <Paragraph position="1"> We consider the WI process as a parsing process with word composition grammar, instead of a CSP problem \[2\], a unification problem \[12\] .... scanning process. A set of Chinese word composition grammar rules are designed to capture the characteristics of Chinese words. The grammar representation is Augmented CFG which is also used to write the English grammar in our English-Chinese MT system.</Paragraph> <Paragraph position="2"> The parser we used is based on Tomita's Generalized LR Parser \[10\]. Itowever, the augmented parts (tests and actions) and preference scoring module have been Ill the WI process, the basic unit is a character.</Paragraph> <Paragraph position="3"> A Chinese word is composed of one to five (may be longer) characters.</Paragraph> <Paragraph position="4"> The WI system consists of a lexicon, the word composition grammar, the preference scoring module, the test functions, and the parser.</Paragraph> <Paragraph position="5"> The lexicon contains a list of Chinese words (sorted by the internal code order) with the following information: the characters from wblch the word is composed, its frequency count, its part of speech, and some semantic features (such as title, surname, and role). The lexicon is a general purpose one; that is, it is built independent of the testing corpora. Currently, there are more than 90,000 lexical entries in the lexicon.</Paragraph> <Paragraph position="6"> A rule in the word grammar consists of a context-free part and an augmented part. Iq addition to the unknown word identification described in the previous section, augmented parts are used for recognizing (1) replication of words; (2) nmnbers; (3) prefixe~; (4) suffixes; and (5) the determiner measure constructions. null Since the word parser would produce two or more parses for an ambiguous sentence, a preference scoring module has been designed to choose the correct parse. Currently, the preference score is assigned based on (1) the length of the word (longer words are preferred), (2) the frequency count, and (3) semantic consideration ( e.g., three-character personal names are preferred to two-character ones). The WI system is written in Common Lisp, running on a TI Micro-Explorer machine.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> Before we present the experimental results, two performance indices, recall rate and precision rate, of a WI system are defined below following Sproat and Shih \[9\] and Chang, et al. \[3\]. Let C be the segmentation results hy the computer, H the results by the human (the correct results), and I the intersection of C and II. Then, recall rate is I divided by tI, and precision rate I divided by C. Fo~&quot; example, if there are 20 words in a sentence (i.e., H equals 20), the WI system produces 22 words for the sentence (i.e., C equals 22), and there are 18 words m common (i.e., I equals 18), tile recall rate would be 0.90 and the precision rate 0.82.</Paragraph> <Paragraph position="1"> To demonstrate tile proposed mechanism, we have tested the W\[ system with two corpora: (1) ten articles from a newspaper corpus, the United Daily corpus, (2) 61 sentences from Chang et al. \[4\]. The first corpus is selected from the United Daily on March 8, 1991. The selection criterion is that the article does not contain any table or figure and, preferably, contains Chinese names. The second corpus is composed of difficult cases for which the NTHU WI system either can not identify the names or overgenerates some Chinese names.</Paragraph> <Paragraph position="2"> In the experiment, we use four versions of the WI system to segment the ten articles. Version 1 is the WI system without name recognition capability, Version 2 the system recognizing only names with title, Version 3 the system recognizing both names with title and 3-character names, and Version 4 also recognizing 2-character names.</Paragraph> <Paragraph position="3"> Recall rates (RR) and precision rates (PK) are computed automatically by comparing the segmentation output with the correct answers segmented by human.</Paragraph> <Paragraph position="4"> The experinmntal results are summarized in Table 1.</Paragraph> <Paragraph position="5"> From the table, we can observe the following facts: 1. Version 2 (It~:96.17, PR:93.46) has a significant improvement over Version 1 (RI~:94.77, PR:89.28). In other words, the capability for name recognition is very important in a WI system. Although Version '2 only has a limited capability (for names with title), the improvement is rather apparent. Note that in Version 2, the dynamic word formation mechanism is much more useful than in Version 3 or 4.</Paragraph> <Paragraph position="6"> 2. Version 3 has the best results (RR.:97.51, PR.:98.19) among the four versions. It is better titan Version 2 for tile obvious reason: the capability for identifying 3-character names without title.</Paragraph> <Paragraph position="7"> 3. Although Version 4 has one more function, identification of 2-character names without title, than Version 3, the result (ITK:96.32, PR:97.51) is slightly worse than Version 3. This is mainly corpus because the gain (recognition of 2-character names) is less than the loss (misintepreting 2 single-character words as a 2-character name).</Paragraph> <Paragraph position="8"> 4. We will analyze the imperfections by the WI system in a subsection after the comparison with NTIIU's system.</Paragraph> <Paragraph position="9"> Comparison with NTHU's System In Chang, et al. \[4\], which we will call NTHU's system, they reported a 95 percent precision rate and a recall rate greater than 95 percent, and listed 5 samplea (A-samples) the name in which their system can identify correctly, 34 examples (B-samples) for which the names are missed, and 22 examples (C-samples) for which Chinese names are over-generated. Among them, we found 3 A-samples, 6 B-samples, and 3 C-samples contain personal names with title. Since NTHU's system is completely statistic-based, it can not make use of the title information. On the other hand, our sublanguage-based system would process these samples correctly.</Paragraph> <Paragraph position="10"> These 61 examples are fed to our WI system for comparison of the name recognition algorithms. The following results are for reference only, since the comparison is rather unfair (the examples are mostly the eases their system can not recognize correctly).</Paragraph> <Paragraph position="11"> 1. For the 5 A-samples, our system can recognize four of them. The only A-sample it failed to identify is: huang2 rong2 you2 you2 de0 dao4 jli ~ ~ ~ ~ il~ . Our segmentation result is huang2-rong2-you2 you2 deO dao4, while the correct result is huang2-rong2 you2-you2 de0 dao4. The reason is (l) our lexicon does not have the adverb you2-you2, and (2) we prefer 3-character names over 2-character ones. Note that NTHU's system can process all 5 cases successfully.</Paragraph> <Paragraph position="12"> 2. For the 34 B-samples, our system can identify 25 of them correctly. That is, there are 9 B-samples the names in which both our system and NTHU's system can not identify. We will discuss the reasons why these cases can not be recognized in the next subsection.</Paragraph> <Paragraph position="13"> 3. For the 22 C-samples for which NTHU's system overgenerates personal names, our system has processed 16 of them correctly. We will discuss the reasons in the next section why our system also overgenerates personal names for the other 6 C-samples.</Paragraph> <Paragraph position="14"> 4. For these 61 samples, our system can process 45 of them correctly.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Some Imperfections </SectionTitle> <Paragraph position="0"> There are still some problems remained unsolved in our WI system. Some are problems for WI systems in general. 'rite others are specific to name recognition systems only.</Paragraph> <Paragraph position="1"> 1. Two-character names are difficult to recognize, especially when followed by a single-character word. For example, in yil jing4 gangl ha3 fa3 bao3 qu3 chul ~1~1\]~,~\]~ , yiljing4 is a 2-chaxacter name. However, our WI system produces a 3-character name yil-jing4gangl, since gangl (just) is a single character word. Although human usually can identify the names correctly by context, our Wl system proposed the 3-character names understandably.</Paragraph> <Paragraph position="2"> 2. The name of a maffied woman is usually prefixed with her husband's surname. Thus, a 3-character name would become 4-character, i.e., husband's surname, father's surname, and a 2-character given name, e.g., xu3 lin2 yah2 mei2 ~1~ . Currently, this kind of names cannot be identified correctly, although a word-grammar rule can be easily added.</Paragraph> <Paragraph position="3"> 3. Some single-character surnames, such as lisa2 (year), tangl ~ (soup), ceng2 ~ (once), and husng2 ~ (yellow), are common single-character words. Thus, the name recognition algorithm sometimes overgenerates a personal name by ACRES DE COLING-92, NANTES. 23-28 AOt)r 1992 1 2 4 2 Paoc. OF COLING-92. NANTES, AUG. 23-28. 1992 combining one such word with two following characters.</Paragraph> <Paragraph position="4"> 4. Some surnames are rather unusual, such as lian ~ (lotus), ping2 ~ (duckweed), and que4 (but). This would make the names not recognizable. There is a tradeoff between a complete surname list and a minimal common surname list. On the one end, a complete surname list would help name recognition but it helps over-generation as well. On the other end, a minimal list would limit the overgeneratiou while missing some would-be names.</Paragraph> <Paragraph position="5"> 5. Some single-character words are very difficult to identify when they can be grouped as two-character words with the characters in the neighbout. A famous example is ba3 shou3 ~ (a handle). The problem is very difficult to solve for any WI systems.</Paragraph> <Paragraph position="6"> 6. Even when the title information is used, overgeneration of personal names is still hard to avoid. In the following is one of such examples: wu4 1~ are produced by our system. A fine adjustment of the scoring f nnctiou should be able to overcome this problem. However, there are so many similar problems such that it would be a real problem when we develop a full-scale system.</Paragraph> <Paragraph position="7"> 7. In Version 4 of our system, 2-character names without title are recognized in addition to those of Version 3, i.e., names with title and 3-character names without title. However, both the recall rate and precision rate of Version 4 are lower than those of Version 3. The major reason is that too many 2-character names are generated. null</Paragraph> </Section> </Section> class="xml-element"></Paper>