File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0125_metho.xml
Size: 24,672 bytes
Last Modified: 2025-10-06 14:14:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0125"> <Title>A Local Grammar-based Approach to Recognizing of Proper Names in Korean Texts</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> * CAIR, Deparlment of Computer Science, KAIST, Korea ** IGM, University of Maroe-la-Vallee, France Abstract </SectionTitle> <Paragraph position="0"> We present an LO-based approach to recognizing of Proper Names in Korean texts.</Paragraph> <Paragraph position="1"> Local grammars (LGs) are constructed by examining specific syntactic contexts of lexical elements, given that the general syntactic rules, independent from lexical items, cannot provide accurate analyses. The LGs will be represented under the form of Finite State Automata (FSA) in our system.</Paragraph> <Paragraph position="2"> So far as we do not have a dictionary which would provide all proper names, we need auxiliary tools to analyze them. We will examine contexts where strings containing proper names occur. Our approach consists in building an electronic lexicon of PNs in a way more satisfactory than other existing methods, such as their recognition in texts by means of statistical approaches or by rule-based methods.</Paragraph> </Section> <Section position="2" start_page="0" end_page="283" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> In this paper, we present a description of the typology of nominal phrases containing Proper Names (IN) and the local grammars \[GrogT\],\[Moh94\] constructed on the basis of this description.</Paragraph> <Paragraph position="1"> The goal is to implement a system which detects automatically PNs in a given text, allowing the construction of an electronic lexicon of PNs.</Paragraph> <Paragraph position="2"> The definition of Proper Nouns, as opposed to that of Common Nouns, is often a problematic issue in linguistic descriptions \[Gar91\]. PNs are understood in general as phonic sequences associated with one referent, without any intrinsic meanings, such as Socrates, Bach or Paris. They usually are characterized by nominal determination, the upper case marker, prohibition of pluralizing procedure, or non-translativity \[Wi191\]. However, semantic or syntactic criteria do not allow to distinguish these two categories in an operational way. For example, nouns such as sun, earth or moon, semantically appropriate to the definition of proper nouns such as Mars or Jupiter, do not have to be written with the upper case initial: hence, they are not considered as proper nouns. On the contrary, some proper nouns such as Baudelaire or Napoleon can be used as well as common nouns in contexts where they occur in metonymic or metaphorical relations with common nouns like: I read some (Baudelaire + poems of Baudelaire) He is a real (Napoleon + general) Moreover, they often allow, like common nouns, the derivation of adjectives: e.g. Socratic, Napoleonic or Parisian. These are also written with initial upper case, differently from usual adjectives.</Paragraph> <Paragraph position="3"> The situation concerning French is similar to that. Let us consider \[Gar91\]: J'ai dcoutd (du Bach + ,~e la musique) J' ai bu (du Champagne + du vin rouge) Derivational productivity is also underlined: socratique, parisien or newyorkais, which however do not begin in the upper case.</Paragraph> <Paragraph position="4"> In the case of Engli,;h or French, one could delimit formally the category proper nouns by means of the upper case, even though this criterion does not correspond entirely to our intuition about proper nouns. However, in Korean, there are no typographical markers such as upper case vs. lower case, while one assumes that the nouns such as in (1) could be semantically and syntactically different from those of(2): (1) ~ ~--$-, z-l~. _=~ Kim Minu, Seoul, France (2) ~}, -~:, q-~ namja\[rnan\], sudo\[capital\], nala\[country\] This situation makes more difficult the distinction between proper nouns and common nouns than in the case of French or English, when the former appears in the same grammatical positions as the latter like: (I listened to (Masart + classic misic) all day)</Paragraph> <Paragraph position="6"> (He only drinks (Bordeaux + red wine)) The derivation of some other categories from PNs is also observed: ~;~ ~ - ~ \[in Park JungHee's manner\] ~z~.. =~- \[France style\] ~-. ~-o~ \[Chinese (language)\] In fact, the distinction between these two categories might be arbitrary. We should perhaps consider a continuum of the noun system: a thesaurus of nouns constituted of the most generic nouns to the most specific nouns (which we call proper nouns). The following example shows a part of a noun Therefore, in the automatic analyses of texts written in Korean, we intend to consider the definition problem of proper nouns from a different view point: whatever the given definition of proper nouns is, once a complete list of them is available, we presumably do not need any longer this particular distinction between proper and common nouns. All nouns have some semantic and syntactic properties, which lead to group them into several classes, not by binary distinctions. Nevertherless, it seems still hard to establish an exhaustive list of what we call proper nouns. Actually, proper nouns, important in number and in frequency, are one of the most problematic units for all automatic analyzers of natural language texts.</Paragraph> <Paragraph position="7"> In this study, we will focus on the problems of recognition of proper names. We do not try to characterize them as an axiomatic class, but attach to them a formal definition to determine explicitly the object of our study. Here is our formal criterion: {X e (Interrogative Pron -~? \[Who?\]) \[ X e (DECOS-NS) } Hu\[x~~raon nouns \[DECOS-NS\] That is, proper names are determined by the fact that they do not exist in our lexicon of Korean common nouns (DECOS-NS/V01) \[Nam94\], and by their correspondence with the interrogative pronoun '-~5 z nugu? \[who?\]'. The nouns considered as proper names according to these conditions do not always correspond to our semantic intuition. Nevertheless, they usually do not have intrinsic meanings; and they do not have explicitly distinct referents. Given that a lexicon of Korean common nouns (DECOS-NS) has already been built \['Nam94\], the ambiguity across the category of common nouns and that of proper ones will be settled only in one of these two lexicons by looking up DECOS-NS: if they already are included in this lexicon, we do not consider them in the lexicon of proper nouns, without questioning their linguistic status. Remember that our goal is not to discuss the nature of this noun class, but to complete the lexicon of Korean Nouns in NLP systems. In order to handle them in an NLP system, given that we do not have yet a dictionary which provides all proper nouns, auxiliary methods are required, such as syntactic information or local grammars that allow to analyze them.</Paragraph> <Paragraph position="8"> In the following sections, we will classify in five types the contexts where Proper Names can appear, and describe their characteristics in detail.</Paragraph> <Paragraph position="9"> 2. Typology of PN Contexts 2.1. Type I < PN-(Postposition + E) > This type of noun phrases is without any particular characteristics inherent to Proper Names (PLY). They actually occur in the positions of common nouns, as shown in the following graph functions of the attached noun. When they appear in this context, there are no ways to sort out proper names, only by analyzing their syntactic structures. Let us consider:</Paragraph> <Paragraph position="11"> We cannot distingnish this PN <Kim Jung I!> from other nouns that can be found in this position, such as in the following:</Paragraph> <Paragraph position="13"> ( (This man +A Korean) is the President of North Korea) As mentioned above, in English or in French, proper names could be distinguished from common nouns, at least by means of the use of the upper ease for the former, even though it is not an absolute criterion. Consider: Jacques Chirac e.st le President de la France Bill Clinton is the President of USA Nevertheless, the upper case does not totally satisfy our semantic intuition, since we also observe nouns with the upper case, such as President or President, which certainly do not designate one particular person (here, we encounter the fundamental problem of the definition of the term 'proper'). Likewise, in the following sentence, the noun Franqais and American started with the upper case cannot be considered as proper names, whatever the definition of proper name is: (Dr. Kirn MinU has studied in U.S.A. during 5 years) The noun phrase in subject position: 'K/m MinU bagsaneun&quot; is composed of three strings. However, in Korean, typographical constraint is not a reliable criterion, since we cannot prohibit writing this phrase in other ways like:</Paragraph> <Paragraph position="15"> When proper names occur as attached to other elements of noun phrases, their analysis becomes more complicated. Therefore, a local grammar recognizing PTs such as (Figure 3): will reduce numerous mismatchings between the strings like (2b) and the combination of the items found in a dictionary.</Paragraph> <Paragraph position="16"> Since a family name alone can precede PTs, the grammar above should be refined (Figure 4): Thus we observe (3) instead of(l):</Paragraph> <Paragraph position="18"> Kim bagsa-neun migug-eise 5 nyengan gongbuha-essda (Dr. Kim has studied in the U.S.A. during 5 years) while a given name alone hardly appears with PTs: .~.~-?- ~'~,~--b ~1~1~t 5~Z~ ~ ??MinU bagsa-neun migug-eise 5 nyengan gongbuha-essda (Dr. Min U has studied in U.S.A. during 5 years) When we list the nouns of professional title, the number of PNs recognized by the local grammar presented in Figure 4 will be increased. Nevertheless, listing these nouns up does not guarantee automatically to recognize PArs, since we can come across specific nouns (Spec) inside of these sequences: Thus, in order to analyze the strin~ followed by a PT in contexts such as (5), the system should first look up a lexicon of Common Nouns (and eventually a lexicon of Determiners), and if the search fails, one could suppose that we found a proper name: (Sa)ol~ ~,1-71- ~71~ -~r~ igonggyei bagsa-ga ingi-ga nop-da (Doctors of Natural Science are highly requested) (Sb)o I ~ ~x~7~ ~71~ ~ i gonghag bagsa-ga ingi-ga nop-da (This doctor of Science is highly requested) (Sc)ol ~ ~'l'~ ~ 21 21- ~r4iminu bagsa-ga ingi-ga nop-da (Doctor Lee MinU is highly requested) In (Sa), the string found with 'bagsa \[doctor\]' is a simple noun 'igonggyei \[natural science\]'; the sequence that precedes 'bagsa \[doctor\]' in (5b) is a phrase composed of a determiner 'i \[this\]' and a common noun 'gonghag \[science\]'; the element followed by &quot;bagsa \[doctor\]' in (5c) will not be matched with any entries of the lexicon of common nouns: only this string will then be recognized as a proper name.</Paragraph> <Paragraph position="19"> The local grammar proposed so far should be completed by the description of the following transformation. Let us compare (4) with (6):</Paragraph> <Paragraph position="21"> The structure can be formalized in the following graph (Figure 5): Figure 5. Type III of norm phrases containing PNs The strings &quot;N-(Gen+E) FR&quot; do not automatically guarantee existence of proper names, since common nouns that have a human feather can also appear with a FR like:</Paragraph> <Paragraph position="23"> 'In fact, strings containing FRs are necessarily based upon human nouns, proper names being only one class of human nouns. This context helps to f'md proper names, but is not a sufficient condition to recognize them automatically.</Paragraph> <Paragraph position="24"> 2.4. Type IV <PN Vocative Term-(Postposition+E) > We call Vocative Terms ( FT) the following utterances: ,~ z~ ! yenggam ! \[Sir !\] e~ ~d ! senbainim ! \[Senior !\] ~ ! nuna ! \[Elder sister ! (for a boy)\] ,L1 q ! enni ! \[Elder sister ! (for a girl)\] ! hyeng ! \[Elder brother ! (for a boy)\] .9.~. ! obba ! \[Elder brother ! (for a girl)\] The nouns above can all be used as FTs, that is, a term one can use to indicate some social or familial relations between himself (i.e. the speaker) and his interlocutor(s), or to call on somebody paYing due respect to his social status (honorific terms). In addition, with proper names, they can also occur in assertive sentences, like: Kim yenggara-i wa-ssda PN<Kixn> sir-Postp come-Past (Sir. Kim came) Ina nuna-ga ddena-ssda PN<In A> elder sister-Postp leave-Past (Elder Sister InA left) These FTs should be compared with the nouns of professional title (PT) that we examined in section ?*~@ l ?*gyosu ! \[Professor !\] ?*~d'-vd - ! ?*/anggwan ! \[Minister !\] Then, one should either attach to them a vocative suffix such as '~d him', or adjoin them to proper names: gyosu-nira!/Kim MinU gyosu! janggwan-niml/Kim janggwan! Semantically, PTs designate professions, the list of which we can determine a priori, while ITs are more vague and non-predictable without examining pragmatic situations: the latter are closer to the nouns of Family Relation (FR), since, as mentioned above, they imply familial or social relations between a speaker and his interlocutor(s).</Paragraph> <Paragraph position="25"> * 4. Difference between FT and FR What we call nouns of Family Relation (FR) cannot appear with a proper name when they are used in the vocative case. Thus, is not allowed the internal structure:</Paragraph> <Paragraph position="27"> This type of Noun Phrase is similar to the preceding one: what we call Incomplete Nouns (IN) is also used for social appellation. However, they are different from the preceding ones by the fact that they do not have syntactic autonomy, and therefore they never can appear alone in any positions of a sentence. Here is their list:</Paragraph> <Paragraph position="29"/> <Paragraph position="31"> These nouns (/Ns), syntactically and semantically incomplete, always require proper names to their left side. In this sense, this type of contexts is appropriate to PNs: if an 1N is recognized, we can be assured to fred a PN near to it. In spite of this strong constraint, since all/Ns are mono-syllabic, ambiguity problems are often hard to handle. For example, the 1N '71. ga \[Mr.\]' is an homograph of several items. Let us consider some of them (Figure 9): The following sentence illustrates this ambiguity problem: (\]) ~71- ~3~-~-~ -?-~ 71- ~ 71- ~71- ~-~tl ~- -?-~7l-,:,.II x.~ ~ 71-~ ~_ _v_ ~ ~, ~ oo~71- ~,~ bag-ga chingu-deul-gwa uli-ga muhega jutaig-ga geunche han umul-ga.eise yuhaing-ga-leul buleu-go isseul ddai, yengyang-ga ebs-neun bbangbuseulegi juwi.-ei myech mali sai-ga anja iss-ess-den-ga ? (When we were singing popular songs with Mr. Park's friends at the edge of a well near the area of unlicensed buildings, how many birds were there sitting around bread crumbs without any taste ? ) We observe the morpheme ga 9 times. But only the first occurrence of ga is an Incomplete Noun which accompanies a PN. In the 8 other strings, we should not expect occurrences of PNs: in order to recognize an 1Nga, first, dictionaries of all common nouns (i.e. simple nouns, derived nouns, and compound nouns) must be available. If the string containing ga is not found in these dictionaries, then the f'mal syllable ga might be a verb, a nominative postposition attached to a noun, or an inflectional suffix attached to a verb; or else, it is an IN ga.</Paragraph> <Paragraph position="32"> In the case of (1), strings containing ga, such as the following ones, are detected as common nouns (simple or derived ones):</Paragraph> <Paragraph position="34"> and the following ones are either nouns followed by a postposition ga or a verb including the inflectional suffix (IS) ga: -~ ~ 71. uli-ga we-Postp x~ 71. sai-ga bird-Postp .~.~ ~. 7}. iss-ess-den-ga be-IS \[Past-Past-Interrogation\] The string '~71. bag-ga' will not be recognized as one of these cases, even though there exists a simple noun &quot;bag \[pumpkin\]' in the dictionary of common nouns, since the postposition required by this noun is not '71-ga', but 'ol r. Therefore, bag-ga will be analyzed as a proper name bag (family name alone) followed by an INga.</Paragraph> <Paragraph position="35"> 3. Building Local Grammars of PNs Let us summarize the formal definition of the five contexts where a Proper Name (PN) can occur:</Paragraph> <Paragraph position="37"> Notice that when we recognize Incomplete Nouns (i.e. ~ ssi, ~ yang, 7} ga, ~d nim, :~ gun, ong), the occurrence of proper names is guaranteed, since _/Ns cannot occur without PNs.</Paragraph> <Paragraph position="38"> Nevertheless, as mentioned above, serious ambiguity problems appear in the distinction of/Ns from their homographs, we here propose two complex local grammars in order to increase the ratio of identification of/Ns.</Paragraph> <Paragraph position="39"> 3.1. Use of PostHN appropriate to Human Nouns There are specific items appropriate to human nouns: we name them .PostHN. They do not constitute autonomous units, but are attached to human nouns at the syntactic level. Thus, they appear even after the plural marker ~ deul \[/s/\]. For example, in the following sentences, a PostHN %il nei \['s family/house\]' appears with a PN alone, or with a PN followed by an/N (here, ~1 ssi \[Mr.\]): ~417} el =\]--~*tlxt ~ql~ ~xl~'rq&quot; MinU-nei-ga imaeul-eise jeiil bujilenha-da PN<MinU>-PostHN\[farnily\]-Postp this village-Postp most diligent-St (MinU's family is most diligent in this village) 7,~..9_a\]~lo~lx t -~o1 ~rq- GangGinO-ssi-nei-eise PN<Kang GinO>-IS\[Mr.\]-PostHN\[house\]-Postp fire-Postp occur-Past (There was a fire in Mr. Kang GinO's house) bul-i na-ssda In French, we observe a preposition similar to this PostHN: ehez ('s family/house), a locative preposition, as at one ~ in English, which selects only human nouns: fly a eu un feu chez M. Pierre Piton There was afire at M. Pierre Picon Therefore, when we encounter a sequence that ends with an IN-PostHN-Poalp, the possibility to find a PN is increased. For example, the following string: ~o~ 7~ !=~\]-~. jang-ga-nei-neun can be analyzed in 510 ways (i.e. (7 x 7 x 5 x 2) + (2 x 5 x 2) = 510) after a simple matching of the words of this string with their lexicon entries (Figure 11): fPN , me> IN:Mr. /<Poyn_ :, mily> Let us examine the following sentences: Here, several of the noun phrases we have examined so far occur piled together. The internal structures of the examples above are respectively: (2a) PN- <Type I/> - <Type 111> - PostHN- Noun - Postp (2b) PN- <Type 11> - <Type III> - PostHN- Postp (2c) PN- <Type IF> - PostHN- Postp Hence, by providing information about the combinations of these strings, we could rise the accuracy in recognizing PNs. For example, the string that includes the sequence ~\]..~ z~ 41 ~ ssi-dongsaing-neijib in (la) can hardly be anything else than a noun phrase containing a PN. Thus, even though we :find several entries :~ lkim in the lexicon of nouns, such as: kira 1. Noun = steam \[e.g. ~ ol ~- \] 2. Noun = dried laver \[e.g. ~ g\]'\] 3. Noun = hope \[e.g. ~o\] ~r.Jc \] 4. Completive Noun = chance \[e.g. -~ ~ ~... \] we can eliminate these interpretations, since these forms precede the complex sequence that requires necessarily a PN.</Paragraph> <Paragraph position="40"> 4. Experimental results i So far, we have examined contexts where we expect to encounter Proper Names (PAr). In order to recognize automatically PNs on a large scale in texts in the absence of a complete lexicon of PNs, the description of noun phrases containing PNs should be necessary. We constructed local grammars based upon our description of the types of nominal phrases containing proper names. I Notice that implementing such a system requires the use of the relation between Recall and Precision. In general, it is understood that Recall is the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in a database, and Precision is the I ratio of the number of relevant documents retrieved over the total number of documents retrieved \[Fra92\].</Paragraph> <Paragraph position="41"> However, Recall-Precision plots show that Recall and Precision arc inversely related. That is, I when Precision goes up, Recall typically goes down and vice-versa. If we want to recognize automatically PNs in a given text in order to construct an electronic lexicon of PNs, Recall, that is the ratio of PNs retrieved for a given grammar over the number of PNs in the text, should certainly be higher than Precision.</Paragraph> <Paragraph position="42"> I Let consider results of In the contexts of i.e. <PN us some experimental our study. Type V, Incomplete Noun-(Postposition + E)>, the Incomplete Noun (/N) '~\] ssi \[Mr./Miss./Mrs.\]' can appear with a family name alone, a given name alone, or a full name (of. 2.5. Figure 7). Remember I that, in Korean, a typographical unit delimited by blanks cannot directly be taken as a basic element for morphological analysis \[Nam97\]: we should then analyze the strings occurring with a blank on the leit side of/Ns as well as the strings stuck to 1Ns in order to examine the context Type V. Thus, I the local grammar of Type V for '~q ssr is the following graph (Figure 13): In the second text, composed of 30869 characters, 69 occmrences of &quot;X-ssi&quot; are observed. All &quot;Full name-ssi&quot; sequences here appear attached, whereas, in the preceding text, they all appear with a blank (i.e. 7(-#-ssi'). Here is the result (Figure 15): Looking up our dictionaries of Korean Simple Nouns (DECOS-NSN01) \[Nam94\], and of Korean Postpositions (DECOS-POST/V01) \[Nam96b\] eliminate \[2\], \[3\], \[4\], \[5\], \[6\], \[8\], \[9\], which are the sequences composed of a common noun and a postposition (or a typographical separator such as a comma). Because \[1\] is a dialectal adverb, and \[7\] a 'Noun-Verb' string, they were not detected in our system.</Paragraph> <Paragraph position="43"> The third text composed of 33982 characters contains 10 occurrences of ~'-ssi&quot; one of which is a nonPN ('.~-01 peulo ssi-leumi \[N/N/Postp\]') (Figure 16): Figure 16 This nonPNwas eliminated after looking up the dictionary of Postpositions: there is no postposition '~o\] leumi'. The analysis of the text above on the basis of the local grammar presented in Figure 3 (Type II <PN Spec-PT-(Posq~ + E)> ) in 2.2. allows to recognize PNs in a more satisfactory way. Besides ~-ssi&quot; strings, with two PTs: '~-~-~ daitonglyeng \[the President\]' and '~ susa~g \[the prime minister\]', we could recognize 73 % of PNs, that is, 49 occurrences of 67 (i.e. Recall is 0.73). However, use of the local grammars of Figure 13 and Figure 3 (only with these two PTs above) leaves some nonPNs: Precision is 0.7 (49 strings of 70 which occurred with these/N and PTs are PNs). Since our goal is to recognize most contexts where PNs can occur, in order to consn'uct a lexicon of FNs as complete as possible, Recall should be more important than Precision in our system. By adding a few more PTs (cf. Type II) such as '~--~ janggun \[general\]', '~ sensu \[player\]', FRs (cf. Type III) such as '~l.~ bunye \[father-daughter\]', or/Ns (cf. Type V) such as 'dego z yang \[Miss.\]' in the lexicon on the basis of which our local grammars are constructed, we could obtain a more reliable result as shown in the following table (Figure 17):</Paragraph> <Paragraph position="45"> To guarantee that all occurrences of PNs are covered by local grammars, it would be necessary to consider a great part of the contexts where common nouns appear.</Paragraph> <Paragraph position="46"> In this paper, we have described the contexts where proper names can occur, but the complete lists of the nouns requiring PNs have not been done. We are sure that these lists are not illimited * ones, they will be presented in further studies. Notice that these studies are deeply related to the syntax of nouns, especially that of human nouns. In this sense, human noun, a semantic concept, can nonetheless become an operational term in the formal description of natural languages, indispensable many procedures of Natural Language Processing CNLP) systems.</Paragraph> </Section> class="xml-element"></Paper>