File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0209_intro.xml
Size: 11,250 bytes
Last Modified: 2025-10-06 14:06:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0209"> <Title>Orthographic Co-Reference Resolution Between Proper Nouns Through the Calculation of the Relation of &quot;Replicancia&quot;</Title> <Section position="3" start_page="0" end_page="62" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The proliferation of texts on electronic form during the last two decades has livened up the interest in Information Retrieval and given rise to new disciplines such as Information Extraction or automatic summarization. These three disciplines have in common the operation of Natural Language Processing techniques (Jacobs and Rau, 1993), which thus can evolve synergically.</Paragraph> <Paragraph position="1"> Identification and treatment of noun phrases is one of the fields of interest shared both by Information Retrieval and Information Extraction.</Paragraph> <Paragraph position="2"> Such interest must be understood within the trend to carry out only partial analysis of texts so as to process them in a reasonable time (Chinchor et al., 1993).</Paragraph> <Paragraph position="3"> The present work proposes a new instrument designed for the treatment of proper nouns and other simple noun phrases in texts written in Spanish language. The tool can be used both in Information Retrieval and Information Extraction systems.</Paragraph> <Section position="1" start_page="0" end_page="61" type="sub_section"> <SectionTitle> 1.1 Information Retrieval. </SectionTitle> <Paragraph position="0"> Information Retrieval systems aim to discriminate between the documents which form the system's entry, according to the information need posed by the user. In the field of Information Retrieval we can find two different approaches: information filtering -or routing- and retrospective, also &quot;known as ad hoc. In the first modality, the information need remains fixed and the documents which form the entry are always new.</Paragraph> <Paragraph position="1"> The ad hoc modality retrieves relevant texts from a relative static set of documents but, in contrast, admits changing information needs (Oard, 1996).</Paragraph> <Paragraph position="2"> Information Retrieval systems make up representations of the texts and compare them with the representation of the information need raised by the user. The representation form most commonly used is a vector whose coordinates depend on the terms' frequency of occurrence, where such &quot;terms&quot; are elements of the text represented which can coincide with words, stems or words associations (n-grams) (Evans and Zhai, 1996).</Paragraph> <Paragraph position="3"> At the present time designers try to enrich the sets of terms used in the representations with noun phrases or parts of them as, for example, proper names. Thompson and Dozier conclude in (Thompson and Dozier, 1997) that: first, the recognition of proper nouns in the texts can be done with remarkable effectiveness; second, proper nouns occur frequently enough as well as in queries ~ to warrant their separate treatment in document retrieval, at least when working with case law documents or press news; third, the inclusion of proper nouns as terms in the documents' representations can improve information retrieval when proper nouns appear in the query.</Paragraph> <Paragraph position="4"> As Gerald Salton has noted (Salton and Allan, 1995), in the field of information retrieval to determine the meaning of words is not as important as to ascertain if the meaning of the terms within the detection need coincide or not with the meaning of the terms in each document.</Paragraph> <Paragraph position="5"> When the terms are proper nouns we shall not talk about meaning but of reference, so that it will be interesting to know if the referents of the proper nouns included in the query coincide or not with the referents of the proper nouns which appear in each document.</Paragraph> <Paragraph position="6"> Referents' resolution can not be done without an additional context (Thompson and Dozier, 1997). Nonetheless, in an Information Retrieval process we can not count on the additional information of a data base register, such as personal identification number, profession or age. Nor can we make a linguistic analysis as thorough as can be done in a language understanding system or in an Information Extraction system. Because of these reasons, the identification of proper nouns as co-references is usually confined to verify if the superficial forms of the two nouns under examination are close enough as to suppose that both of them refer to the same individual or entity.</Paragraph> </Section> <Section position="2" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 1.2 Aim of the Work </SectionTitle> <Paragraph position="0"> Co-reference resolution is difficult because it demands the establishment of relationships between linguistic expressions -for example, proper names- and the entitys denoted. So as to establish the co-reference between two expressions, we must identify first the referent of each one of them and then check if both coincide.</Paragraph> <Paragraph position="1"> 1 &quot;Query&quot; is the translation of the user's information need into a format appropriate for the system.</Paragraph> <Paragraph position="2"> In this work we propose a tool to resolve proper nouns co-reference, based only on the exam of those nouns' superficial form. To carry out this exam we have defined, in Section Two, the relation of replicancia. This relation links pairs of nouns -replicantes- which show certain resemblance between their superficial forms and use to have the same referent. In Section Three we propose an algorithm for the calculation of replicancia. This algorithm has the ability to learn pairs of replicantes and also allows to manually introduce pairs of proper nouns with the same referent and which register a high frequency of occurrence, although their orthographic forms do not bear any similarity at all. Section Four contains the results of the evaluation of co-reference resolution between proper nouns using only the relation of replicancia. Finally, in Section Five we draw some conclusions.</Paragraph> </Section> <Section position="3" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 2.1 Definition of Replicancia 2.2 Antecedent </SectionTitle> <Paragraph position="0"> The need to develop an algorithm capable to resolve in a simple way to resolve proper nouns co-reference has been pointed-out by several authors. Among them, we can quote: Dragomir Radev, in (Radev and McKeown, 1997), includes, between the forthcoming extensions of PROFILE system, an algorithm &quot;...that will match different instances of the same entity appearing in different syntactic forms -e.g. to establish that 'PLO' is an alias for de 'Palestine Liberation Organization'...&quot; The second antecedent is (Bikel et al., 1998), where Bikel and his collaborators say about the future improvements of their own system: &quot;... We would like to incorporate the following into the current model: ...an aliasing algorithm, which dynamically updates the model (where e.g.</Paragraph> <Paragraph position="1"> IBM is an alias of International Business</Paragraph> </Section> <Section position="4" start_page="61" end_page="62" type="sub_section"> <SectionTitle> Machines)...&quot; 2.2 Definition </SectionTitle> <Paragraph position="0"> We call replicancia to the relation which links proper nouns which presumably refer to the same entity if we attend exclusively to the nouns themselves, without paying attention to their respective contexts nor to which their actual referents may be. We shall call replicantes the nouns which maintain a relation of replicancia between them. We also assign the label replicante 2 to the predicate which resolves the replicancia relation.</Paragraph> <Paragraph position="1"> Replicancia is a diadic, reflexive, simetric, but not necessarily transitive relation. It is reflexive because every noun is a replicante of itself; simetric because if noun A is a replicante of noun B, noun B will also be a replicante of noun A; it is not transitive because the replicancia relation is established attending only to the nouns form, not to their referents, and it is possible that the same noun denotes two or more entities depending on its context of utilization. For example, let us say that noun A designates to Jos6 Salazar Fernfindez, noun B to JSF and name C to Juan Sfinchez Figueroa; in this case A is a replicante of B and B a replicante of C, but A is not a replicante of C. Not being a transitive relation, the replicancia is neither an equivalence relation and does not induce a set of equivalence classes as it had been desirable (Hirsman, 1997).</Paragraph> <Paragraph position="2"> We do not ask the predicate replicante to recognize as replicantes the pseudonyms, diminutives, nicknames, abreviations and familiar names, although the most frequent tokens can be manually introduced as if they had previously been learned by the system.</Paragraph> <Paragraph position="3"> Two proper nouns are replicantes when any of the following assumptions is satisfied: a) Both nouns or their canonic forms are identicaP, as in Josd Maria Aznar and Jose Maria Aznar.</Paragraph> <Paragraph position="4"> b) One of them coincides with the initials of the other, be it in singular as in Uni6n Europea and UE, or in plural as Estados Unidos and EE UU. We also admit nouns with nexus as in the case of Boletin Oficial del Estado and BOE.</Paragraph> <Paragraph position="5"> c) The shorter noun is contained in the longer one and among the lexemes not shared there are no nexes. Under this rule it is admitted that Julio P6rez de Lucas is a replicante of Pgrez de Lucas; nevertheless, Caja de Madrid is not a replicante of Madrid because the part not shared, Caja de, includes the nexus de.</Paragraph> <Paragraph position="6"> d) Every word of the shorter noun, N2, is a version of some word included in the longer noun, N1, in the same relative order, although there can be words in N1 which have no version in N2. A word P2 is a version of another word P1 when their canonic forms are identical, when P2 is the initial of P1 or when the initials of both coincide 4. According to this fourth rule, the noun JM Pacheco is a replicante of Juan Manuel Pacheco Fern6ndez; to verify it we compare the lexemes of the longer noun, one by one, with the lexemes which form the second name. In the comparison of Juan with JM Pacheco we identify the letter J as a version of Juan, after which remains M Pacheco as a rest. In the comparison between Manuel and M Pacheco we identify the letter M as a version of Manuel, after which remains Pacheco as the new rest that can be identified with the same lexeme in the first noun. As the new rest is empty, we decide that both names are replicantes of eachother although in the first noun there are lexemes left unmatched.</Paragraph> <Paragraph position="7"> This definition of replicante makes it possible to identify the following list of nouns as replicantes of the first:</Paragraph> </Section> </Section> class="xml-element"></Paper>