File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1117_intro.xml
Size: 3,639 bytes
Last Modified: 2025-10-06 14:02:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1117"> <Title>Cognate Mapping -- A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Medical language presents a unique combination of challenges for language engineering, with a focus on applications such as information retrieval, text mining and information extraction. Document collections - on the Web or in clinical databases - are usually very large and dynamic. In addition, medical document collections are truly multilingual. Furthermore, the user population which access medical documents are really diverse, ranging from physicians and nurses to laypersons, who use different jargons and sublanguages. Therefore, the simplicity of the content representation of the documents, as well as automatically performed intra- and interlingual lexical mappings or transformations of equivalent expressions, become crucial issues for an adequate machine support.</Paragraph> <Paragraph position="1"> We respond to these challenges in terms of the MORPHOSAURUS system (an acronym for MORPHeme TheSAURUS). It is centered around a new type of lexicon, in which the entries are subwords, i.e., semantically minimal, morpheme-style units (Schulz and Hahn, 2000). Intralingual as well as interlingual synonymy is then expressed by the assignment of subwords to concept-like equivalence classes. As subword equivalence classes abstract away from subtle particularities within and between languages, and reference to them is achieved via a language-independent code system, they form an interlingua characterized by semantic identifiers.</Paragraph> <Paragraph position="2"> Compared to relationally richer, e.g., WORDNET based, interlinguas as applied for cross-language information retrieval (CLIR) (Gonzalo et al., 1999; Ruiz et al., 1999), we use a rather limited set of semantic relations and pursue a more restrictive approach to synonymy. In particular, we restrict ourselves to the specific sublanguage used in the context of the medical domain. Our claim that this interlingual approach is useful for the purpose of cross-lingual text retrieval and categorization has already been experimentally supported (Schulz et al., 2002; Mark'o et al., 2003).</Paragraph> <Paragraph position="3"> The quality of cross-lingual indexing fundamentally depends on the underlying lexicon and thesaurus. Its manual construction and maintenance is costly and error-prone. Therefore, machinesupported lexical acquisition techniques increasingly deserve attention. Whereas in the medical domain parallel corpora are only available for a limited number of language pairs, unrelated (i.e., nonparallel, non-aligned) corpora might provide sufficient evidence for cognate identification, at least in languages which are closely related.</Paragraph> <Paragraph position="4"> In this paper, we present the results of such an experiment. We have chosen Spanish and Portuguese as a pair of closely related languages. Both languages exhibit a high degree of similarity in their lexical inventory, as well as in the rules governing word formation. Accordingly, a Portuguese native speaker is able to understand technical texts in Spanish without much effort, and vice versa. In both languages, there is also an increasing number of electronic texts available, so that a cross-lingual search interface would significantly improve the accessibility of domain relevant documents.</Paragraph> </Section> class="xml-element"></Paper>