File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/e93-1043_intro.xml
Size: 4,377 bytes
Last Modified: 2025-10-06 14:05:23
<?xml version="1.0" standalone="yes"?> <Paper uid="E93-1043"> <Title>Coping With Derivation in a Morphological Component *</Title> <Section position="2" start_page="0" end_page="368" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper is about words. Since word is a rather fuzzy term we will first try to make clear what word means in the context of this paper. Following \[di Sciullo and Williams, 1989\] we discriminate two senses.</Paragraph> <Paragraph position="1"> One is the morphological word which is built from morphs according to the rules of morphology. The other is the syntactic word which is the atomic entity from which sentences are built according to the rules of syntax.</Paragraph> <Paragraph position="2"> *Work on this project was partially sponsored by the Austrian Federal Ministry for Science and Research and the &quot;Fonds zur FSrderung der wissenschaftlichen Forschung&quot; grant no.P7986-PHY. I would also like to thank John Nerbonne, Klaus Netter and Wolfgang Heinz for comments on earlier versions of this paper.</Paragraph> <Paragraph position="3"> These two views support two different sets of information which are to be kept separate but which are not disjunctive. The syntactical word carries information about category, valency and semantics, information that is important for the interpretation of a word in the context of the sentence. It also carries information like case, number, gender and person. The former information is basically the same for all different surface forms of the syntactic word 1 the latter is conveyed by the different surface forms produced by the inflectional paradigm and is therefore shared with the morphological word.</Paragraph> <Paragraph position="4"> Besides this shared information the morphological word carries information about the inflectional paradigm, the stem, and the way it is internally structured. In our view the lexicon should be a mediator between these two views of word.</Paragraph> <Paragraph position="5"> Traditionally, the lexicon in natural language processing (NLP) systems is viewed as a finite collection of syntactic words. Words have stored with them their syntactic and semantic information. In the most simple case the lexicon contains an entry for every different word form. For highly inflecting (or agglutinating) languages this approach is not feasible for realistic vocabulary sizes. Instead, morphological components are used to map between the different surface forms of a word and its canonical form stored in the lexicon. We will call this canonical form and the information associated with it lezeme.</Paragraph> <Paragraph position="6"> There are problems with such a static view of the lexicon. In the open word classes our vocabulary is potentially infinite. Making use of derivation and compounding speakers (or writers) can and do always create new words. A majority of these words IFor some forms like the passive PPP some authors assume different syntactic features. Nevertheless they are derived regularly, e.g., by lexical rules.</Paragraph> <Paragraph position="7"> are invented on the spot and may never be used again. Skimming through real texts one will always find such ad-hoc formed words not to be found in any lexicon that are nevertheless readily understood by any competent reader. A realistic NLP system should therefore have means to cope with ad-hoc word formation.</Paragraph> <Paragraph position="8"> Efficiency considerations also support the idea of extending morphological components to treat derivation. Because of the regularities found in derivation a lexicon purely based on words will be highly redundant and wasting space. On the other hand a large percentage of lexicalized derived words (and compounds) is no longer transparent syntactically and/or semantically and has to be treated like a monomorphemic lexeme. What we do need then is a system that is flexible enough to allow for both a compositional and an idiosyncratic reading of polymorphemic stems.</Paragraph> <Paragraph position="9"> The system described in this paper is a combination of a feature-based hierarchical lexicon and word grammar with an extended two-level morphology. Before desribing the system in more detail we will shortly discuss these two strands of research.</Paragraph> </Section> class="xml-element"></Paper>