File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/e95-1028_intro.xml

Size: 6,137 bytes

Last Modified: 2025-10-06 14:05:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1028">
  <Title>Rapid Development of Morphological Descriptions for Full Language Processing Systems</Title>
  <Section position="2" start_page="0" end_page="202" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The paradigm of two-level morphology (Koskenniemi, 1983) has become popular for handling word formation phenomena in a variety of languages. The original formulation has been extended to allow morphotactic constraints to be expressed by feature specification (Trost, 1990; A1shawi et al, 1991) rather than Koskenniemi's less perspicuous device of continuation classes. Methods for the automatic compilation of rules from a notation convenient for the rule-writer into finite-state automata have also been developed, allowing the efficient analysis and synthesis of word forms.</Paragraph>
    <Paragraph position="1"> The automata may be derived from the rules alone (Trost, 1990), or involve composition with the lexicon (Karttunen, Kaplan and Zaenen, 1992).</Paragraph>
    <Paragraph position="2"> However, there is often a trade-off between run-time efficiency and factors important for rapid and accurate system development, such as perspicuity of notation, ease of debugging, speed of compilation and the size of its output, and the independence of the morphological and lexical compo- null nents. In compilation, one may compose any or all of (a) the two-level rule set, (b) the set of affixes and their allowed combinations, and (c) the lexicon; see Kaplan and Kay (1994 / for an exposition of the mathematical basis. The type of compilation  appropriate for rapid development and acceptable run-time performance depends on, at least, the nature of the language being described and the number of base forms in the lexicon; that is, on the position in the three-dimensional space defined by (a), (b) and (c).</Paragraph>
    <Paragraph position="3"> For example, English inflectional morphology is relatively simple; dimensions (a) and (b) are fairly small, so if (c), the lexicon, is known in advance and is of manageable size, then the entire task of morphological anMysis can be carried out at compile time, producing a list of analysed word forms which need only be looked up at run time, or a network which can be traversed very simply. Alternatively, there may be no need to provide as powerful a mechanism as two-level morphology at all; a simpler device such as affix stripping (A1shawi, 1992, pll9ff) or merely listing all inflected forms explicitly may be preferable.</Paragraph>
    <Paragraph position="4"> For agglutinative languages such as Korean, Finnish and Turkish (Kwon and Karttunen, 1994; Koskenniemi, 1983; Oflazer, 1993), dimension (b) is very large, so creating an exhaustive word list is out of the question unless the lexicon is trivial. Compilation to a network may still make sense, however, and because these languages tend to exhibit few non-eoncatenative morphophonological phenomena other than vowel harmony, the continuation class mechanism may suffice to describe the allowed affix sequences at the surface level. Many European languages are of the inflecting type, and occupy still another region of the space of difficulty. They are too complex morphologically to yield easily to the simpler techniques that can work for English. The phonological or orthographic changes involved in affixation may be quite complex, so dimension (a) can be laige, and a feature mechanism may be needed to handle such varied but interrelated morphosyn- null tactic phenomena such as umlaut (Trost, 1991), case, number, gender, and different morphological paradigms. On the other hand, while there may be many different affixes, their possibilities for combination within a word are fairly limited, so dimension (b) is quite manageable.</Paragraph>
    <Paragraph position="5"> This paper describes a representation and associated compiler intended for two-level morphological descriptions of the written forms of inflecting languages. The system described is a component of the Core Language Engine (CLE; AIshawi, 1992), a general-purpose language analyser and generator implemented in Prolog which supports both a built-in lexicon and access to large external lexical databases. In this context, highly efficient word analysis and generation at run-time are less important than ensuring that the morphology mechanism is expressive, is easy to debug, and allows relatively quick compilation. Morphology also needs to be well integrated with other processing levels. In particular, it should be possible to specify relations among morphosyntactic and morphophonological rules and lexical entries; for the convenience of developers, this is done by means of feature equations. Further, it cannot be assumed that the lexicon has been fully specified when the morphology rules are compiled. Developers may wish to add and test further lexical entries without frequently recompiling the rules, and it may also be necessary to deal with unknown words at run time, for example by querying a large external lexical database or attempting spelling correction (Alshawi, 1992, pp124-7). Also, both analysis and generation of word forms are required. Run-time speed need only be enough to make the time spent on morphology small compared to sententia\] and contextual processing.</Paragraph>
    <Paragraph position="6"> These parameters - languages with a complex morphology/syntax interface but a limited number of affix combinations, tasks where the lexicon is not necessarily known at compile time, bidirectional processing, and the need to ease development rather than optimize run-time efficiency dictate the design of the morphology compiler described in this paper, in which spelling rules and possible affix combinations (items (a) and (b)), but not the lexicon (item (c)), are composed in the compilation phase. Descriptions of French, Polish and English inflectional morphology have been developed for it, and I show how various aspeers of the mechanism allow phenomena in these languages to be handled.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML