File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3069_metho.xml
Size: 7,850 bytes
Last Modified: 2025-10-06 14:12:32
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3069"> <Title>AN INTEGRATED SYSTEM FOR MORPHOLOGICAL ANALYSIS OF THE SLOVENE LANGUAGE</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Structure </SectionTitle> <Paragraph position="0"> The system was implemented on VAX/V-MS in Quintus Prolog and consists of the following parts: (1) The compiler, which takes as its input two-level rules and produces final state automata (transducers).</Paragraph> <Paragraph position="1"> (2) The lexicon module which provides a user interface for the creation and updating of tile lexicon - the lexicon input module. This module embodies that part of morphological knowledge of Slovene inflectional morphology which cannot be (elegantly) covered by two-level rules. It is also the part of the system responsible for passing lexical word forms to (3) - the lexicon output module.</Paragraph> <Paragraph position="2"> (3) The MAS itself, which, having access to the transducers and (indirectly) to the lexicon, is able to analyze Slovene word forms into their lexical counterparts, and to synthesize word forms from lexical data.</Paragraph> <Paragraph position="3"> As we can see, the MAS with its knowledge of phono-morphological alternations embodied in the transducers guides the lexicon module in choosing the correct lexical word from the lexicon. The MAS module is of course also able to synthesize words, given their lexical representation. The 'Tceding&quot; of lexical words to the MAS is however application dependent, and will thus not be dealt with further in this paper. The workings of the compiler will also not be discussed, as this is not its first implementation (Karttmmrt 87).</Paragraph> <Paragraph position="4"> 3. Lexicon module A basic part of our MAS system is the lexicon. The ~tructure of the lexicon accords with the twoqevel model type lexicon; that is, the lexicon is composed of letter-tree sub-lexicons (tries), consisting of morphemes with a comnlon t)ropelty. We can have, \[br instance, a sub-lexicon lot stems, another for endings of male noun declension, another for conjugative endings of certain verbs, etc. A set of .,;ub-lexicons is marked as initial, meaning that a (recognizable) word can only start with a member of Ihese sub-lexicons. The other sub-lexicons are connected to initial sub-lexicons through pointers, ~ypically making them inflectional paradigms of various word classes.</Paragraph> <Paragraph position="5"> An entu in a sub-lexicon consists of three parts: (1) the &quot;morpheme&quot;, which, in stein suMexicons (two--level rules aside), is ll~c invariant part of the stem lcxcme, written in the symbols of Ihe lcxical alphabet; (2) the continuation lexicon(s) of the morpheme; (3) morpho-syntactic features of the n~,orphcme. '\['o illustrate: bolezEn decl subst 12 / bv=subst gen = fenl; O) (2) (3) (t) - the stem of the lexeme &quot;illness&quot;; the lexical sylnbol &quot;E&quot; denotes an unstressed &quot;e&quot; (schwa sound), deleted in word forms with non-null endings Cbolczen '' - nora. sg., but &quot;bolezni&quot; - gen. sg.); (2) - the name of the lexicon with endings of second female declension; (3) - inherent morpho-syntactic properties of the lexeme (noun, female gender).</Paragraph> <Paragraph position="6"> We can see that the lexicon system can take care of regular paradigms of inflecting words of the languagc (at least lbr suffixing languages, such as Slovene), while the two-level rules handle phonemorphological alternations. The Slovene language, however, abounds in alternations that are lexically conditioned. This is not to say that no rules can be constructed to cover these alternations, but rather that they are not (purely) phonologically conditioned. There is for instance an alternation that affects only nouns of male gender which have the &quot;animate&quot; property, and another one which pertains only to the plural and dual of certain Slovene noun,;. Since two-level rules are sensitive only to the form of the word (string) they proces, they arc insufficient tor expressing such alternations.</Paragraph> <Paragraph position="7"> To handle texically conditioned types of alternations, we have concentrated on the linking mechanisln between the sub-lexicons. The &quot;continuation&quot; information belonging to an entry can also, along with a pointer to another sub-lexicon, include a list of lexical alternations. When accessing word forms from the lexicon, these alternations tell the lexicon output module how to modify the continuation sub-lexicon to express the desired changes. The rules governing such modifications of the continuation sub-lexicon can pertbrm a certain number of primitive &quot;transformational&quot; operations on the sub-lexicon in question.</Paragraph> <Paragraph position="8"> To make the point clearer, we give a simple case of an alternation that affects certain nouns of male gender. The alternation &quot;j epenthesis&quot; inserts a &quot;j&quot; in the stem final position in word forms with a non-null eMing; e.g. &quot;krompir&quot; -potato, but &quot;krompirja&quot; for the singular genitive form. The lexicon entry looks like this: krompir decl_subst_m(pod_j) / bv=subst gen=mas deganim; When the lexicon output module &quot;jumps&quot; to the continuation lexicon, the &quot;pod_j&quot; item will trigger the corresponding alternation in the morphological rule base of the system. The alternation procedure then takes as its input the continuation lexicon, modifies it, and returns the modified lexicon (with &quot;j&quot; prefixed to the non-null gramatemes). Analysis then proceeds with entries of the modified lexicon.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Input Module </SectionTitle> <Paragraph position="0"> If new entries are to be added to our lexicon by persons not acquainted with implementation details of the system (lexical alphabet and alternations), an input module with a friendly user interface is of prime importance. In our system the user is therefore expected to enter only the base form of the new word (e.g. nora, sg. for nouns) along with inherent morpho-syntactic properties of the word (e.g. noun, male, animate), and another &quot;comparative&quot; word form of the same word (e.g. gen. pl.). Both word-forms are entered in the surface alphabct.</Paragraph> <Paragraph position="1"> 2, 349 With this information at its disposal, the input module must, in order to store the entry into the lexicon, do the following: - extract form the word its lexical stem; - transcribe it from surface into lexical characters; - determine the continuation lexicon(s) (paradigms) and lexical alternations; Extracting the lexical stem and assigning lexical alternations are performed by comparing the (base and comparative) word forms entered. For example the comparison of &quot;ladja&quot; (ship) and &quot;ladij&quot; (gen. pl.) shows an insertion of &quot;i&quot; into the stem, so the name of the lexical alternation for &quot;i&quot; epenthesis is added to the entry.</Paragraph> <Paragraph position="2"> The &quot;lemmatization&quot; of words, especially the mapping from surface to lexical symbols, is basically nondeterministic; i.e. the input module &quot;guesses&quot; the correct lemmatization of the word, produces the lexical word form of the comparative word, and synthesizes its surlhce word form. If the synthesized word-form matches the one entered by the user, the lemmatization is correct; if not, the module tries again, with a different mapping.</Paragraph> </Section> class="xml-element"></Paper>