File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1028_metho.xml
Size: 24,441 bytes
Last Modified: 2025-10-06 14:14:00
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1028"> <Title>Rapid Development of Morphological Descriptions for Full Language Processing Systems</Title> <Section position="3" start_page="202" end_page="203" type="metho"> <SectionTitle> 2 The Description Language </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="202" end_page="203" type="sub_section"> <SectionTitle> 2.1 Morphophonology </SectionTitle> <Paragraph position="0"> The formalism for spelling rules (dimension (a)) is a syntactic variant of that of Ruessink (1989) and Pulman (1991). A rule is of the form spell(Name, Surface Op Lexical, Classes, Features).</Paragraph> <Paragraph position="1"> Rules may be optional (Op is &quot;~&quot;) or obligatory (Op is &quot;C/~&quot;). Surface and Lexical are both strings of the form &quot; LContext I Target I RContext&quot; meaning that the surface and lexical targets may correspond if the left and right contexts and the Features specification are satisfied. The vertical bars simply separate the parts of the string and do not themselves match letters. The correspondence between surface and lexical strings for an entire word is licensed if there is a partitioning of both so that each partition (pair of corresponding surface and lexica\] targets) is licensed by a rule, and no partition breaks an obligatory rule. A partition breaks an obligatory rule if the surface target does not match but everything else, including the feature specification, does.</Paragraph> <Paragraph position="2"> The Features in a rule is a list of Feature = Value equations. The allowed (finite) set of values of each feature must be prespecified. Value may be atomic or it may he a boolean expression.</Paragraph> <Paragraph position="3"> Members of the surface and lexieal strings may be characters or classes of single characters. The latter are represented by a single digit N in the string and an item N/ClassName in the Classes list; multiple occurrences of the same N in a single rule must all match the same character in a given application.</Paragraph> <Paragraph position="4"> Figure I shows three of the French spelling rules developed for this system. The change_e_~l rule (simplified slightly here) makes it obligatory for a lexical e to be realised as a surface ~ when followed by t, r, or l, then a morpheme boundary, then e, as long as the feature cdouble has an appropriate value. The default rule that copies characters between surface and lexical levels and the boundary rule that deletes boundary markers are both optional. Together these rules permit the following realization of cher (&quot;expensive&quot;) followed by e (feminine gender suffix) as chore, as shown in Figure 2. Because of the obligatory nature of change_e_~l, and the fact that the orthographic feature restriction on the root cher, \[cdouble=n\], is consistent with the one on that rule, an alternative realisation chere, involving the use of the default rule in third position, is ruled out. 1 Unlike many other flavours of two-level morphology, the Target parts of a rule need not consist of a single character (or class occurrence); they can contain more than one, and the surface target may be empty. This obviates the need for &quot;null&quot; characters at the surface. However, although surface targets of any length can usefully be specified, it is in practicea good strategy 1The cdouble feature is in fact used to specify the spelling changes when e is added to various stems: cher+e=chdre, achet+e=ach~te, but jet+e=jette.</Paragraph> <Paragraph position="5"> spell(change_e_~l, &quot; I ~1&quot; ~:~ &quot; I e I l+e&quot;, \[l/trl\], \[,cdouble=n\]). spell(default, &quot;Ill&quot; =~ &quot;Ill&quot;, \[,1/letter\], \['3). spell(boundary, &quot;\[ \[&quot; ~ &quot;Ill&quot;, \[,I/bmarker\] , \['1). Surface: c h ~ r e Lexical: c h e r + e + Rule: def. def. c.e_~l def. bdy. def. bdy.</Paragraph> <Paragraph position="6"> always to make lexical targets exactly one character long, because, by definition, an obligatory rule cannot block the application of another rule if their lexicM targets axe of different lengths. The example in Section 4.1 below clarifies this point.</Paragraph> </Section> <Section position="2" start_page="203" end_page="203" type="sub_section"> <SectionTitle> 2.2 Word Formation and Interfacing to Syntax </SectionTitle> <Paragraph position="0"> The allowed sequences of morphemes, and the syntactic and semantic properties of morphemes and of the words derived by combining them, are specified by morphosyntactic production rules (dimension (b)) and lexical entries both for affixes (dimension (b)) and for roots (dimension (c)), essentially as described by Alshawi (1992) (where the production rules are referred to as &quot;morphology rules&quot;). Affixes may appear explicitly in production rules or, like roots, they may be assigned complex feature-valued categories. Information, including the creation of logical forms, is passed between constituents in a rule by the sharing of variables. These feature-augmented production rules are just the same device as those used in the CLE's syntactico-semantic descriptions, and are a much more natural way to express morphotactic information than finite-state devices such as continuation classes (see Trost and Matiasek, 1994, for a related approach).</Paragraph> <Paragraph position="1"> The syntactic and semantic production rules for deriving the feminine singular of a French adjective by suffixation with &quot;e&quot; are given, with some details omitted, in Figure 3. In this case, nearly MI features are shared between the inflected word and the root, as is the logical form for the word (shown as Adj in the doriv rule). The only differing feature is that for gender, shown as the third argument of the (c)agr macro, which itself expands to a category.</Paragraph> <Paragraph position="2"> Irregular forms, either complete words or affixable stems, are specified by listing the morphological rules and terminal morphemes from which the appropriate analyses may be constructed, for example: irreg(dit, \[-dire, ' PRESENT_3s ' \], \[v_v_affix-only\] ).</Paragraph> <Paragraph position="3"> Here, PRESENT_3s is a pseudo-affix which has the same syntactic and semantic information attached to it as (one sense of) the affix &quot;t&quot;, which is used to form some regular third person singulars.</Paragraph> <Paragraph position="4"> However, the spelling rules make no reference to PRESENT_3s; it is simply a device allowing categories and logical forms for irregulax words to be built up using the same production rules as for regular words.</Paragraph> </Section> </Section> <Section position="4" start_page="203" end_page="205" type="metho"> <SectionTitle> 3 Compilation </SectionTitle> <Paragraph position="0"> All rules and lexieal entries in the CLE are compiled to a form that allows normal Prolog unification to be used for category matching at run time. The same compiled forms are used for analysis and generation, but are indexed differently.</Paragraph> <Paragraph position="1"> Each feature for a major category is assigned a unique position in the compiled Prolog term, and features for which finite value sets have been specified are compiled into vectors in a form that allows boolean expressions, involving negation as well as conjunction and disjunction, to be conjoined by unification (see Mellish, 1988; Alshawi, 1992, pp46-48).</Paragraph> <Paragraph position="2"> The compilation of morphological information is motivated by the nature of the task and of the languages to be handled. As discussed in Section 1, we expect the number of affix combinations to be limited, but the lexicon is not necessarily known in advance. Morphophonological interactions may be quite complex, and the purpose of morphological processing is to derive syntactic and semantic analyses from words and vice versa for the purpose of full NLP. Reasonably quick compilation is required, and run-time speed need only be moderate.</Paragraph> <Section position="1" start_page="203" end_page="204" type="sub_section"> <SectionTitle> 3.1 Compiling Spelling Patterns </SectionTitle> <Paragraph position="0"> Compilation of individual spell rules is straightforward; feature specifications are compiled to positional/boolean format, characters and occurrences of character classes are also converted to boolean vectors, and left contexts are reversed (cf Abrahamson, 1992) for efficiency. However, although it would be possible to analyse words directly with individually compiled rules (see Section 5 below), it can take an unacceptably long time to do so, largely because of the wide range of</Paragraph> <Paragraph position="2"> choices of rule available at each point and the need to check at each stage that obligatory rules have not been broken. We therefore take the following approach.</Paragraph> <Paragraph position="3"> First, all legal sequences of morphemes are produced by top-down nondeterministic application of the production rules (Section 2.2), selecting affixes but keeping the root morpheme unspecified because, as explained above, the lexicon is undetermined at this stage. For example, for English, the sequences *+ed+ly and un+*+ing are among those produced, the asterisk representing the unspecified root.</Paragraph> <Paragraph position="4"> Then, each sequence, together with any associated restrictions on orthographic features, undergoes analysis by the compiled spelling rules (Section 2.1), with the surface sequence and the root part of the lexical sequence initially uninstantiated. Rules are applied recursively and nondeterministically, somewhat in the style of Abramson (1992), taking advantage of Prolog's unification mechanism to instantiate the part of the surface string corresponding to affixes and to place some spelling constraints on the start and/or end of the surface and/or lexical forms of the root.</Paragraph> <Paragraph position="5"> This process results in a set of spelling palterns, one for each distinct application of the spelling rules to each affix sequence suggested by the production rules. A spelling pattern consists of partially specified surface and lexical root character sequences~ fully specified surface and lexical affix sequences, orthographic feature constraints associated with the spelling rules and affixes used, and a pair of syntactic category specifications derived from the production rules used. One category is for the root form, and one for the inflected form.</Paragraph> <Paragraph position="6"> Spelling patterns are indexed according to the surface (for analysis) and lexical (for generation) affix characters they involve. At run time, an inflected word is analysed nondeterministically in several stages, each of which may succeed any number of times including zero.</Paragraph> <Paragraph position="7"> * stripping off possible (surface) affix characters in the word and locating a spelling pattern that they index; * matching the remaining characters in the word against the surface part of the spelling pattern, thereby, through shared variables, instantiating the characters for the lexical part to provide a possible root spelling; * checking any orthographic feature constraints on that root; * finding a lexical entry for the root, by any of a range of mechanisms including lookup in the system's own lexicon, querying an external lexical database, or attempting to guess an entry for an undefined word; and * unifying the root lexical entry with the root category in the spelling pattern, thereby, through variable sharing with the other category in the pattern, creating a fully specified category for the inflected form that can be used in parsing.</Paragraph> <Paragraph position="8"> In generation, the process works in reverse, starting from indexes on the lexical affix characters.</Paragraph> </Section> <Section position="2" start_page="204" end_page="205" type="sub_section"> <SectionTitle> 3.2 Representing Lexical Roots </SectionTitle> <Paragraph position="0"> Complications arise in spelling rule application from the fact that, at compile time, neither the lexical nor the surface form of the root, nor even its length, is known. It would be possible to hypothesize all sensible lengths and compile separate spelling patterns for each. However, this would lead to many times more patterns being produced than are really necessary.</Paragraph> <Paragraph position="1"> Lexical (and, after instantiation, surface) strings for the unspecified roots are therefore represented in a more complex but less redundant way: as a structure</Paragraph> <Paragraph position="3"> Here the Li's are variables later instantiated to single characters at the beginning of the root, and L is a variable, which is later instantiated to a list of characters, for its continuation. Similarly, the /~'s represent the end of the root, and R is the continuation (this time reversed) leftwards into the root from the R1. The v(L, R) structure is always matched specially with a Kleene-star of the default spelling rule. For full generality and minimal redundancy, Lm and R1 are constrained not to match the default rule, but the other Li's and Ri's may. The values of n required are those for which, for some spelling rule, there are k characters in the target lexical string and n - k from the beginning of the right context up to (but not including) a boundary symbol. The lexical string of that rule may then match R1,...,Rk, and its right context match Rk+l,..., Rn,+,.... The required values of m may be calculated similarly with reference to the left contexts of rules. 2 During rule compilation, the spelling pattern that leads to the run-time analysis of chore given above is derived from m = 0 and n = 2 and the specified rule sequence, with the variables R1 R2 matching as in Figure 4.</Paragraph> </Section> <Section position="3" start_page="205" end_page="205" type="sub_section"> <SectionTitle> 3.3 Applying Obligatory Rules </SectionTitle> <Paragraph position="0"> In the absence of a lexical string for the root, the correct treatment of obligatory rules is another problem for compilation. If an obligatory rule specifies that lexical X must be realised as surface Y when certain contextual and feature conditions hold, then a partitioning where X is realised as something other than Y is only&quot; allowed if one or more of those conditions is unsatisfied. Because of the use of boolean vectors for both features and characters, it is quite possible to constrain each partitioning by unifying it with the complement of one of the conditions of each applicable obligatory rule, thereby preventing that rule from applying. For English, with its relatively simple inflectional spelling changes, this works well. However, for other languages, including French, it leads to excessive numbers of spelling patterns, because there are many obligatory rules with non-trivial contexts and feature specifications.</Paragraph> <Paragraph position="1"> For this reason, complement unification is not actually carried out at compile time. Instead, the spelling patterns are augmented with the fact that certain conditions on certain obligatory rules need to be checked on certain parts of the partitioning when it is fully instantiated. This slows down run-time performance a little but, as we will see below, the speed is still quite acceptable.</Paragraph> </Section> <Section position="4" start_page="205" end_page="205" type="sub_section"> <SectionTitle> 3.4 Timings </SectionTitle> <Paragraph position="0"> The compilation process for the entire rule set takes just over a minute for a fairly thorough de2Alternations in the middle of a root, such as umlaut, can be handled straightforwardly by altering the root/affix pattern from L1... Lm v(L,R) R1...R, to L1...Lm v(L,R) M v(L',R') R1...Rn, with M forbidden to be the default rule. This has not been necessary for the descriptions developed so far, but its implementation is not expected to lead to any great decrease in run-time performance, because the non-determinism it induces in the lookup process is no different in kind from that arising from alternations at root-affix boundaries.</Paragraph> <Paragraph position="1"> scription of French inflectional morphology, running on a Sparcstation 10/41 (SPECint92=52.6).</Paragraph> <Paragraph position="2"> Run-time speeds are quite adequate for full NLP, and reflect the fact that the system is implemented in Prolog rather than (say) C and that full syntactico-semantic analyses of sentences, rather than just morpheme sequences or acceptability judgments, are produced.</Paragraph> <Paragraph position="3"> Analysis of French words using this rule set and only an in-core lexicon averages around 50 words per second, with a mean of 11 spelling analyses per word leading to a mean of 1.6 morphological analyses (the reduction being because many of the roots suggested by spelling analysis do not exist or cannot combine with the affixes produced). If results are cached, subsequent attempts to analyse the same word are around 40 times faster still. Generation is also quite acceptably fast, running at around 100 Words per second; it is slightly faster than analysis because only one spelling, rather than all possible analyses, is sought from each call. Because of the separation between lexical and morphological representations, these timings are essentially unaffected by in-core lexicon size, as full advantage is taken of Prolog's built-in indexing. null Development times are at least as important as computation times. A rule set embodying a quite comprehensive treatment of French inflectional morphology was developed in about one person month. The English spelling rule set was adapted from Ritchie e~ al (1992) in only a day or two. A Polish rule set is also under development, and Swedish is planned for the near future.</Paragraph> </Section> </Section> <Section position="5" start_page="205" end_page="206" type="metho"> <SectionTitle> 4 Some Examples </SectionTitle> <Paragraph position="0"> To clarify further the use of the formalism and the operation of the mechanisms, we now examine several further examples.</Paragraph> <Section position="1" start_page="205" end_page="206" type="sub_section"> <SectionTitle> 4.1 Multiple-letter spelling changes </SectionTitle> <Paragraph position="0"> Some obligatory spelling changes in French involve more than one letter. For example, masculine adjectives and nouns ending in eau have feminine counterparts ending in elle: beau (&quot;nice&quot;) becomes belle, chameau (&quot;camel&quot;) becomes chamelle. The final e is a feminizing affix and can be seen as inducing the obligatory spelling change au ~ II.</Paragraph> <Paragraph position="1"> However, although the obvious spelling rule, spell(change_au_ll, &quot;Ill\[&quot; +-+ &quot;laui+e&quot;), allows this change, it does not rule out the incorrect realization of beau+e as e'beaue, shown in Figure 5, because it only affects partitionings where the au at the lexical level forms a single partition, rather than one for a and one for u. Instead, the following pair of rules, in which the lexical targets have only one character each, achieve the desired effect: Surface: b e a u e Lexical: b e a u + e + Rule: def. def. def. def. bdy. def. bdy.</Paragraph> <Paragraph position="2"> It is not necessary for the surface target to contain exactly one character for the blocking effect to apply, because the semantics of obligatoriness is that the lezicaltarget and all contexts, taken together, make the specified surface target (of whatever length) obligatory for that partition. The reverse constraint, on the lexical target, does not apply.</Paragraph> </Section> <Section position="2" start_page="206" end_page="206" type="sub_section"> <SectionTitle> 4.2 Using features to control rule </SectionTitle> <Paragraph position="0"> application Features can be used to control the application of rules to particular lexical items where the applicability cannot be deduced from spellings alone. For example, Polish nouns with stems whose final syllable has vowel 6 normally have inflected forms in which the accent is dropped. Thus in the nominative plural, kr6j (&quot;style&quot;) becomes kroje, b6r (&quot;forest&quot;) becomes bory, b6j (&quot;combat&quot;) becomes boje. However, there are exceptions, such as zb6j (&quot;bandit&quot;) becoming zbgje. Similarly, some French verbs whose infinitives end in -eler take a grave accent on the first e in the third per-son singular future (modeler, &quot;model&quot;, becomes mod~lera), while others double the I instead (e.g.</Paragraph> <Paragraph position="1"> appeler, &quot;call&quot;, becomes appellera).</Paragraph> <Paragraph position="2"> These phenomena can be handled by providing an obligatory rule for the case whether the letter changes, but constraining the applicability of the rule with a feature and making the feature clash with that for roots where the change does not occur. In the Polish case: spell(change_6_o, &quot;\[o\[&quot; +-+ &quot;\[611+2&quot;, \[i/c, 21v\], \[clmgo:y\]).</Paragraph> <Paragraph position="3"> orth(zb6j, \[chngo=n\] ).</Paragraph> <Paragraph position="4"> Then the partitionings given in Figure 6 will be the only possible ones. For b6j, the change_6_o rule must apply, because the chngo feature for b6j is unspecified and therefore can take any value; for zb@ however, the rule is prevented from applying by the feature clash, and so the default rule is the only one that can apply.</Paragraph> </Section> </Section> <Section position="6" start_page="206" end_page="207" type="metho"> <SectionTitle> 5 Debugging the Rules </SectionTitle> <Paragraph position="0"> The debugging tools help in checking the operation of the spelling rules, either (1) in conjunction with other constraints or (2) on their own.</Paragraph> <Paragraph position="1"> For case (1), the user may ask to see all inflections of a root licensed by the spelling rules, production rules, and lexicon; for chef, the output is \[cher,e\] : adjp -> chore \[cher,e,s\]: adjp -> chores \[cher,s\] : adjp -> chers meaning that when cher is an adjp (adjective) it may combine with the suffixes listed to produce the inflected forms shown. This is useful in checking over- and undergeneration. It is also possible to view the spelling patterns and production rule tree used to produce a form; for chore, the trace (slightly simplified here) is as in figure 7. The spelling pattern 194 referred to here is the one depicted in a different form in Figure 4. The notation {clmnprstv=A} denotes a set of possible consonants represented by the variable A, which also occurs on the right hand side of the rule, indicating that the same selection must be made for both occurrences. Production rule tree 17 is that for a single application of the rule adjp_adjp_fem, which describes the feminine form of the an adjective, where the root is taken to be the masculine form. The Root and Infl lines show the features that differ between the root and inflected forms, while the Both line shows those that they share. Tree 18, which is also pointed to by the spelling pattern, describes the feminine forms of nouns analogously.</Paragraph> <Paragraph position="2"> For case (2), the spelling rules may be applied directly, just as in rule compilation, to a specified surface or lexical character sequence, as if no Surface: b o j e Lexical: b 6 j + e + Rule: def. c_6_o, def. bdy. def. bdy.</Paragraph> <Paragraph position="3"> Surface: z b 6 j e Lexicah z b 6 j + e + Rule: def. def. def. def. bdy. def. bdy.</Paragraph> <Paragraph position="4"> lexical or morphotactic constraints existed. Feature constraints, and cases where the rules will not apply if those constraints are broken, are shown. For the lexical sequence cher+e+, for example, the output is as follows.</Paragraph> <Paragraph position="5"> This indicates to the user that if chef is given a lexical entry consistent with the constraint cdouble=n, then only the first analysis will be valid; otherwise, only the second will be.</Paragraph> </Section> class="xml-element"></Paper>