File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-1014_metho.xml
Size: 22,343 bytes
Last Modified: 2025-10-06 14:12:55
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-1014"> <Title>A High-level Morphological Description Language Exploiting Inflectional Paradigms</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Paradigm Description Language </SectionTitle> <Paragraph position="0"> Oar paradigm description language (PDL) is composcd of three major components - form rules, an inheritance hierarchy of paradigms, and orthographic rules.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Form Rules </SectionTitle> <Paragraph position="0"> We divide word lorms into * surface forms, which are those that show tip in a text, * lexical forms, which are those that are stored directly in the lexicon, and * intermediate forms, those forms created by affixation or stem-change operations applied to other lorms. These terms may not ever show up in a text but are useful in describing intermediate steps in the construction of surlhce lorms from lexical fi)rms. In the form ennstruction rules, we distinguish between two major categories of strings. Stems are any forms which include the primary \[exical base of tile word, whereas affixes comprise tile prefixes and suffixes which can be concatenated with a stein in the process of word formation. Once an affix is appended to or removed from a stem, the result is also a stem, since tire result also includes the primary lexical base. Form construction rides ,are restrictexl to the five cases below:</Paragraph> <Paragraph position="2"> The <lotto> is a name for the string form created by the rule. <stem> is the name of a stein form. <affix> may be a prefix or suffix string (or string variable), its type (i.c., prefix or suffix) impliexl by its position before or after the <stcm> in the rulc. The operator (+ or -) always precexles the affix. If +, then the affix is appended to the stem as a prefix or suffix. If -, then the affix is removexl from the stem. The rest,lting <lorm> name may in turn be used as a stem in the consU'uction el some other k}rm. In this way, the construction of a surface form may be described via a succession of affixatinn or stem-change operations, each operation described in a single rule.</Paragraph> <Paragraph position="3"> The special syndml LEX may be used in the right-hand-side of a form rule to imlicate that the tonn is stored as a lexical stem in the lexicon.</Paragraph> <Paragraph position="4"> Grammatical \[~ttures may be associated with form names, as follows:</Paragraph> <Paragraph position="6"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Paradigms </SectionTitle> <Paragraph position="0"> A paradigm in PDL is composed of a set of term construction rules which collectively characterize the filmily of surface forms for those words which belong tn that paradigm. To capture the similarities among paradigms and to avoid redundancy in the description of a language, we allow one paradigm to be based on another paradigm.</Paragraph> <Paragraph position="1"> If paradigm B is based on paradigm A, then all the fimns and fi)rm construction rules that have been defined R)r paradigm A also apply, by default, to paradigm B. We can then differentiate paradigm B ti'om A in three ways: I. We can add new lorms and their conslrnction rules fi~r tbrms that do not exist in A.</Paragraph> <Paragraph position="2"> ACN-~S DE COLING-92, NAiVFES, 23-28 AO6&quot;r 1992 6 8 PREC. O1.' COLING-92, NANTES, AUG. 23-28, 1992 2. We cue rewrite (override) tile construction rnles tor tornls Ihal do exist in A.</Paragraph> <Paragraph position="3"> 3, if a li)rra in A is no longer applicable in \[I, we can delete it lionl t3.</Paragraph> <Paragraph position="4"> Note that the l~ttnre set(s) associated with lornl names cannot change froin paradignl to l)aradignl; fornl nanles are nniversal, denoting tim same lcatures regardless of where they appear.</Paragraph> <Paragraph position="5"> Ill order to facilitate the capture of generalizations across paradigms, we allow tile definition of abstract pamdignls. These ;ire paradigms to which no words of a langnago actnally belong, hut which contain a sot of tbrnls and consmictions which other paradigms have in connnon. Thus a COllCrCic paradignl nlay be based on shine ()tiler concrete paradigm or on an abstract l)aradigm. Likewise, air abstract paradigm nlay itself be based on yet another abstract (or concrete) paradigm.</Paragraph> <Paragraph position="6"> The ability to base one paradignl on another, combined with the ability to represent intermediate stenl forms ;IS slols in a paradigm, is a very lXlwerful feature of our morphological description langnage. Not only does it allow for paradign/descriptions that correspond closely with the kinds of descriptions lonnd in graminar hooks, but, since the roguhirilies alnong paradignls can Ix: ahstracled ont and shared hy nniliil/te llaradiglns, it alklws for very concise descrilltions el ioiloctional hehavinr (inchlding subregularities often overlooked in graulnlar hooks), ;.is il-Inslrated in section 3.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Orthographic Rules </SectionTitle> <Paragraph position="0"> l,'orm COllSlfnction rules describe which stems can colnbine with which aflixes to create new |orms. The concatenation or removal of all affix may in some cases result ill fl spoiling change other than tile mere concatenation or removal of tile affix string. In English, inany words ending in a vowel followed by a consonant will donble the final consonant whml an affix starting with a vowel is appended, ill French, the addition of certain affixes requires that ;in &quot;e&quot; in the stein of some verbs be rewritten as &quot;~,&quot;. Since these spelling change rules ;ire often hased on general phonological/orthographic llroperties nf alfixes and steins, rather lhnn llle specific forln rules Ihe, lnsolvos, and hmlce may apply acrnss paradigms, we supllort the mC/lepoudent stx~cificatinn of spelling rules caplnring lheso changes. Each rnle is written to allply to the orthographic context of a slen/and affix at tile point el the concatenation or deletion opontiion. Thus, there ;ire two kinds of spelling rules: 1, Suffix rules, which describe spelling changes applying to the end of tile stem and the hoginnmg of the snffix, and 2. Prefix rnles, which describe spelliug changes al I null IIlying to tile end el lhc prelix and tim beginning of the stein, A sllelling rule can make reference to literal strings and variables. A vnriahle refers to a nanled Set of characters and/or slrings, snch as Vowel f,a,e,i,oai) or Dental (d,t,dn,m,chn,fn,gn). The grammar writer nray define snch sets and variables ranging over those gets.</Paragraph> <Paragraph position="1"> The general feral of a suffix spelling rule is ;is fklllows: (<parameter>*) \[<slcm-paneHl>l <opcrator>{<aflix paneul> I > \[<mergtul panern>\] {<lots>} The opelator may he either ~ or , indicnting concatenation and deletion respectively. The <incrged-pattern> re liars to tile term constructed by perfornlmg tile operation on a Stem and alfix. The two pattelns tin tile left of tile arrow refer lo tile slem anti affix parlicipating ill tile con struction. Each pattern is a list of variables and/or literal strings. Whenever tile stlnle variable nanle appears more I\]lan once ill the rule, it is assnlned to take on tile salile value throughout.</Paragraph> <Paragraph position="2"> <paranletcr> is a lloole:in condition on the applicallility of tile spelling ride, It it necessary for ttloso cases wilere tile application of the rnle depends on iuik)rlnntion al)ont the lexical ilcln whk'h is not inclnded in IhC/ orlhograllhy.</Paragraph> <Paragraph position="3"> (Like {BEAR88 I, we choose to represeot these conditions ;is featnres rather tllan ;is diacritics I KOSKENNIEMIB4 I,) All exanlllle in linglish where a parameter is necessary is lhe case of gonlinating final consonants. GelninaLinn tlepends on llhonological ciiaracteristics which ;ire not prodictahle fronl tile spelling alone. Only words whose lexicnl entries contain the specified parameter valne will nndergo spelling changes sensitive to that parameter.</Paragraph> <Paragraph position="4"> Specifying orthographic rules indel~ndently el the specific affixe, s to which they apply allows for a more coucise declarative rcpresenlu\[ioll, as regnklritics across pal'adigms and Ibrms can I~,, abstracted out. However, there are cases in which the application of ;in orthographic rnle is constrained to specific paradigms or to specific forms wilhin a paradigin. The oplional <h/cs> qualifier can Ix: nsed to liniit the paradignis and/or specific lornis it) which the orthographic rifle applies.</Paragraph> <Paragraph position="5"> Prefixntion rules are exliressed ill a similar nlalnler, c, xcept that tile <operator> precedes the first pattern in tile left haud side. Stein changes fin whk;h a stein undergoes a spclliug change in the absence of ally affixalion ot)elation ) are llandled hy the association of an orthographic rule wilh a fornl rule el tile lorni <:folul> : <stem>. The, orthographic rule in snch a case wonhl contain no affix pattern. t lore we illnslrato a hypothelical spelling rule: I&quot;a&quot; Cons Consl i/Vowell > &quot;t2&quot; (?tills Vowel Ac+rEs DI! COLING 92, NANqES. 23-28 AO~r 1992 6 9 F&quot;XoC. OF C()1,IN(3-92, NANTES. AUG. 23-28, 1992 This is a suffix rule, since the operator precedes the second left-hand-side pattern. Accordingly, the <stempattern> refers to the characters at the end of the stem while the <affix-pattern> refers to the letters at the beginning of the affix. This rule states that, if we are appending an affix which begins with a vowel to a stem which ends in the character &quot;a&quot; followed by two identical consonants, then we construct the resulting form (<mergedpattern>) as follows: 1. Remove the last three characters from the stem, leaving <sub-stem>.</Paragraph> <Paragraph position="6"> 2. Remove the first character from the suffix, leaving <sub-allix>.</Paragraph> <Paragraph position="7"> 3. Construct the string <spell-change> by concatenating the strings and iastantiated character variables described by the right-hand-side pattern.</Paragraph> <Paragraph position="8"> 4. Construct the resulting form as the concatenation of the strings <sub-stem>, <spell-change>, and <sub-affix>.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 The Lexicon </SectionTitle> <Paragraph position="0"> We have seen above how one paradigm can be based on another, thereby allowing lorm conslruction roles to be &quot;inherited&quot; by paradigms. This inherit~mce is controlled through the form names themselves. If we have a paradigm B based on paradigm A, then any form rules in A for which there is no rule in B with the same form name are by detroit assumed to be part of paradigm B.</Paragraph> <Paragraph position="1"> Although onr lexicon is maintained as a secondary storage database with entries represented and indexed differently from the (memory resident) paradigms, it is useful to think of a lexical entry as &quot;inheriting&quot; rules from its paradigm ~ts well. The inflectional behavior of any individnal word will depend on both the information inherited from its paradigm and the information stored in the lexicon.</Paragraph> <Paragraph position="2"> Lexicon entries contain the equivalent of a single kind of form construction rule: <fi)rm> : <stem>/{ supersede I augment} The interaction of lexical information with the word's p~tradigm is as fi)llows: * If <form> correspends to a lexical stem nile in the paradigm (i.e., one whose right-hand-side is the special symbol LEX), then this form provides the stem fi)r that rule.</Paragraph> <Paragraph position="3"> * If <form> correspomLs to a surface form in the paradigm or an iutermediate form qualified with the qualifier/allow lexical override , then the lcxical fornl either supersedes or augments the consU'nction rule in the paradigm, depending on the value of the stem's /\[supersede I augment} qualifier.</Paragraph> <Paragraph position="4"> The qualifier/allow_lexical override is necessary to inform the run-time inflectional analyzer when to attempt a lexical lookup of an intermediate form stem. By default, the analyzer looks up any form found directly in the text (surface form) and any forms whose right hand side is LEX. The use of the /allow lexical override flag can save disk accesses by limiting lexical lookups of intermediate forms to just those cases in which lexical overrides may actually occur.</Paragraph> <Paragraph position="5"> Utilizing the/allow lexical_override qualifier and the default lookup of suri~,ce forms, one could write lexical entries in which all the rules in a paradigm were overridden by lexical information. In general, this is not a good idea, since it fails to take advantage of the generalizations that paradigms provide, but there are exceptional cases, such as the verb &quot;be&quot;, fl~r which there must necessarily be a large number of lexical stems. Allowing lexical overrides in this manner eliminates the need to create tm excessive number of highly idiosyncratic paradigms specifically to accomodate irregular verbs in languages like French and German (see section 3).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Using Paradigm Inheritance to Capture </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Linguistic Generalizations </SectionTitle> <Paragraph position="0"> In PDL, word formation is characterized as a sequence of discrete transformational steps, lu many cases, paradigms (as well as iudividual lexical items) will differ with respect to one or more of these intermediate steps, yet share the bulk of the rules that apply to the results of the intermediate operations. Default inheritance, including the inheritance of the partially derived forms, makes it possible to express such facts very succinctly. Figure I depicts the hierarchy of paradigms we have developed for the French verbs. The root of the hierarchy (VERBROOT) represents the &quot;greatest common denominator&quot; of all the paradigms in the hierarchy. (All of the inteianediate form rules in the root paradigm are shown in Figure 1, but many of the surface form rules are omitted because of space limitations. However, all of the form rules, both intermediate and surface, in the other paradigms are listed.) The first sub-paradigm, VERB ER, represents wlmt are commonly referred to ,as first conjugation verbs, VERB_IR represents the second conjugation, and VERB_RE_IR, VERB OIR, and VERBRE together represent the third conjugation, which includes virtually all of the &quot;irregular&quot; verbs.</Paragraph> <Paragraph position="1"> \[BESCHERELLE90\] describes over 70 conjugation types that fall within one of the three basic groups, the third group being subdivided iuto three sections, one for the irregular verbs ending in -ir, one tier the -oir verbs and one for the -re verbs. These sections map directly onto para-AC/_'TES DI,: COLING-92, NAI'CrHS, 23-28 Aotrr 1992 7 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 digms VERB. RE IR, VERB OIR, and VERBRE, respectively, with the exception of several types (which actually fit VERBROOT directly.) Through the use of form rule inheritance, intermediate form odes, lexical override and orthographic rules, we arc able to condense the rules for the 78 types into these six paradigms, which capture in a straightforward way most of the linguistic regularities within and among the paradigms.</Paragraph> <Paragraph position="2"> The useful role played by intermediate form rides in inheritance can be seen by comparing the VERB ER and VERB IR paradigms. Both share (inherit) the imp intermediate form and the set of six surface forms that doscribe the imperfect tense (e.g., imp Is). However, they differ in the siirface lbrm prt~s_lp, which is overridden in VERB IR, and in the interlnediate form bllse, which is overridden in VERB_ER. The interesting point here is that even though the imperfect indicative tetras employ the stein imp, a form that is generated from a form that is not shared (prOs lp) and wliich is in turn generated from an unshared form (base), both the imp stem ~md the set of imperfect indicative forms may still be shared.</Paragraph> <Paragraph position="3"> Another example of how ovcrridable intermediate fonn niles can be used to condense paradiguls is provided by the VERB_RE IR paradignt (which handles all of the irregular verbs ending in -it that behave nlore like the -re verbs, e.g., dormir and v~tir) and its sub-paradigms. This is accomplished by first defining a new intermediate form, prl~s_s, which may be oveIliden by a lexical entry (or stem change rule). This ,,dlows for au irregular stem in the singular fonns of the present indicative (e.g., dormir -> dot. mouvoir -> meu) whilc lint overriding the base form, which is used elsewhere. Secoudly, allowing lexical override of the stems used to generate the fliture and t)res, ent conditional tense forms (fur) and the past simple and impedcct subjunctivc terms (pas), respectively, allows for irregular stems such as valoir --> vaudr (fur) anti mouvoir --> mu (pas).</Paragraph> <Paragraph position="4"> We have found this combination of intermediate form niles and lexical override uscful for defining paradignis for Gemlan verbs as well. Bccausc some strong verbs un..</Paragraph> <Paragraph position="5"> dergo a stem changc in the 2nd and 3rd person singular forms of the prescnt tease, an additional intermediate feint uiay bc defiued to accoulotklte ix)ssible stem .... .7(i.f=,'<:<>,,,,,i: ' '&quot; lil p~,td,a~e -,~_: = &quot;co,,,,,:,t&quot;) I ,i t cueillir (inf =&quot;cucillir&quot; I / flit= &quot;cueiller&quot;} I / assaillir (inf = &quot;assaillir&quot;) \] I// VERB_El( i i I.</Paragraph> <Paragraph position="6"> base: inf &quot;er&quot; II surface_forms { 11 prds_; pass6 ls:lxasd &quot;i&quot; II prds.: ,:~'sst-3s: t'as II prdsi~ass6 3p: base -<. +Ic:rcllt&quot; prds; part pass6 masc s: base + &quot;6&quot; II prds i jeter (inf = &quot;jeler&quot;) nlener (inf = &quot;mencr&quot;), , surface totals { inf: I.L:X pfds Is: base + &quot;c&quot; imp Is: imp + &quot;ais&quot; hit 1 s: flit t. &quot;at&quot; coil ls: fur + &quot;ais&quot; passt_ls: pas + &quot;s&quot; prds Is: prts-s t- &quot;s&quot; prts2s: prts- s 1%&quot; prts 3s: pl~s s b &quot;l&quot; prds \[p: lll~s~l + &quot;ons&quot; prds_2p: prds~p e &quot;ez&quot; p~s 3p: pros p 4 &quot;ent&quot; } intermediate l~mns { base: ill|&quot; &quot;tC&quot; fur: lid&quot; &quot;e&quot; /allow lexical ovenide surface totals { prts 3s: prds s part i.~tss6 mast s: base ~, &quot;u&quot; ) items for each paradigm are in single boxes.</Paragraph> <Paragraph position="7"> AcrEs DE COIJNG-92, NANn~s, 23-28 Ao(rr 1992 7 l PRo('_ oF COI,ING-92, NANrI!s, AUG. 23~28, 1992 changes in these two \[orms, just as the intermediate form pr~s_s was employed in the French paradigm VERB_RE IR. This alh)ws all of the st,'ong verbs to he combined into a single l)aradigm.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Compilation and Run-time Algorithms </SectionTitle> <Paragraph position="0"> A PDL description is con}piled into a non-ileterministic transition network, suilable tor the recognition and generation of word forms, as tollows. First, the form rules arc chained into a network based on the form i}antcs appearing in the rules' left aud right hand sides. The full set el paradigms u) which each form lule applies is calculated and stored ;.it each corresponding node in the network.</Paragraph> <Paragraph position="1"> Then the orthographic rules are conftated with lhe word formatirnl rules by unifying tile orthographic patterns with tile affixes th the form rules, Finally, a character discrimination net is constructed front all suffix surface lorm rules to optimize tile rul}-linlc inatehing of the outermost suffix patterns in the form rule transition net.</Paragraph> <Paragraph position="2"> During morphological analysis, tile conflated patterns arc matched against the input string and the string undergoes whatever Iranslormation tile correspontling word lk}rmation rule diclates. At each step through the network, the set of paradigms for which that step is valid is intersected with the set that has been valid tip to that point in the derivation. If this intersection becomes NULL, then the path is abanthmed as iuvalid. Traversal through the net proceetls ahmg ;.ill possible paths for as h)ng its patterns continue to match. Lexicou Iookups of candidate stem strings occur only when a I,EX node or node marked ;is Icxically overritkthle is reached. If a lexical stein matching the fern} mune, paradigm set, and tcaturc constrnints acquired from the uet is found, then its len}lna is returned.</Paragraph> <Paragraph position="3"> For generation, the traversal is reversed, llowever, m ortier to calcuhtte the sequence uf rules to traverse tu generate a surface lorm, we must work backwards from the nile that prty.luces the desired surtitce form (given the paradigm of tile lemma) to the rule that precedes that rule, and s(I on, untd we reach a lorm whose stem is salted with the lemma in the lexicon. At this point, we know both the proper starting lexical stem li)rm and tile sequence nf rules to apply to that stem.</Paragraph> </Section> class="xml-element"></Paper>