XML Viewer - e89-1007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/e89-1007_metho.xml
Size: 25,880 bytes
Last Modified: 2025-10-06 14:12:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="E89-1007">
  <Title>ON THE GENERATIVE POWER OF TWO.LEVEL MORPHOLOGICAL RULES</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE FORMALISM
TWO-LEVEL NOTATION
</SectionTitle>
    <Paragraph position="0"> The original notation proposed in Koskenniemi(1983a) included some rather complex notational conventions which have not survived into later versions. The formalisation given here will deal only with the core ideas, as embodied in Koskenniemi(1985) (and other implementations such as Karttunen et al.(1987),</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
-51 -
</SectionTitle>
      <Paragraph position="0"> Ritchie, Black, et a1.(1987)). By way of illustration, here is a two-level morphological rule taken from Ritchie, Pulman et al. (1987):  Rules are phrased in terms of symbol-pairs (written with an infix colon), where the first in the pair is a lexical symbol and the second is a surface symbol. In the above example, the pair of symbols on the left (lexical &amp;quot;e&amp;quot; and surface null) are allowed to occur only in the contexts listed on the right of the rule, where .... indicates the position of the pair &amp;quot;e:0&amp;quot;. Each context has a left part and a right part, each of these being essentially a regular expression over symbol-pairs, where angle brackets indicate sequences of pairs and braces indicate alternatives (disjunction). Certain versions of the notation may also allow the &amp;quot;Kleene star&amp;quot; symbol &amp;quot;*&amp;quot; to indicate zero or more repetitions, and the insertion of optional elements. In this example, &amp;quot;C&amp;quot;, &amp;quot;V&amp;quot;, &amp;quot;C2&amp;quot;, and &amp;quot;=&amp;quot; represent subsets of the relevant symbol alphabets and &amp;quot;+&amp;quot; is an abstract symbol occurring in certain lexical forms.</Paragraph>
      <Paragraph position="1"> The formalism here will not include symbolic mnemonics for sets of symbols, nor variables ranging over sets of symbols. The semantics of both these notations (which are commonly used in two-level morphology) can be stated in terms of equivalent sets of rules without such abbreviatory conventions, so all that is required is a definition of the interpretation of rules containing only actual character symbols, together with the various devices for indicating disjunction, repetition, etc. (Most of the latter could also be ignored here by a similar assumption, but the presentation is perhaps easier to follow if the resemblance to the actual notation is retained).</Paragraph>
      <Paragraph position="2"> One of the more peripheral aspects of two-level morphology is the role of the rules in segmenting surface input strings into lexical forms (i.e. the interface between a rule interpreter and a lexicon of morphemes). It is only there that the special null symbol &amp;quot;0&amp;quot; takes on special significance (see later section). Hence most of the definitions, and the subsequent diseussion of generative power, are concerned with sequences of symbol-pairs, which is equivalent to considering only pairs of strings of equal length.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
BASIC DEFINITIONS
</SectionTitle>
    <Paragraph position="0"> Given any two finite symbolic alphabets, A and A', a symbol-pair from A and A' is a pair &lt;a, a'&gt; where a ~ A and a' ~ A'. Such symbol-pairs will normally be written as &amp;quot;a:a'&amp;quot;. A symbol-pair sequence from A and A' is simply a sequence (possibly empty) of symbol-pairs from A and A', and a symbol-pair language over A and A' is a set of symbol-pair sequences (i.e. a subset of (A x A')*).</Paragraph>
    <Paragraph position="1"> Given two alphabets A and A', and a symbol-pair sequence S from A and A', a sequence &lt;P1,..P~&gt; of symbol-pair sequences from A and A' is said to be a partition of S iff S = P1P2....Pn (i.e. the concatenation of the P3</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CONTEXTS AND '.RULES
</SectionTitle>
    <Paragraph position="0"> Given two symbol sets A and A', a context-expression from A and A' is a regular expression over A x A'. That is, a context-expression characterises a regular set of sequences of symbol-pairs. For example, the expression b:b v (a:a b:b)* characterises the set {e, b:b, a:a b:b, a:a b:b a:a b:b ..... } where e denotes the empty sequence.</Paragraph>
    <Paragraph position="1"> Given two alphabets A and A', a two-level morphological rule over A and A' consists of a pair &lt;P, C&gt; where P is a symbol-pair from A and A', and C is a non-empty set of pairs &lt;LC,RC&gt; where LC and RC are contextexpressions from A and A'. The reason for including a set of pairs of contexts, is that we must cater, in the g~meral case, for there being a disjunction of pairs of contexts (as in the illustrative example above, where the disjuncts are separated by &amp;quot;or&amp;quot;). fin the case where the set is a singleton, this reduces to the simple (nondisjunctive) case.</Paragraph>
    <Paragraph position="2"> A context-expression ce is said to match at the right-end a symbol-pair sequence S iff there is a partition &amp;quot;,~v 1, P2&gt; of S such that Pz is an element of the set characterised by ce.</Paragraph>
    <Paragraph position="3"> A context-expression ce is said to match at the left-end a symbol-pair sequence S iff there is a partition &lt;P1, P2 &gt; of S such that P1 is - 52 an element of the set characterised by ce.</Paragraph>
    <Paragraph position="4"> In a two-level morphological grammar, there are generally three sorts of rule, although one of them can be re-expressed as a combination of rules of the two more basic sorts. The first basic form of rule is the context restriction rule written with the operator &amp;quot;=&gt;&amp;quot; separating the symbol-pair from the specification of the contexts. For example, l:i =&gt; b:b e:e would mean &amp;quot;if there is a lexical 1 paired with a lexical i, then there must be a lexical and surface b on its left, and a lexical e and surface e on its right&amp;quot;.</Paragraph>
    <Paragraph position="5"> On the other hand, a surface coercion rule, written using the operator &amp;quot;&lt;--&amp;quot; indicates that wherever the contexts (i.e. the right side of the rule) occur, and the lexical symbol is as given in the pair on the left side of the rule, then the surface symbol must be as given on the left side of the rule. For example: 15 &lt;= b:b e:e would mean that &amp;quot;whenever there is a lexical b and surface b on the left, a lexical e and surface e on the right, and a lexical 1, then the surface symbol must be i&amp;quot;.</Paragraph>
    <Paragraph position="6"> The third type of rule, illustrated earlier, uses the &amp;quot;&lt;=&gt;&amp;quot; operator, and is defined to be equivalent to a pair of rules, one of each of the two basic types, but with the same content.</Paragraph>
    <Paragraph position="7"> Hence, no formal definition will be given of the third type of rule, on the grounds that a grammar written using the &amp;quot;&lt;=&gt;&amp;quot; operator is merely an abbreviation for a larger set of rules of the two basic types. We will first define the form of restriction imposed by rules normally written with the &amp;quot;=&gt;&amp;quot; operator (&amp;quot;context restriction&amp;quot; rules).</Paragraph>
    <Paragraph position="8"> A set R of two-level morphological rules contextually allows a symbol-pair sequence S iff, for every partition &lt;P1, a:a', P2&gt; of S, either there is no rule of the form &lt;a:a', C&gt; in R, or there is at least one rule &lt;a:a', C&gt; in R such that C contains a context pair &lt;LC, RC&gt; such that LC matches P1 at the right end and RC matches P2 at the left end.</Paragraph>
    <Paragraph position="9"> The definition corresponding to a &amp;quot;surface coercion&amp;quot; rule (operator &amp;quot;&lt;=&amp;quot;) is as follows. A two-level morphological rule R = &lt;&lt;a,a'&gt;, C&gt; coercively allows a symbol-pair sequence S iff for every possible partition &lt;Px, b:b', P2&gt; of S and every element &lt;LC, RC&gt; in C such that LC matches P1 at the right end, and RC matches/'2 at the left end, if b = a, then b' = a'. An alternative but equivalent variation on the last definition would be that a two-level morphological rule R = &lt;&lt;a,a'&gt;, C&gt; coercively disallows a symbol-pair sequence S iff there is a possible partition &lt;P1, b:b', P2 &gt; of S and an element &lt;LC, RC&gt; in C such that LC matches P1 at the right end, RC matches P2 at the left end, b = aand b' # a'.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TWO-LEVEL GRAMMARS
</SectionTitle>
    <Paragraph position="0"> Given two alphabets A and A', a two-level morphological grammar based on A and A' consists of a pair &lt;CR, SC&gt; where CR and SC are finite sets of two-level morphological rules over A and A'. The two sets of rules are the context restriction and surface coercion rules respectively.</Paragraph>
    <Paragraph position="1"> One minor detail which must now be considered is the question of feasible pairs. When set-mnemonics and variables are used within rules, these are deemed to cover not all possible symbol-pairs, but only those which are &amp;quot;feasible&amp;quot;. Even when not using these abbreviatory devices, it is necessary to have some notion of feasible symbol-pair, since such pairs are allowed to occur freely even if licensed by no rule (providing no rule forbids them). Usually, pairs of the form x:x (where x is in the intersection of the two alphabets) are taken as feasible, but any pairs which appear in a rule are also deemed feasible. If we assume that the notion of a symbol-pair occurring in a regular expression is clear enough, occurrence within a rule set is straightforward-- a symbol-pair a:a' is said to occur in a rule &lt;b:b', C&gt; iff either a:a' = b:b' or for at least one element &lt;LC, RC&gt; of C, a:a' occurs in at least one of LC and RC.</Paragraph>
    <Paragraph position="2"> Given a two-level morphological grammar G = &lt;CR, SC&gt;, the set of feasible pairs in G is the set of symbol-pairs {a:a' I a:a' occurs in some element of CR u SC} (In an implemented system, the user may be allowed to declare certain pairs as feasible, but at this level of abstraction we do not need to include this in our definition of a two-level morphological grammar, since such an effect could be represented by including rather vacuous context-restriction rules of the form  generated by G iff all the following hold: (i) all the symbol-pairs in S are feasible pairs in G; (ii) each rule in SC coercively allows S; (iii) the set CR of rules contextually allows S.  Notice that the two classes of rules are treated slightly differently - surface coercion rules are conjoined, forming a set of constraints all of which must be met, and context restriction rules are disjoined, giving a set of possible licensing contexts. If no rules apply to a particular symbol-pair, it is acceptable if and only if it is feasible.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE LEXICON
</SectionTitle>
    <Paragraph position="0"> The mechanisms described so far have provided a way of relating one sequence of symbols to another sequence (of the same length). There has been little or no asymmetry between the roles played by the two sequences, and no explicit indication of how these rules might achieve the practical task of segmenting a word into a set of lexical forms which appear in a given dictionary. 1 The first convention that is needed is quite simple - the string of lexical symbols is regarded as being supplied by any valid concatenation of lexical forms. That is, the set of lexical entries implicitly defines an infinite set of strings of indefinite length, formed by any concatenation of lexical forms. It is in the course of integrating the string-matching with the segmentation that the special null symbol will be needed, so we must first define the notion of two strings being the same after the removal of nulls.</Paragraph>
    <Paragraph position="1"> Suppose we have some symbolic alphabet A.</Paragraph>
    <Paragraph position="2"> We define the function &amp;quot;delete&amp;quot; from A x A* to A* as follows, where, e denotes the empty string: delete(a, ~) = e delete(a, aS) ffi delete(a, S) delete(a, bS) = b delete(a,S) for any b ~ a.</Paragraph>
    <Paragraph position="3"> x The fonnal argtnnents concerning generative power concern only the mechanisms presented so far, so readers uninterested in the interface to the lexicon may skip this section.</Paragraph>
    <Paragraph position="4"> The other minor formal definition we need is to allow us to move from equal-length sequences of symbol-pairs to pairs of equal-length symbol-sequences in the obvious way. Suppose $1 and $2 are two sequences of symbols, of equal length, with Sl = al...an and $2 = bl...bn. Then the symbol-pair sequence associated with $1 and $2 is the sequence al :bl....an:b,, We can then define a two-level morphological grammar as licensing a pair of strings of equal length, iff their associated symbol-pair sequence is generated by the grammar.</Paragraph>
    <Paragraph position="5"> A lexical segmentation system consists of a tuple (AL, AS, 0, L, (3) where AL is a finite set (the lexical alphabet ), AS is a finite set (the surface alphabet ), 0 is a symbol which is not an element of AL u AS, L is a set (the set of lexical forms ) of non-null elements of AL*, and G is a two-level morphographemic grammar based on AL u {0} and AS u {0}.</Paragraph>
    <Paragraph position="6"> Given a lexical segmentation system (AL, AS, 0, L, G), a siring S e AS* can be segmented as &lt;ll,...l~&gt; where li ~ L for all i, if there are strings S 1 ~ AL*, $2 ~ AS* such that the following all hold:</Paragraph>
    <Paragraph position="8"> Notice that there is no distinguished symbol indicating a morpheme boundary or word boundary. Although the writer of the two-level rules will probably find it useful to insert certain special symbols (e.g. the &amp;quot;+&amp;quot; used in the example above), these have no special significance, and rules must be written to define how they relate to other symbols. The boundaries between morphemes are implicit in the successful match between the surface form (via the two-level rules) and the concatenated sequence of lexical forms.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CROSS-LINKED LEXICONS
</SectionTitle>
    <Paragraph position="0"> In Koskenniemi(1983a) (and in the papers in Dalrymple et al.(1983)) the interface to the lexicon is slightly more complicated, since the representation of morphotactic information is built into the interface, in the following way. A lexical entry (for a single morpheme) contains one or more con~!nuation classes which indicate what categories of morpheme might follow it within a valid word; for example, a noun stem - 54 is marked as allowing a noun suffix as a possible continuation. The morphemes are not held in a single, uniform dictionary, but in a set of sublexicons, where each lexicon corresponds to some single morphotactic class. Hence, when the lookup process has found a particular morpheme (say, a noun stem) by matching entries in the noun-stem sublexicon, the indication that noun-suffix is a possible continuation will cause the lookup process to continue scanning in the noun suffix sublexicon as it matches the input word from left to fight. This can be rephrased in a more declarative way by stating that an input string S corresponds to a sequence of lexical forms wt,...w,, if S matches wtw2 ...w, (the concatenation of the forms) according to the morphographemic rules, and for each i between 1 and n-l, wi+t is in a continuation class of wl.</Paragraph>
    <Paragraph position="1"> A lexical segmentation system would then have to include a function which mapped each lexical entry to its set of continuation classes. Hence the definitions given above would have to be altered to the following.</Paragraph>
    <Paragraph position="2"> A lexical segmentation system consists of a tuple (AL, AS, 0, {Lt,.. L,}, f, G) where AL is a finite set (the lexical alphabet ), AS is a finite set (the surface alphabet ), 0 is a symbol which is not an element of AL u AS, {Li} is a finite set of finite sets of non-null elements of AL* (the sublexicons), f is a function which associates with each pair &lt;w, j&gt; (where w ~ L/) a subset of {Lt,...Ln} (the continuation class mapping) and G is a two-level morphographemic grammar based on At. u {0} and AS u {0}.</Paragraph>
    <Paragraph position="3"> Given a lexical segmentation system (AL, AS, 0, L, f, G), a string S in AS* can be segmented as &lt;It,J,,&gt; where lj in Lso ) for each j, ff there are strings St in AL*, $2 in AS* such that the following all hold:</Paragraph>
    <Paragraph position="5"> Ls~t ) ~ f(l/, gO)) for each j from 1 to n-1 The advantage of introducing cross-linked lexicons is that some form of morphotactic information can be inserted directly into the lexicon, and the processing of this information incorporated into the scanning of the surface string very easily. One theoretical disadvantage is that it imposes a finite-state structure on the morphotactics, which may well be undesirable. If cross-linked sublexicons are not used, some further descriptive device is needed to express morphotactic information in a usable form, but this could be completely separate from the two-level morphology system (cf. Ritchie, Pulman et al.(1987)).</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
LANGUAGES GENERATED
</SectionTitle>
    <Paragraph position="0"> With the above definitions, it is now possible to ask what sorts of symbol-pair languages can be characterised using a two-level morphological grammar. Here we shall ignore the issue of the interface to the lexicon, and simply consider the capacity of two-level morphological grammars to characterise sets of sequences of symbol-pairs.</Paragraph>
    <Paragraph position="1"> Lemma 1 * Let R be a set of two-level morphological rules. Let EI and E2 be symbol-pair sequences such that R contextually allows El, and R contextually allows E2. Then R contextually allows the concatenation ERE2.</Paragraph>
    <Paragraph position="2"> Proof : If there is no symbol-pair a:a' in ErE 2 such that there is some rule &lt;&lt;a:a'&gt;, C&gt; in R, then R contextually allows ErE 2 for trivial reasons. Let a:a' be a symbol-pair occurring in ErE2 such that there is at least one rule &lt;&lt;a:a',C&gt; in R. Let &lt;Pt, a:a',P2&gt; be a partition of ERE2. It follows from the definitions of a partition and concatenation that either Pt is a proper initial subsequence of E1 and/&gt;2 = $2E2 for some sequence $2 (i.e. this occurrence of a:a' is in El), or Pt = EtS1 for some sequence S t and P2 is proper final subsequence of E 2 (i.e. this occurrence of a:a' is in E2). That is, either &lt;Pt, a:a', $2&gt; is a partition of El, or &lt;St, a:a', P2&gt; is a partition of E 2. Assume the former is true (a symmetrical argument can be followed for the latter). Since R contextually allows E 1, for the partition &lt;Pt, a:a', $2&gt; of El there is at least one rule C in R which contains at least one context-pair &lt;LC, RC&gt; such that LC matches Pt at the right end and RC matches $2 at the left end. If RC matches $2 at the left end, then RC will also match $2E2 = P2 at the left end. Hence, for the partition &lt;P1, a:a', P2&gt; of ErE 2 there is at least one rule C in R which contains at least one context-pair &lt;LC, RC&gt; such that LC matches Pt at the fight end and RC matches P2 at the left end. A similar argument can be given for the occurrence of a:a' being in E 2.</Paragraph>
    <Paragraph position="3"> Since this will be true for any such a:a' in EiE2,  R contextually allows ErE 2.</Paragraph>
    <Paragraph position="4"> Lemma 2 : Let R = &lt;a:a', C&gt; be a two-level morphological rule. Let El, E2, E3 be symbol~.-F - 55 - null pair sequences such that E1E2E3 is coercively allowed by R. Then E2 is coercively allowed by R.</Paragraph>
    <Paragraph position="5"> Proof : If E 2 were not coercively allowed by R, it would mean that there is a partition &lt;$1, a:b, $2&gt; of E2 such that for some &lt;LC, RC&gt; in C, LC matches S~ at the right end, RC matches $2 at the left end, and b ~ a'. If this were the case, there would be a corresponding partition &lt;E~S1, a:b, $2E3&gt; of E~E2E 3, with LC matching E1S1 at the right end, and RC matching $2E3 at the left end. This would (by definition) mean that R does not coercively allow E~E2E 3, which is not the case by hypothesis.</Paragraph>
    <Paragraph position="6"> Corollary : Let C be a set of two-level morphological rules, all of which coercively allow a symbol-pair sequence E. Then all of the rules in C coercively allow any subsequence of E.</Paragraph>
    <Paragraph position="7"> Lemma 3 : Let G be a two-level morphological grammar &lt;CR, SC&gt;, and let L(G) be the set of symbol-pair sequences generated by G. Suppose that there are sequences E 1, E2, E B, E 4 such that E2 ~ L(G), E3 ~ L(G), and E1E2E3E4 L(G). Then E2E3 ~ L(G).</Paragraph>
    <Paragraph position="8"> Proof: (i) Since E1E2E3E4 ~ L(G), all the symbol-pairs in it are feasible with respect to G, hence all the symbol-pairs in E2E3 are feasible. (ii) Since E2 and E3 are in L(G), it follows that CR contextually allows E2 and E3 (by definition). By Lemma 1 above, this means that CR contextually allows E2E 3.</Paragraph>
    <Paragraph position="9"> (iii) Since E1E2E3E4 ~ L(G), it follows (by definition) that all of the rules in SC coercively allow EIE2E3E4. Hence, by the corollary to Lemma 2 above, all of the rules in SC coercively allow E2E3.</Paragraph>
    <Paragraph position="10"> This establishes the three defining conditions for E2E3 ~ L(G).</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
REGULAR RELATIONS
</SectionTitle>
    <Paragraph position="0"> As mentioned in the introduction, two-level grammars have historically been written in two different ways-- as rules as defined here, and as sets of finite-state transducers. In the latter case, each transducer deals with some linguistic phenomenon, and a sequence of symbol-pairs is generated by the grammar if every transducer in the grammar accepts it.</Paragraph>
    <Paragraph position="1"> That is, the symbol-pair sequence must be in the intersection of the languages accepted by the transducers (viewed as acceptors); in procedural terms, this is often referred to as &amp;quot;having the transducers executed in parallel&amp;quot;. Hence, when working with the transducer formalism the linguist has to devise independent transducers whose intersection is the required language.</Paragraph>
    <Paragraph position="2"> Kaplan(1988) discussed the notion of a regular relation, which is, roughly speaking, a symbol-pair language which can be characterised by a regular expression of symbol-pairs. Not surprisingly, a set of symbol-pair sequences is regular if and only if it can be accepted by a finite-state transducer in the obvious way.</Paragraph>
    <Paragraph position="3"> Kaplan has developed an algebraic way of manipulating regular expressions over symbol-pairs together with ordinary regular expressions over symbols, and one of his results is that the intersection of several regular relations is also a regular relation. It follows that the symbol-pair languages accepted by the two-level transducer model are exactly the regular relations.</Paragraph>
    <Paragraph position="4"> Kaplan also formalises the re-expression of two-level morphological rules as transducers (i.e. the compilation mentioned in the introduction above) by constructing regular relations equivalent to languages generated by individual two-level morphological rules. This re-expression is one-way - from a two-level morphological rule an equivalent regular relation can be formed.</Paragraph>
    <Paragraph position="5"> All this suggests that the &amp;quot;parallel transducer&amp;quot; model is at least as powerful as the strict two-level grammar model defined earlier. The obvious question is whether there is a difference in power; in fact, there is: Theorem: There are regular relations (i.e.</Paragraph>
    <Paragraph position="6"> symbol-pair languages characterised by regular expressions of symbol-pairs) which cannot be generated by any two-level morphological grammar. null Proof.&amp;quot; This follows directly from Lemma 3 above. Any language L generated by a two-level morphological grammar must have the property that if E2, E3, and E1E2E3E4 are in L, then E2E3 is in L. There are regular relations which do not have this property, such as the language b:b v (a:a b:b)* mentioned earlier (which contains b:b and a'a b:b but not b:b a:a b:b, even though that sequence is a subsequence of other elements of the language).</Paragraph>
    <Paragraph position="7"> There is another, rather trivial, difference between the power of two-level morphological rules and regular relations. According to the definitions given here, the empty sequence of symbol-pairs is in every language generated by a two-level morphological grammar, since it - 56 conforms to the definition regardless of the content of the rules. The definitions could be altered to exclude the empty sequence from every language, but it is hard to see how the rule mechanism could be used to allow the empty sequence in some languages but not othors. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML