File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/e87-1003_metho.xml

Size: 28,839 bytes

Last Modified: 2025-10-06 14:11:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="E87-1003">
  <Title>FORMALISMS FOR MORPHOGRAPHEMIC DESCRIPTION</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FORMALISMS FOR MORPHOGRAPHEMIC DESCRIPTION
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Recently there has been some interest in rule formaltsms for describing morphologically significant regularities in orthography of words, largely influenced by the work of Koskenniemi. Varioue implementationa of these rules are possible, but there are some weaknesses in the formalism as it stands. An alternative specification formalism is possible which solves some of the problems. This new formalism can be viewed as a variant of the &amp;quot;pure'&amp;quot; Koskenniemi model with certain conetraints relaxed. The new formalism has particular advantages for multiple cheLracter changes. An interpreter has been implemented for the formalism and a significant subset of EngLish morphographenfice has been described, but it has yet to be used for describing other languages.</Paragraph>
    <Paragraph position="1"> Background This paper describes work in a partic~dAr area of computational morphology, that of morphographemics. Morphographemics is the area dealing with systematic discrepancies between the surface form of words and the symbolic representation of the words in a lexicon. Such differences are typical/y orthographic changes that occuz when basic lexical items are concatenated; e.g. when the sWm move and sufflx +~d are concatenated they form moved with the deletion of an e+. The work discussed here does not deal with the wider issue of which morphemes can join together. (The way we have dealt with that question is described in Russell a aL (1986)).</Paragraph>
    <Paragraph position="2"> The fzamework described here is based on the two-level model of morphographemics (Koskenniemi 1983) where rules are written to de~zibe the relationships between surface forms (e.g. moved) and lexical forms (e.g. move+ed). In his thesis, Koskennlemi (1983) presents a formalism for describing morphographemics. In the early implementatiorm (Koskenniemi 1983, Karttunen 1983) although a hlgh-level notation was specified the actual implementation was by handcompilation into a form of finite state machine. Latez implementations have included automatic compilation techniques (Bear 1986, Ritchie et aZ 1987), which take in a high-level specification of marface-t~-lexical relationships and produce a directly interpretable set of automata. This pre-compilation is based on the later work of Koskeno niemi (1985).</Paragraph>
    <Paragraph position="3"> Note that there is a distinction between the /u,,e~_7!~ and its Imp~nentatlon. Although the Koskenniemi formalism is often discussed in terms of automata (or transducers) it is not always necessary for the morphologist using the system to know exactly how the rules are implemented, but only that the rules adhere to theiz defined interpretation. A suitable formalism should make it easier to specify spelling changes in an elegant form. Obviously for practical reasons there should be an efficient implementation, but it is not necessary for the specification formalism to be identical to the low-level representation used in the implementation.</Paragraph>
    <Paragraph position="4"> As a result of our experience with these rule systems, we have encountered various limitations or inelegances, as follows:</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="14" type="metho">
    <SectionTitle>
II
</SectionTitle>
    <Paragraph position="0"> * in * reall~cally sized rule set, the description may be obscure to the human reader;, * different rules my inmact with each other in non-obvious and inconvenient ways; * certain forms of correspondence demand the use of several rules in an clumsy manner; * some optional correspondences are extremely ditficult to describe.</Paragraph>
    <Paragraph position="1"> Some of these problems can be overcome using a modified formalism, which we have also implemented and teated, although it aim has its limitations. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Kmkenniemi Rules
</SectionTitle>
      <Paragraph position="0"> The exact form of rule described here is that used in our wozk (Russell ,~ aL 1986, Ritehie eZ -I.</Paragraph>
      <Paragraph position="1"> 1987) but is the same as Koskenniemi's (1983, 1985) apart from some minor changes in surface syntax. Koskenniemi Rules describe relationships between a sequence of surface characters and a sequence of lexlcal characters. A rule consists of a rule pair (which consists of a lexical and a surface character), an operator, a left context and a right context. There are three types of ru/e: Con:,=z Re~r/czion: These are of the form pair --* I.eftContext ~ RightContext This specifies that the rule pair may appear on/y in the given context.</Paragraph>
      <Paragraph position="2"> Sw-/ace ~lon: These are of the form pair *-- LeftContext ~ RightContext This specifies that if the given contexts and lexical character appear then the surface character n=~ appear.</Paragraph>
      <Paragraph position="3"> Combined Ru~: This final rule type is a combination of the above two forms and is ~r/tten pair *-* LeftContext ~ RightContext This form of rule specifies that the surface character of the rule pair musz appear if the left and right context appears and the lexical characte~ appears, and also that this is the onZy context in which the rule pair is allowed.</Paragraph>
      <Paragraph position="4"> The operator types may be thought of as a form of implication. Contexts are specified as regular expressions of lexical and surface pairs. For example the following rule:</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="14" type="sub_section">
      <SectionTitle>
Epenthesis
</SectionTitle>
      <Paragraph position="0"> +:e *'* {s:s x:x z:z &lt; {s:s c:c) h:h&gt;~ -- s:s specifies (some of) the cases when an * is inserted at the conjunction of a stem morpheme and the suffix +$ (representing plurals for nouns and third person tingular for verbs). The braces in the left context denote optional choices, while the angled brackets denote sequences. The above rule may be summarised as &amp;quot;an * must be inserted in the surface string when it has s, x, z, ch or sh in its left context and $ in its right&amp;quot;.</Paragraph>
      <Paragraph position="1"> Another addition to the formafism is that alternative contexts may be specified for each rule pair. This is done with the or connective for mult/pie left and right contexts on the right hand side of the rule e.g.</Paragraph>
      <Paragraph position="3"> This example also in*roduces sets - C and V (which are elsewhere declazed to represent consonants and vowels). The or construct states that * can correspond to 0 (the null symbol) when (and only when) in eir3urr of the two given contexts.</Paragraph>
      <Paragraph position="4"> The first option above copes with words such as motmd resolving with move+ed and the second deals with examples llke agreed ~esolving with agrN+ed.</Paragraph>
      <Paragraph position="5"> Sets have a somewhat non-standard interpretation within this basic formalism. The expansion of them is done in terms of the feasible set. This is the set of all lexical and surface pairs mentioned anywhere in the set of rules. That is, all identity pairs from the intersection of the lexica/ and surface alphabets and all concrete pairs from the rules, where concrete pairs are those pairs that do not contain sets. The interpretation of a pair containing a set is all members of the feasible set that match. This means that if y:i is a member of the feasible set and a set Ve is declax~.-d for the set {a e i o u ~} the paiz Ve:Ve represents the pair y:l as well as the more obvious ones.</Paragraph>
      <Paragraph position="6"> Traditionally, (if such a word can be used), Koskenniem/ Rules are implemented in terms of finite date machines (or transducers). ~O (Kartlmnen 1983), one of the early implementat/ons, required the morphologist to specify the rules dizectly in transducer form which was  dtmcult and prone to ~or. Koskennlemi (1985) later described a possible method for compilation of the high-level specification into transduceri. This means the morphologist does not have to write and debug low-level finite state machines.</Paragraph>
      <Paragraph position="7">  The basic idea behind the Koskenniemi Formalism - that rules should describe correspondences between a surface string and s lexical string (which effectively represents a normal form) appears to be sound. The problems listed here are not fundamental to the underlying theory, that of describing relationships between su~face and lexlca/strings, but axe more problems with the exact form of the rule notation. The formal~m as it stands does not make it impossible to describe many phenomena but can make it difficult and unintuitlve.</Paragraph>
      <Paragraph position="8"> One problem is that of interaction between rules. This is when a pair that is used in s context part of a rule A is aim restricted by some other rule B, but the context within which the appears in A is not a valid context with respect to B. An example will help to Ulnstrate this. Suppose, having developed the EZ/slon rule given above, the linguist wishes to introduce a rule which expresses the correspondence between reduction and the lexical form reduc~atton, a phenomenon apparently unrelated to elision. The obvious rule. are:</Paragraph>
      <Paragraph position="10"> However, these rules do not operate independently. The pair e:O in the left context of the Adeletlon rule is not licensed by the E7/aion rule as it occurs in a context (c:c ~ &lt; +:0 a:O &gt;) which is not valid with respect to the right context of the E1/slon rule, since the V:V pair does not match the pair a:0. The necessary EUaton rule to circumvent this problem is:</Paragraph>
      <Paragraph position="12"> Such possible situations mean that the writer of the rules must check, every time the rt~ pair from s rule A is used within one of the context statements of another rule B, that the character sequence in that context statement is valid with respect to rule A. TheoreticaLly it would be possible for a compiler to check for such cases although this would require finding the intersection of the languages generated by the set of finite state automats which is computationally expensive (Oarey and Johnson 1979 p266).</Paragraph>
      <Paragraph position="13"> A similar problem which is more easily detected is what can be termed double coercion.</Paragraph>
      <Paragraph position="14"> This is when two rules have the same lexical character in their rule pair, and their respective left and right contexts have an intersection. The situation which could cause this is where an underlying lexical charact~ can correspond to two different surface characters, in different contexts, with the correspondence being completely determined by the context, but with one context description being more general than (subsuming) the other. For example, the following rules allow lexical I to map to su,-face null or surface I (and might be proposed to describe the generation of forms like probably and probab/Zlt'y from probable):</Paragraph>
      <Paragraph position="16"> Matching the surface string bOO to the lexical string b/e (as demanded by the first rule) would be invalid because the second rule is coercing the lexica/l to a surface t; similarly the surface string btO would not be able to match the lexical string ble because of the first rule coercing the lexical Z to a surface 0. (Again, such conflicts between rules could in principle be detected by a compiler).</Paragraph>
      <Paragraph position="17"> There appears to be no simple way round this within the formalism. A possible modification to the formalism which would stop conflicts occurring would be to disallow the inclusion of more than one rule with the same lexical character in the rule-pair, but this seems a little too restrictive. One argument that has been made against the Koskenniemi Formalism is that multiple character changes require more than one rule. That is where a group of characters on the surface match a group on in the lexicon (as opposed to one character changing twice, which is not catered for nor is intended to be in the frameworks presented here).</Paragraph>
      <Paragraph position="18">  For example in English we may wish to describe the ~Jationahlp between the mtrface form application and the lexical form applyt.atton u a two character change t C/ to y +. The general way to deal with multiple character changes in the Koskenniem/Formalism is to write a rule for each character change. Where a related character change is referred to in a context of rule it should be written as a lexiced character and an &amp;quot;,,&amp;quot; on the surface. Where &amp;quot;-&amp;quot; is defined u a surface ~q that consists of edI surface characters. Thus the application example can be encoded as follows.</Paragraph>
      <Paragraph position="20"> The &amp;quot;-&amp;quot; on the surface must be used to ensure that the rules enforce each other. If the following were</Paragraph>
      <Paragraph position="22"> then ap~3~atlon would ~ be matched with apply+at/on. This technique is not particul~ly intuitive but does work. It has been suggested that a compile~ could automatically do this.</Paragraph>
      <Paragraph position="23"> Another problem is that because only one ruie may be written for each pair, the rules are effectively sorted by ~ rather than phenomena so when a change is genuinely a multiple change the ~ changes in it cannot neces~____rily be described together, thug making a rule set di~icult to read.</Paragraph>
      <Paragraph position="24"> Because of the way sets are expanded, the interpretation of rules depends on all the other rules. The addition or deletion of a spelling rule my change the feasible pair set and hence a rule's interpretation may change. The problem is not so much that the rules then need re-compiled (which is not a very expensive operation) but that interpretation of a rule cannot be viewed indepondently from the rest of the rule set.</Paragraph>
      <Paragraph position="25"> The above problems are edl actuedly criti= of the elegance of the formalism for describing speUing phenomena as opposed to actual restrict/oug in its descriptive power. However, one problem that has been pointed out by Bear is that rule pairs can only have one type of operator so that a pair may not be optional In one context but mandatory in another.</Paragraph>
      <Paragraph position="26"> There has also been some discussion of the formed descriptive power of the formalism, particuiarly the work of Barton (1986). Barton has shown that the question of finding a lexical/surface correspondence from an arbitrary Koskenniemi rule s~t is NP-complete. It seems intuitively wrong to suggest that the process of morphographemlc analysis of natured language is computationally difficult, and hence Barton's result suggests that the formalism is actually more powerful than is r~lly needed to describe the phenomenon. A leu powerful formalism would be deairable.</Paragraph>
      <Paragraph position="27"> A final point is that although initially this high-level formalism appears to be easy to read and comprehend from the writer's point of view, in practice when a number of rules are involved this ceases to be the case. We have found that debugging these rules is a slow and difficult task. A/ternative Formalism section proposes a formalism which is basicedly sim~lar to the &amp;quot;pure&amp;quot; Koskenniemi one. Again a description consists of a set of rules.</Paragraph>
      <Paragraph position="28"> There are two types of rule which aUow the description of the two types of changes that can occur, mandatory changes and optional changes.</Paragraph>
      <Paragraph position="29"> The rules can be of two types, first surfaceto-lex~al rules which are used to describe optional changes and lexical-to=surface rules which are used to describe mandatory changes, the interpretation is as follows Sw'fac~o-laxtc..aZ ~des: These rules are of the</Paragraph>
      <Paragraph position="31"> face and lexiced characters respectively, each of the same length. The interpretation is that for a surface string and lexical string to match there must be a partition of the surface string such that each partition is a LI-/S of a rule and that the lexical string is equal to the concatenation of the corresponding RHSs.</Paragraph>
      <Paragraph position="32"> Lextcal-to-Surface ~ht/es: These rules are of the form</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="14" end_page="16" type="metho">
    <SectionTitle>
I.HS *- RHS
</SectionTitle>
    <Paragraph position="0"> The Z.HS and ~P./-/S are equal length strings of surface and lexical characters respectively.</Paragraph>
    <Paragraph position="1"> Their interpx~.tation is that any subetxing of a lexical string that is a ~P~/S of a rule must correspond to the surface string given in the corresponding/.~S of the rule.</Paragraph>
    <Paragraph position="2"> asymmetry in the application rules means that L.S-~_-_~ (lexical-to-su~ace rules) can overlap while SL-~u~ (surface-to-lexical rules) do not, An example may help to explain their use, A basic set of spelling rules in this formalism would consist of first the simple llst of idan- null which could be automatically generated f~om the in~tion of the surface and lexical alphabets.</Paragraph>
    <Paragraph position="3"> In addition to this basic set we would wish to add the rule 0-'.+ which would allow us to match null with a special character marking the start of a su/~. These rules would then allow us to match strings like boyOs to boy+s, glrl to girl and waUcOlng to ~+ing.</Paragraph>
    <Paragraph position="4"> To cope with epenthesis we can add SL-Rules of the form  would allow matching of forms like boxe~ with box+s and m~c, he~ with maZch+s but still allows boxOs with box+s. We can make the adding of the * on the surface mandatory rather than just optional by adding a cox'responding IS-Rule for each tL-Rule. In this case if we add the IS-Rules  the surface string boxOs would not match box+s because thia would violate the LS-Rule; similarly, m~cJ~$ would not match ~_~__tch+s.</Paragraph>
    <Paragraph position="5"> However if some change is optional and not mandatory we need only write the SL-Rule with no corresponding LS-Rule. For example, assuming the word ~co/has the alternative plurals hooves or hoofs, we can describe this optional change by wTiting the SL=RUle ves--.f+s The main difference between this form of rules and the Koskenniemi rules is that now one rule can be written for multiple changes where the Koskenniemi Formalism would require one for each character change. For example, consider the double change described above for matching appllcation with appZy+atlon. This required two distinct rules in the Koskennlemi Format, while in the revised formalism only two clearly related rules are x~quired icat--.y+at icat'-y+at One problem which the formalism as it stands does suffer from is that it requires multiple rules to describe different &amp;quot;cases&amp;quot; of changes e.g. each case of epenthesis requires a rude -- one each for words ending in ch, sh, $, x and z. In our implementation rules may be specified with sets instead of just simple characters thus allowing the rules to be more general. Unfortunately this is not sufficient as the user really requires to specify the left and right hand sides of rules as regular expressions, thus allowing rules such as: &lt;{ &lt;{sc}h&gt;xzs}es &gt;--* &lt;{ &lt;{sc}h&gt;xzs}+s&gt; but this seems to significantly reduce the readability of the formalism.</Paragraph>
    <Paragraph position="6"> One useful modification to this formalism could be the coUapsing of the two types of rule (IS and tL). It appears that an IS-Rule is never required without a corresponding SL-Rule so we could change the formalism so that we have two  operators --* for the simple SL-Rule for optional changes and *-* to repree~qlt the corresponding SL and I S-Rulea for mandatory changes.</Paragraph>
    <Paragraph position="7"> So far we have implemented an interpreter for this alternative for_m_-tlsm and written a description of English. It. coverage is comparable with out English deecription in the Koskennieml Formalism but the alternative description is possibly easier to understand. The implementation of these rules is again in the form of special automata which check for valid and invalid patterns, like that of the Koskenniemt rules. This is not surprising u both formalisms are designed for licensing matches between surface and lex/cal strings. The time for compilation and interpretation is comparable with that for the Koskenniemi rules. Comparison of the two formalisms It is interesting to note that if we extended the Koskenniemi formalism to allo`w regulax expressionu of pa/rs on the left hand side of rules rather than just simple pairs, `we get a formalism that is very similar to our alternative proposal. The main difference then is the lack of contexts in 'which the rules apply -- in the alternative formalism the rules are alto specifying the correspondences for what would be contexts in the Koskenniemi formalism. null Because SL-Rules do not overlap this means phenomena which are physically close together or overlapping have to be described in one rule, thus it may be the case that changes have to be declared in more than one place. For example, one could argue that there is e-deletion in the matching of redu~ton to reduce+atic~ (thus following the Koskenniemi Formalism) or that the change is a double change in that the e-deletion and the adeletion are the same phenomena (as in this new formalism). But there may also be cases where the morphologiet identifies two separate phenomena which can occur together in some circumstances. In this new formalism rules would be zequixed for each phenomena and also where the two overlap.</Paragraph>
    <Paragraph position="8"> One example of this In EngLish may be qu/zzes where both consonant doubling and e-insertion apply. In this formalism a rule would need to be written for the combined phenonmena as well as each individual case. Ideally, a rule formalism should not require information to be duplicated, so that phenomena are only described in one place. In English this does not occur often so seems not to be a problem but this is probably not true for languages &amp;quot;with richer morphogsaphemics such as Finnish and Japanese.</Paragraph>
    <Paragraph position="9"> Interaction bet`ween rules however can in a sense still exist, but in the formalism's current form it is significantly easier for a compiler to detect it. SL-Rules do not cause interaction, since different possible partitions of the surface string represent diff~t analyses (not conflicting analyses). Interaction can happen only with L3-Rules, which in principle may have overlapping matches and hence may stipulate conflicting surface sequences for a single lexical sequence. Interaction will occur if any RHS of a rule is a substring of a RHS of any other rule (or concatenation of rules) and has a different corresponding LHS. With the formalism only allowing simple strings in rules this would be relatively easy to detect but if regular expressions were allowed the problem of detection would be the same as in the Koskenniemi Formalism. Double coercion in the new formalism is actual/y only a special case of interaction.</Paragraph>
    <Paragraph position="10"> The interpretation of symbols representing sets of characters has been changed so that adding and deleting rules does not affect the other rules already in the rule set. This seems to be an advantage, as each rule may be understood in isolation from others.</Paragraph>
    <Paragraph position="11"> One main advantage of the new formalism is that changes can be optional or mandatory. If some change (say e-deletion) is sometimes mandatory and sometimes optional there will be distinct rules that describe the d~erent cases.</Paragraph>
    <Paragraph position="12"> As regenls the computational power of the formalism, no detailed analysis has been made, but intuitively it is suspected to be equivalent to the Koskenniemi Forma~sm. That is, for every set of these rules there is a set of Koskenniemi rules that accepts/rejects the same surface and lexical matches and vice versa. The formal power seems an independent issue here as neither formalism has particular advantages.</Paragraph>
    <Paragraph position="13"> It may be worth noting that both formalisms are suitable for generation as well as recognition. This is due to the use of the two-level model (surface and lexical strings), rather than the forrealism notations.</Paragraph>
    <Section position="1" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
Pumm Work
</SectionTitle>
      <Paragraph position="0"> Although this alternative formalism ~ to have mine advantages over the Koskenniemi Formalism (optional and mandatory changes, set notation and multiple character changes), there is still much work to be done on the development of the new formalism. The actual surface syntax of this new fo~ requires some experimentation to find the most suitable form for easy specification of the rules. Both the Koskenniem/ Formalism and the new one seem adequate for specification of English morphogx~phemics (which is comparatively timpie) but the real issue appears to be which of them allows the writer to describe the phenomena in the most succinct form.</Paragraph>
      <Paragraph position="1"> One of the major problems we have found in our work is that although formalisms appear sirepie when described and initially implemented, actual use often shows them to be complex and d~cult to use. There is a useful analogy here with computer programming languages. New programming languages offer difl'ex~nt and sometimes better faculties but in spite their help, effective programming is still * dimcult task. To continue the analogy, both these morphographemic formalisms require * form of debugger to allow the writer to test the rule set quickly and find its short-comingr. Hence we have implemented a debugger for the Koskenniemi Formalism. This debugger acts on user given surface and lexical strings and allows s~rp or diagnosis modes. The stop mode describes the current match step by step tn ~ of the user wrft~en r,~_-~_% and explains the reason for any failures (rude blocking, no rule lieensln 8 apafr etc). The diagnosis mode runs the match to completion and summarises the rules used and any faLlures if they occur. The important point is that the debugger describes the problems in terms of the user wriUen rules rather than some low level automata. In earlier versions of our system debugging of our spelling rules was very difficult and time consuming. We do not yet have a similar debugger for our new formalism but if fully incorporated into our system we see a debugger as a necessary part of the system to make it useful.</Paragraph>
      <Paragraph position="2"> Another aspect of our work is that of testing our new formalism with other languages. English has a somewhat simple morphographemics and is probably not the best language to test our formalism on. The Koskenniemi Formalism has been used to describe a number of different languages (see Oazdar (1985) for a list) and seems adequate for many languages. Semitic languages, like Arabic, which have discontinuous changes have been posed as problems to this framework. Koskenniemi (personal communication) has shown that in fact his formalism is adequate for describing such languages. We have not yet used our new formalism for describing languages other than English, but we feel that it should be at least as suitable as the Koskenniemi Formalism.</Paragraph>
      <Paragraph position="3"> Concleslon paper has described the Koskenniemi Formalbrm which can be used for describing morphographemic changes at morpheme boundaries. It has pointod out some problems with the basic formalism as it stands and proposes a possible alternative. This alternative is at least as adequate for describing English morphographenfics and may be suitable for at least the languages which the Koskenniemi Formalism can describe.</Paragraph>
      <Paragraph position="4"> The new formalism is possibly better, as initially it appears to be more intuitive and simple to write but from experience this cannot be said with certainty until the formalism has been significantiy used.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML