File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/87/j87-3008_intro.xml

Size: 12,452 bytes

Last Modified: 2025-10-06 14:04:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="J87-3008">
  <Title>A COMPUTATIONAL FRAMEWORK FOR LEXICAL DESCRIPTION</Title>
  <Section position="4" start_page="0" end_page="0" type="intro">
    <SectionTitle>
3. SPELLING RULES
</SectionTitle>
    <Paragraph position="0"> These rules (called &amp;quot;morphographemic rules&amp;quot;) are concerned with undoing spelling or phonological changes to recover the form of a word which corresponds to some morpheme entry in the lexicon. For example, moved can be viewed as move+ed, but with the deletion of the extra e; provability can be viewed as prove+able+ity, with adjustments occurring at both the internal boundary points.</Paragraph>
    <Paragraph position="1"> The formalism used within this system is based on the work of Koskenniemi (1983a, 1983b, Karttunen 1983). In earlier versions of this formalism, the linguist had to specify the spelling rules in a low-level notation similar to that for finite state automata, but Koskenniemi (1985) outlined a more perspicuous high level notation, and we have adopted a variant of that, with a compilation technique inspired by the work of Bear(1986).</Paragraph>
    <Paragraph position="2"> The first point to understand about the rule formalism is that the rules describe relationships between the surface form, that is the actual word as it appears in a sentence, and the lexicai form, as it appears in the citation forms of the lexical entries. For example, moved is the surface form while move and +ed are the lexical forms. What is required is a rule that allows the deletion of an e from the lexical form. Note that the rule should refer to the context where the e can be deleted and not just allow arbitrary deletions of es in the lexical form as then the surface form reed would match red in the lexicon u The format for the Spelling Rules includes initial declarations and definitions of the associated entities (character sets, etc.) needed to support the actual rule-definitions, as follows. The surface alphabet is the 292 Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 Graeme D. Ritchie, Stephen G. Pulman, Alan W. Black, and Graham J. Russell A Framework for Lexicai Description set of acceptable symbols in a string being looked up, the lexieal alphabet is the set of acceptable symbols within citation forms in lexical entries, and named subsets of these alphabets can be declared.</Paragraph>
    <Paragraph position="3"> The spelling rules are specified as a pair (lexical symbol : surface symbol), and the context in which that pair is acceptable. A lexical symbol can be one of three types: a lexical character from the declared lexical alphabet; a lexical set, declared over a range of lexical characters; or the symbol 0 (zero) which represents the null symbol. Similarly there are three possibilities for the surface symbol.</Paragraph>
    <Paragraph position="4"> Before a more detailed description of the formalism is given a simple example may help to explain the notation. The following example describes the phenomenon of adding an e when pluralising some nouns (also making some verbs into their third person singular form), e.g boys as boy+s while boxes as box+s. This phenomena is known as &amp;quot;epenthesis&amp;quot; :</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Epenthesis
</SectionTitle>
      <Paragraph position="0"> +:e &lt;=&gt; {&lt;{s:sc:c}h:h&gt; s:sx:xz:z }---s:s The left and right contexts are basically regular expressions, with angle brackets indicating sequences of items, curly braces indicating disjunctive choices, and ordinary parentheses enclosing optional items. This rule assumes that the morpheme +s (see below for comments on the + character) is in the lexicon and represents the plural morpheme. (Let us exclude for the time being its use as the third person singular morpheme).</Paragraph>
      <Paragraph position="1"> Roughly speaking, the epenthesis rule states that e can be added at a morpheme boundary when and only when the boundary has sh, ch, s, x, or z or on the left side and s on the right. The &amp;quot;---&amp;quot; can be thought of as marking the position of the symbol pair + :e.</Paragraph>
      <Paragraph position="2"> Within our formalism there are no built-in conventions concerning morpheme boundaries. However, it is often necessary to state a rule which stipulates the presence of a morpheme boundary in the context. One way to do this is to add a marker (some special character) to the lexical form of the morphemes involved. Rules would then be able to refer indirectly to morpheme boundaries by means of this special character in the context statement. This means we have morphemes of the lexical form +ed, move, +ing, +ation, etc.</Paragraph>
      <Paragraph position="3"> Another example in our English description is the &amp;quot;E-deletion&amp;quot; rule: E-Deletion:</Paragraph>
      <Paragraph position="5"> where V, C and C2 represent particular subsets of the alphabets, and the = sign matches any symbol (roughly speaking). Although alternatives can be specified within a left or right context using the disjunctive construct, we also need the ability to allow alternatives for full contexts. If separate rules were given for each alternative left and right context there would be the undesirable effect of each one blocking the other, since rules are treated as conjoined; that is, all rules must match for a sequence of symbol pairs to be acceptable. Hence, to achieve a disjunctive choice for contexts there is the &amp;quot;or&amp;quot; connective as used in &amp;quot;E-deletion&amp;quot; above. (This is not fully general as a rule pair can only have one operator type). Each context in the above rule is for particular cases: the first allows words like moved as move+ed; the second allows argued as argue+ed; the third allows encouraging as encourage+ing but also copes with courageous as courage+ous; the fourth context deals with e-deletion in words like readability as read+able+ity; and the last context allows e-deletion in reduction as reduce+ation.</Paragraph>
      <Paragraph position="6"> The three possible rule operators are: &lt;--, --&gt; or &lt;=&gt;, which represent forms of implication, in the following manner.</Paragraph>
      <Paragraph position="7">  Context Restriction: a:b --&gt; LC --- RC This means the lexical character a can match the surface character b only when it is in the context of LC and RC, and hence a:b cannot appear in any other context.</Paragraph>
      <Paragraph position="8"> Surface Coercion:</Paragraph>
      <Paragraph position="10"> This means that in the context LC and RC a lexical a can only be matched with a surface b and nothing else; for example a:c is disallowed in this context.</Paragraph>
      <Paragraph position="11"> Combined Rule: a:b &lt;--&gt; LC --- RC This is equivalent to the combination of the context restriction and surface coercion rules. It means a matches b only in the context LC and RC, and a:b is the only pair possible in that context.</Paragraph>
      <Paragraph position="12"> An addition to the formalism, which is formally not needed, is the introduction of a &amp;quot;where&amp;quot; clause. This saves the user writing separate rules for similar phenomena. A good example can be seen in the rule for consonant doubling (gemination): Gemination:</Paragraph>
      <Paragraph position="14"> The rule is effectively duplicated with the variable X bound to each member of the set in turn. If a &amp;quot;where&amp;quot; clause were not used and X declared as a set ranging over { b d f g 1 m n p r s t }, the value found for X in the rule pair + :X would not necessary be the same value for Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 293 Graeme D. Ritchie, Stephen G. Pulman, Alan W. Black, and Graham J. Russell A Framework for Lexical Description X in the left context. There would be no point in giving sets this interpretation as we do not want the V:V in the left context necessarily to be the same V:V in the right.</Paragraph>
      <Paragraph position="15"> The interpretation of pairs containing sets depends on the notion of feasible pairs. A pair consisting of a lexical symbol and a surface symbol is a feasible pair if either it is a concrete pair (see below) or consists of two identical symbols from the intersection of the lexical and surface alphabets. Concrete pairs are those pairs appearing in the rules (assuming any &amp;quot;where&amp;quot; clauses are expanded into explicit enumeration) which are made up of characters in the alphabets or null symbol only (i.e. containing no sets). Pairs containing sets, such as V:V where the lexical set V is { a e i o u y } and the surface set V is { a e i o u y } are interpreted as all feasible pairs that match. If y:i is a feasible pair then it will match V:V. Rules will typically be written only for pairs a:b where a and b are different characters. It is built into the formalism that unless otherwise restricted, all feasible pairs are accepted in any context.</Paragraph>
      <Paragraph position="16"> In addition to the definition above for feasible pairs there is the facility to declare explicitly that certain pairs are feasible. This may be useful where some pair in a rule contains a set and the user wishes it to stand for some concrete pair that does not actually exist in any of the currently specified rules. For example the pair +:= may be used, where = can be thought of as a set containing the whole surface alphabet. The user may intend this pair to stand for, among others, +:/, although + :l does not actually appear in any of the rules.</Paragraph>
      <Paragraph position="17"> In this case, + :l should be declared as a default pair.</Paragraph>
      <Paragraph position="18"> Any number of spelling rules can be specified (our English description has 15 -- see appendix 2 for an annotated list). These rules are applied in parallel to the matching of the surface form and the lexical forms. For a match to succeed, all rules must find it acceptable. All members of the set of feasible pairs not on the left-hand side of some rule (i.e. a:a, b:b, c:c, etc.) are accepted in any context.</Paragraph>
      <Paragraph position="19"> There are some problems with this form of rule.</Paragraph>
      <Paragraph position="20"> When a rule pair a:b from some rule A with the operator &lt;= &gt; or = &gt; also appears within a context of some other rule B, the user must take care to ensure that the context where a:b appears within rule B is catered for in rule A. An example will help to illustrate this point.</Paragraph>
      <Paragraph position="21"> Consider the following two rules: E-Deletion:  The e:O in the left context of the &amp;quot;A-deletion&amp;quot; rule is in a context that is not catered for within the &amp;quot;E-Deletion&amp;quot; rule. This means that &amp;quot;A-deletion&amp;quot; will always fail. What is required is the addition of another context to the &amp;quot;E-Deletion&amp;quot; rule: or c::c --- &lt; +:0 a:0 t:t &gt; ;; A-deletion This rule-clashing is a significant factor that must be taken into consideration when specifying spelling rules (see Black ell a1.(1987) for further discussion). We have not yet investigated formal criteria for detecting clashes within a rule-set, and it may in principle be undecidable (or at least highly intractable) Another decision the linguist has to make is when to treat a given alternation as morphographemic, and when to treat it by writing distinct morpheme entries. For example, it seems ridiculous to go as far as writing the following rule: o:e &lt;=&gt; g:w --- &lt;+:0 e:n d:t&gt; which will match went to go+ed. This rule is in fact insufficient as it introduces the pairs w:g, e:n and d:t into the feasible pairs set and thus allows wear to match gear etc. If this rule were to be included then three more would be needed to cope with these three extra pairs.</Paragraph>
      <Paragraph position="22"> But rules that match surface forms to such different lexical forms are not recommended. It seems wise to have went as a morpheme entry with the necessary past tense marking. Went is a clear example but some others are not so clear. Should written match write+en? The question of when a change is to be taken as a different morpheme or just as a spelling change is a question of the overall adequacy and elegance of the description there are no firm guidelines.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML