File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1065_metho.xml

Size: 16,041 bytes

Last Modified: 2025-10-06 14:11:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1065">
  <Title>A MORPHOLOGICAL RECOGNIZER WITH SYNTACTIC AND PHONOLOGICAL RULES</Title>
  <Section position="4" start_page="272" end_page="272" type="metho">
    <SectionTitle>
DATA FOR CONSONANT DOUBLING
</SectionTitle>
    <Paragraph position="0"> travel+ed ~ (travelled or traveled) both are allowed In English, final consonants are doubled if they, &amp;quot;follow a single \[orthographic\] vowel and the vowel is stressed.&amp;quot; \[from l{arttunen and Wittenbnrg 1983\]. So for instance, in \[hear+ing\], thc final \[r I is preceded by two vowels, so there is no doubling. In \[haek+ing\], the final \[k\] is not preceded by a vowel, so there is no doubling.</Paragraph>
    <Paragraph position="1"> In \[question+lug\], the last syllable is not stressed so again there is no doubling.</Paragraph>
    <Paragraph position="2"> In Karttunen and Wittenlmrg \[1983\] there is a single rule listed to describe the data. llowever, the rule makes use of a diacritic (') for showing stress, and words in the lexicon must contain this diacritic in order for the rule to work. The same thing could be done in the system being described here, but it was deemed undesirable to allow words in the lexicon to contain diacritics encoding information such as stress. Instead, the following rules are used. Ultimately, the goal is to have some sort of general mechanism, perhaps negative rule features, for dealing with this sort of thing, but for now no such mechanism has been implemented.</Paragraph>
  </Section>
  <Section position="5" start_page="272" end_page="274" type="metho">
    <SectionTitle>
RULES FOR CONSONANT DOUBLING
</SectionTitle>
    <Paragraph position="0"> The allowed-type rules in tile top set are those that license consonant doubling. The disallowed-type rules in the second set constrain the doubling so it does not occur in words like \[eat+ing\] C/:==&gt; \[eating\] and \[hear+ing\] C/====~ \[hearing I. The disallowed-type rulcs say that a morpheme boundary \[+\] may not ever correspond to a consonant when tile \[+\] is followed by a vowel and preceded by that same consonant and then two more vowels.</Paragraph>
    <Paragraph position="1"> The rules given above suffer from the same problem as the previous rules, namely, over generation. Although they produce all the right answers and allow nmltiple forms for words like \[travel+er\] ~ (\[traveller\] or \[traveler\]), which is certainly a positive result, they also allow multiple forms for words which do not allow them. For instance they generate both \[referred\] and \[refered\]. As mentioned earlier, this problem will be tolerated for the time being.</Paragraph>
    <Section position="1" start_page="273" end_page="273" type="sub_section">
      <SectionTitle>
2.2 Comparison with Koskenniemi's Rules
</SectionTitle>
      <Paragraph position="0"> Koskenniemi \[1983, 1984\] describes three types of rules, as exem- null plified below: R4) a &gt; b :=:*- c/d c/f- g/h i/j RS) a &gt; b ~= old e/f- g/h i/j R6) a &gt; b ~ e/d ell- g/h i/j.</Paragraph>
      <Paragraph position="1">  Rule R4 says that if a lexical \[a\] eorresponds to a surface \[b\], then it must be within tile context given, i.e., it must be preceded by \[c/d eft\] and followed by \[g/h i/j|. This corresponds exactly to tile rule given below: RV) a/b allowed in context old e/f_ g/h i/j. The rule introduced as R5 and repeated below says that if a lexieal \[a\] occurs following \[c/d e/f |and preceding \[g/h i/j|, then it must correspond to a surface \[b\]: RS) a &gt; b e-= e/d e/f_ g/h i/j.</Paragraph>
      <Paragraph position="2"> 'rhe corresponding rule in the formalism being proposed here would look approximately like this: R10) a/sS disallowed in context e/d c/f- g/h i/j, where sS is some set of characters to which \[a\] can correspond that does not include \[b\].</Paragraph>
      <Paragraph position="3"> A comparison of each system's third type of rule involves compost|on of rules and is the subject of the next section.</Paragraph>
    </Section>
    <Section position="2" start_page="273" end_page="273" type="sub_section">
      <SectionTitle>
2.3 Rule Composition and Decomposition
</SectionTitle>
      <Paragraph position="0"> In Koskennlemi's systems, rule composition is fairly straightfor- null ward. Samples of the three types of rules are repeated here: R4) a&gt;b=:~e/de/f g/hi/j R5) a &gt; b C/=== e/d e/f_ g/h i/j R6) a &gt; b ~ e/d e/f_ g/h i/j  If a grammar contains the two rules, R4 and RS, they can be replaced by tile single rule R6.</Paragraph>
      <Paragraph position="1"> In contrast, the composition of rules in the system proposed here is slightly more complicated. We need the notion of a default correspondence. The default correspondence for any alphabetic character is itself. In other words, in the absence of any rules, an alphabetic character will correspond to itself. There may also be characters that are not alphabetic, e.g., the \[+\] representing a morpheme boundary, currently the only non-alphabetic character in this system. Other conceivable non-alphabetic characters would be an accent mark for representing stress, or say, a hash mark for word boundarics. The default for these characters is that they correspond to 0 (zero). Zero is ttle name for the null character used ill this system.</Paragraph>
      <Paragraph position="2"> Now it is easy to say how rules are composed in this system. If a grammar contains both Rll and RI 2 bclow, {qlen RI3 may be substituted for them with the same effect: Rll) a/b allowed it, context e/d e/f g/h i/j R12) a/&amp;quot;a's default&amp;quot; disallowed in context e/d e/f g/h ~/j R13) a~b/c/de/f g/hi/j In fact, when a file of rules is read into the system, oCCUl'rence:~ of rules like RI3 are internalized as if the grammar really contained a rule like Rll and another like R12.</Paragraph>
    </Section>
    <Section position="3" start_page="273" end_page="274" type="sub_section">
      <SectionTitle>
2.4 Using the Rule~
</SectionTitle>
      <Paragraph position="0"> Again consider for an example tile rule R1 repeated below.</Paragraph>
      <Paragraph position="1"> R1) +--~ e/ {x IzlY/i \[s (h) \[oh} _s When this rule is read in, it is expanded into a set of rules whose contexts do not contain disjunction or optionality. Rules</Paragraph>
      <Paragraph position="3"> The disallowed-type rules given here stipulate that a morpheme boundary, lexieal \[+\], may never be paired with a mill surface character, \[0\], in the environments indicated. Another way to de.scribe what disallowed-type rules do, in general, is to say that they expressly rule out certain sequences of pairs of letters. For example, R20 R20) +/0 disallowed in cantext x _ s states that the sequence</Paragraph>
      <Paragraph position="5"> is never permitted to be a part of a mapping of a surface string to a lexical string.</Paragraph>
      <Paragraph position="6"> The allowed-type rules behave sfightly differently than their disallowed-type counterparts. A rule such as R26) '+'/e allowed in context x _ s, says that lexieal \[+\] is not normally allowed to correspond to surface Ie\]. It also affirms that lexical \[q-\] may appear between as Ix |and a~t Is|. Other rules starting with tbe same pair say, in effect, &amp;quot;here is another cnvirmuncnt where this pair is acceptable.&amp;quot; The way these rules are to be interpreted is that a rule's main correspondence, i.e., the character pair that corresponds to the underscore in tile context, is forbidden except in contexts where it is expressly permitted by some rnle.</Paragraph>
      <Paragraph position="7"> Once the rules are broken into the more primitive allowed-type and disallowed-type rules, there are several ways in which one could try to match them against a string of surface characters in tile recognition process. One way wonld be to wait until a pair of characters was encountered that was the main pair for a rule, and tficn look backwards to see if the left context of the rule matches the current analysis path. If it does, put the right context on hold to see whether it will ultimately be matched.</Paragraph>
      <Paragraph position="8"> Another posslblility would be to continually keel) track of the left contexts of rnles that are matching the characters at hand, so that when tbe main character of a rule is encountered, the program already knows that the left context has been matched.</Paragraph>
      <Paragraph position="9"> The right context still needs to be pnt on hold and dealt with the same way as in the other scheme.</Paragraph>
      <Paragraph position="10"> The second of the two strategies is the one actually employed in this system, though it may very well turn out that the first one is more efficient for the current grammar of English.</Paragraph>
    </Section>
    <Section position="4" start_page="274" end_page="274" type="sub_section">
      <SectionTitle>
2.5 Possible Correspondences
</SectionTitle>
      <Paragraph position="0"> The rules act as filters to weed out seqnenees of character pairs, but before a particular mapping can bc weeded out, somcthlng needs to propose it ~s being possible. There is a list called a list of l)ossible correspondences, or sometimes, a list of feasible pairs - that tells which characters may correspond to which others. Using this list, the ri:cognizer generates l)ossible Icxica\] forms to correspond to tile input surface form. These can then bc checked against the rules and against the lexicon. If tim rules (1o not weed it out, and it is also in the lexicon, we have successfully recognized a morpheme.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="274" end_page="275" type="metho">
    <SectionTitle>
3 Syntax
</SectionTitle>
    <Paragraph position="0"> The goal of the work being deserlbcd was an analyzer that would be easy to use. In the area of syntax, this entails two subgoal.s.</Paragraph>
    <Paragraph position="1"> First, it should be easy to specify which morphemes may combine with which, and second, when tile recognition tlas been completed, the result shnuld be something that can easily be used by a parser or some other program.</Paragraph>
    <Paragraph position="2"> Karttunen \[1983\] and Karttlmen and Wittenburg \[1983\] have some suggestions for what a proper syntactic component for a morphological analyzer might contain. They mention using context-free rules and some sort of feature-handling system as possible extensions of both their and Koskenniemi's systems. In short, it has been acknowledged that any such system really ought to have some of the tools that have been used in syntax proper.</Paragraph>
    <Paragraph position="3"> The first course of action that was followed in building this analyzer was to implement a unification system for (lags (directed acyclie graphs), and then to have the analyzer unify the dags of all tile morphemes encountered in a single analysis. That scheme turned out to be too weak to be practical. The next step was to implement a PATR rule interpreter \[Shieber, et al. 1983\] so that selected paths of dags could bc unified. Finally, when that turned out to be still less flexible than one would like, tile capability of handling disjunction in the dags was added to the unification package, and the PATR rule interpreter \[Karttnncn i984\].</Paragraph>
    <Paragraph position="4"> The rules look like PA'I'R rules with tile context free skeleton.</Paragraph>
    <Paragraph position="5"> The first two lines of a rule are just a comment, however, and are not used in doing the analysis. The recognizer starts with the (lag \[cat: empty\]. The rnle below states that the &amp;quot;empty&amp;quot; dag may be combined with the (lag from a verb stem to produce a dag for a verb.</Paragraph>
    <Paragraph position="6">  tense: pres pers: {1 ~} 1}.</Paragraph>
    <Paragraph position="7"> The resulting dag will he a.mbigatous between an infinitive verh, and a l)rcsent tense verb that is in clther the first or second person. (The braces in tim rule arc the indicators of disjunction.) The verb stem's value for the feature Icx will be whatever spelling tile stem has. This value will then I)e the value for the fl~at~u'e word in the new (lag.</Paragraph>
    <Paragraph position="8"> The analyzer applies these rules in a w~ry simple wrff. It always carries along a (lag representing the results found t, hns far. Initially this dag is \[cat: empty\]. When a morpheme is fonnd, tile analyzer tries to combine it, via a rule, with the (lag it has been carrying along. If tile rule succeeds, a new (lag is produced and becomes the (lag carried along by the analyzer. In this way tile information about which morpbentes have been fonnd is propagated. null If an ling |is encountered after a verb has been found, the following rule builds the new (lag. It first makes sure that the verb is infinitive (form: inf) so that tile suffix cannot be added onto the end of a past participle, for instance, and then makes the tense of the new dag be pres part for present participle. The category of the new dag is verb, and the value for word is the same as it was in the original verb's dag. The form of the input verb is a disjunction of inf (infinitive) with \[tcnsc: prcs, pets: {1 2}\], so the unification succeeds.</Paragraph>
    <Paragraph position="9">  The system also has a rule for combining an infinitive verb with the nominalizing \[er\] morpheme, e.g., swim : swimmer. This rule, given below, also checks the form of the input verb to verify that it is infinitive, it makes the resnlting dag have category: noun, number: singular, and so on.</Paragraph>
    <Paragraph position="10">  The noun thus formed behaves just the same as other nouns.</Paragraph>
    <Paragraph position="11"> In particular, a pluralizing Is\] may be added, or a possessive \['s\], or any other affix that can be appended to a noun.</Paragraph>
    <Paragraph position="12"> There are other rules in the grammar for handling adjective endings, more verb endings, etc. Irregular forms are handled in a fairly reasonable way. The irregular nouns are listed in the lexicon with form: irregular. Other rules than the ones shown here refer to that feature; they prevent tile addition of plural morphemes to words that are already plural. Irregular verbs are listed in the lexicon with an appropriate value for tense (not unifiable with inf) so that the test for infinitivcness will fail when it should. Irregular adjectives, e.g. good, better, best are dealt with in an analogous manner.</Paragraph>
  </Section>
  <Section position="7" start_page="275" end_page="275" type="metho">
    <SectionTitle>
4 Further Work
</SectionTitle>
    <Paragraph position="0"> There are still some things that are not as straightforward as one would like. In particular, consider the following example. Let us suppose as a first approximation that one wanted to analyze the \[un\] prefix in English as combining with adjectives to yield new ones, e.g., unfair, unclear, unsafe. Suppose also that one wanted to be able to build past participles of transitive verbs (passives) into adjectives, so that they could combine with \[tin\], a.~ in uncovered, unbuilt, unseen.</Paragraph>
    <Paragraph position="1"> What we would need, would be a rule to combine an &amp;quot;empty&amp;quot; with an \[un\] to make an \[un\] and then a rule to combine an \[un\] with a verb stem to form a thingl, and finally a rule to combine a thingl with a past participle marker to form a negative adjective.</Paragraph>
    <Paragraph position="2"> More rules would be needed for the case where \[un\] combines with an adjective stem like \[fair\]. In addition, rules would be needed for irregular passives, etc.</Paragraph>
    <Paragraph position="3"> In short, without a more sophisticated control strategy, the grammar would contain a fair amount of redundancy if one really attempted to handle English morphology in its entirety. However, on a more positive note, the rules do allow one to deal effectively and elegantly with a sufficient range of phenomena to make it quite acceptable as, for instance, an interface between a parser and its lexicon.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML