File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/86/c86-1065_intro.xml
Size: 7,155 bytes
Last Modified: 2025-10-06 14:04:31
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1065"> <Title>A MORPHOLOGICAL RECOGNIZER WITH SYNTACTIC AND PHONOLOGICAL RULES</Title> <Section position="3" start_page="0" end_page="272" type="intro"> <SectionTitle> 2 Orthography </SectionTitle> <Paragraph position="0"> The researchers mentioned above use finite-state transducers for stipulating correspondences between surface segments, and underlying segments. In contrast, the system described in this pall am indebted to Lauri Karttunen and Fernando Pereir~ for all their help. Laurl supplied the initial English automat~ on which the orthographic grammar was based, while Fernando furnished some of the Prolog code. Both provided many helpful suggestion~ and explanations as well. I would also like to thank Kimmo Koskennlemi for his comments on an earlier draft of this paper.</Paragraph> <Paragraph position="1"> This research was supported by the following grants: Naval Electronics</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Systems Command N00039-84-K-0078; Navelex N00039-84-C-0524 P00003; Office of Naval Research N00014-85-C-0013. </SectionTitle> <Paragraph position="0"> per does not use finite state machines. Instead, orthographic rules are interpreted directly, as constraints on pairings of surface strings with lexieal strings.</Paragraph> <Paragraph position="1"> Tile rule notation employed, including conventions for expressing abbreviations, is based on that described in Koskenniemi \[1983,1984\]. Tile rules actually used in this system are based on tile account of English in Karttunen and Wittenburg \[1983\].</Paragraph> </Section> <Section position="2" start_page="0" end_page="272" type="sub_section"> <SectionTitle> 2.1 Rules </SectionTitle> <Paragraph position="0"> What follows is an inductive introduction to the types of rules needed. Some pertinent data will be presented, then some potential rules for handling these data. We shall also discuss the reasons for needing a weaker form of rule and indicate what it might look like.</Paragraph> <Paragraph position="1"> Let us first consider some data regarding English /s/ morphemes: null Below are presented two possible orthographic rules for describing the foregoing data: tu) + ---, c {x I z I y/i I s (h) I c h} _ s p~2) + ---* e {x I z I y/i I s (h) I e h I o} _ s The first of these rules will be shown to be too weak; the second, in contrast, will be shown to be too strong. This fact will serve as an argument for introducing a second kind of rule. Before describing how the rules should be read, it is necessary to define two technical terms. In phonology, one speaks of underlying segments and surface segments; in orthography, characters making up the words in the lexicon contrast with characters in word forms that occur in texts. The term lezical character will be used here to refer to a character in a word or morpheme in tile lexicon, i.e., the analog of a phonological underlying segment. Tile term sat\[ace character will be used to mean a character in a word that could appear in text. For example, \[1 o v e + e d\] is a string of lexieal characters, while \[I o v e d\] is a string of surface characters.</Paragraph> <Paragraph position="2"> We may now describe how the rules should be read. The first rule should be read roughly as, &quot;a morpheme boundary \[+\] at the lexical level corresponds to an \[el at the surface level whenever it is between an \[x\] and an \[s\], or between a \[z\] and an \[s\], or between a lcxical \[y\] corresponding to a surface \[i\] and an \[s\], or between an \[ s h\] and an \[s\] or between a\[e h\] and an \[s\].&quot; This means, for instance, that the string of lexical characters \[c h u r e h + s\] corresponds to the string of surface characters \[c h u r c h e s\] (forgetting for the moment about the possibility that other rules might also obtain). The second rule is identical to the first except for an added \[o\] in tile left context.</Paragraph> <Paragraph position="3"> When we say \[+\] corresponds to \[el between an lxl and an N, we mean between a Icxical I x\] corresponding to a surface lxl and a lexical Is\] corrcsponding to a surface \[s\]. If we wantcd to say that it does not matter what the lexieal \[x\] corresponds to on the surface, we would use \[x/=\] instead of just ix\].</Paragraph> <Paragraph position="4"> The rules given above get tile facts right for the words that do not end in \[o\]. For those that do, however, Rule 1 misses on \[do+s\] ~-~ \[docs\], \[potato+s\[ C/=~ \[potatoes\]; Rule 2 misses on \[piano+s\] ~ \[pianos\], \[solo+s\] ~:~ \[solos\[. Furthermore, neither rule allows for the possibility of more than one acceptable form, as in \[banjo+s\] ~ (\[banjoes\] or \[banjos\]), \[cargo+s\] (\[cargoes\] or \[cargos\]).</Paragraph> <Paragraph position="5"> The words ending in \[o\] can be divided into two classes: those that take an \[es\] in their plural and third-person singular forms, and those that just take an \[s\]. Most of the facts could be described correctly by adopting one of the two rules, e.g., the one stating that words ending in \[o\] take an \[es\] ending. In addition to adopting this rule, one wouhl need to list all the words taking an \[s\] crating as being irregular. This approach has two problems. First, no matter which rule is chosen, a very large number of words wouht have to bc listed in the lexicon; second, this approach does not account for the cocxlstcnce of two alternative forms for some words, e.g., \[banjoes\] or \[banjos\].</Paragraph> <Paragraph position="6"> The data and arguments just given suggest the need for a second type of rule. It would stipulate that such and such a correspondence is allowed but not required. An example of such a rule is given below: R3) +/c allowed in context o _ s.</Paragraph> <Paragraph position="7"> Rule 3 says that a morpheme boundary may correspond to an \[el between an \[o\] and an \[s\]. It also has the effect of saying that if a morphcme boundary ever corresponds to an \[c\], it must be in a context that is explicitly allowed by some rule.</Paragraph> <Paragraph position="8"> If we now have the two rules R1 and R3, R1) 4- ~e/ {xlz \[y/\[Is(h) \[eh} -s R3) +\]e allowed in context o _ s, we can generate all the correct forms for the data given. Furthermore, for the words that have two acceptable forms for plural or third person sing-ular, we get both, just as we would like. The problem is that we generate both forms whether we want them or not. Clearly some sort of restriction on the rules, or &quot;fine tuning,&quot; is in order; for the time being, however, the problem of deriving both forms is not so serious that it cannot be tolerated. So far we have two kinds of rules, those stating that a correspondence always obtains in a certain environment, and those stating that a correspondence is allowed to obtain in some environment. The data below argue for one more type of rule, namely, a rule stipulating that a certain correspondence never obtains in a certain environment.</Paragraph> </Section> </Section> class="xml-element"></Paper>