XML Viewer - c86-1066

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1066_metho.xml
Size: 14,914 bytes
Last Modified: 2025-10-06 14:11:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1066">
  <Title>A Dictionary and Morphological Analyser for English</Title>
  <Section position="2" start_page="0" end_page="277" type="metho">
    <SectionTitle>
2. Linguistic Assumptions
</SectionTitle>
    <Paragraph position="0"> The grammatical framework underlying the linguistic aspects of the system is that of Generalized Phrase Structure Grammar, as set out in Gazdar et al. (1985).</Paragraph>
    <Paragraph position="1"> Morphological categories employed here correspond to the syntactic categories in that work, and the type of syntactic information present in dictionary entries is intended to facilitate the use of the system as part of a more general GPSG-based program. In developing our prototype, we have adopted many of the proposals made in that work. To that extent, certain assumptions about a correct analysis of English sentence syntax are built in to the lexlcal entries, but this should not preclude adaptation by users to suit different analyses.</Paragraph>
    <Paragraph position="2"> Following what has become a general assumption in syntactic theory, we take the major lexlcal categories to be partitioned into four classes by the two binary-valued features \[+ N\] and \[:k V\]. The major lexlcat categories have phrasal projections; these are distinguished from their lexlcal counterparts by their value for the feature BAR. Lexlcal categories have the value 0, and phrasal categories (including sentences) have the value 1 or 2.</Paragraph>
    <Paragraph position="3"> Thus, a Noun Phrase is of the category:</Paragraph>
    <Paragraph position="5"> In our analysis, 'bound morphemes', that is to say prefxes and suffixes, are distinguished from others by their BAR specification; tile suffix ing is the sole member of the category: ((V 4-) (N -) (VFORM ING) (BAR -1)) As in other GPSG-based work, our analysis encodes the subcategorlzational prbpertles of lexlcal Items in the value of a feature SUBCAT. Transitive verbs such as devour are specified as (SUBCAT NP), and Intransitives such as elapse as (SUBCAT NULL).</Paragraph>
    <Paragraph position="6"> As an example from the current analysis of how the system can operate to produce well-formed words, consider the familiar fact of English morphology that no word may contain more than one imqection. The word grammar must permit both walked and walking, but not walkinged. This is achiev~xi by restricting the distribution of inflectional suffixes so that they attach to non-Inflected stems only. A general statement of this type of restriction is made in terms of a feature INFL: stems specified as (INFL +) may take an lnflecUonal sulfix, while those specified as (INFL ~) may not. The STEM feature described in section 4 provides one means of enforcing correct stem-affix combinations; if the suffixes ed and ing are specified with (STEM ((INFL +))), they  will attach only to categories which Include the specification (INFL +). Walk, as a regular verb, is so specified; wallced and waltcing are therefore accepted. Ed, ing, other tnfectlonal suffixes, and irregular (i.e.</Paragraph>
    <Paragraph position="7"> unlnflectable) words, however, are specified as (INFL -). Our grammar assigns a binary structure to the words in question. In order for this method to prevent e.g. walkinged, the stem walking must also bear the (INFL -) specification. This it does, since we regard sutfixes as being the head of a word, and as contributing to the categorial content of the word as a whole. If the INFL specification of the suf~x is copied into the mother category, the STEM specification of a further suffix will not be satisfied. See section 4 for more discussion of these matters.</Paragraph>
  </Section>
  <Section position="3" start_page="277" end_page="278" type="metho">
    <SectionTitle>
3. The Lexicon
</SectionTitle>
    <Paragraph position="0"> The lexicon itself consists of a sequence of entries, each in the form of a Lisp s-expression. An entry has five elements: (1) and (ii) the head word, in its written form and in a phonological transcription, (ill) a 'syntactic field', (iv) a 'semantic field', and (v) a 'user field'. The semantic field has been provided as a facility for users, and any Lisp s-expression can be inserted here. No significant semantic information is present in our entries, beyond the fact that e.g. better and best are related in meaning to good.</Paragraph>
    <Paragraph position="1"> Similarly, the user feld Is unexploited, being occupied in all cases by the atom 'nil'. It serves primarily as a place-holder, in that, while it is desirable to maintain the possibility for users to include in an entry whatever additional information they desire, the form which that Information might take in practice is clearly not predictable. null The syntax field consists of a syntactic category, as defined by Gazdar et al. (1985), i.e. a set of feature-value pairs. Some of these are relevant only to the workings of the word grammar, and may thus be Ignored by other components In an integrated natural language processing system. Their purpose is to control the distribution of morphemes in complex words, as described in the following section.</Paragraph>
    <Paragraph position="2"> The content of a syntax field is often at least partlally predictable. This fact allows us to employ as an aid to users wishing to write their own dictionary rules which add information to the lexicon during the compilation process. Recall that, in our analysis of English, the lnflectablllty of a word is governed by the value in that word's category for INFL. Completion Rules (CRs) can be written that will add the specification (INFL-) to any entry already Including (PLU +) (for e.g. men), (AFORM ER) (for e.g. worse), (VFORM ING), etc,, thus removing the need to state Individually that a given word cannot be inflected.</Paragraph>
    <Paragraph position="3"> A second means of reducing the amount of preparatory work is provided in the form of Multiplication Rules (MRs). Whereas CRs add further specifications to a single entry, MRs have the effect of Increasing the number of entries In some principled way. One application of MRs Is to express the fact that nouns and adjectlves do not subcategorize for obligatory complements. A MR can be written which, for each entry containing the specification (N +) and some non-NULL value for SUBCAT, produces a copy of that entry where the SUBCAT specification is replaced by (SUBCAT NULL).</Paragraph>
    <Paragraph position="4"> The lexicon complies Into two files, one holding morphemes stored in a tree-shaped structure (cf. Thorne et  al. (1968)), and the other holding the expanded entries relating to them. The comptlatlon of a lexicon can take a considerable amount of time; our prototype incorporates a lexicon with approximately 3500 entries, which complies In approximately ninety minutes.</Paragraph>
  </Section>
  <Section position="4" start_page="278" end_page="278" type="metho">
    <SectionTitle>
4. The Word Grammar
</SectionTitle>
    <Paragraph position="0"> The internal structure of words is handled by a unification feature grammar with rules of the form: mother -~ daughter 1 daughter 2 ...</Paragraph>
    <Paragraph position="1"> where 'mother', 'daughtcrl', etc. are categories. A rule which adds the plural morpheme to a noun might be given as shown below:</Paragraph>
    <Paragraph position="3"> The system provides two methods of writing rules in a more general form; variables and feature-passing conventions. null In our grammar, the category and inflectabllity of a suffixed word are determined by the category and lnflectablllty of the suffix; in the rule below, ALPHA, BETA, and GAMMA are variables ranging over the set of values {+, -}:</Paragraph>
    <Paragraph position="5"> Since variables are interpreted consistently throughout a rule, the mother category and suffix will be identical In their specifications for N, V and INFL.</Paragraph>
    <Paragraph position="6"> As an alternative to variables, feature passing conventions are also available. These relate categories in what Gazdar et al. (1.985) term 'local trees', i.e. sections of morphological structure consisting of a mother category and all of Its immediate daughters. The conventions refer to 'pre-lnstantlatlon' features; these are features present in the categories mentioned In the relevant rule.</Paragraph>
    <Paragraph position="7"> 'Extension' and 'unification' are meant In the sense of Gazdar et al. (1985), q.v.</Paragraph>
    <Paragraph position="8"> The Word-Head Convention: After lnstantlatlon, the set of WHead features in the mother is the unification of the pre-lnstantlatlon WHead features of the Mother with the pre-lnstantlatlon WHead features of the Rlghtdaughter.</Paragraph>
    <Paragraph position="9"> This convention is analogous to the simplest case of the Head Feature Convention in Gazdar et at. (1985).</Paragraph>
    <Paragraph position="10"> Although there is no formal notion of 'head' in the system, this convention embodies the Implicit claim that the head in a local tree is always the right daughter. If the daughters are a prefix and a stem (as in e.g. re-apply), the WHead features of the stem are passed up to the mother. Features encoding morphosyntactic category can be declared as members of the WHead set, and re-apply is then of the same category as, and shares various sentence-level syntactic properties with, apply. If the daughters are a stem and a suffix, the category of the mother Is determined not by the stem, but rather by the suffix. For example, possible and ity may be combined to form possibility, whose 'nountness' is due to the category of the suffix.</Paragraph>
    <Paragraph position="11"> The Word-Daughter Convention: (a) If any WDaughter features exist on the Rightdaughter then the WDaughter features on the Mother are the unification of the pre-lnstantlaUon WDaughter features on the Mother with the pre-lnstantlatlon WDaughter featm-es on the Right-.</Paragraph>
    <Paragraph position="12"> daughter.</Paragraph>
    <Paragraph position="13"> (b) If no WDaughter features exist on the Rightdaughter then the WDaughter features on the Mother are the unification of the pre-lnstantiatlon WDaughter features on the Mother with the prelnstantlation WDaughter features on the Leftdaughter. null The subcategorlzation class of a word remains constant under Inflection, but is likely to be changed by the attachment of a derlvatlonal suffix. Moreover, the sub-categorization of a prefixed word is the same as that of its stem. The WDaughter convention is designed to reflect these facts by enforcing a feature correspondence between one of the daughters and the mother. When the feature set WDaughter is defined as including the subcategorlzation feature SUBCAT, the convention results in configuratkms such as:</Paragraph>
    <Paragraph position="15"> which show the relevant feature specifications in local trees arising from suffixatton of an adjective with +ize to produce a transitive verb and suffixatlon of a transitive verb with +ing to produce a present participle.</Paragraph>
    <Paragraph position="16"> The Word-Sister Convention: When one daughter is specified for STEM, the category of the other daughter must be an extension of the value of STEM.</Paragraph>
    <Paragraph position="17"> The purpose of this third convention is to allow the subcategorization of affixes with respect to the type of stem they may attach to. The behavlour of affixes that attach to more than one category can be handled naturally by giving them a suitable specification for STEM. If it is desired to have anti- attached to both nouns and adjectives, for example, the specification (STEM ((N +))) will have that effect, since both adjectives and nouns are extensions of the category ((N +)1.</Paragraph>
    <Paragraph position="18"> The user can define the sets WHead and WDaughter as he wishes, or, by leaving them undefined, avoid their effects altogether. The feature STEM is built in, and need not be defined. The effects of the Word-Sister Convention can be modified by changing the STEM specifications ill the lexlcal entries, and avoided by omitting them.</Paragraph>
  </Section>
  <Section position="5" start_page="278" end_page="278" type="metho">
    <SectionTitle>
5. The Spelling Rules
</SectionTitle>
    <Paragraph position="0"> The rules are based on the work of Koskennlemt (1983a, 1983b, Karttunen 1983), though their application here is solely to the question of 'morphographemlcs'; the more general morphological effects of Koskenniemi's rules are produced dlffenmtly. The current version of the system contains a compiler allowing the rules to be written in a high level notation based on KoskennIemi (1985). Any number of spelling rules can be employed, though our system has fifleen. They are compiled during the general dictionary pre-processlng stage into deterministic finite state transducers, of which one tape represents the lexlcal form and the other the surface form.</Paragraph>
    <Paragraph position="1"> The following rule describes the process by which an additional e is Inserted when some nouns are suffixed with the plural morpheme +s: Epenthesls +:e &lt;=~&gt; { &lt; s:s h:h &gt; s:s x:x z:z } --- s:s or &lt; c:c h:h2&gt; .... s:s The epenthests rule states that e must be inserted at a morpheme boundary if an(:\[ only if the boundary has to its left sh, s, x, z or eh and to Its right s. The Interpretation of the rule Is simple; the character pair ('lexical character:surface character') to the left of the arrow specifies the change that takes place between the contexts (again stated in character pairs) given to the right of the arrow. Braces ('{','}') Indicate disjunction and angled brackets Indicate a sequence, Alternative contexts may be specified using the word 'or'. IJexlcal and surface strings of unequal length can be matched by using the null character '0', and special characters may be defined and used in rules, for example to cover the set of alphabetic characters representing vowels.</Paragraph>
    <Paragraph position="2"> The spelling rules are able to match any pair of character strings. It would for example be possible to analyse the suppletlve went as a surface form corresponding to the lexlcal form go+ed. In this case, four rules would be needed to effect the change, and a better solution is to list went separately In the lexicon. in practice, the choice between treating this type of alternation dynamically, with morphological and spelling rules, and statically, by exploiting the lexicon directly, depends on the user's Idea of which is the more elegant solution. While elegance may be in the eye of the beholder, computational efficiency is mffortunately not.</Paragraph>
    <Paragraph position="3"> I\[ will generally be more efficient to list a word In the lexicon titan to add spelling or morphological rules specific to small number of cases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML