File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-3145_metho.xml

Size: 12,756 bytes

Last Modified: 2025-10-06 14:13:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-3145">
  <Title>A Freely Available Wide Coverage Morphological Analyzer for English*</Title>
  <Section position="3" start_page="0" end_page="317477" type="metho">
    <SectionTitle>
2 Lexicons for PC-KIMMO
</SectionTitle>
    <Paragraph position="0"> We used the set of morphological rules for English described by Karttunen and Wittenburg (1983). The rules handle the following phenomena (among others1): epenthesis, y to i correspondences, s-deletion, elision, i to y correspondences, gemination, and hyphenation. In addition to the set of rules, PC-KIMMO requires lexicons. We derived PC-KIMMO-style lexicons from the 1979 edition of the Collins Dictionary of the English Language. The 90000-odd roots ~ in the lexicon yield over 317000 inflected forms.</Paragraph>
    <Paragraph position="1"> The lexicons use the following parts of speech: verbs (V), pronoun (Pron), preposition (Prep), noun (N), determiner (D), conjunction (Conj), adverb (Adv), and adjective (A). Figure 1 shows the distribution of these parts of speech ill the two formats: The first column is the distribution of the root forms in the PC-KIMMO lexicon files, and the second column is tile distribution for the inflected forms derived from the lexicons and stored in the database. For each word, the lexicon lists its lexical form, a continuation class, and a parse. The continuation class specifies which inflections the lexical form can undergo. At most, a noun root engenders four inflections (singular, plural, singular genitive, plural genitive); an adjective root, three (base, comlWe refer the render to Karttunen and Wittenburg (1983) or Antworth (1990) for more details on the morphological rule~. 2Proper nouns were not included in the tables.</Paragraph>
    <Paragraph position="2"> AcrEs DE COLING-92. NANTES. 23-28 AOt)r 1992 9 5 0 Paoc. oF COLING-92. NArcr~s. AUG. 23-28. 1992 parative, superlative); and a verb root, five (infinitive, third-person singular present, simple past, past participle, progressive). The exact number generated by any given root depends on its continuation class.</Paragraph>
    <Section position="1" start_page="317477" end_page="317477" type="sub_section">
      <SectionTitle>
2.1 Adjectives
</SectionTitle>
      <Paragraph position="0"> Ttle continuation classes for adjective specify that the word can undergo the rules of comparative and superlative. For example, the lexicon entry for the adjective 'funky' is: funky A-Root2 &amp;quot;A (~unky)&amp;quot; The entry consists of a word ~unky, followed by the continuation class hA~oot2, and a parse &amp;quot;A(fuaky)&amp;quot;. The continuation class specifies that the word can undergo the normal rules of comparative and superlative, and the parse states that the word is an adjective with root 'funky'. The following is a sample run of PC-</Paragraph>
      <Paragraph position="2"> The output line contains the root tbrm and any affixes, separated by '+'s. Thus, a '+' in the output indicates a morphological rule was used; its absence means no rule was used, and the parse was returned as found in the lexicon. PC-KIMMO will antomatically add attributes such as COKP and SUPER to the parse, depending on the morphological rule matched by the surface form. But for irregularly inflected forms, special continuation classes indicate that tbc complete parse (viz., part of speech, root, mid attributes) should be taken 'as is' from the lexicon entry. For example:  better A-Root I &amp;quot;l(good) COMP&amp;quot; beat A..Root; 1 &amp;quot;A (good) SUPFAt&amp;quot; good A-Root I &amp;quot;A(good)&amp;quot;  Tile class A-Root1 tells PC-KIMMO not to apply the morphological rules to 'better', 'best', and 'good'. Thus, 'gooder' is not recognized as 'goodTcr'.</Paragraph>
      <Paragraph position="3">  The attributes (such as COl,~') can later be translated into feature structures with the help of templates as in PATR (Shieber, 1986). The list of attributes is found in Appendix A.</Paragraph>
    </Section>
    <Section position="2" start_page="317477" end_page="317477" type="sub_section">
      <SectionTitle>
2.2 Nouns
</SectionTitle>
      <Paragraph position="0"> Inflections of nouns, such as the formation of plural and genitive, are handled by morphological rules (unless the formation is idiosyncratic). In the lexicon for nouns, the continuation class Ii~oott indicates that the formation of genitive applies regularly and that no other inflection applies. The continuation class IIAtoot2 indicates that the formation of the plural and of the genitive apply regularly.</Paragraph>
      <Paragraph position="1"> mice N-Root 1 &amp;quot;N (mouse) PL&amp;quot; mouse W_Root t &amp;quot;N(mouae) SG&amp;quot; ambassador ~-Root2 &amp;quot;I (ambassador)&amp;quot; &amp;quot; Thus, the above lexicon entries are recognized as be-</Paragraph>
    </Section>
    <Section position="3" start_page="317477" end_page="317477" type="sub_section">
      <SectionTitle>
2.3 Verbs
</SectionTitle>
      <Paragraph position="0"> Given the infinitive form of a verb, the formation of the third person singular (+s), its past tense (+ed), its past participle (+ed), and its progressive form (+ing) is AcrEs DE COLING-92. NANIES, 23-28 Aotrr 1992 9 $ l PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 handled by morphological rules unless lexical idiosyncrasies apply. In order to encode all possible idiosyncrasies over the three verb endings, eight continuation classes are defined (see Figure 2). Each continuation class specifies the inflectional rules which can apply to the given lexical item.</Paragraph>
      <Paragraph position="1">  The attributes WE (for &amp;quot;weak&amp;quot;) and STR (for &amp;quot;strong&amp;quot;) mark whether the verb forms its past tense regularly or irregularly, respectively. The distinction enables unambiguous reference to homographs--words spelled identically but with different semantic and syntactic properties. For example, the verb 'lie' with the meaning 'to make an untrue statement' and the verb 'lie' with the meaning 'to be prostrate' have different syntactic and morphological behavior: the first one is regular, while the second one is irregular: He has lain on the floor.</Paragraph>
      <Paragraph position="2"> He has lied about; everything.</Paragraph>
      <Paragraph position="3"> Usually, it suffices to index the syntactic properties of each verb by its root form alone. However, homographs require addition information. In English, the attributes WE and STR are sufficient to distinguish homographs with different morphological behavior.</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="4" start_page="317477" end_page="317477" type="sub_section">
      <SectionTitle>
2.4 Other Parts of Speech
</SectionTitle>
      <Paragraph position="0"> Pronouns, prepositions, determiners, conjunctions, and adverbs are given continuation classes that inhibit the application of morphological rules. All of the morphological informatiou is stored in tile parse in the lexicon  the complete lexicon. Consequently, our large lexicons occupy more than 19 Mbytes of process memory. Further, the large size of the structure implies long search times as PC-KIMMO swaps pages in and out.</Paragraph>
      <Paragraph position="1"> Thus, to solve both the time and space problems simultaneously, we compiled all inflectional forms into AUtT.S DE COI.\]NG-92, NANTES, 23-28 AOt~&amp;quot; 1992 9 5 2 PRoc. OF COLING-92, NANTES, AUG. 23-28, 1992 a disk-based database using a UNIX hash table facility (Seltzer and Yigit, 1991).</Paragraph>
      <Paragraph position="2"> To compile the database, we used PC-K1MMO as a generator, inputting each root form and all the endings that it could take, as indicated by the continuation class. The resulting inflected form became thc key, and the associated morphological information was then inserted into the database.</Paragraph>
      <Paragraph position="3"> For example, the PC-KIMMO lexicon file contains the entry: sa,~ if_Root 2 &amp;quot;II (saw)&amp;quot; The class LRoot2 indicates that tire noun 'saw' forms its plural, singular genitive, and plural genitive regularly. Thus, we send to the generator three lexieal forms and the three suffixes for each infleetiou, extracting three inflected surface forms: Lexical ea~+s sav+'s sav+s+'s Surface saws saw ~ s saws J The root form of a noun is identical with the singular iuflection, so we have a total of four inflected forn~s. Since we know which suffix we added to tbe root, we also know the attributes for that inflection. The inflected form becomes the key, while tile part of speech, root, and attributes are stored as the content in tire database. Hence, the lexicon entry for the noun 'saw' produces four key-content pairs in tbe database: Csaw, saw N SG), (saws, saw II PL), (saw's, saw l\[ SG GEl\[), (saws ~ , saw l\[ PL GEN).</Paragraph>
      <Paragraph position="4"> Likewise, the verb lexicon contains the entries: salt V_Root 8 &amp;quot;V(saw)&amp;quot; saw V_Roo~l &amp;quot;vCsee) PAST STR&amp;quot; The continuation class VAtoot8 indicates fonr inflections besides the infinitive: third-person singular (+s), past (+ed), weak past participle (Ted), and present participle (+ing). Hence, the generator produces: Lexical sal~+s saw+ed saw+ing Surface saws sawed sawing The class V_Rootl allows no irdlections, but builds tire inflection-feature pair directly: (sav, sea V PAST STR).</Paragraph>
      <Paragraph position="5"> Ilence, morphological aualysis is rednced to sending the surface forms to the database as keys arid retrieving thc returned strings. Figure 3 lists the database keys and content strings produced by the three lexicon lines given above. Note that distinct entries are separated by '#'. Since multiple lexical forms can map to the same surface form, the actual number of keys (ca. 292000) is less than the number of lexical forms (ca. 317000). Also, with the database residing on the disk, access times average fi to I0 milliseconds, which greatly improves upon PC-KIMMO.</Paragraph>
    </Section>
    <Section position="5" start_page="317477" end_page="317477" type="sub_section">
      <SectionTitle>
3.1 Implementation Considerations
</SectionTitle>
      <Paragraph position="0"> Thc large number of keys implies a very large disk file. &amp;quot;Ib reduce the size of the file, we take advantage of tire morphological similarity in English between an inflected form and its lexical root form. Indeed, the root is often contained intact within the inflected form.</Paragraph>
    </Section>
    <Section position="6" start_page="317477" end_page="317477" type="sub_section">
      <SectionTitle>
Kcy~ontents
</SectionTitle>
      <Paragraph position="0"> saw N SG#saw V INF#see V PAST STR saw N PL#saw V 3SG PRES saw N SG GEN saw V PROG saw V PAST WK#saw V PPART WK saw N PL GEN  of shared characters along with any differing characters, and reassemble tile root front the inflected form on each database query. Further, despite tire large set of attributes, relatively few combinations (ca. 80) are meaningful, and can be encoded in a single byte. Since a large proportion of roots are wholly contained within tire surface form, and since 92% of the keys llave one lexical entry, the average content string is only three bytes long. Consequently, the total disk file is under 9Mbytes. We anticipate further compaction in the near future.</Paragraph>
    </Section>
    <Section position="7" start_page="317477" end_page="317477" type="sub_section">
      <SectionTitle>
3.2 Accompanying Utilities
</SectionTitle>
      <Paragraph position="0"> Besides the PC-KIMMO lexicons, we currently maintain the database file and an ASCII-character &amp;quot;flat&amp;quot; version for on-line database browsing. One program converts the lexicons into the database format, while others dump the database into the flat file or reconstruct tl~e database from the flat file. We have also built a X Windows tool to perform maintenance on the database file (see Figure 4). This tool automatically maintains the consistency between the flat file and the database file. We have built hooks in C and Lisp (Lucid 4.0) to access either the database or PC-K1MMO from within a running process.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="317477" end_page="317477" type="metho">
    <SectionTitle>
4 Obtaining the Analyzer
</SectionTitle>
    <Paragraph position="0"> The PCoKIMMO lexicons, the database files, ttle LISP mtd C access functions, programs for converting between formats, and the X Window maintenance tool are ACl .T~s DE COLING-92, NAntEs, 23-28 AOt~l&amp;quot; 1992 9 5 3 l'aoc. Ol: COLING-92, NANTES, AUG. 23-28, 1992 available without charge for research purposes. Please send e-mall to zaidell|cia.npann, adn or write to either Yves Sehabas, Martin Zaidel, or Dania Egedi.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML