File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1005_metho.xml

Size: 19,130 bytes

Last Modified: 2025-10-06 14:14:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1005">
  <Title>From Submit to Submitted via Submission: On Lexical Rules in Large-Scale Lexicon Acquisition.</Title>
  <Section position="3" start_page="0" end_page="33" type="metho">
    <SectionTitle>
2 Nature of Lexical Rules
2.1 Ontological-Semantic Background
</SectionTitle>
    <Paragraph position="0"> Our approach to NLP can be characterized as ontology-driven semantics (see, e.g., (Nirenburg and Levin, 1992)). The lexicon for which our LRs are introduced is intended to support the computational specification and use of text meaning representations. The lexical entries are quite complex, as they must contain many different types of lexical knowledge that may be used by specialist processes for automatic text analysis or generation (see, e.g.,  (Onyshkevych and Nirenburg, 1995), for a detailed description). The acquisition of such a lexicon, with or without the assistance of LRs, involves a substantial investment of time and resources. The meaning of a lexical entry is encoded in a (lexieal) semantic representation language (see, e.g., (Nirenburg et al., 1992)) whose primitives are predominantly terms in an independently motivated world model, or ontology (see, e.g., (Carlson and Nirenburg, 1990) and (Mahesh and Nirenburg, 1995)).</Paragraph>
    <Paragraph position="1"> The basic unit of the lexicon is a 'superentry,' one for each citation form holds, irrespective of its lexical class. Word senses are called 'entries.' The LR processor applies to all the word senses for a given superentry. For example, p~vnunciar has (at least) two entries (one could be translated as &amp;quot;articulate&amp;quot; and one as &amp;quot;declare&amp;quot;); the LR generator, when ap= plied to the superentry, would produce (among others) two forms of pronunciacidn, derived from each of those two senses/entries.</Paragraph>
    <Paragraph position="2"> The nature of the links in the lexicon to the ontology is critical to 'the entire issue of LRs. Representations of lexical meaning may be defined in terms of any number of ontological primitives, called con= cepts. Any of the concepts in the ontology may be used (singly or in combination) in a lexical meaning representation.</Paragraph>
    <Paragraph position="3"> No necessary correlation is expected between syntactic category and properties and semantic or ontological classification and properties (and here we definitely part company with syntax-driven semanticssee, for example, (Levin, 1992), (Dorr, 1993) -pretty much along the lines established in (Nirenburg and Levin, 1992). For example, although meanings of many verbs are represented through reference to ontological EVENTs and a number of nouns are represented by concepts from the OBJECT sublattice~ frequently nominal meanings refer to EVENTs and verbal meanings to OBJECTs. Many LRs produce entries in which the syntactic category of the input form is changed; however, in our model, the semantic category is preserved in many of these LRs. For example, the verb destroy may be represented by an EVENT, as will the noun destruction (naturally, with a different linking in the syntax-semantics interface). Similarly, destroyer (as a person) would be represented using the same event with the addition of a HUMAN as a filler of the agent case role. This built-in transcategoriality strongly facilitates applications such as interlingual MT, as it renders vacuous many problems connected with category mismatches (Kameyama et al., 1991) and misalignments or divergences (Dorr, 1995), (Held, 1993) that plague those paradigms in MT which do not rely on extracting language-neutral text meaning representations. This transcategoriality is supported by LRs.</Paragraph>
    <Section position="1" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
2.2 Approaches to LRs and Their Types
</SectionTitle>
      <Paragraph position="0"> In reviewing the theoretical and computational linguistics literature on LRs, one notices a number of different delimitations of LRs from morphology, syntax, lexicon, and processing. Below we list three parameters which highlight the possible differences among approaches to LRs.</Paragraph>
      <Paragraph position="1">  Depending on the paradigm or approach, there are phenomena which may be more-or less-appropriate for treatment by LRs than by syntactic transformations, lexical enumeration, or other mechanisms. LRs offer greater generality and productivity at the expense of overgeneration, i.e., suggesting inappropriate forms which need to be weeded out before actual inclusion in a lexicon. The following phenomena seem to be appropriate for treatment with LRs: * Inflected Forms- Specifically, those inflectional phenomena which accompany changes in sub-categorization frame (passivization, dative alternation, etc.).</Paragraph>
      <Paragraph position="2"> * Word Formation- The production of derived forms by LR is illustrated in a case study below, and includes formation of deverbal nominals (destruction, running), agentive nouns (catcher). Typically involving a shift in syntactic category, these LRs are often less productive than inflection-oriented ones. Consequently, derivational LRs are even more prone to overgeneration than inflectional LRs.</Paragraph>
      <Paragraph position="3"> * Regular Polysemy - This set of phenomena includes regular polysemies or regular nonmetaphoric and non-metonymic alternations such as those described in (Apresjan, 1974), (Pustejovsky, 1991, 1995), (Ostler and htkins, 1992) and others.</Paragraph>
      <Paragraph position="4">  Once LRs are defined in a computational scenario, a decision is required about the time of application of those rules. In a particular system, LRs can be applied at acquisition time, at lexicon load time and at run time.</Paragraph>
      <Paragraph position="5"> * Acquisition Time - The major advantage of this strategy is that the results of any LR expansion can be checked by the lexicon acquirer, though at the cost of substantial additional time. Even with the best left-hand side (LHS) conditions (see below), the lexicon acquirer may be flooded by new lexical entries to validate. During the review process, the lexicographer can accept the generated form, reject it as inappropriate, or make minor modifications. If the LR is being used to build the lexicon up from scratch, then mechanisms used by Ostler and Atkins (Ostler and Atkins, 1992) or (Briscoe et al., 1995), such as blocking or preemption, are not available as  automatic mechanisms for avoiding overgeneration. null * Lexicon Load Time - The LRs can be applied to the base lexicon at the time the lexicon is loaded into the computational system. As with run-time loading, the risk is that overgeneration will cause more degradation in accuracy than the missing (derived) forms if the LRs were not applied in the first place. If the LR inventory approach is used or if the LHS constraints are very good (see below), then the overgeneration penalty is minimized, and the advantage of a large run-time lexicon is combined with efficiency in look-up and disk savings.</Paragraph>
      <Paragraph position="6"> * Run Time - Application of LRs at run time raises additional difficulties by not supporting an index of all the head forms to be used by the syntactic and semantic processes. For example, if there is an Lit which produces abusive-adj2 from abuse-v1, the adjectival form will be unknown to the syntactic parser, and its production would only be triggered by failure recovery mechanisms -- if direct lookup failed and the reverse morphological process identified abusevl as a potential source of the entry needed. A hybrid scenario of LR use is also plausible, where, for example, LRs apply at acquisition time to produce new lexical entries, but may also be available at run time as an error recovery strategy to attempt generation of a form or word sense not already found in the lexicon.</Paragraph>
      <Paragraph position="7">  For any of the Lit application opportunities itemized above, a methodology needs to be developed for the selection of the subset of LRs which are applicable to a given lexical entry (whether base or derived). Otherwise, the Lits will grossly overgenerate, resulting in inappropriate entries, computational inefficiency, and degradation of accuracy. Two approaches suggest themselves.</Paragraph>
      <Paragraph position="8"> * Lit Itemization - The simplest mechanism of rule triggering is to include in each lexicon entry an explicit list of applicable rules. LR application can be chained, so that the rule chains are expanded, either statically, in the specification, or dynamically, at application time. This approach avoids any inappropriate application of the rules (overgeneration), though at the expense of tedious work at lexicon acquisition time. One drawback of this strategy is that if a new LR is added, each lexical entry needs to be revisited and possibly updated.</Paragraph>
      <Paragraph position="9"> * Itule LIIS Constraints - The other approach is to maintain a bank of LRs, and rely on their LHSs to constrairi the application of the rules to only the appropriate cases; in practice, however, it is difficult to set up the constraints in such a way as to avoid over- or undergeneration a prior~. Additionally, this approach (at least, when applied after acquisition time) does not allow explicit ordering of word senses, a practice preferred by many lexicographers to indicate relative frequency or salience; this sort of information can be captured by other mechanisms (e.g., using frequency-of-occurrence statistics). This approach does, however, capture the paradigmatic generalization that is represented by the rule, and simplifies lexical acquisition.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="33" end_page="35" type="metho">
    <SectionTitle>
3 Morpho-Semantics and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="33" end_page="35" type="sub_section">
      <SectionTitle>
Constructive Derivational
Morphology: a Transcategorial
</SectionTitle>
      <Paragraph position="0"> Approach to Lexical Rules In this section, we present a case study of LRs based on constructive derivational morphology. Such LRs automatically produce word forms which are polysemous, such as the Spanish generador 'generator,' either the artifact or someone who generates. The LRs have been tested in a real world application, involving the semi-automatic acquisition of a Spanish computational lexicon of about 35,000 word senses.</Paragraph>
      <Paragraph position="1"> We accelerated the process of lexical acquisition 1 by developing morpho-semantic LRs which, when applied to a lexeme, produced an average of 25 new candidate entries. Figure 1 below illustrates the overall process of generating new entries from a citation form, by applying morpho-semantic LRs.</Paragraph>
      <Paragraph position="2"> Generation of new entries usually starts with verbs. Each verb found in the corpora is submitted to the morpho-semantic generator which produces all its morphological derivations and, based on a detailed set of tested heuristics, attaches to each form an appropriate semantic LR. label, for instance, the nominal form comprador will be among the ones generated from the verb comprar and the semantic LR &amp;quot;agent-of&amp;quot; is attached to it. The mechanism of rule application is illustrated below.</Paragraph>
      <Paragraph position="3"> The form list generated by the morpho-semantic generator is checked against three MRDs (Collins Spanish-English, Simon and Schuster Spanish-English, and Larousse Spanish) and the forms found in them are submitted to the acquisition process.</Paragraph>
      <Paragraph position="4"> However, forms not found in the dictionaries are not discarded outright because the MRDs cannot be assumed to be complete and some of these &amp;quot;:rejected&amp;quot; forms can, in fact, be found in corpora or in the input text of an application system. This mechanism works because we rely on linguistic clues and a See (Viegas and Nirenburg, 1995) for the details on the acquisition process to build the core Spanish lexicon, and (Viegas and Beale, 1996) for the details oil the conceptual and technological tools used to check the quality of the lexicon.</Paragraph>
      <Paragraph position="5">  The Lexical Rule Processor is an engine which produces a new entry from an existing one, such as the new entry compra (Figure 3) produced from the verb entry comprar (Figure 2) after applying the LR2event rule. 2 The acquirer must check the definition and enter an example, but the rest of the information is simply retained. The LEXical-RUT.~.S zone specifies the morpho-semantic rule which was applied to produce this new entry and the verb it has been applied to.</Paragraph>
      <Paragraph position="6"> The morpho-semantic generator produces all predictable morphonological derivations with their morpho-lexico-semantic associations, using three major sources of clues: 1) word-forms with their corresponding morpho-semantic classification; 2) stem alternations and 3) construction mechanisms. The patterns of attachement include unification, concatenation and output rules 3. For instance beber can be  derived into beb{e\]dero, bebe\[e\]dor, beb\[i\]do, beb\[i\]da, volver into vuelto, and communiear into telecommunicac\[on, etc... All affixes are assigned semantic features. For instance, the morpho-semantic rule LRpolarity_negative is at least attached to all verbs belonging to the -Aa class of Spanish verbs, whose initial stem is of the form 'con', 'tra', or 'fir' with the corresponding allomorph .in attached to it (inconlrolable, inlratable, ... ).</Paragraph>
      <Paragraph position="7"> Figure 4 below, shows tlle derivational morphology output for eomprar, with the associated lexical rules which are later used to actually generate the entries. Lexical rules 4 were applied to 1056 verb citation forms with 1263 senses among them. The rules helped acquire an average of 25 candidate new entries per verb sense, thus producing a total of 31,680 candidate entries.</Paragraph>
      <Paragraph position="8"> From the 26 different citation forms shown in Figure 4, only 9 forms (see Figure 5), featuring 16 new entries, have been accepted after checking. 5 For instance, comprable, adj, LR3feasibilityallribulel, is morphologically derived from comprar, scope of this paper, and is discussed in (Viegas et al.,  and adds to the semantics of comprar the shade of meaning of possibility.</Paragraph>
      <Paragraph position="9"> In this example no forms rejected by the dictionaries were found in the corpora, and therefore there was no reason to generate these new entries.</Paragraph>
      <Paragraph position="10"> However, the citation forms supercompra, precompra, precomprado, autocomprar actually appeared in other corpora, so that entries for them could be generated automatically at run time.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="35" end_page="36" type="metho">
    <SectionTitle>
4 The Cost of Lexical Rules
</SectionTitle>
    <Paragraph position="0"> It is clear by now that LRs are most useful in large-scale acquisition. In the process of Spanish acquisition, 20% of all entries were created from scratch by H-level lexicographers and 80% were generated by LRs and checked by research associates. It should be made equally clear, however, that the use of LRs is not cost-free. Besides the effort of discoveriug and implementing them, there is also the significant time and effort expenditure on the procedure of semi-automatic checking of the results of the application of LRs to the basic entries, such as those for the verbs.</Paragraph>
    <Paragraph position="1"> The shifts and modulations studied in the literature in connection with the LRs and generative lexicon have also been shown to be not problem-free: sometimes the generation processes are blocked-or preempted-for a variety of lexical, semantic and other reasons (see (Ostler and Atkins, 1992)). In fact, the study of blocking processes, their view as systemic rather than just a bunch of exceptions, is by itself an interesting enterprise (see (Briscoe et al., 1995)).</Paragraph>
    <Paragraph position="2"> Obviously, similar problems occur in real-life large-scale lexical rules as well. Even the most seemingly regular processes do not typically go through in 100% of all cases. This makes the LR-affected entries not generable fully automatically and this is why each application of an LR to a qualifying phe- null nomenon must be checked manually in the process of acquisition.</Paragraph>
    <Paragraph position="3"> Adjectives provide a good case study for that. The acquisition of adjectives in general (see (Raskin and Nirenburg, 1995)) results in the discovery and application of several large-scope lexical rules, and it appears that no exceptions should be expected. Table 1 illustrates examples of LRs discovered and used in adjective entries.</Paragraph>
    <Paragraph position="4"> The first three and the last rule are truly large-scope rules. Out of these, the -able rule seems to be the most homogeneous and 'error-proof.' Around 300 English adjectives out of the 6,000 or so, which occur in the intersection of LDOCE and the 1987-89 Wall Street Journal corpora, end in -able.</Paragraph>
    <Paragraph position="5"> About 87% of all the -able adjectives are like readable: they mean, basically, something that can be read. In other words, they typically modify the noun which is the theme (or beneficiary, if animate) of the verb from which the adjective is derived: One can read the book.-The book is readable.</Paragraph>
    <Paragraph position="6"> The temptation to mark all the verbs as capable of assuming the suffix -able (or -ible) and forming adjectives with this type of meaning is strong, but it cannot be done because of various forms of blocking or preemption. Verbs like kill, relate, or necessitate do not form such adjectives comfortably or at all.</Paragraph>
    <Paragraph position="7"> Adjectives like audible or legible do conform to the formula above, but they are derived, as it were, from suppletive verbs, hear and read, respectively. More distressingly, however, a complete acquisition process for these adjectives uncovers 17 different combinations of semantic roles for the nouns modified by the -ble adjectives, involving, besides the &amp;quot;standard&amp;quot; theme or beneficiary roles, the agent, experiencer, location, and even the entire event expressed by the verb. It is true that some of these combinations are extremely rare (e.g. perishable), and all together they account for under 40 adjectives. The point remains, however, that each case has to be checked manually (well, semi-automatically, because the same tools that we have developed for acquisition are used in checking), so that the exact meaning of the derived adjective with regard to that of the verb itself is determined. It turns out also that, for a polysemous verb, the adjective does not necessarily inherit all its meanings (e.g., perishable again).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML