XML Viewer - w96-0303

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0303_intro.xml
Size: 17,815 bytes
Last Modified: 2025-10-06 14:06:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0303">
  <Title>Controlling the application of lexical rules</Title>
  <Section position="3" start_page="0" end_page="11" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Lexicalist linguistic theories, such as HPSG, LFG and categorial grammar, rely heavily on lexical rules. Recently, techniques have been described which address the efficiency issues that this raises for fully productive rules, such as inflectional rules and 'syntactic rules' (such as the HPSG complement extraction lexical rule). For example, Bouma &amp; van Noord (1994) and Johnson &amp; Dorre (1995) propose techniques for delayed evaluation of lexical rules so that they apply 'on demand' at parse time. Meurers ~ Minnen (1995) present a covariation approach, in which a finite-state machine for the application of lexical rules is derived by computing possible follow relations between the set of rules and then pruned FSMs are associated with classes of actual lexical entries representing the restricted set of rules which can apply to those entries. Finally, entries themselves are extended with information common to all their derived variants. These techniques achieve most of the advantages of lexicon expansion in the face of recursive rules and cyclic rule interactions which preclude a full off-line expansion.</Paragraph>
    <Paragraph position="1"> Although these treatments allow for the efficient use of productive lexical rules, they do not address the issue of semi-productivity of derivational morphological and sense extension rules, which causes considerable problems in construction of broad coverage lexical knowledge bases (LKBs) (see, for example, Climent and Mart/(1995), Pirelli et al., 1994). The standard formalization of lexical rules entails that derived entries will exist without exception for any basic entry which is compatible with the lexical rule input description. Formal accounts of some classes of exceptions, such as preemption by synonomy, have been developed (e.g. Briscoe et al, 1995), but these suffer from the disadvantage that detailed lexical semantic information must be available to detect potential synonyms. The search for a fully productive statement of verb alternations has led to an increasingly semantic perspective on such rules. Pinker (1989) argues that so-called broad semantic classes (e.g. creation or transfer verbs) provide necessary conditions for lexical rule application, but that narrow class lexical rules should be specified, breaking down such rules into a number of fully-productive subcases. But, in the attempt to define such subcases, Pinker is forced to make subtle and often unintuitive distinctions. Similarly, Levin (1992) delimits classes of verbs to which particular sets of alternations apply, but some of her classes are very small and do not have straight-forward semantic criteria for membership. Thus, even if the narrow class approach is correct, its implementation is problematic.</Paragraph>
    <Paragraph position="2"> From a computational perspective, an equally acute problem is the proliferation of senses that results when lexical rules are encoded as fully productive. For instance, the result of applying the vehicle-name -&gt; verb-of-motion lexical rule can be input to several other lexical rules. The forms which would arise if the alternations given by Levin (1992:267) are applied to helicopter are illustrated in (1):  (1) a The pilot helicoptered b The pilot helicoptered over the forests c Mrs Clinton was helicoptered to the base  d The pilot helicoptered the forests e The pilot helicoptered his passengers sick Judgements of the grammaticality of such examples differ (though (lc) is an attested example) but even when such senses are plausible and attested, they are rare for the great majority of nouns which could in principle undergo the conversion.</Paragraph>
    <Paragraph position="3"> Jackendoff (1975) and others have proposed that lexical rules be interpreted as redundancy statements which abbreviate the statement of the lexicon but which are not applied generatively. This conception of lexical rules has been utilized in computational lexical knowledge bases, for example by Sanfilippo (1993). However, this approach cannot account for the semi-productive nature of such rules, illustrated with respect to the dative alternation in (2): (2) John faxed / xeroxed / emailed his colleagues a copy of the report And for practical LKB building, there is a problem acquiring the information about which lexical entries a rule applies to. Machine readable dictionaries (MRDs) were used for this purpose by Sanfilippo, but the absence of a sense in an MRD does not mean it is unknown to the lexicographer: dictionaries have space limitations and senses may be omitted if they are rare or specialized, and also if they are 'obvious' -- i.e. the result of a highly productive process (Kilgarriff, 1992). Furthermore, if broad-coverage is attempted, the polysemy problem is still acute. Finally, theories of the lexicon in which the consequences of lexical rules are precomputed, cannot be correct in the limit because of the presence of recursive lexical rules such as re-, anti- or great- prefixation (e.g. rereprogram, anti-anti-missile or great-great-grandfather).</Paragraph>
    <Paragraph position="4"> Thus neither the interpretation of lexical rules as fully generative or as purely abbreviatory is adequate linguistically or as the basis for LKBs. Although many lexical rules are subject to exceptions, gaps and variable degrees of conventionalization, most are semi-productive in the sense that they play a role in the production and interpretation of nonce forms and errors. In the remainder of this paper, we illustrate how the linguistically-motivated probablistic framework for lexical rule application described in Copestake and Briscoe (1995) and Briscoe and Copestake (1995) might be utilized to address these practical problems.</Paragraph>
    <Paragraph position="5"> 2 Probabilistic lexical rules Copestake and Briscoe (1995) and Briscoe and Copestake (1995) argue that lexical rules, are sensitive to both type and token frequency effects which determine language users' assessments of the degree of acceptability of a given derived form and also their willingness to apply a rule in producing i or interpreting a novel form. Arguments for a treatment of semi-productivity along these lines have been advanced by Goldberg (1995) and Bauer (1983) (though not with respect to lexical rules). We regard our use of probabilities as being consistent with Bauer's claim that accounting for semi-productivity is an issue of performance, not competence (Bauer 1983:71f).</Paragraph>
    <Paragraph position="6"> The frequency with which a given word form is associated with a particular lexical entry (i.e. sense or grammatical realization) is often highly skewed; Church (1988) points out that a model of part-of-speech assignment in context will be 90% accurate (for English) if it simply chooses the lexically most frequent part-of-speech for a given word. Briscoe and Carroll (1995) found in one corpus that there were about 18 times as many instances of believe in the most common subcategorizati0n class as in the 4 least common classes combined. In the absence of other factors, it seems very likely that language users utilize frequency information to resolve indeterminacies in both generation and interpretation. Such a strategy is compatible with and may well underlie the Gricean Maxim of Manner, in that ambiguities in language will be more easily interpretable if there is a tacit agreement not to utilize abnormal or rare means of conveying particular messages. We can model this aspect of language use as a conditional probability that a word form will be associated to a specific lexical entry:</Paragraph>
    <Paragraph position="8"> This proposal is not novel and is the analogue of proposals to associate probabilities with initial trees in a Lexicalized Tree Adjoining Grammar (Resnik, 1992; Schabes, 1992). The derivation probability which gives the probability of a particular sentence interpretation will depend on the product of the lexical probabilities (rule probabilities might also play a role, but can be ignored in the categorial framework we adopt here).</Paragraph>
    <Paragraph position="9"> Lexical probabilities are acquired for both basic and derived lexical entries independently of the lexical rules used to create derived entries, so a derived entry might be more frequent than a basic one. Basic entries are augmented with a representation of the attested lexical rules which have applied to them and any such derived chains, where both the basic entry and these 'abbreviated' derived entries are associated with a probability. One way of implementing this approach is to adopt the covariation technique of Meurers &amp; Minnen (1995) discussed above. If we assume a precompiled representation of this form, conditional probabilities that a word form will be associated with a particular (basic or derived) entry can be associated with states in the FSM, as illustrated in Figure 1. (The feature structure itself is based on the verb representation scheme developed by Sanfilippo (1993), though the details are unimportant for current purposes.) In this representation, the states of the FSM, which have been given mnemonic names corresponding to their types, are each associated with a probability representing the relative likelihood that fax will be associated with the derived entry which results from applying the rule to the source entry (the probabilities shown here are purely for illustrative purposes). We call this representation the lexeme for a given word. Figure 2 shows part of the corresponding FSM explicitly. Note that there are states with no associated probabilities, reflecting possible but unattested usages. The topology of the FSM associated with a given word may be shared with other words, but the specific probabilities associated with the states representing lexical entries will be idiosyncratic so that the each lexeme representation must minimally encode the unique name of the relevant FSM and a probability for each attested state / lexical entry as shown in Figure 1. If the derived form is irregular in some way, then the exceptional information can be stipulated at the relevant state, and the feature structure calculated by default-unifying the specified information with the productive output of the lexical rule. For example, if beggar is treated as derived by the agentive -er rule (which</Paragraph>
    <Paragraph position="11"/>
    <Paragraph position="13"> is reasonable synchronically), then the irregular morphology can be stipulated and will override the predicted begger.</Paragraph>
    <Paragraph position="14"> The resulting FSM is not a Markov model because probabilities on states represent output probabilities and not transition probabilities in the machine. In addition, since the probabilities encode the relative likelihood that a given word form will associate with a particular lexical entry, the set of probabilities on states of a FSM will not be globally normalized. One FSM will represent the application of both rules of conversion (zero affixation) and rules of derivation to a given lexeme and the latter will change the form of the word, and thus participate in a different distribution. See for example, Figure 3, which is intended to cover the noun and verb lacquer, plus the derived form, lacquerer (with agentive and instrument readings taken as distinct).</Paragraph>
    <Paragraph position="15"> One problem with the acquisition of reliable estimates of such probabilities is that many possibilities will remain unseen and will, therefore, be unattested. There are a variety of well-known techniques for smoothing probability distributions which avoid assigning zero probability to unseen events. Church ~ Gale (1994) discuss the applicability of these to linguistic problems and emphasize the need for differential estimation of the probability of different unseen events in typical linguistic applications. For instance, one standard approach to smoothing involves assigning a hypothetical single observation to each unseen event in a distribution before normalizing frequencies to obtain probabilities. This captures the intuition that the more frequent the observation of some events in a distribution, the less likely it is that the unseen possibilities will occur. Thus, a rare word with only a few observations may be more likely to be seen in an alternative realization than a very frequent word which has been observed many times in some subset of the possible realizations licensed by the grammar. However, all unseen events will be assigned the same probability within each distinct distribution and this is at best a gross estimate of the actual distribution.</Paragraph>
    <Paragraph position="16"> For unattested derived lexical entries for a given word form, the relative productivity of the lexical rule(s) required to produce the derived entry can be used to allow differential estimation  productivity of each lexical rule by calculating the ratio of possible to attested outputs for each rule (cf Aronoff, 1976):</Paragraph>
    <Paragraph position="18"> (where N is the number of attested lexical entries which match the lexical rule input and M is the number of attested output entries). We discuss some more elaborate measurements for productivity in section 4.</Paragraph>
    <Paragraph position="19"> This information concerning degree of productivity of a rule can be combined with a smoothing technique to obtain a variant enhanced smoothing method of the type discussed by Church &amp; Gale (1994) capable of assigning distinct probabilities to unseen events within the same distribution. This can be achieved by estimating the held back probability mass to be distributed between the unseen entries using the basic smoothing method and then distributing this mass differentially by multiplying the total mass for unseen entries (expressed as a ratio of the total observations for a given word) by a different ratio for each lexical rule. This ratio is obtained by dividing the ratio representing the productivity of the lexical rule(s) by the sum of the ratios of the lexical rules required to construct all the unseen entries.</Paragraph>
    <Paragraph position="21"> (where lrl...lrn are the n lexical rules needed to derive the n unattested entries for word-form j) This will yield revised ratios for each given word which can then be normalized to probabilities.</Paragraph>
    <Paragraph position="22"> To make this clearer, consider the use of the probabilities to drive interpretation in the case of a nonce usage. Consider the lexical entry for the verb fax given in Figure 1 and assume the verb is unattested in a dative construction, such as fax me the minutes of the last meeting. But it may undergo either the benefactive-dative or recipient-dative rules to yield a dative realization. These rules would produce either a deputive reading where although the speaker is a beneficiary of the  action the recipient is unspecified or a reading where the speaker is also the recipient of the transfer action. Choosing between these rules in the absence of clear contextual information could be achieved by choosing the derivation (and thus interpretation) with highest probability. This would depend solely on the relative probability of the unseen derived entries created by applying these two rules to fax. This would be (pre)computed by applying the formulae above to a representation of the lexeme for fax in which ratios represent the number of observations of an entry for a given word form over the total number of observations of that word form, and unattested entries are noted and assigned one observation each 20 30 1 1 create/transfer-lexeme-fsm (trans(1~), for-ditrans(1~), recip-dative (1~) , benef-dative (1~),...) Now if we assume that the recipient dative rule can apply to 100 source entries and the resulting derived entries are attested in 60 cases, whilst the benefactive dative can apply to 1000 entries and the derived entries are attested in 100 cases, we can compute the revised estimates of the probabilities for the unseen entries for fax by instantiating the formula for estimated frequency as follows: Est-freq(fax with recipient-dative) -- 10--O2 x (~( 1~0,1 110~0) x 1~0) and similarly for the benefactive-dative case. The resulting ratios can then be converted to probabilities by normalizing them along with those for the attested entries for .fax. In this case, the recipient reading will be preferred as the recipient dative rule is more productive.</Paragraph>
    <Paragraph position="23"> This general approach handles the possibility of specialized subcases of more general rules. For example, we could factor the computation of productivity between subtypes of the input type of a rule and derive more fine-grained measures of productivity for each narrow class a rule applies to. In the case of specialized subcases of lexical rules which apply to a narrower range of lexical items but yield a more specific interpretation (such as the rules of Meat or Fur grinding as opposed to Grinding proposed in Copestake &amp; Briscoe, 1995), the relative productivity of each rule will be estimated in the manner described above, but the more specialized rule is likely to be more productive since it will apply to fewer entries than the more general rule. Similarly, in Figure 3, we assumed a use-substance lexical rule, but a more accurate estimation of probabilities is obtained by considering specialized subclasses, as we will see in the next section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML