XML Viewer - j96-2001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/j96-2001_intro.xml
Size: 10,932 bytes
Last Modified: 2025-10-06 14:06:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="J96-2001">
  <Title>Psycholinguistics</Title>
  <Section position="4" start_page="162" end_page="165" type="intro">
    <SectionTitle>
3. Discussion
</SectionTitle>
    <Paragraph position="0"> As we have seen in the four examples discussed above, the MLE computed over hapax legomena yields a better prediction of lexical prior probabilities for unseen cases than does an MLE computed over the entire training corpus. We now have to consider why this result holds. As we shall see, the reasons are different from case to case, but nonetheless share a commonality: in all four cases, idiosyncratic lexical properties of high-frequency words dominate the statistical properties of the high-frequency ranges, thus making the overall MLE a less reliable predictor of the properties of the low-frequency and unseen cases.</Paragraph>
    <Paragraph position="1"> First let us discuss the final case, that of -er ambiguity in Dutch, beginning with the derived and underived nouns. The hapax-based MLE estimate for derived nouns in -er is somewhat higher than the overall MLE; for underived nouns, the hapax-based MLE is significantly lower -- half -- of the overall MLE. This can be explained by the observation that a good many of the underived nouns in -er are high-frequency words such as moeder 'mother' and vader 'father'. Such words contribute to the overall proportional mass of the underived nouns, thus boosting the estimate of the overall MLE for this class. A similar argument holds for the derived and underived adjectives.</Paragraph>
    <Paragraph position="2"> Turning to proper names, we see that the hapax-based MLE is much larger than the overall MLE. Proper names differ from ordinary words in that there are relatively few  Computational Linguistics Volume 22, Number 2 proper names that are highly frequent, in comparison with words in general, but there are large numbers of types of names that occur rarely. Thus, we expect an imbalance of the kind we observe.</Paragraph>
    <Paragraph position="3"> Consider next the ambiguity in Dutch between -en verb forms and -en plural nouns. Ceteris paribus, plural nouns are less frequent than singular nouns; on the other hand, -en for verbs serves both the function of marking plurality and of marking the infinitive. High-frequency verbs include some very common word forms, such as the auxiliaries hebben 'have', zullen 'will', kunnen 'can', and moeten 'must'. Thus, for the high-frequency ranges, the data is weighted heavily towards verbs. On the other hand, while both nouns and verbs are open classes, nouns are far more productive as a class than are verbs (Baayen and Lieber 1991), and this pattern becomes predominant in the low-frequency ranges: among low-frequency types, most tokens are nouns. Hence, for the low-frequency ranges, the data is weighted towards nouns. These two opposing forces conspire to yield a downward trend in the percentage of verbs as we proceed from the high- to the low-frequency ranges.</Paragraph>
    <Paragraph position="4"> Next, consider the English past tense versus past participle ambiguity. One of the important functions of the past participle form is as an adjectival modifier or predicate; for example, the parked car. In this function the past participle has a passive meaning with transitive verbs, and a perfective meaning with unaccusative intransitive verbs; see Levin (1993, 86-88) for details. For reasons that are not clear to us, a predominant number of the high-frequency verbs cannot felicitously be used as prenominal adjectives. These verbs include unergative intransitives like walk, for which one would not expect to find the adjectival usage, given the above characterization; but they also include clear transitives like move, try, and ask, and unaccusative intransitives like appear, which are not generally felicitous in this usage. Consider: ?a moved car, ?a tried approach, ?an asked question, ?an appeared ad; but contrast: an oft-tried approach, a frequently asked question, a recently appeared ad, where an adverbial modifier renders the examples felicitous. 5 Among the low-frequency verbs, including accentuate, bottle and incense, the predominate types are those in which the past participle usage is preferred. What is clear from the plot in the top panel of Figure 2 is that the downward trend in the regression curve to the right of the plot is due to the lexical properties of a relatively small number of high-frequency verbs. For the greater part of the frequency range, there is a relatively stable proportion of participles to finite past forms. Thus, the hapax-based MLE yields an estimate that is uncontaminated by the lexical properties of individual high-frequency forms.</Paragraph>
    <Paragraph position="5"> Finally, consider the Dutch verb forms -en that we started with. In Figure 1 the strong downward trend in the regression curve at the right of the figure is due in large measure to the inclusion of high-frequency auxiliary verbs, examples of which have already been given. These verbs, while possible in the infinitival form, occur predominantly in the finite form. Hence, a form such as hebben 'have' is much more likely to be a plural finite form than it is to be an infinitive. At the low end of the frequency spectrum, we find a great many verbs derived with separable particles, such as afzeggen 'cancel'; note that separable prefixation is the most productive verbforming process in Dutch. In the infinitival form, the particle is always attached to 5 One reviewer has suggested that the infelicity of many adjectival passives relates to the fact that the action denoted by the base verb is not regarded as producing an enduring result that affects the object denoted by the (deep) internal argument: contrast a broken vase, where the vase is enduringly affected by the breaking, with ?a seen movie, where the movie is not affected. However, this cannot be the whole story since the object denoted by the internal argument of kill is presumably enduringly affected by the killing, yet ?a killed man seems about as odd as ?a seen movie.</Paragraph>
    <Paragraph position="6">  Baayen and Sproat Lexical Priors for Low-Frequency Forms the verb. However, in the finite forms in main clauses, the particle must be separated: for example, wij zeggen onze afspraak af 'we are cancelling our appointment'. These properties of Dutch separable verbs boost the likelihood of infinitival forms for the low-frequency ranges, but they also boost the likelihood of (higher-frequency) finite plural forms such as zeggen: since the separated finite plural form zeggen is identical to the finite plural of the underived verb zeggen 'say', any separated finite forms will accrue to the frequency of the generally much more common derivational base.</Paragraph>
    <Paragraph position="7"> What all of these cases share is that the statistical properties of the high-frequency ranges are dominated by lexical properties of particular sets of high-frequency words.</Paragraph>
    <Paragraph position="8"> This in turn biases the overall MLE and makes it a poor predictor of novel cases.</Paragraph>
    <Paragraph position="9"> For example, auxiliaries such as hebben 'have' are among the most common verbs in Dutch, but they have rather different syntactic, and hence morphological, properties from other verbs; these properties in turn contaminate the high-frequency ranges and thus the overall MLE. In contrast, words in the low-frequency ranges, and particularly hapaxes, are heavily populated with (necessarily non-idiosyncratic) neologisms derived via productive morphological processes (Baayen 1989; Baayen and Renouf 1996). Any lexical biases that are inherent in these morphological processes -- for example, the fact that a low frequency Dutch word ending in -en is more likely to be a noun than a verb -- are well-estimated by the hapaxes. Now, for a sufficiently large training corpus, we can be very confident that an unseen complex word is non-idiosyncratic and formed via a productive morphological process, and this confidence increases as the corpus size increases (Baayen and Renouf 1996). Since the hapaxes of a particular morphological process mostly consist of non-idiosyncratic formations from that process, it makes sense that the distribution of a property among the hapaxes is the least contaminated estimate available for the distribution of that property among the unseen cases.</Paragraph>
    <Paragraph position="10"> The hapax-based MLE that we have proposed is not only observationally preferable to the overall MLE, it is also firmly grounded in probability theory. The probability of encountering an unseen word given that this word is a word in -en is estimated by:</Paragraph>
    <Paragraph position="12"> where N1,N(-en) denotes the number of hapax legomena in -en among the N(-en) tokens in -en in the training sample; see Baayen (1989), Baayen and Lieber (1991), Good (1953), and Church and Gale (1991). Of course, this estimate is heavily influenced by the highest-frequency words in -en, as these words contribute many tokens to N(-en). In our example, high-frequency auxiliaries such as hebben cause the probability of sampling unseen types in -en to be low -- newly sampled tokens have a high probability of being an auxiliary rather than some previously unseen word. Interestingly, (1) can be used to derive an expression for the conditional probability that a word is, say, a noun, given that it is an unseen type in -en (Baayen 1993):  Computational Linguistics Volume 22, Number 2 nominator) to the distribution of all -en words; and once (in the numerator) to the distribution of the -en nouns -- after reclassifying all verbal tokens in -en as representing one (very high-frequency) noun type in the frequency distribution. Similarly, the probability that an unseen word in -en is a verb is given by N1,N(-en, verb) Pr(verb I unseen -en type) ~ N1,N(-en) (3) Thus the proportion of verbal hapaxes in -en that we have suggested as an adjusted MLE estimator on the basis of the curve shown in Figure 2 is in fact an estimate of the conditional probability that a word is a verb, given that it is an unseen type in -en. The results of the analyses presented in this paper are of potential importance in various applications that require lexical disambiguation and where an estimate of lexical priors is required. For high-frequency words, one can obtain fairly reliable estimates of the lexical priors by tagging a corpus that gives a good coverage to words of various ranges. For predicting the lexical priors for the much larger mass of very low-frequency types, most of which would not occur in any such corpus, the results we have presented suggest that one should concentrate on tagging a good representative sample of the hapaxes, rather than extensively tagging words of all frequency ranges.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML