File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-2002_metho.xml
Size: 42,565 bytes
Last Modified: 2025-10-06 14:13:23
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-2002"> <Title>From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax</Title> <Section position="3" start_page="244" end_page="245" type="metho"> <SectionTitle> 2. Collecting Observations </SectionTitle> <Paragraph position="0"> This section describes the local morpho-syntactic cues that Lerner uses to identify likely examples of particular syntactic frames. These cues must address two problems: finding verbs in the input and identifying phrases that represent arguments to the verb. The next two subsections present cues for these tasks. The cues presented here are not intended to be the last word on local cues to structure in English; they are merely intended to illustrate the feasibility of such cues and demonstrate how the statistical model accommodates their probabilistic correspondence to the true syntactic structure of sentences. Variants of these cues are presented in Brent (1991a, 1991b). The final subsection summarizes the procedure for collecting observations and discusses a sample of the observations table collected from the Brown corpus.</Paragraph> <Section position="1" start_page="244" end_page="245" type="sub_section"> <SectionTitle> 2.1 Finding Verbs </SectionTitle> <Paragraph position="0"> Lerner identifies verbs in two stages, each carried out on a separate pass through the corpus. First, strings that sometimes occur as verbs are identified. Second, occurrences of those strings in context are judged as likely or unlikely to be verbal occurrences.</Paragraph> <Paragraph position="1"> The second stage is necessary because of lexical ambiguity.</Paragraph> <Paragraph position="2"> The first stage uses the fact that all English verbs can occur both with and without the suffix -ing. Words are taken as potential verbs if and only if they display this alternation in the corpus. 2 There are a few words that meet this criterion but do not occur as verbs, including income~incoming (,incame/incomed), ear~earring, her~herring, and middle~middling. However, the second stage of verb detection, combined with the statistical criteria, prevent these pairs from introducing errors.</Paragraph> <Paragraph position="3"> adjustment rules (Karttunen 1983). The system described here uses rules similar to those of Karttunen and Wittenburg (1983), but it resolves the ambiguities using only the contents of the corpus. This technique will be described in a subsequent paper.</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 19, Number 2 Lerner assumes that a potential verb is functioning as a verb unless the context suggests otherwise. In particular, an occurrence of a potential verb is taken as a non-verbal occurrence only if it follows a determiner or a preposition other than to. For example, was talking would be taken as a verb, but a talk would not. This precaution reduces the likelihood that a singular count noun will be mistaken for a verb, since singular count nouns are frequently preceded by a determiner.</Paragraph> <Paragraph position="5"> Finally, the only morphological forms that are used for learning syntactic frames are the stem form and the -ing form. There are several reasons for this. First, forms ending in -s are potentially ambiguous between third person singular present verbs and plural nouns. Since plural nouns are not necessarily preceded by determiners (I like to take walks), they could pose a significant ambiguity problem. Second, past participles do not generally take direct objects: knows me and knew me are OK, but not * is known me. Further, the past tense and past participle forms of some verbs are identical, while those of others are distinct. As a result, using the -ed forms would have complicated the statistical model substantially. Since the availability of raw text is not generally a limiting factor, it makes sense to wait for the simpler cases.</Paragraph> </Section> <Section position="2" start_page="245" end_page="245" type="sub_section"> <SectionTitle> 2.2 Identifying Argument Phrases </SectionTitle> <Paragraph position="0"> When a putative occurrence of a verb is found, the next step is to identify the syntactic types of nearby phrases and determine whether or not they are likely to be arguments of the verb.</Paragraph> <Paragraph position="1"> First, assume that a phrase P and a verb V have been identified in some sentence. Lerner's strategy for determining whether P is an argument to V has two components: .</Paragraph> <Paragraph position="2"> 2.</Paragraph> <Paragraph position="3"> If P is a noun phrase (NP), take it as an argument only if there is evidence that it is not the subject of another clause.</Paragraph> <Paragraph position="4"> Regardless of P's category, take it as an argument only if it occurs to the right of V and there are no potential attachment points for P between V and P.</Paragraph> <Paragraph position="5"> For example, suppose that the sequence that the were identified as the left boundary of a clause in the sentence I want to tell him that the idea won'tfly. Because pronouns like him almost never take relative clauses, and because pronouns are known at the outset, Lerner concludes that the clause beginning with that the is probably an argument of the verb tell. 3 It is always possible that it could be an argument of the previous verb want, but Lerner treats that as unlikely. On the other hand, if the sentence were I want to tell the boss that the idea won'tfly, then Lerner cannot determine whether the clause beginning with that the is an argument to tell or is instead related to boss, as in I want to fire the boss that the workers don't trust.</Paragraph> <Paragraph position="6"> Now consider specific cues for identifying argument phrases. The phrase types for which data are reported here are noun phrases, infinitive verb phrases (VPs), and tensed clauses. These phrase types yield three syntactic frames with a single argument and three with two arguments, as shown in Table 1. The cues used for identifying these frames are shown in Tables 2 and 3. Table 2 defines lexical categories that are referred to in Table 3. The category V in Table 3 starts out empty and is filled as verbs are detected on the first pass. &quot;cap&quot; stands for any capitalized word and &quot;cap+&quot; for any sequence of capitalized words. These cues are applied by matching them against the string of words immediately to the right of each verb. For example, a verb V is</Paragraph> </Section> </Section> <Section position="4" start_page="245" end_page="248" type="metho"> <SectionTitle> 3 Thanks to Don Hindle for this observation (personal communication). </SectionTitle> <Paragraph position="0"> Michael R. Brent From Grammar to Lexicon Table 1 The six syntactic frames studied in this paper. SF Description Good Example Bad Example NP only greet them *arrive them tensed clause hope he'll attend *want he'll attend infinitive hope to attend *greet to attend Cues for syntactic frames. The category V is initially empty and is filled out during the first pass. &quot;cap&quot; stands for any capitalized word and &quot;cap+&quot; stands for any sequence of capitalized words.</Paragraph> <Paragraph position="1"> recall 42 3 4 recur 5 recede 5 redeem 3 receive 106 4 redirect 2 reckon 10 rediscover 2 recognize 71 6 6 reduce 85 2 recommend 32 2 1 reek 2 reconcile 5 reel 2 record 97 2 2 refer 43 1 recount 5 refine 4 recover 14 4 1 reflect 41 1 recreate 2 refresh 4 recruit 11 refuel 3 refuse 22 1 recorded as having occurred with a direct object and no other argument phrase if V is followed by a pronoun of ambiguous case and then a coordinating conjunction, as in I'll see you when you return from Mexico. The coordinating conjunction makes it unlikely that the pronoun is the subject of another clause, as in I see you like champagne. It also makes it unlikely that the verb has an additional NP argument, as in I'II tell you my secret recipe.</Paragraph> <Section position="1" start_page="246" end_page="248" type="sub_section"> <SectionTitle> 2.3 Summary and Sample Data </SectionTitle> <Paragraph position="0"> To summarize, the procedure for collecting observations from a corpus is as follows: 1. Go through the corpus once finding pairs of words such that one is the result of adding the suffix -ing to the other, applying appropriate morphological adjustment rules. List members of such pairs as verbs. 2. Go through the corpus again. At each word w that is on the list of verbs, (a) If w is not preceded by a preposition or a determiner, increment the number of times that w appears as a verb.</Paragraph> <Paragraph position="1"> (b) If any of the cues listed in Table 3 match the words immediately following w, increment the number of times that w appears to occur in the corresponding frame.</Paragraph> <Paragraph position="2"> 3. Combine the data for the stem form and the -ing form.</Paragraph> <Paragraph position="3"> Table 4 shows an alphabetically contiguous portion of the observations table that results from applying this procedure to the Brown Corpus (untagged). Each row represents data collected from one pair of words, including both the -ing form and the stem form. The first column, titled V, represents the total number of times the word occurs in positions where it could be functioning as a verb. Each subsequent column represents a single frame. The number appearing in each row and column represents the number of times that the row's verb cooccurred with cues for the column's frame. Zeros are omitted. Thus recall and recalling occurred a combined total of 42 times, excluding those occurrences that followed determiners or prepositions. Three of those occurrences were followed by a cue for a single NP argument and four were followed by cues for a tensed clause argument.</Paragraph> <Paragraph position="4"> Michael R. Brent From Grammar to Lexicon Table 5 Judgments based on the observations in Table 4, made by the method of Section 3.</Paragraph> <Paragraph position="5"> recall NP, cl recognize NP, cl recover NP refuse inf The cues are fairly rare, so verbs in Table 4 that occur fewer than 15 times tend not to occur with these cues at all. Further, these cues occur fairly often in structures other than those they are designed to detect. For example, record, recover, and refer all occurred with cues for an infinitive, although none of them in fact takes an infinitive argument. The sentences responsible for these erroneous observations are: (2) (a) (b) (c) (d) But I shall campaign on the Meyner record to meet the needs of the years ahead.</Paragraph> <Paragraph position="6"> Sposato needed a front, some labor stiff with a clean record to act as business agent of the Redhook local.</Paragraph> <Paragraph position="7"> Then last season the Birds tumbled as low as 11-18 on May 19 before recovering to make a race of it and total 86 victories. But I suspect that the old Roman was referring to change made under military occupation--the sort of change which Tacitus was talking about when ....</Paragraph> <Paragraph position="8"> In (2a,b) record occurs as a noun. In (2c) recover is a verb but the infinitive VP, to make a race of it .... does not appear to be an argument. In any case, it does not bear the same relation to the verb as the infinitive arguments of verbs like try, want, hope, ask, and refuse. In (2d) refer is a verb but to change is a PP rather than an infinitive. The remainder of this paper describes and evaluates a method for making judgments about the ability of verbs to appear in particular syntactic frames on the basis of noisy data like that of Table 4. Given the data in Table 4, that method yields the judgments in Table 5.</Paragraph> </Section> </Section> <Section position="5" start_page="248" end_page="255" type="metho"> <SectionTitle> 3. Statistical Modeling </SectionTitle> <Paragraph position="0"> As noted above, the correspondence between syntactic structure and the cues that Lerner uses is not perfect. Mismatches between cue and structure are problematic because naturally occurring language provides no negative evidence. If a V verb is followed by a cue for some syntactic frame S, that provides evidence that V does occur in frame S, but there is no analogous source of evidence that V does not occur in frame S.</Paragraph> <Paragraph position="1"> The occurrence of mismatches between cue and structure can be thought of as a random process where each occurrence of a verb V has some non-zero probability of being followed by a cue for a frame S, even if V cannot in fact occur in S. If this model is accurate, the more times V occurs, the more likely it is to occur at least once with a cue for S. The intransitive verb arrive, for example, will eventually occur with a cue for an NP argument, if enough text is considered. A learner that considers a single occurrence of verb followed by a cue to be conclusive evidence will eventually come to the false conclusion that arrive is transitive. In other words, the information Computational Linguistics Volume 19, Number 2 provided by the cues will eventually be washed out by the noise. This problem is inherent in learning from naturally occurring language, since infallible parsing is not possible. The only way to prevent it is to consider the frequency with which each verb occurs with cues for each frame. In other words, to consider each occurrence of V without a cue for S as a small bit of evidence against V being able to occur in frame S. This section describes a statistical technique for weighing such evidence.</Paragraph> <Paragraph position="2"> Given a syntactic frame S, the statistical model treats each verb V as analogous to a biased coin and each occurrence of V as analogous to a flip of that coin. An occurrence that is followed by a cue for S corresponds to one outcome of the coin flip, say heads; an occurrence without a cue for S corresponds to tails. 4 If the cues were perfect predictors of syntactic structure then a verb V that does not in fact occur in frame S would never appear with cues for S--the coin would never come up heads.</Paragraph> <Paragraph position="3"> Since the cues are not perfect, such verbs do occur with cues for S. The problem is to determine when a verb occurs with cues for S often enough that all those occurrences are unlikely to be errors.</Paragraph> <Paragraph position="4"> In the following discussion, a verb that in fact occurs in frame S in the input is described as a +S verb; one that does not is described as a -S verb. The statistical model is based on the following approximation: for fixed S, all -S verbs have equal probability of being followed by a cue for S. Let ~r-s stand for that probability. ~r-s may vary from frame to frame, but not from verb to verb. Thus, errors might be more common for tensed clauses than for NPs, but the working hypothesis is that all intransitives, such as saunter and arrive, are about equally likely to be followed by a cue for an NP argument. If the error probability ~r-s were known, then we could use the standard hypothesis testing method for binomial frequency data. For example, suppose 7r-s = .05--on average, one in twenty occurrences of a -S verb is followed by a cue for S. If some verb V occurs 200 times in the corpus, and 20 of those occurrences are followed by cues for S, that ought to suggest that V is unlikely to have probability .05 of being followed by a cue for S, and hence V is unlikely to be -S. Specifically, the chance of flipping 20 or more heads out of 200 tosses of a coin with a five percent chance of coming up heads each time is less than three in 1000. On the other hand, it is not all that unusual to flip 2 or more heads out of 20 on such a coin--it happens about one time in four. If a verb occurs 20 times in the corpus and 2 of those occurrences are followed by cues for S, it is quite possible that V is -S and that the 2 occurrences with cues for S are explained by the five percent error rate on -S verbs.</Paragraph> <Paragraph position="5"> The next section reviews the hypothesis-testing method and gives the formulas for computing the probabilities of various outcomes of coin tosses, given the coin's bias.</Paragraph> <Paragraph position="6"> It also provides empirical evidence that, for some values of 7r_s, hypothesis-testing does a good job of distinguishing +S verbs from -S verbs that occur with cues for S because of mismatches between cue and structure. The following section proposes a method for estimating ~r-s and provides empirical evidence that its estimates are nearly optimal.</Paragraph> <Section position="1" start_page="249" end_page="252" type="sub_section"> <SectionTitle> 3.1 Hypothesis Testing </SectionTitle> <Paragraph position="0"> The statistical component of Lerner is designed to prevent the information provided by the cues from being washed out by the noise. The basic approach is hypothesis testing on binomial frequency data (Kalbfleisch 1985). Specifically, a verb V is shown to 4 Given a verb V, the outcomes of the coins for different S's are treated as approximately independent, even though they cannot be perfectly independent. Their dependence could be modeled using a multinomial rather than a binomial model, but the experimental data suggest that this is unnecessary.</Paragraph> <Paragraph position="1"> Michael R. Brent From Grammar to Lexicon be +S by assuming that it is -S and then showing that if this were true, the observed pattern of cooccurrence of V with cues for S would be extremely unlikely.</Paragraph> <Paragraph position="2"> need to estimate the probability ~r-s that an occurrence of a verb V will be followed by a cue for S if V is -S. In this section it is assumed that 7r_s is known. The next section suggests a means of estimating Tr_s. In both sections it is also assumed that for each +S verb, V, the probability that V will be followed by a cue for S is greater than 7r_s. Other than that, no assumptions are made about the probability that a +S verb will be followed by a cue for S. For example, two verbs with transitive senses, such as cut and walk, may have quite different frequencies of cooccurrence with cues for NP. It does not matter what these frequencies are as long as they are greater than lr_Np. If a coin has probability p of flipping heads, and if it is flipped n times, the probability of its coming up heads exactly m times is given by the binomial distribution:</Paragraph> <Paragraph position="4"> The probability of coming up heads m or more times is given by the obvious sum:</Paragraph> <Paragraph position="6"> Analogously, P(m+, n, ~r-s) gives the probability that m or more occurrences of a -S verb V will be followed by a cue for S out of n occurrences total.</Paragraph> <Paragraph position="7"> If m out of n occurrences of V are followed by cues for S, and if P(m+, n, ~r-s) is quite small, then it is unlikely that V is -S. Traditionally, a threshold less than or equal to .05 is set such that a hypothesis is rejected if, assuming the hypothesis were true, the probability of outcomes as extreme as the observed outcome would be below the threshold. The confidence attached to this conclusion increases as the threshold decreases.</Paragraph> <Paragraph position="8"> 3.1.2 Experiment. The experiment presented in this section is aimed at determining how well the method presented above can distinguish +S verbs from -S verbs. Let p-s be an estimate of 7r_s. It is conceivable that P(m+,n,p-s) might not be a good predictor of whether or not a verb is +S, regardless of the estimate p-s. For example, if the correspondence between the cues and the structures they are designed to detect were quite weak, then many -S verbs might have lower P(m+,n,p-s) than many +S verbs. This experiment measures the accuracy of binomial hypothesis testing on the data collected by Lerner's cues as a function of p-s. In addition to showing that P(m+, n, P-s) is good for distinguishing +S and -S verbs, these data provide a baseline against which to compare methods for estimating the error rate 7r_s.</Paragraph> <Paragraph position="9"> Method The cues described in Section 2 were applied to the Brown Corpus (untagged version). Equation 2 was applied to the resulting data with a cutoff of P(m+,n,p_s) < .02 and p-s varying between 2 -5 (1 error in every 32 occurrences) and 2 -13 (1 error in every 8192 occurrences). The resulting judgments were compared to the blind judgments of a single judge. One hundred ninety-three distinct verbs were chosen at random from the tagged version of the Brown Corpus for comparison.</Paragraph> <Paragraph position="10"> Common verbs are more likely to be included in the test sample than rare verbs, but no verb is included more than once. Each verb was scored for a given frame only if it cooccurs with a cue for that frame at least once. Thus, although 193 verbs were randomly selected from the corpus for scoring, only the 63 that cooccur with a cue for tensed clause at least once were scored for the tensed-clause frame. This procedure makes it possible to evaluate the hypothesis-testing method on data collected by the cues, rather than evaluating the cues per se. It also makes the judgment task much easier--it is not necessary to determine whether a verb can appear in a frame in principle, only whether it does so in particular sentences. There were, however, five cases where the judgments were unclear. These five were not scored. See Appendix C for details.</Paragraph> <Paragraph position="11"> Results The results of these comparisons are summarized in Table 6 (tensed clause) and Table 7 (infinitive). Each row shows the performance of the hypothesis-testing procedure for a different estimate P-s of the error-rate 7r_s. The first column shows the negative logarithm of P-s, which is varied from 5 (1 error in 32 occurrences) to 13 (1 error in 8192 occurrences). The second column shows P-s in decimal notation.</Paragraph> <Paragraph position="12"> The next four columns show the number of true positives (TP)--verbs judged +S both by machine and by hand; false positives (FP)--verbs judged +S by machine, -S by hand; true negatives (TN)--verbs judged -S both by machine and by hand; and false negatives (FN)--verbs judged -S by machine, +S by hand. The numbers represent distinct verbs, not occurrences. The seventh column shows the number of verbs that were misclassified (MC)--the sum of false positives and false negatives. The eighth column shows the percentage of verbs that were misclassified (%MC). The next-to-last column shows the precision (PRE)--the true positives divided by all verbs that Lerner judged to be +S. The final column shows the recall (REC)--the true positives divided by all verbs that were judged +S by hand.</Paragraph> <Paragraph position="13"> Discussion For verbs taking just a tensed clause argument, Table 6 shows that, given the right estimate P-s of lr_s, it is possible to classify these 63 verbs with only 1 false positive and 8 false negatives. If the error rate were ignored or approximated as zero then the false positives would go up to 19. On the other hand, if the error rate were taken to be as high as 1 in 25 then the false negatives would go up to 20. In this case, the sum of both error types is minimized with 2 -8 < P-c1 _< 2 -1deg. Table 7 shows similar results for verbs taking just an infinitive argument, where misclassifications are minimized with p-inf = 2-7.</Paragraph> </Section> <Section position="2" start_page="252" end_page="255" type="sub_section"> <SectionTitle> 3.2 Estimating the Error Rate </SectionTitle> <Paragraph position="0"> As before, assume that an occurrence of a -S verb is followed by a cue for S with probability 7r_s. Also as before, assume that for each +S verb V, the probability that an occurrence of V is followed by a cue for S is greater than 7r_s.</Paragraph> <Paragraph position="1"> It is useful to think of the verbs in the corpus as analogous to a large bag of coins with various biases, or probabilities of coming up heads. The only assumption about the distribution of biases is that there is some definite but unknown minimum bias 7r_s. 5 Determining whether or not a verb appears in frame S is analogous to determining, for some randomly selected coin, whether its bias is greater than ~r-s.</Paragraph> <Paragraph position="2"> The only available evidence comes from selecting a number of coins at random and flipping them. The previous section showed how this can be done given an estimate of ~r-s.</Paragraph> <Paragraph position="3"> Suppose a series of coins is drawn at random from the bag. Each coin is flipped N times. It is then assigned to a histogram bin representing the number of times it came up heads. At the end of this sampling procedure bin i contains the number of coins that came up heads exactly i times out of N. Such a histogram is shown in Figure 1, where N = 40. If N is large enough and enough coins are flipped N times, one would expect the following: 1. The coins whose probability of turning up heads is ~r-s (the minimum) should cluster at the low-heads end of the histogram. That is, there should be some 0 __G j0 _< N such that most of the coins that turn up j0 heads or fewer have probability 7r_s, and, conversely, most coins with probability ~r-s turn up j0 heads or fewer.</Paragraph> <Paragraph position="4"> 2. Suppose j0 were known. Then the portion of the histogram below j0 should have a roughly binomial shape. In Figure 1, for example, the first eight bins have roughly the shape one would expect if j0 were 8. In contrast, the first 16 bins do not have the shape one would expect if j0 5 If the number of coins is taken to be infinite, then the biases must be not only greater than ~r-s but bounded above ~r-s.</Paragraph> <Paragraph position="5"> A histogram illustrating a binomially shaped distribution in the first eight bins. .</Paragraph> <Paragraph position="6"> were 16---their height drops to zero for two stretches before rising significantly above zero again. Specifically, the height of the i th histogram bin should be roughly proportional to P(i, N, P-s), with N the fixed sample size and P-s an estimate of 7r_s.</Paragraph> <Paragraph position="7"> Suppose again that j0 were known. Then the average rate at which the coins in bins j0 or lower flip heads is a good estimate of ~r-s.</Paragraph> <Paragraph position="8"> The estimation procedure tries out each bin as a possible estimate of j0. Each estimate of j0 leads to an estimate of ~r-s and hence to an expected shape for the first j0 histogram bins. Each estimate j of j0 is evaluated by comparing the predicted distribution in the first j bins to the observed distribution--the better the fit, the better the estimate. Moving from coins to verbs, the procedure works as follows. For some fixed N, consider the first N occurrences of each verb that occurs at least N times in the input. (A uniform sample size N is needed only for estimating 7r-s. Given an estimate of 7r-s, verbs with any number of occurrences can be classified.) Let S be some syntactic frame and let H\[i\] be the number of distinct verbs that were followed by cues for S exactly i times out of N--i.e., the height of the ith histogram bin. Assume that there is some</Paragraph> <Paragraph position="10"> conversely most verbs that are followed by cues for S j0 times or fewer are -S verbs.</Paragraph> <Paragraph position="11"> For each possible estimate j of j0 there is a corresponding estimate of 7r_s; namely, the average rate at which verbs in the first j bins are followed by cues for S. Choosing the most plausible estimate of 7r_s thus comes down to choosing the most plausible estimate of j0, the boundary between the -S verbs and the rest of the histogram. To evaluate the plausibility of each possible estimate j of j0, measure the fit between the predicted distribution of -S verbs, assuming j is the boundary of the -S cluster, and the observed distribution of the -S verbs, also assuming j is the boundary of the -S cluster. Given j, let p-s stand for the average rate at which verbs in bins j or lower are followed by cues for S. The predicted distribution for -S verbs is proportional to P(i,N,p-s) for 0 < i < N. The observed distribution of -S verbs, assuming j is the boundary of the -S cluster, is H\[i\] for 0 < i < j and 0 for j < i < N. Measure the fit between the predicted and observed distributions by normalizing both to have unit area and taking the sum over 0 < i < N of the squares of the differences between the two distributions at each bin i.</Paragraph> <Paragraph position="12"> Michael R. Brent From Grammar to Lexicon Table 8 Comparison of automatic classification using the Brown Corpus to hand judgments. The estimate p-s is made with N = 100. The probability threshold is .02.</Paragraph> <Paragraph position="13"> In pseudo-code, the procedure is as follows:</Paragraph> <Paragraph position="15"> Estimate ~r-s by the average cooccurrence rate for the first j bins--those presumed to hold - S verbs</Paragraph> <Paragraph position="17"> Check the fit, assuming j is the :kS boundary sum-of-squares := 0 for i from 0 to N Compute the predicted distribution/or bin i N! ~i /1 P := ~H-s~ -P-s) N-i Verbs in the first bins j and below are presumed - S if i<_j Computational Linguistics Volume 19, Number 2 the results for each of the six frames. Varying N between 50 and 150 results in no significant change in the estimated error rates. One way to judge the value of the estimation and hypothesis-testing methods is to examine the false positives. Three of the five false positives result from errors in verb detection that are not distributed uniformly across verbs. In particular, shock, board, and near are used more often as nonverbs than as verbs. This creates many opportunities for nonverbal occurrences of these words to be mistaken for verbal occurrences. Other verbs, like know, are unambiguous and thus are not subject to this type of error. As a result, these errors violate the model's assumption that errors are distributed uniformly across verbs and highlight the limitations of the model. The remaining false positives were touch and belong, both mistaken as taking an NP followed by a tensed clause. The touch error was caused by the capitalization of the first word of a line of poetry: I knew not what did to a friend belong Till I stood up, true friend, by thy true side; Till was mistaken for a proper name. The belong error was caused by mistaking a matrix clause for an argument in: With the blue flesh of night touching him he stood under a gentle hill caressing the flageolet with his lips, making it whisper.</Paragraph> <Paragraph position="18"> It seems likely that such input would be much rarer in more mundane sources of text, such as newspapers of record, than in the diverse Brown Corpus.</Paragraph> <Paragraph position="19"> The results for infinitives and clauses can also be judged by comparison to the optimal classifications rates from Tables 6 and 7. In both cases the classification appears to be right in the optimal range. In fact, the estimated error rate for infinitives produces a better classification than any of those shown in Table 7. (It falls at a value between those shown.) The classification of clauses and infinitives remains in the optimal range when the probability threshold is varied from .01 to .05.</Paragraph> <Paragraph position="20"> Overall the tradeoff between improved precision and reduced recall seems quite good, as compared to doing no noise reduction (P-s = 0). The only possible exception is the NP frame, where noise reduction causes 59 false negatives in exchange for preventing only 5 false positives. This is partly explained by the different prior probabilities of the different frames. Most verbs can take a direct object argument, whereas most verbs cannot take a direct object argument followed by a tensed clause argument. There is no way to know this in advance. There may be other factors as well. If the error rate for the NP cues is substantially lower than 1 out of 100, then it cannot be estimated accurately with sample size N = 100. On the other hand, if the sample size N is increased substantially there may not be enough verbs that occur N times or more in the corpus. So a larger corpus might improve the recall rate for NP.</Paragraph> </Section> </Section> <Section position="6" start_page="255" end_page="258" type="metho"> <SectionTitle> 4. General Discussion </SectionTitle> <Paragraph position="0"> This paper explores the possibility of using simple grammatical regularities to learn lexical syntax. The data presented in Tables 6, 7, and 8 provide evidence that it is possible to learn significant aspects of English lexical syntax in this way. Specifically, these data suggest that neither a large parser nor a large lexicon is needed to recover enough syntactic structure for learning lexical syntax. Rather, it seems that significant Michael R. Brent From Grammar to Lexicon lexical syntactic information can be recovered using a few approximate cues along with statistical inference based on a simple model of the cues' error distributions.</Paragraph> <Section position="1" start_page="256" end_page="256" type="sub_section"> <SectionTitle> 4.1 Other Syntactic Frames </SectionTitle> <Paragraph position="0"> The lexical entry of a verb can specify other syntactic frames in addition to the six studied here. In particular, many verbs take prepositional phrases (PPs) headed by a particular preposition or class of prepositions. For example, put requires a location as a second argument, and locations are often represented by PPs headed by locative prepositions.</Paragraph> <Paragraph position="1"> Extending Lerner to detect PPs is trivial. Since the set of prepositions in the language is essentially fixed, all prepositions can be included in the initial lexicon. Detecting a PP requires nothing more than detecting a preposition. 6 The statistical model can, of course, be applied without modification.</Paragraph> <Paragraph position="2"> The problem, however, is determining which PPs are arguments and which are adjuncts. There are clearly cases where a prepositional phrase can occur in a clause not by virtue of the lexical entry of the verb but rather by virtue of nonlexical facts of English syntax. For instance, almost any verb can occur with a temporal PP headed by on, as in John arrived on Monday. Such PPs are called adjuncts. On the other hand, the sense of on in John sprayed water on the ceiling is quite different. This sense, it can be argued, is available only because the lexical entry of spray specifies a location argument that can be realized as a PP. If anything significant is to be learned about individual words, the nonspecific cooccurrences of verbs with PPs (adjuncts) must be separated from the specific ones (arguments). It is not clear how a machine learning system could do this, although frequency might provide some clue. Worse, however, there are many cases in which even trained linguists lack clear intuitions. Despite a number of attempts to formulate necessary and sufficient conditions for the argument/adjunct distinction (e.g., Jackendoff 1977), there are many cases for which the various criteria do not agree or the judgments are unclear (Adams and Macfarland 1991). Thus, the Penn Treebank does not make the argument/adjunct distinction because their judges do not agree often enough. Until a useful definition that trained humans can agree on is developed, it would seem fruitless to attempt machine learning experiments in this domain.</Paragraph> </Section> <Section position="2" start_page="256" end_page="257" type="sub_section"> <SectionTitle> 4.2 Limitations of the Statistical Model </SectionTitle> <Paragraph position="0"> Although the results of this study are generally encouraging, they also point to some limitations of the statistical model presented here. First, it does not take into account variation in the percentage of verbs that can appear in each frame. For example, most verbs can take an NP argument, while very few can take an NP followed by a tensed clause. This results in too few verbs being classified as +NP and too many being classified as +NPcl, as shown in Table 8. Second, it does not take into account the fact that for some words with verbal senses most of their occurrences are verbal, whereas for others most of their occurrences are nonverbal. For example, operate occurs exclusively as a verb while board occurs much more often as a noun than as a verb.</Paragraph> <Paragraph position="1"> Since the cues are based on the assumption that the word in question is a verb, board presents many more opportunities for error than operate. This violates the assumption that the probability of error for a given frame is approximately uniform across verbs.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 19, Number 2 Table 9 Distribution of occurrences among morphological forms in the Brown Corpus. The ambiguous words board and project show a pattern of distribution distinct from that of the unambiguous verbs operate and follow. project 52 board 111 operate 48 follow 76 projects 54 boards 31 operates 15 follows 72 projected 10 boarded 3 operated 26 followed 150 projecting 5 boarding 1 operating 55 following 97 These limitations do not constitute a major impediment to applications of the current results. For example, an applied system can be provided with the rough estimates that 80-95 percent of verbs take a direct object, while 1-2 percent take a direct object followed by a tensed clause. Such estimates can be expected to reduce misclassification significantly. Further, an existing dictionary could be used to &quot;train&quot; a statistical model on familiar verbs. A trained system would probably be more accurate in classifying new verbs. Finally, the lexical ambiguity problem could probably be reduced substantially in the applied context by using a statistical tagging program (Brill 1992; Church 1988).</Paragraph> <Paragraph position="3"> For addressing basic questions in machine learning of natural language the solutions outlined above are not attractive. All of those solutions provide the learner with additional specific knowledge of English, whereas the goal for the machine learning effort should be to replace specific knowledge with general knowledge about the types of regularities to be found in natural language.</Paragraph> <Paragraph position="4"> There is one approach to the lexical ambiguity problem that does not require giving the learner additional specific knowledge. The problem is as follows: words that occur frequently as, say, nouns are likely to have a different error rate from unambiguous verbs. If it were known which words occur primarily as verbs and which occur primarily as nouns then separate error rate estimates could be made for each. This would reduce the rate of false positive errors even without any further information about which particular occurrences are nominal and which are verbal. One way to distinguish primarily nominal words from primarily verbal words is by the relative frequencies of their various inflected forms. For example, Table 9 shows the contrast in the distribution of inflected forms between project and board on the one hand and operate and follow on the other. Project and board are two words whose frequent occurrence as nouns has caused Lerner to make false positive errors. In both cases, the stem and -s forms are much more common than the -ed and -ing forms. Compare this to the distribution for the unambiguous verbs operate and follow. In these cases the diversity of frequencies is much lower and does not display the characteristic pattern of a word that occurs primarily as a noun-- -ing and -ed forms that are much rarer than the -s and stem forms. Similar characteristic patterns exist for words that occur primarily as adjectives. Recognizing such ambiguity patterns automatically would allow a separate error rate to be estimated for the highly ambiguous words.</Paragraph> </Section> <Section position="3" start_page="257" end_page="258" type="sub_section"> <SectionTitle> 4.3 Future Work </SectionTitle> <Paragraph position="0"> From the perspective of computational language acquisition, a natural direction in which to extend this work is to develop algorithms for learning some of the specific knowledge that was programmed into the system described above. Consider the mor- null Michael R. Brent From Grammar to Lexicon phological adjustment rules according to which, for example, the final &quot;e&quot; of bite is deleted when the suffix -ing is added, yielding biting rather than ,&quot;biteing.&quot; Lerner needs to know such rules in order to determine whether or not a given word occurs both with and without the suffix -ing. Experiments are under way on an unsupervised procedure that learns such rules from English text, given only the list of English verbal suffixes. This work is being extended further in the direction of discovering the morphemic suffixes themselves and discovering the ways in which these suffixes alternate in paradigms. The short-term goal is to develop algorithms that can learn the rules of inflection in English starting from only a corpus and a general notion of the nature of morphological regularities.</Paragraph> <Paragraph position="1"> Ultimately, this line of inquiry may lead to algorithms that can learn much of the grammar of a language starting with only a corpus and a general theory of the kinds of formal regularities to be found in natural languages. Some elements of syntax may not be learnable in this way (Lightfoot 1991), but the lexicon, morphology, and phonology together make up a substantial portion of the grammar of a language. If it does not prove possible to learn these aspects of grammar starting from a general ontology of linguistic regularities and using distributional analysis then that, too, is an interesting result. It would suggest that the task requires a more substantive initial theory of possible grammars, or some semantic information about input sentences, or both. In any case this line of inquiry promises to shed light on the nature of language, learning, and language learning.</Paragraph> </Section> </Section> class="xml-element"></Paper>