File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/j94-4005_metho.xml
Size: 34,332 bytes
Last Modified: 2025-10-06 14:13:53
<?xml version="1.0" standalone="yes"?> <Paper uid="J94-4005"> <Title>Training and Scaling Preference Functions for Disambiguation Hiyan Alshawi * AT&T Bell Laboratories</Title> <Section position="3" start_page="0" end_page="638" type="metho"> <SectionTitle> 2. The Experimental Setup </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="636" type="sub_section"> <SectionTitle> Disambiguation Task </SectionTitle> <Paragraph position="0"> All the experiments we describe here were done with the Core Language Engine (CLE), a primarily rule-based natural language-processing system (Alshawi 1992).</Paragraph> <Paragraph position="1"> More specifically, the work on optimizing preference factors and semantic collocations was done as part of a project on spoken language translation in which the CLE was used for analysis and generation of both English and Swedish (AgnSs et al. 1993).</Paragraph> <Paragraph position="2"> Hiyan Alshawi and David Carter Training and Scaling Preference Functions The work presented here is all concerned with the English analysis side, though we see no reason why its conclusions should not be applicable to Swedish or other natural languages.</Paragraph> <Paragraph position="3"> In our experiments we made use of the Air T, avel Information System (ATIS) corpus of transcribed speech sentences. This application was chosen because the proposed method for automatic derivation of scaling factors requires a corpus of sentences that are representative of the sublanguage, together with some independent measure of the correctness or plausibility of analyses of these sentences. In addition, we had access to a hand-parsed subcollection of the ATIS corpus built as part of the Penn Treebank project (Marcus, Santorini, and Marcinkiewicz 1993). Another reason for choosing ATIS was that it consists of several thousand sentences in a constrained discourse domain, which helped avoid sparseness problems in training collocation functions} In the various experiments, the alternatives we are choosing between are analyses expressed in the version of quasi logical form (QLF) described by Alshawi and Crouch (1992). QLFs express semantic content, but are derived compositionally from complete syntactic analyses of a sentence and therefore mirror much syntactic structure as well. However, the use of QLF analyses is not central to our method: the important thing is that the representation used is rich enough to support a variety of preference functions. We have experimented with combinations of around 30 different functions and use 20 of them in our spoken language translation system; the others contribute so little to overall performance that their computational cost cannot be justified. This default set of 20 was used throughout the scaling factor work described in Sections 3 and 4. It consists of 1 collocation-based function and 19 non-collocation-based ones. The work described in Section 6 involved substituting single alternative collocation-based functions for the single one in the set of 20.</Paragraph> <Paragraph position="4"> Many (unscaled) preference functions simply return integers corresponding to counts of particular constructs in the representation, such as the number of expressions corresponding to adjuncts, unresolved ellipses, particular attachment configurations, or balanced conjunctions. There are also some real-valued functions, including the semantic collocation functions discussed in Section 5.</Paragraph> <Paragraph position="5"> To illustrate how the system works, consider the ATIS sentence &quot;Do I get dinner on this flight?&quot; The CLE assigns two analyses to this sentence; in one of them, QH, &quot;on this flight&quot; attaches high to &quot;get,&quot; and in the other, QL, it attaches low to &quot;dinner.&quot; Four functions return non-zero scores on these analyses. Two of them, Lowl and Low2, prefer low attachment; the difference between them is an implementation detail that can be ignored here. A third, SynRules, returns an estimate of the log probability of the syntactic rules used to construct the analysis. A fourth, SemColl, is a semantic collocation function. The scores, after multiplying by scaling factors, are as shown in function has a relatively large scaling factor, it is able to override the other four, which all prefer QL for syntactic reasons.</Paragraph> </Section> <Section position="2" start_page="636" end_page="637" type="sub_section"> <SectionTitle> 2.1 Training Data </SectionTitle> <Paragraph position="0"> The Penn Treebank contains around 650 ATIS trees, which we used during initial development of training and optimization software. Some of the results in these initial trials were encouraging, but most appeared to be below reasonable thresholds of sta1 The hand-parsed sub-corpus was that on the ACL DCI CD-ROM 1 of September 1991. The larger corpus, used for the bulk of the work reported here, consisted of 4615 class A and D sentences from the ATIS-2 training corpus. These were all such sentences of up to 15 words that we had access to at the time, excluding a set of randomly selected sentences that were set aside for other testing purposes. tistical significance. So, we concluded that it was worthwhile to produce more training data. For this purpose, we developed a semiautomatic mechanism for producing skeletal constituent structure trees directly from QLF analyses proposed by our analyser. To make these trees compatible with the treebank and to make them relatively insensitive to minor changes in semantic analysis, these QLF-induced trees consist simply of nested constituents with two categories, A (argument) and P (predication), corresponding to constituents induced by QLF term and form expressions, respectively. The tree for the example sentence used above is as follows:</Paragraph> <Paragraph position="2"> The interactive software for producing the trees proposes constituents for confirmation by a user and takes into account answers given, to minimize the number of interactive choices necessary. Of the 4615 sentences in our training set, the CLE produced an acceptable constituent structure for 4092 (about 89%). A skeletal tree for each of these 4092 sentences was created in this way and used in the various experiments whose results are described below. We do not directly address here the problems of applying preference functions to select the best analysis when none is completely correct; we assume, based on our experience with the spoken language translator, that functions and scaling factors trained on cases for which a completely correct analysis exists will also perform fairly well on cases for which one does not.</Paragraph> </Section> <Section position="3" start_page="637" end_page="638" type="sub_section"> <SectionTitle> 2.2 Training Score </SectionTitle> <Paragraph position="0"> Employing treebank analyses in the training process required defining a measure of the &quot;'degree of correctness&quot; of a QLF analysis under the assumption that the phrase-structure analysis in the treebank is correct. At first sight this might appear difficult, in that QLF is a logical formalism, but in fact it preserves much of the geometry of constituent structure. Specifically, significant (typically BAR-2 level) constituents tend to give rise to term (roughly argument) or form (roughly predication) QLF subexpressions, though the details do not matter here. It is thus possible to associate segments of the input with such QLF subexpressions and to check whether such a segment is also present as a constituent in the treebank analysis. The issues raised by measuring Hiyan Alshawi and David Carter Training and Scaling Preference Functions QLF correctness in terms of agreement with structures containing less information than those QLFs are discussed further at the end of Section 4.</Paragraph> <Paragraph position="1"> The training score functions we considered for a QLF q with respect to a treebank tree t were functions of the form score(q, t) = allQ N T I - a21Q \ T\] - a31T \ QI, where Q is the set of string segments induced by the term and form expressions of q; T is the set of constituents in t; al, a2, and a3 are positive constants; and the &quot;\&quot; operator denotes set difference. The idea is to reward the QLF for constituents in common with the treebank and to penalize it for differences. Trial and error led us to choose a1=1, a2=10, a3=0, which penalizes hallucination of incorrect constituents (modeled by \[Q \ T\]) more heavily than a shortfall in completeness (modeled by IQ n TI). These constants were fixed before we carried out the experiments whose results are presented below.</Paragraph> <Paragraph position="2"> The explanation for setting a3 to 0 was that trees in the Penn Treebank contain many constituents that do not correspond to QLF form or term expressions; we had to avoid penalizing QLF analyses simply because the treebank uses a different kind of linguistic representation. For QLF-induced trees, in which the correspondence is one to one, it is also reasonable to set a3 to 0 because when IT \ Q I is non-zero, I Q A T I tends to be non-maximal. Among the 4092 sentences for which skeletal trees were derived, there were only 5 with alternative QLFs for which the training score value was the same with a3 = 0 but would be different if a3 were non-zero.</Paragraph> </Section> </Section> <Section position="4" start_page="638" end_page="641" type="metho"> <SectionTitle> 3. Computing Scaling Factors </SectionTitle> <Paragraph position="0"> When we first implemented a disambiguation mechanism of the kind described above, an initial set of scaling factors was chosen by hand according to knowledge of how the particular raw preference functions were computed and introspection about the &quot;strength&quot; of the functions as indicators of preference. These initial scaling factors were subsequently revised according to their observed behavior in ranking analyses, eventually leading to reasonably well-behaved rankings.</Paragraph> <Paragraph position="1"> However, as suggested earlier, there are a number of disadvantages to manual tuning of scaling factors. These include the effort spent in maintaining the parameters. This effort is greater for those with less knowledge of how the raw preference functions are computed, since this increases the effort for trial-and-error tuning. A point of diminishing returns is also reached, after which further attempts at improvement through hand tuning often turn out to be counterproductive. Another problem was that it became difficult to detect preference functions that were ineffective, or simply wrong, if they were given sufficiently low scaling factors. Probably a more serious problem is that the contributions of different preference functions to selecting the most plausible analyses seem to vary from one sublanguage to another. These disadvantages point to the need for automatic procedures to determine scaling factors that optimize preference function rankings for a particular sublanguage.</Paragraph> <Paragraph position="2"> In our framework, a numerical &quot;preference score&quot; is computed for each of the alternative analyses, and the analyses are ranked according to this score. As mentioned earlier, the preference score is a weighted sum of a set of preference functions: Each preference function f/ takes a complete QLF representation qi as input, returning a numerical score sq, the overall preference score being computed by summing over the Computational Linguistics Volume 20, Number 4 product of function scores with their associated scaling factors cj: ClSil q- . . . -}- CmSim</Paragraph> <Section position="1" start_page="639" end_page="639" type="sub_section"> <SectionTitle> 3.1 Collection Procedure </SectionTitle> <Paragraph position="0"> The training process begins by analyzing the corpus sentences and computing, for each analysis of each sentence, the training score of the analysis with respect to the manually approved skeletal tree and the (unscaled) values of the preference functions applied to that analysis.</Paragraph> <Paragraph position="1"> One source of variation in the data that we want to ignore in order to derive scaling factors appropriate for selecting QLFs is the fact that preference function values for an analysis often reflect characteristics shared by all analyses of a sentence, as much as the differences between alternative analyses. For example, a function that counts the occurrences of certain constructs in a QLF will tend to give higher values for QLFs for longer sentences. In the limit, one can imagine a function ~b that, for an N-word sentence, returned a value of N + G for a QLF with training score G with respect to the skeletal tree. Such a function, if it existed, would be extremely useful, but (if sentence length were not also considered) would not be a particularly accurate predictor of the QLF training score.</Paragraph> <Paragraph position="2"> To discount irrelevant intersentence variability, both the training score with respect to the skeletal tree and all the preference function scores are therefore relativized by subtracting from them the corresponding values for the analysis of that sentence which best matches the skeletal tree. If the best match is shared by several analyses, the average for those analyses is subtracted. The relativized training score is the distance function with respect to which the first stage of scaling factor calculation takes place. It can be seen that the relativized results of our hypothetical preference function 6 are a perfect predictor of the relativized training score. Consider, for example, a six-word sentence with three QLFs, two of which, ql and q2, have completely correct skeletal tree structures and the third of which, q3, does not. Suppose also that the training scores and the scores assigned by preference functions, G fl, and fz, are as follows:</Paragraph> </Section> <Section position="2" start_page="639" end_page="639" type="sub_section"> <SectionTitle> 3.2 Least Squares Calculation </SectionTitle> <Paragraph position="0"> An initial set of scaling factors is calculated in a straightforward analytic way by approximating gi, the relativized training score of qi, by ~j cjzij, where cj is the scaling factor for preference function fj and zq is the relativized score assigned to qi by ~. We vary the values of cj to minimize the sum, over all QLFs for all training sentences, of the squares of the errors in the approximation Hiyan Alshawi and David Carter Training and Scaling Preference Functions Defining the error function as a sum of squares of differences in this way means that the minimum error is attained when the derivative with respect to each ck, --2 ~i Zik(gi -y'~q CjZq), is zero. These linear simultaneous equations, one for each of cl ... c,,, can be solved by Gaussian elimination. (For a full explanation of this standard technique, see Moore and McCabe 1989, pp. 174ff and 680ff.) This least squares set of scaling factors achieves quite good disambiguation performance (see Section 4) but is not truly optimal because of the inherent nonlinearity of the goal, which is to maximize the proportion of sentences for which a correct QLF is selected, rather than to approximate training scores (even relativized ones). Suppose that a function M has a tendency to give high scores to correct QLFs when the contributions of other functions do not clearly favor any QLF, but that M tends to perform much less well when other functions come up with a clear choice. Then increasing the scaling factor on M from the least squares value will tend to improve system performance even though the sum of squares of errors is increased; M's tendency to perform well just when it is important to do so should be rewarded.</Paragraph> </Section> <Section position="3" start_page="639" end_page="641" type="sub_section"> <SectionTitle> 3.3 Iterative Scaling Factor Adjustment </SectionTitle> <Paragraph position="0"> The least squares scaling factors are therefore adjusted iteratively by a hill-climbing procedure that directly examines the QLF choices they give rise to on the training corpus. Scaling factors are altered one at a time in an attempt to locally optimize 2 the number of correct disambiguation decisions, i.e., the number of training sentences for which a QLF with a correct skeletal tree receives the highest score.</Paragraph> <Paragraph position="1"> A step in the iteration involves calculating the effect of an alteration to each factor in turn. 3 If factors Ck, k ~ j, are held constant, it is easy to find a set (possibly empty) of real-valued intervals \[u/j, viii such that a correct choice will be made on sentence i if uij < cj <_ vii. By collecting these intervals for all the functions and for all the sentences in the training corpus, one can determine the effect on the number of correct disambiguation decisions of any alteration to any single scaling factor. The alteration selected is the one that gives the biggest increase in the number of sentences for which a correct choice is made. When no increase is possible, the procedure terminates. We found that convergence tends to be fairly rapid, with the number of steps seldom exceeding the number of scaling factors involved (although the process does occasionally change a scaling factor it has previously altered, when intervening changes make this appropriate).</Paragraph> <Paragraph position="2"> One of the functions we used shows the limitations of least squares scaling factor optimization, alluded to above, in quite a dramatic way. The function in question returns the number of temporal modifiers in a QLE Its intended purpose is to favor readings of utterances like &quot;Atlanta to Boston Tuesday,&quot; in which &quot;Tuesday&quot; is a temporal modifier of the (elliptical) sentence rather than a compound noun formed with &quot;Boston.&quot; Linear scaling always gives this function a negative weight, causing temporal modifications to be downgraded, and in fact the relativized training score of a QLF turns out to be negatively correlated with the number of temporal modifiers it contains. However, the intuitions that led to the introduction of the function do seem 2 Finding a global optimum would of course be desirable. However, inspection of the results, over various conditions, of the iterative scheme presented here did not suggest that the introduction of a technique such as simulated annealing, which in general can improve the prospect of finding a more global optimum, would have had much effect on performance.</Paragraph> <Paragraph position="3"> 3 An algorithm based on gradient descent might appear preferable, on the grounds that it would alter all factors simultaneously and have a better chance of locating a global optimum. However, the objective function, the number of correct disambiguation decisions, varies discontinuously with the scaling factors, so no gradients can be calculated.</Paragraph> <Paragraph position="4"> to hold for QLFs that are close to being correct, and therefore iterative adjustment makes the weight positive.</Paragraph> </Section> </Section> <Section position="5" start_page="641" end_page="642" type="metho"> <SectionTitle> 4. Comparing Scaling Factor Sets </SectionTitle> <Paragraph position="0"> The performance of the factors derived from least squares calculation and adjustment by hill climbing was compared with that of various other sets of factors. The factor sets considered, roughly in increasing order of their expected quality, were the following: * &quot;Normalized&quot; factors: the magnitude of each factor is the inverse of the standard deviation of the preference function in question, making each function contribute equally. A factor is positive if it correlates positively with training scores; otherwise it is negative.</Paragraph> <Paragraph position="1"> * Factors chosen and tuned by hand for ATIS sentences before the work described in this paper was done, or, for functions developed during the work described here, without reference to any automatically derived values.</Paragraph> <Paragraph position="2"> * Factors resulting from least squares calculation, as described in Section 3.2.</Paragraph> <Paragraph position="3"> * Factors resulting from least squares calculation followed by hill-climbing adjustment (Section 3.3).</Paragraph> <Paragraph position="4"> To provide a baseline, performance was also evaluated for the technique of a random selection of a single QLF for each sentence.</Paragraph> <Paragraph position="5"> The performance of each set of factors was evaluated as follows. The set of 4092 sentences with skeletal trees was divided into five subsets of roughly equal size. Each subset was &quot;held out&quot; in turn: the functions and scaling factors were trained on the other four subsets, and the system was then evaluated on the held-out subset. The system was deemed to have correctly processed a sentence if the QLF to which it assigned the highest score agreed exactly with the corresponding skeletal tree. The numbers of correctly processed sentences (i.e., sentences whose selected QLFs had correct constituent structures) are shown in Table 2; because all the sentences involved were within coverage, the theoretical maximum achievable is 4092 (100%). We use a standard statistical method, the sign test (explained in, for example, Dixon and Massey 1968), to assess the significance of the difference between two factor sets, $1 and $2. Define Fi(x) to be the function that assigns 1 to a sentence x if Si makes the correct choice in disambiguating x and 0 if it makes the wrong choice. The null hypothesis is that F1 (x) and F2(x), treated as random variables over x, have the same distribution, from which we would expect the difference between F1 (x) and F2(x) to be positive as often as it is negative. Table 3 gives the number of cases in which this difference is positive or negative. As is usual for the sign test, the cases in which the difference is 0 do not need to be taken into account. The test is applied to compare six pairs of factor sets. The &quot;#SDs&quot; column in Table 3 shows the number of standard deviations represented by the difference between the &quot;+&quot; and &quot;-&quot; figures under the null hypothesis; a #SDs value of 1.95 is statistically significant at the 5% level (two tail), and a value of 3.3 is significant at the 0.1% level.</Paragraph> <Paragraph position="6"> Table 3 shows that, in terms of wrong QLF choices, both sets of machine-optimized factors perform significantly better than the hand-optimized factors, to which considerable skilled effort had been devoted. It is worth emphasizing that the process of determining the machine-optimized factors does not make use of the knowledge encoded by hand optimization. The hill-climbing factor set, in turn, performs significantly better than the least squares set from which it is derived.</Paragraph> <Paragraph position="7"> A possible objection to this analysis is that, because QLFs are much richer structures than constituent trees, it is possible for a QLF to match a tree perfectly but have some other characteristic that makes it incorrect. In general, the principal source of such discrepancies is a wrong choice of word sense, but pure sense ambiguity (i.e., different predicates for the same syntactic behavior of the same word) turns out to be extremely rare in the ATIS corpus. An examination of the selected QLFs for the 20 + 36 = 56 sentences making up the / and - values for the comparison between the least squares and hill-climbing factor sets showed that in no case did a QLF have a correct constituent structure but fail to be acceptable on other criteria. Thus although the absolute percent correctness figures for a set of scaling factors may be very slightly (perhaps up to 1%) overoptimistic, this has no noticeable effect on the differences between factor sets.</Paragraph> </Section> <Section position="6" start_page="642" end_page="644" type="metho"> <SectionTitle> 5. Lexical Semantic Collocations </SectionTitle> <Paragraph position="0"> In this section we move from the problem of calculating scaling factors to the other main topic of this paper, showing how our experimental framework can be used diagnostically to compare the utility of competing suggestions for preference functions.</Paragraph> <Paragraph position="1"> We refer to the variant of collocations we used as lexical semantic collocations because (i) they are collocations between word senses rather than lexical items, and (ii) the relations used are often deeper than syntactic relations (for example the relations between a verb and its subject are different for passive and active sentences).</Paragraph> <Paragraph position="2"> The semantic collocations extracted from QLF expressions take the form of (H1, R, H2) triples, in which H1 and H2 are the head predicates of phrases in a sentence Computational Linguistics Volume 20, Number 4 and R indicates the relation (e.g., a preposition or an argument position) between the two phrases in the proposed analysis. For this purpose, the triple derivation software abstracted away from proper names and some noun and verb predicates when they appeared as heads of phrases, replacing them by hand-coded class predicates. For example, predicates for names of meals are mapped onto the class name cc SpecificMeal on the grounds that their distributions in unseen sentences are likely to be very similar. Some of the triples for the high-attachment QLF for &quot;Do I get dinner on this flight?&quot; are as follows: (getAcquire, 2, personal) (getAcquire, 3, cc_Specif icMeal) (get_Acquire, on, f light_AirplaneTrip).</Paragraph> <Paragraph position="3"> The first two triples correspond to the agent and theme positions in the predicate for get, whereas the third expresses the vital prepositional phrase attachment. In the triple set for the other QLF, this triple is replaced by (cc_SpecificMeal, on, flight_AirplaneTrip).</Paragraph> <Paragraph position="4"> Data collection for the semantic collocation functions proceeds by deriving a set of triples from each QLF analysis of the sentences in the training set. This is followed by statistical analysis to produce the following functions of each triple in the observed triple population. The first two functions have been used in other work on collocation; some authors use simple pairs rather than triples (i.e., no relation, just two words) when computing collocation strengths, so direct comparisons are a little difficult. The third function is an original variant of the second; the fourth is original; and the fifth is prompted by the arguments of Dunning (1993).</Paragraph> <Paragraph position="5"> * Mutual information: this relates the probability Pl(a)P2(b)P3(c) of the triple (a, b~ c) assuming independence between its three fields, where P~(x) is the probability of observing x in position p, with the probability A estimated from actual observations of triples derived from analyses ranked highest (or joint highest) in training score. More specifically, we use ln{a/\[P1 (a)Pa(b)P3(c)\]}.</Paragraph> <Paragraph position="6"> * X2: compares the expected frequency E of a triple with the square of the difference between E and the observed frequency F of the triple. Here the observed frequency is in analyses ranked highest (or joint highest) in training score, and the &quot;expected&quot; frequency assumes independence between triple fields. More specifically we use IF - E\] * (F - E)/E. This variant of X 2, in which the numerator is signed, is used so that the function is monotonic, making it more suitable in preference functions.</Paragraph> <Paragraph position="7"> * X: as X 2, but the quantity used is (F - E)/v&quot;E, as large values of F - E have a tendency to swamp the X 2 function.</Paragraph> <Paragraph position="8"> * Mean distance: the average of the relativized training score for all QLF analyses (not necessarily the highest ranked ones) that include the semantic collocation corresponding to the triple. In other words, the mean distance value for a triple is the mean amount by which a QLF giving rise to that triple falls short of a perfect score.</Paragraph> <Paragraph position="9"> * Likelihood ratio: for each triple (HI~ R, H2), the ratio of the maximum likelihood of the triple, given the distribution of triples in correct analyses of the training data, on the assumption that H1 and H2 are independent given R, to the maximum likelihood without that assumption. (See Dunning 1993, for a fuller explanation of the use of likelihood ratios.) Computation of the mutual information and 2 functions for triples involves the simple smoothing technique, suggested by Ken Church, of adding 0.5 to actual counts. From these five functions on triples, we define five semantic collocation preference functions applied to QLFs, in each case by averaging over the result of applying the function to each triple derived from a QLE We refer to these functions by the same names as their underlying functions on triples. The collocation functions are normalized by multiplying up by the number of words in the sentence to which the function is being applied. This normalization keeps scores for QLFs in the same sentence comparable, while at the same time ensuring that the triple function scores tend to grow with sentence length in the same way that the non-collocation functions tend to do. Thus the optimality of a set of scaling factors is relatively insensitive to sentence length.</Paragraph> <Paragraph position="10"> Our use of the mean distance function was motivated by the desire to take into account additional information from the training material that is not exploited by the other collocation functions. Specifically, it takes into account all analyses proposed by the system, as well as the magnitude of the training score. In contrast, the other collocation functions make use only of the training score to select the best analysis of a sentence, discarding the rest. Another way of putting this is that the mean distance function is making use of negative examples and a measure of the degree of unacceptability of an analysis.</Paragraph> </Section> <Section position="7" start_page="644" end_page="645" type="metho"> <SectionTitle> 6. Comparing Semantic Collocation Functions </SectionTitle> <Paragraph position="0"> An evaluation of each function acting alone on the five held-out sets of test data yielded the numbers of correctly processed sentences shown in Table 4. The figures for the random baseline are repeated from Table 2. We also show, for comparison, the results for a function that scores a QLF according to the sum of the logs of the estimated probabilities of the syntactic rules used in its construction. 4 4 We estimate the probability of occurrence of a syntactic rule R as the number of occurrences of R leading to QLFs with correct skeletal trees, divided by the number of occurrences of all rules leading to such QLFs. In cases where a function judged N QLFs equally plausible, of which 0 < G < N were correct, we assigned a fractional count G/N to that sentence; a random choice among the N QLFs would pick a correct one with probability G/N. For significance tests, which require binary data, we took a function as performing correctly only if all the QLFs it selected were correct. Such ties did not occur at all for the other experiments reported in this paper.</Paragraph> <Paragraph position="1"> A pairwise comparison of the results shows that all the differences between collocational functions are statistically highly significant. The syntactic rule cost function is significantly worse than all the collocational functions except the mutual information one, for which the difference is not significant either way. (There may, of course, exist better syntactic functions than the one we have tried.) The mean distance function is much superior to all the others when acting alone. Presumably, this function has an edge over the other functions because it exploits the additional information from negative examples and degree of correctness.</Paragraph> <Paragraph position="2"> The difference in performance between our syntactic and semantic preference functions is broadly in line with the results presented by Chang, Luo, and Su (1992), who use probabilities of semantic category tuples. However, this similarity in the results should be taken with some caution, because our syntactic preference function is rather crude, and because our best semantic function (mean distance) uses the additional information mentioned above. This information is not normally taken into account by direct estimates of tuple probabilities.</Paragraph> <Paragraph position="3"> When I collocation function is selected to act together with the 19 non-collocation-based functions from the default set (the set defined in Section 2 and used in the experiments on scaling factor calculation), the picture changes slightly. In this context, when scaling factors are calculated in the usual way, by least squares followed by hill climbing, the results for the best 3 of the above functions are as shown in Table 5.</Paragraph> <Paragraph position="4"> The difference between the mean distance function and the other 2 functions is still highly significant; therefore this function is chosen to be the only collocational one to be included in the default set of 20 (hence the &quot;mean distance&quot; condition here is the same as the &quot;hill-climbing&quot; condition in Section 4). However, the difference between the X and X 2 functions is no longer quite so clear cut, and the relative advantage of the mean distance function compared with the X function is less. It may be that other preference functions make up for some shortfall of the X function that is, at least in part, taken into account by the mean distance function.</Paragraph> </Section> class="xml-element"></Paper>