File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/j94-4005_intro.xml
Size: 6,394 bytes
Last Modified: 2025-10-06 14:05:46
<?xml version="1.0" standalone="yes"?> <Paper uid="J94-4005"> <Title>Training and Scaling Preference Functions for Disambiguation Hiyan Alshawi * AT&T Bell Laboratories</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The importance of good preference functions for ranking competing analyses produced by language-processing systems grows as the coverage of these systems improves.</Paragraph> <Paragraph position="1"> Increasing coverage usually also increases the number of analyses for sentences previously covered, bringing the danger of lower accuracy for these sentences. Large scale rule-based analysis systems have therefore tended to employ a collection of functions to produce a score for sorting analyses in a preference order. In this paper we address two issues relating to the application of preference functions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Combining Multiple Preference Functions </SectionTitle> <Paragraph position="0"> The first problem we address is that of combining different functions, each of which is supposed to offer some contribution to selecting the best among a set of analyses of a sentence. Although multiple functions have been used in other systems (for example, McCord 1990; Hobbs and Bear 1990), little is typically said about how the functions are combined to produce the overall score for an analysis, the weights presumably being determined by intuition or trial and error. McCord (1993) gives very specific information about the weights he uses to combine preference functions, though these weights are chosen by hand. Selecting weights by hand, however, is a task for experts, which needs to be redone every time the system is applied to a new domain or corpus.</Paragraph> <Paragraph position="1"> Furthermore, there is no guarantee that the selected weights will achieve optimal or even near-optimal performance.</Paragraph> <Paragraph position="2"> The speech-processing community, on the other hand, has a longer history of using numerical evaluation functions, and speech researchers have used schemes for scoring recognition hypotheses that are similar to the one proposed here for disambiguation.</Paragraph> <Paragraph position="3"> * AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974, USA. E-maih hiyan@research.at t.com j&quot; SRI International, Cambridge Computer Science Research Centre, 23 Miller Yard, Cambridge CB2 1RQ, UK. E-mail: dmc@cam.sri.com (~ 1994 Association for Computational Linguistics Computational Linguistics Volume 20, Number 4 For example, Ostendorf et al. (1991) improve recognition performance by using a linear combination of several scoring functions. In their work the weights for the linear combination are chosen to optimize a generalized mean of the rank of the correct word sequence.</Paragraph> <Paragraph position="4"> In our case, the problem is formulated as follows. Each preference function is defined as a numerical (possibly real-valued) function on representations corresponding to the sentence analyses. A weighted sum of these functions is then used as the overall measure to rank the possible analyses of a particular sentence. We refer to the coefficients, or weights, used in this linear combination as the &quot;scaling factors&quot; for the functions. We determine these scaling factors automatically in order both to avoid the need for expert hand tuning and to achieve performance that is at least locally optimal.</Paragraph> <Paragraph position="5"> We start with the solution to minimizing a squared-error cost function, a well-known technique applied to many optimization and classification problems. This solution is then enhanced by application of a hill-climbing technique.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Word Sense Collocation Functions </SectionTitle> <Paragraph position="0"> Until recently, the choice of the various functions used in rule-based systems was made mainly according to anecdotal information about the effectiveness of, for example, various attachment preference strategies. There is now more empirical work comparing such functions, particularly in the case of functions based on statistical information about lexical or semantic collocations. Lexical collocation functions, especially those determined statistically, have recently attracted considerable attention in computational linguistics (Calzolari and Bindi 1990; Church and Hanks 1990; Sekine et al. 1992; Hindle and Rooth 1993) mainly, though not exclusively, for use in disambiguation. These functions are typically derived by observing the occurrences of tuples (usually pairs or triples) that summarize relations present in an analysis of a text, or their surface occurrences. For example, Hindle and Rooth (1993) and Resnik and Hearst (1993) give experimental results on the effectiveness of functions based on lexical associations, or lexical-class associations, at selecting appropriate prepositional phrase attachments.</Paragraph> <Paragraph position="1"> We have experimented with a variety of specific functions that make use of collocations between word senses. The results we present show that these functions vary considerably in disambiguation accuracy, but that the best collocation functions are more effective than a function based on simple estimates of syntactic rule probabilities.</Paragraph> <Paragraph position="2"> In particular, the best collocation function performs significantly better than a related function that defines collocation strength in terms of mutual information, reducing the error rate in a disambiguation task from approximately 30% to approximately 10%.</Paragraph> <Paragraph position="3"> We start by describing our experimental context and training data in Section 2.</Paragraph> <Paragraph position="4"> Then we address the issue of selecting scaling factors by presenting our optimization procedure in Section 3 and a comparison with manual scaling in Section 4. Finally, we take a close look at a set of semantic collocation functions, defining them in Section 5 and comparing their effectiveness at disambiguation in Section 6.</Paragraph> </Section> </Section> class="xml-element"></Paper>