File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3605_intro.xml
Size: 14,840 bytes
Last Modified: 2025-10-06 14:04:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3605"> <Title>Using semantic relations to refine coreference decisions. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods</Title> <Section position="3" start_page="33" end_page="37" type="intro"> <SectionTitle> 2 Method </SectionTitle> <Paragraph position="0"> Let us consider a system in which there are N solutions s1, . . ., sN [?] S to a problem and M target functions f1, . . ., fM, where fk : S - R, that assign a score to each of the solutions. The score fk(si) expresses the extent to which the solution si satisfies the criterion implemented by the target function fk. The overall score of a solution si</Paragraph> <Paragraph position="2"> is the sum of the scores given by the individual target functions. The objective is to identify ^s, the best among the N possible solutions, that maximizes the overall score:</Paragraph> <Paragraph position="4"> Suppose that the solutions are generated incrementally so that each solution si can be reached through a sequence of F partial solutions si,1, si,2, . . ., si,F , where si,F = si. Let further u : S - (0, 1] be a measure of a degree of completion for a particular solution. For a complete solution si, u(si) = 1, and for a partial solution si,n, u(si) < 1. For instance, when assigning POS tags to the words of a sentence, the degree of completion could be defined as the number of words assigned with a POS tag so far, divided by the total number of words in the sentence.</Paragraph> <Paragraph position="5"> The score of a partial solution si,n is, to a certain extent, a prediction of the score of the corresponding complete solution si. Intuitively, the accuracy of this prediction depends on the degree of completion.</Paragraph> <Paragraph position="6"> The score of a partial solution with a high degree of completion is generally closer to the final score, compared to a partial solution with a low degree of completion.</Paragraph> <Paragraph position="7"> Let</Paragraph> <Paragraph position="9"> be the difference between the scores of si and si,n.</Paragraph> <Paragraph position="10"> That is, dk(si,n) is the error in score caused by the incompleteness of the partial solution si,n. As the solutions are generated incrementally, the exact value of dk(si,n) is not known at the moment of generating si,n because the solution si has not been completed yet. However, we can model the error based on the knowledge of si,n. We assume that, for a given si,n, the error dk(si,n) is a random variable distributed according to a probability distribution with a density function [?]k, denoted as dk(si,n) [?] [?]k(d; si,n) . (4) The partial solution si,n is a parameter to the distribution and, in theory, each partial solution gives rise to a different distribution of the same general shape. We assume that the error d(si,n) is distributed around a mean value and for a 'reasonably behaving' target function, the probability of a small error is higher than the probability of a large error. Ideally, the target function will not exhibit any systematic error, and the mean value would thus be zero1. For instance, a positive mean error indicates a systematic bias toward underestimating the score. The mean error should approach 0 as the degree of completion increases and the error of a complete solution is always 0. We have further argued that the reliability of the prediction grows with the degree of completion. That is, the error of a partial solution with a high degree of completion should exhibit a smaller variance, compared to that of a largely incomplete solution. The variance of the error for a complete solution is always 0.</Paragraph> <Paragraph position="11"> Knowing the distribution [?]k of the error dk, the density of the distribution dk(f; si,n) of the final score fk(si) is obtained by shifting the density of the error dk(si,n) by fk(si,n), that is,</Paragraph> <Paragraph position="13"> So far, we have discussed the case of a single target function fk. Let us now consider the general case of M target functions. Knowing the final score density dk for the individual target functions fk, it is now necessary to find the density of the overall score f(si). By Equation 1, it is distributed as the sum 1We will see in our evaluation experiments that this is not the case, and the target functions may exhibit a systematic bias in the error d.</Paragraph> <Paragraph position="15"> distribution of the final score f(si), given a partial solution si,n. The density is assumed normally distributed, with mean u(si,n) and variance s2(si,n).</Paragraph> <Paragraph position="16"> With probability 1 [?] e, the final score is less than e(si,n).</Paragraph> <Paragraph position="17"> of the random variables f1(si) , . . ., fM(si). Therefore, assuming independence, its density is the convolution of the densities of these variables, that is, given si,n,</Paragraph> <Paragraph position="19"> We have assumed the independence of the target function scores. Further, we will make the assumption that d takes the form of the normal distribution, which is convolution-closed, a property necessary for efficient calculation by Equation 7. We thus have d(f; si,n) = nparenleftbigf; u(si,n) , s2(si,n)parenrightbig , (9) where n is the normal density function. While it is unlikely that independence and normality hold strictly, it is a commonly used approximation, necessary for an analytical solution of (7). The notions introduced so far are illustrated in Figure 1.</Paragraph> <Section position="1" start_page="34" end_page="36" type="sub_section"> <SectionTitle> 2.1 The search algorithm </SectionTitle> <Paragraph position="0"> We will now apply the model introduced in the previous section to derive a probabilistic search algorithm. null Let us consider two partial solutions si,n and sj,m with the objective of deciding which one of them is 'more promising', that is, more likely to lead to a complete solution with a higher score. The condition of 'more promising' can be defined in several ways. For instance, once again assuming independence, it is possible to directly compute the proba-</Paragraph> <Paragraph position="2"> where dsi,n refers to the function d(f; si,n). Since d is the convolution-closed normal density, Equation 10 can be directly computed using the normal cumulative distribution. The disadvantage of this definition is that the cumulative distribution needs to be evaluated separately for each pair of partial solutions. Therefore, we assume an alternate definition of 'more promising' in which the cumulative distribution is evaluated only once for each partial solution.</Paragraph> <Paragraph position="3"> Let e [?] [0, 1] be a probability value and e(si,n) be the score such that P(f(si) > e(si,n)) = e. The value of e(si,n) can easily be computed from the inverse cumulative distribution function corresponding to the density function d(f; si,n). The interpretation of e(si,n) is that with probability of 1 [?] e, the partial solution si,n, once completed, will lead to a score smaller than e(si,n). The constant e is a parameter, set to an appropriate small value. See Figure 1 for illustration.</Paragraph> <Paragraph position="4"> We will refer to e(si,n) as the maximal expected score of si,n. Of the two partial solutions, we consider as 'more promising' the one, whose maximal expected score is higher. As illustrated in Figure 2, it is possible for a partial solution si,n to be more promising even though its score f(si,n) is lower than that of some other partial solution sj,m.</Paragraph> <Paragraph position="5"> Further, given a complete solution si and a partial solution sj,m, a related question is whether sj,m is a promising solution, that is, whether it is likely that advancing it will lead to a score higher than f(si).</Paragraph> <Paragraph position="6"> Using the notion of maximal expected score, we say that a solution is promising if e(sj,m) > f(si).</Paragraph> <Paragraph position="7"> With the definitions introduced so far, we are</Paragraph> <Paragraph position="9"> the score of sj,m, the partial solution si,n is more promising, since e(si,n) > e(sj,m). Note that for the sake of simplicity, a zero systematic bias of the error d is assumed, that is, the densities are centered around the partial solution scores.</Paragraph> <Paragraph position="10"> now able to perform two basic operations: compare two partial solutions, deciding which one of them is more promising, and compare a partial solution with some complete solution, deciding whether the partial solution is still promising or can be disregarded. These two basic operations are sufficient to devise the following search algorithm.</Paragraph> <Paragraph position="11"> * Maintain a priority queue of partial solutions, ordered by their maximal expected score.</Paragraph> <Paragraph position="12"> * In each step, remove from the queue the partial solution with the highest maximal expected score, advance it, and enqueue any resulting partial solutions.</Paragraph> <Paragraph position="13"> * Iterate while the maximal expected score of the most promising partial solution remains higher than the score of the best complete solution discovered so far.</Paragraph> <Paragraph position="14"> The parameter e primarily affects how early the algorithm stops, however, it influences the order in which the solutions are considered as well. Low values of e result in higher maximal expected scores and therefore partial solutions need to be advanced to a higher degree of completion before they can be disregarded as unpromising.</Paragraph> <Paragraph position="15"> While there are no particular theoretical restrictions on the target functions, there is an important practical consideration. Since the target function is evaluated every time a partial solution si,n is advanced into si,n+1, being able to use the information about si,n to efficiently compute fk(si,n+1) is necessary.</Paragraph> <Paragraph position="16"> The algorithm is to a large extent related to the Astar search algorithm, which maintains a priority queue of partial solutions, ordered according to a score g(x) + h(x), where g(x) is the score of x and h(x) is a heuristic overestimate2 of the final score of the goal reached from x. Here, the maximal expected score of a partial solution is an overestimate with the probability of 1[?]e and can be viewed as a probabilistic counterpart of the Astar heuristic component h(x). Note that Astar only guarantees to find the best solution if h(x) never underestimates, which is not the case here.</Paragraph> </Section> <Section position="2" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 2.2 Estimation of uk(si,n) and s2k(si,n) </SectionTitle> <Paragraph position="0"> So far, we have assumed that for each partial solution si,n and each target function fk, the density [?]k(d; si,n) is defined as a normal density specified by the mean uk(si,n) and variance s2k(si,n). This density models the error dk(si,n) that arises due to the incompleteness of si,n. The parameters uk(si,n) and s2k(si,n) are, in theory, different for each si,n and reflect the behavior of the target function fk as well as the degree of completion and possibly other attributes of si,n. It is thus necessary to estimate these two parameters from data.</Paragraph> <Paragraph position="1"> Let us, for each target function fk, consider a training set of observations Tk [?] S x R. Each training observation tj = parenleftbigsj,nj, dkparenleftbigsj,njparenrightbigparenrightbig [?] Tk corresponds to a solution sj,nj with a known error</Paragraph> <Paragraph position="3"> Before we introduce the method to estimate the density [?]k(d; si,n) for a particular si,n, we discuss data normalization. The overall score f(si,n) is defined as the sum of the scores assigned by the individual target functions fk. Naturally, it is desir2In the usual application of Astar to shortest-path search, h(x) is a heuristic underestimate since the objective is to minimize the score.</Paragraph> <Paragraph position="4"> able that these scores are of comparable magnitudes.</Paragraph> <Paragraph position="5"> Therefore, we normalize the target functions using the z-normalization</Paragraph> <Paragraph position="7"> Each target function fk is normalized separately, based on the data in the training set Tk. Throughout our experiments, the values of the target function are always z-normalized.</Paragraph> <Paragraph position="8"> Let us now consider the estimation of the mean uk(si,n) and variance s2k(si,n) that define the density [?]k(d; si,n). Naturally, it is not possible to estimate the distribution parameters for each solution si,n separately. Instead, we approximate the parameters based on two most salient characteristics of each solution: the degree of completion u(si,n) and the score fk(si,n). Thus,</Paragraph> <Paragraph position="10"> Let us assume the following notation: ui = u(si,n),</Paragraph> <Paragraph position="12"> where K stands for the kernel value Kui,fi(uj, fj).</Paragraph> <Paragraph position="13"> The kernel K is the product of two Gaussians, centered at ui and fi, respectively.</Paragraph> <Paragraph position="15"> where nparenleftbigx; u, s2parenrightbig is the normal density function.</Paragraph> <Paragraph position="16"> The variances s2u and s2f control the degree of smoothing along the u and f axes, respectively.</Paragraph> <Paragraph position="17"> High variance results in stronger smoothing, compared to low variance. In our evaluation, we set the By (12) and (13), the error is approximated as a function of the degree of completion u(si,n) and the score fk(si,n). The degree of completion is on the horizontal and the score on the vertical axis. The estimates (uA, s2A) and (uB, s2B) correspond to the RLSC regressor and average link length target functions, respectively.</Paragraph> <Paragraph position="18"> variance such that su and sf equal to 10% of the distance from min(uj) to max(uj) and from min(fj) to max(fj), respectively.</Paragraph> <Paragraph position="19"> The kernel-smoothed estimates of u and s2 for two of the four target functions used in the evaluation experiments are illustrated in Figure 3. While both estimates demonstrate the decrease both in mean and variance for u approaching 0, the target functions generally exhibit a different behavior. Note that the values are clearly dependent on both the score and the degree of completion, indicating that the degree of completion alone is not sufficiently representative of the partial solutions. Ideally, the values of both the mean and variance should be strictly 0 for u = 1, however, due to the effect of smoothing, they remain non-zero.</Paragraph> </Section> </Section> class="xml-element"></Paper>