XML Viewer - w06-3605

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3605_metho.xml
Size: 6,783 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3605">
  <Title>Using semantic relations to refine coreference decisions. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods</Title>
  <Section position="4" start_page="37" end_page="38" type="metho">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> We test the proposed search algorithm on the problem of dependency parsing. We have previously developed a finite-state implementation (Ginter et al., 2006) of the Link Grammar (LG) parser (Sleator and Temperley, 1991) which generates the parse through the intersection of several finite-state automata. The resulting automaton encodes all candidate parses.</Paragraph>
    <Paragraph position="1"> The parses are then generated from left to right, proceeding through the automaton from the initial to the final state. A partial parse is a sequence of n words from the beginning of the sentence, together with string encoding of their dependencies. Advancing a partial parse corresponds to appending to it the next word. The degree of completion is then defined as the number of words currently generated in the parse, divided by the total number of words in the sentence.</Paragraph>
    <Paragraph position="2"> To evaluate the ability of the proposed method to combine diverse criteria in the search, we use four target functions: a complex state-of-the-art parse re-ranker based on a regularized least-squares (RLSC) regressor (Tsivtsivadze et al., 2005), and three measures inspired by the simple heuristics applied by the LG parser. The criteria are the average length of a dependency, the average level of nesting of a dependency, and the average number of dependencies linking a word. The RLSC regressor, on the other hand, employs complex features and word n-gram statistics.</Paragraph>
    <Paragraph position="3"> The dataset consists of 200 sentences randomly selected from the BioInfer corpus of dependency-parsed sentences extracted from abstracts of biomedical research articles (Pyysalo et al., 2006). For each sentence, we have randomly selected a maximum of 100 parses. For sentences with less than 100 parses, all parses were selected.</Paragraph>
    <Paragraph position="4"> The average number of parses per sentence is 62.</Paragraph>
    <Paragraph position="5"> Further, we perform 5 x 2 cross-validation, that is, in each of five replications, we divide the data randomly to two sets of 100 sentences and use one set to estimate the probability distributions and the other set to measure the performance of the search algorithm. The RLSC regressor is trained once, using a different set of sentences from the BioInfer corpus.</Paragraph>
    <Paragraph position="6"> The results presented here are averaged over the 10 folds. As a comparative baseline, we use a simple  greedy search algorithm that always advances the partial solution with the highest score until all solutions have been generated.</Paragraph>
    <Section position="1" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
3.1 Results
</SectionTitle>
      <Paragraph position="0"> For each sentence s with parses S{s1, . . ., sN}, let SC [?] S be the subset of parses fully completed before the algorithm stops and SN = S \SC the sub-set of parses not fully completed. Let further TC be the number of iterations taken before the algorithm stops, and T be the total number of steps needed to generate all parses in S. Thus, |S |is the size of the search space measured in the number of parses, and T is the size of the search space measured in the number of steps. For a single parse si, rank(si) is the number of parses in S with a score higher than f(si) plus 1. Thus, the rank of all solutions with the maximal score is 1. Finally, ord(si) corresponds to the order in which the parses were completed by the algorithm (disregarding the stopping criterion).</Paragraph>
      <Paragraph position="1"> For example, if the parses were completed in the order s3, s8, s1, then ord(s3) = 1, ord(s8) = 2, and ord(s1) = 3. While two solutions have the same rank if their scores are equal, no two solutions have the same order. The best completed solution ^sC [?] SC is the solution with the highest rank in SC and the lowest order among solutions with the same rank. The best solution ^s is the solution with rank 1 and the lowest order among solutions with rank 1. If ^s [?] SC, then ^sC = ^s and the objective of the algorithm to find the best solution was fulfilled. We use the following measures of performance: rank(^sC), ord(^s), |SC||S |, and TCT . The most important criteria are rank(^sC) which measures how good the best found solution is, and TCT which measures the proportion of steps actually taken by the algorithm of the total number of steps needed to complete all the candidate solutions. Further, ord(^s), the number of parses completed before the global optimum was reached regardless the stopping criterion, is indicative about the ability of the search to reach the global optimum early among the completed parses. Note that all measures except for ord(^s) equal to 1 for the baseline greedy search, since it lacks a stopping criterion.</Paragraph>
      <Paragraph position="2"> The average performance values for four settings of the parameter e are presented in Table 1. Clearly,  the algorithm behaves as expected with respect to the parameter e. While with the strictest setting e = 0.01, 94% of the search space is explored, with the least strict setting of e = 0.2, 73% is explored, thus pruning one quarter of the search space. The proportion of completed parses is generally considerably lower than the proportion of explored search space. This indicates that the parses are generally advanced to a significant level of completion, but then ruled out. The behavior of the algorithm is thus closer to a breadth-first, rather than depth-first search. We also notice that the average rank of the best completed solution is very low, indicating that although the algorithm does not necessarily identify the best solution, it generally identifies a very good solution. In addition, the order of the best solution is low as well, suggesting that generally good solutions are identified before low-score solutions. Further, compared to the baseline, the globally optimal solution is reached earlier among the completed parses, although this does not imply that it is reached earlier in the number of steps. Apart from the overall averages, we also consider the performance with respect to the number of alternative parses for each sentence (Table 2). Here we see that even with the least strict setting, the search finds a reasonably good solution while being able to reduce the search space to 48%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML