XML Viewer - n06-2024

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2024_metho.xml
Size: 8,460 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2024">
  <Title>NER Systems that Suit User's Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction</Title>
  <Section position="3" start_page="93" end_page="93" type="metho">
    <SectionTitle>
2 Extractor tweaking
</SectionTitle>
    <Paragraph position="0"> Learning methods such as VP-HMM and CRFs optimize criteria such as margin separation (implicitly maximized by VP-HMMs) or log-likelihood (explicitly maximized by CRFs), which are at best indirectly related to precision and recall. Can such learning methods be modified to more directly reward a user-provided performance metric? In a non-sequential classifier, a threshold on confidence can be set to alter the precision-recall tradeoff. This is nontrivial to do for VP-HMMs and CRFs.</Paragraph>
    <Paragraph position="1"> Both learners use dynamic programming to find the label sequence y = (y1,...,yi,...,yN) for a word sequence x = (x1,...,xi,...,xN) that maximizes the function W * summationtextif(x,i,yi[?]1,yi) , where W is the learned weight vector and f is a vector of features computed from x, i, the label yi for xi, and the previous label yi[?]1. Dynamic programming finds the most likely state sequence, and does not output probability for a particular sub-sequence. (Culotta and McCallum, 2004) suggest several ways to generate confidence estimation in this framework. We propose a simpler approach for directly manipulating the learned extractor's precision-recall ratio. We will assume that the labels y include one label O for &amp;quot;outside any named entity&amp;quot;, and let w0 be the weight for the feature f0, defined as follows:</Paragraph>
    <Paragraph position="3"> If no such feature exists, then we will create one.</Paragraph>
    <Paragraph position="4"> The NER based on W will be sensitive to the value of w0: large negative values will force the dynamic programming method to label tokens as inside entities, and large positive values will force it to label fewer entities1.</Paragraph>
    <Paragraph position="5"> 1We clarify that w0 will refer to feature f0 only, and not to other features that may incorporate label information. We thus propose to &amp;quot;tweak&amp;quot; a learned NER by varying the single parameter w0 systematically so as to optimize some user-provided performance metric.</Paragraph>
    <Paragraph position="6"> Specifically, we tune w0 using a a Gauss-Newton line search, where the objective function is iteratively approximated by quadratics.2 We terminate the search when two adjacent evaluation results are within a 0.01% difference3.</Paragraph>
    <Paragraph position="7"> A variety of performance metrics might be imagined: for instance, one might wish to optimize recall, after applying some sort of penalty for precision below some fixed threshold. In this paper we will experiment with performance metrics based on the (complete) F-measure formula, which combines precision and recall into a single numeric value based on a user-provided parameter b:</Paragraph>
    <Paragraph position="9"> A value of b &gt; 1 assigns higher importance to recall. In particular, F2 weights recall twice as much as precision. Similarly, F0.5 weights precision twice as much as recall.</Paragraph>
    <Paragraph position="10"> We consider optimizing both token- and entity-level Fb - awarding partial credit for partially extracted entities and no credit for incorrect entity boundaries, respectively. Performance is optimized over the dataset on which W was trained, and tested on a separate set. A key question our evaluation should address is whether the values optimized for the training examples transfer well to unseen test examples, using the suggested approximate procedure.</Paragraph>
  </Section>
  <Section position="4" start_page="93" end_page="95" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="93" end_page="94" type="sub_section">
      <SectionTitle>
3.1 Experimental Settings
</SectionTitle>
      <Paragraph position="0"> We experiment with three datasets, of both email and newswire text. Table 1 gives summary statistics for all datasets. The widely-used MUC-6 dataset includes news articles drawn from the Wall Street Journal. The Enron dataset is a collection of emails extracted from the Enron corpus (Klimt and Yang, 2004), where we use a subcollection of the messages located in folders named &amp;quot;meetings&amp;quot; or &amp;quot;calendar&amp;quot;. The Mgmt-Groups dataset is a second email  collection, extracted from the CSpace email corpus, which contains email messages sent by MBA students taking a management course conducted at Carnegie Mellon University in 1997. This data was split such that its test set contains a different mix of entity names comparing to training exmaples. Further details about these datasets are available elsewhere (Minkov et al., 2005).</Paragraph>
      <Paragraph position="1">  We used an implementation of Collins' votedpercepton method for discriminatively training HMMs (henceforth, VP-HMM) (Collins, 2002) as well as CRF (Lafferty et al., 2001) to learn a NER.</Paragraph>
      <Paragraph position="2"> Both VP-HMM and CRF were trained for 20 epochs on every dataset, using a simple set of features such as word identity and capitalization patterns for a window of three words around each word being classified. Each word is classified as either inside or outside a person name.4</Paragraph>
    </Section>
    <Section position="2" start_page="94" end_page="95" type="sub_section">
      <SectionTitle>
3.2 Extractor tweaking Results
</SectionTitle>
      <Paragraph position="0"> Figure 1 evaluates the effectiveness of the optimization process used by &amp;quot;extractor tweaking&amp;quot; on the Enron dataset. We optimized models for Fb with different values of b, and also evaluated each optimized model with different Fb metrics. The top graph shows the results for token-level Fb, and the bottom graph shows entity-level Fb behavior. The graph illustates that the optimized model does indeed roughly maximize performance for the target b value: for example, the token-level Fb curve for the model optimized for b = 0.5 indeed peaks at b = 0.5 on the test set data. The optimization is only roughly accurate5 for several possible reasons: first, there are differences between train and test sets; in addition, the line search assumes that the performance metric is smooth and convex, which need not be true. Note that evaluation-metric optimization is less successful for entity-level performance, 4This problem encoding is basic. However, in the context of this paper we focus on precision-recall trade-off in the general case, avoiding settings' optimization.</Paragraph>
      <Paragraph position="1"> 5E.g, the token-level F2 curve peaks at b = 5.</Paragraph>
      <Paragraph position="2">  tom) optimization for varying values of b, for the Enron dataset, VP-HMM. The y-axis gives F in terms of b. b (x-axis) is given in a logarithmic scale.</Paragraph>
      <Paragraph position="3"> which behaves less smoothly than token-level performance. null  Similar results were obtained optimizing baseline CRF classifiers. Sample results (for MUC-6 only, due to space limitations) are given in Table 2, optimizing a CRF baseline for entity-level Fb. Note that as b increases, recall monotonically increases and precision monotonically falls.</Paragraph>
      <Paragraph position="4"> The graphs in Figure 2 present another set of results with a more traditional recall-precision curves. The top three graphs are for token-level Fb optimization, and the bottom three are for entity-level optimization. The solid lines show the token-level and entity-level precision-recall tradeoff obtained by  and the bottom three are for entity-level optimization. Each graph shows the baseline learned VP-HMM and evaluation-metric optimization for different values of b, in terms of both token-level and entity-level performance. varying6 b and optimizing the relevant measure for Fb; the points labeled &amp;quot;baseline&amp;quot; show the precision and recall in token and entity level of the baseline model, learned by VP-HMM. These graphs demonstrate that extractor &amp;quot;tweaking&amp;quot; gives approximately smooth precision-recall curves, as desired. Again, we note that the resulting recall-precision trade-off for entity-level optimization is generally less smooth.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML