File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1062_metho.xml

Size: 14,482 bytes

Last Modified: 2025-10-06 14:07:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1062">
  <Title>Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Global features
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The global-feature generator
</SectionTitle>
      <Paragraph position="0"> The module we describe in this section generates global features for each candidate tagged sequence.</Paragraph>
      <Paragraph position="1"> As input it takes a sentence, along with a proposed segmentation (i.e., an assignment of a tag for each word in the sentence). As output, it produces a set of feature strings. We will use the following tagged sentence as a running example in this section:</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
nation/N ./N
</SectionTitle>
    <Paragraph position="0"> An example feature type is simply to list the full strings of entities that appear in the tagged input. In this example, this would give the three features WE=Gen Xer WE=The Day They Shot John Lennon WE=Dougherty Arts Center Here WE stands for &amp;quot;whole entity&amp;quot;. Throughout this section, we will write the features in this format. The start of the feature string indicates the feature type (in this case WE), followed by =. Following the type, there are generally 1 or more words or other symbols, which we will separate with the symbol . A seperate module in our implementation takes the strings produced by the global-feature generator, and hashes them to integers. For example, suppose the three strings WE=Gen Xer, WE=The Day They Shot John Lennon, WE=Dougherty Arts Center were hashed to 100, 250, and 500 respectively. Conceptually, the candidate a16 is represented by a large number of features a32a70a69 a22 a16a25a24 for a71a72a38 a2a48a47a4a47a4a47 a26 where a26 is the number of distinct feature strings in training data. In this example, only a32a73a34a36a35a37a35</Paragraph>
    <Paragraph position="2"> take the value a2 , all other features being zero.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Feature templates
</SectionTitle>
      <Paragraph position="0"> We now introduce some notation with which to describe the full set of global features. First, we assume the following primitives of an input candidate:  where the transformation is applied in the same way as the final feature type in the maximum entropy tagger. Each character in the word is mapped to its a17a19a18 a0a21a20 , but repeated consecutive character types are not repeated in the mapped string. For example, Animal would be mapped to Aa in this feature, G.M. would again be mapped to A.A..</Paragraph>
      <Paragraph position="2"> a2a48a47a4a47a4a47a33a79 is the same as a27a29a28 , but has an additional flag appended. The flag indicates whether or not the word appears in a dictionary of words which appeared more often lower-cased than capitalized in a large corpus of text. In our example, Animal appears in the lexicon, but G.M. does not, so the two values for a12a85a28 would be Aa1 and A.A.0 respectively. In addition, a17a86a28a19a30 a80 a28a86a30a33a27a29a28 and a12a85a28 are all defined to be NULL if a46a88a87 a2 or a46a88a89 a79 .</Paragraph>
      <Paragraph position="3"> Most of the features we describe are anchored on entity boundaries in the candidate segmentation. We will use &amp;quot;feature templates&amp;quot; to describe the features that we used. As an example, suppose that an entity Description Feature Template The whole entity string WE=a90a45a91 a90a48a92a91a94a93a96a95a98a97 a99a100a99a100a99 a90a45a101 The a102 a53 features within the entity FF=a102 a91 a102 a92a91a59a93a96a95a98a97 a99a100a99a37a99 a102 a101 The a103 a53 features within the entity GF=a103 a91 a103 a92a91a94a93a104a95a98a97 a99a37a99a19a99 a103 a101 The last word in the entity LW=a90a45a101 Indicates whether the last word is lower-cased LWLC=a105a101</Paragraph>
      <Paragraph position="5"> is seen from words a71 to a20 inclusive in a segmentation. Then the WE feature described in the previous section can be generated by the template WE=a80 a69 a80 a69a19a111a112a34 a47a4a47a4a47 a80a114a113 Applying this template to the three entities in the running example generates the three feature strings described in the previous section. As another example, consider the template FF=a27a5a69 a27a5a69a86a111a112a34 a47a4a47a4a47 a27 a113 . This will generate a feature string for each of the entities in a candidate, this time using the values a27a5a69</Paragraph>
      <Paragraph position="7"> rather than a80 a69 a47a4a47a4a47 a80 a113 . For the full set of feature templates that are anchored around entities, see figure 1.</Paragraph>
      <Paragraph position="8"> A second set of feature templates is anchored around quotation marks. In our corpus, entities (typically with long names) are often seen surrounded by quotes. For example, &amp;quot;The Day They Shot John Lennon&amp;quot;, the name of a band, appears in the running example. Define a71 to be the index of any double quotation marks in the candidate, a20 to be the index of the next (matching) double quotation marks if they appear in the candidate. Additionally, define a20a14a115 to be the index of the last word beginning with a lower case letter, upper case letter, or digit within the quotation marks. The first set of feature templates tracks the values of a12 a28 for the words within quotes:2</Paragraph>
      <Paragraph position="10"> 2We only included these features if a118a120a119a122a121a124a123a110a125a122a126a128a127 , to prevent an explosion in the length of feature strings.</Paragraph>
      <Paragraph position="11"> The next set of feature templates are sensitive to whether the entire sequence between quotes is tagged as a named entity. Define a129 a115 to be a2 if</Paragraph>
      <Paragraph position="13"> if the sequence of words within the quotes is tagged as a single entity). Also define a134 to be the number of upper cased words within the quotes, a135 to be the number of lower case words, and a129 to be a2 if a134a13a136a9a135 , a3 otherwise. Then two other templates are:</Paragraph>
      <Paragraph position="15"> In the &amp;quot;The Day They Shot John Lennon&amp;quot; example we would have a129 a115 a38 a2 provided that the entire sequence within quotes was tagged as an entity. Additionally, a134a137a38a139a138 , a135a140a38 a3 , and a129a141a38 a2 . The values for a12 a58a69a86a111a112a34 a64 and a12 a113 a74 would be a142a144a143 a2 and a142a144a143 a3 (these features are derived from The and Lennon, which respectively do and don't appear in the capitalization lexicon). This would give QF=a2 a138 a3 a142a144a143 a2 a142a144a143 a3 and QF2=a2 a2 a142a144a143 a2 a142a145a143 a3 .</Paragraph>
      <Paragraph position="16"> At this point, we have fully described the representation used as input to the reranking algorithms.</Paragraph>
      <Paragraph position="17"> The maximum-entropy tagger gives 20 proposed segmentations for each input sentence. Each candidate a16 is represented by the log probability a135</Paragraph>
      <Paragraph position="19"> from the tagger, as well as the values of the global features a32a75a69 a22 a16a25a24 for a71a146a38 a2a48a47a4a47a4a47 a26 . In the next section we describe algorithms which blend these two sources of information, the aim being to improve upon a strategy which just takes the candidate from the tagger with the highest score for a135 a22 a16a25a24 .</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Ranking Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Notation
</SectionTitle>
      <Paragraph position="0"> This section introduces notation for the reranking task. The framework is derived by the transformation from ranking problems to a margin-based classification problem in (Freund et al. 1998). It is also related to the Markov Random Field methods for parsing suggested in (Johnson et al. 1999), and the boosting methods for parsing in (Collins 2000). We consider the following set-up: a15 Training data is a set of example input/output pairs. In tagging we would have training examples a147 a71 a28 a30a37a17 a28a19a148 where each a71 a28 is a sentence and each a17 a28 is the correct sequence of tags for that sentence.</Paragraph>
      <Paragraph position="1"> a15 We assume some way of enumerating a set of candidates for a particular sentence. We use a16a75a28a150a149 to denote the a151 'th candidate for the a46 'th sentence in training data, and a152 a22 a71 a28 a24a153a38 a147 a16 a28a98a34 a30a37a16 a28a66a74 a47a4a47a4a47a148 to denote the set of candidates for a71a14a28 . In this paper, the top a154 outputs from a maximum entropy tagger are used as the set of candidates.</Paragraph>
      <Paragraph position="2">  for a71a158a38 a2a48a47a4a47a4a47 a26 . The features could be arbitrary functions of the candidates; our hope is to include features which help in discriminating good candidates from bad ones.</Paragraph>
      <Paragraph position="3"> a15 Finally, the parameters of the model are a vector of a26 a132 a2 parameters, a161a162a38 a147 a80 a35a85a30 a80 a34 a47a4a47a4a47 a80a164a163 a148 . The ranking function is defined as  This function assigns a real-valued number to a candidate a16 . It will be taken to be a measure of the plausibility of a candidate, higher scores meaning higher plausibility. As such, it assigns a ranking to different candidate structures for the same sentence, 3In the event that multiple candidates get the same, highest score, the candidate with the highest value of log-likelihood a167 under the baseline model is taken as a168</Paragraph>
      <Paragraph position="5"> and in particular the output on a training or test example a71 is a170a85a171a110a160a173a172a124a170a29a174a107a175a85a176a52a177 a58a69 a64 a129 a22 a16a122a30a37a161a156a24 . In this paper we take the features a32a75a69 to be fixed, the learning problem being to choose a good setting for the parameters a161 .</Paragraph>
      <Paragraph position="6"> In some parts of this paper we will use vector notation. Define a178 a22 a16a73a24 to be the vector</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The boosting algorithm
</SectionTitle>
      <Paragraph position="0"> The first algorithm we consider is the boosting algorithm for ranking described in (Collins 2000). The algorithm is a modification of the method in (Freund et al. 1998). The method can be considered to be a greedy algorithm for finding the parameters a161 that  retical motivation for this algorithm goes back to the PAC model of learning. Intuitively, it is useful to note that this loss function is an upper bound on the number of &amp;quot;ranking errors&amp;quot;, a ranking error being a case where an incorrect candidate gets a higher value for a129 than a correct candidate. This follows because for all a16 , a20 a116 a175 a136a193a192a70a194a16a75a195 , where we define a192a70a194a16a75a195 to be a2 for a16a197a196 a3 , and a3 otherwise. Hence  where a198a54a28 a62a149a216a38a217a129 a22 a16a75a28 a62a34a11a30a37a161a156a24a133a200a54a129 a22 a16a70a28 a62a149a5a30a37a161a156a24 . The boosting algorithm chooses the feature/update pair a209a96a218 a30a33a210 a218 which is optimal in terms of minimizing the loss function, i.e.,</Paragraph>
      <Paragraph position="2"> and then makes the update a80a114a211a11a219 a38 a80a114a211a11a219 a132a220a210 a218 .</Paragraph>
      <Paragraph position="3"> Figure 2 shows an algorithm which implements this greedy procedure. See (Collins 2000) for a full description of the method, including justification that the algorithm does in fact implement the update in Eq. 1 at each iteration.4 The algorithm relies on the following arrays:</Paragraph>
      <Paragraph position="5"> Thus a142 a111a211 is an index from features to correct/incorrect candidate pairs where the a209 'th feature takes value a2 on the correct candidate, and value a3 on the incorrect candidate. The array a142 a116a211 is a similar index from features to examples. The arrays a222 a111</Paragraph>
      <Paragraph position="7"> are reverse indices from training examples to features.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 The voted perceptron
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows the training phase of the perceptron algorithm, originally introduced in (Rosenblatt 1958). The algorithm maintains a parameter vector a161 , which is initially set to be all zeros. The algorithm then makes a pass over the training set, at each training example storing a parameter vector a161 a28 for a46a144a38 a2a48a47a4a47a4a47a33a79 . The parameter vector is only modified when a mistake is made on an example. In this case the update is very simple, involving adding the difference of the offending examples' representations (a161 a28 a38a223a161 a28a57a116a73a34 a132a224a178 a22 a16a75a28a57a34a108a24a88a200a220a178 a22 a16a70a28a199a149a29a24 in the figure). See (Cristianini and Shawe-Taylor 2000) chapter 2 for discussion of the perceptron algorithm, and theory justifying this method for setting the parameters.</Paragraph>
      <Paragraph position="1"> In the most basic form of the perceptron, the parameter values a161a153a225 are taken as the final parameter settings, and the output on a new test example with a16a104a149 for a151a131a38 a2a48a47a4a47a4a47 a26 is simply the highest  of the perceptron, the voted perceptron. The training phase is identical to that in figure 3. Note, however, that all parameter vectors a161 a28 for a46a208a38 a2a48a47a4a47a4a47a110a79 are stored. Thus the training phase can be thought of as a way of constructing a79 different parameter settings. Each of these parameter settings will have its own highest ranking candidate, a16  a16 a149 a24 . The idea behind the voted perceptron is to take each of the a79 parameter settings to &amp;quot;vote&amp;quot; for a candidate, and the candidate which gets the most votes is returned as the most likely candidate. See figure 4 for the algorithm.5</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML