File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1034_evalu.xml

Size: 8,063 bytes

Last Modified: 2025-10-06 13:58:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1034">
  <Title>New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Parsing Wall Street Journal Text
</SectionTitle>
      <Paragraph position="0"> We used the same data set as that described in (Collins 2000). The Penn Wall Street Journal tree-bank (Marcus et al. 1993) was used as training and test data. Sections 2-21 inclusive (around 40,000 sentences) were used as training data, section 23 was used as the final test set. Of the 40,000 training sentences, the first 36,000 were used to train the perceptron. The remaining 4,000 sentences were used as development data, and for tuning parameters of the algorithm. Model 2 of (Collins 1999) was used to parse both the training and test data, producing multiple hypotheses for each sentence. In order to gain a representative set of training data, the 36,000 training sentences were parsed in 2,000 sentence chunks, each chunk being parsed with a model trained on the remaining 34,000 sentences (this prevented the initial model from being unrealistically &amp;quot;good&amp;quot; on the training sentences). The 4,000 development sentences were parsed with a model trained on the 36,000 training sentences. Section 23 was parsed with a model trained on all 40,000 sentences.</Paragraph>
      <Paragraph position="1"> The representation we use incorporates the probability from the original model, as well as the all-subtrees representation. We introduce a parameter a195 which controls the relative contribution of the two terms. If a196 a2a5a4a7a6 is the log probability of a tree a4 under the original probability model, and a1a3a2a5a4a7a6a197a19 a2  the feature vector under the all subtrees representation, then the new representation is a1  perceptron algorithm to use the probability from the original model as well as the subtrees information to rank trees. We would thus expect the model to do at least as well as the original probabilistic model.</Paragraph>
      <Paragraph position="2"> The algorithm in figure 1(b) was applied to the problem, with the inner product a1</Paragraph>
      <Paragraph position="4"> in the definition of a77 a2a5a4a7a6 . The algorithm in 1(b) runs in approximately quadratic time in the number of training examples. This made it somewhat expensive to run the algorithm over all 36,000 training sentences in one pass. Instead, we broke the training set into 6 chunks of roughly equal size, and trained 6 separate perceptrons on these data sets. This has the advantage of reducing training time, both because of the quadratic dependence on training set size, and also because it is easy to train the 6 models in parallel. The outputs from the 6 runs on test examples were combined through the voting procedure described in section 3.4.</Paragraph>
      <Paragraph position="5"> Figure 4 shows the results for the voted perceptron with the tree kernel. The parameters a195 and a165 were set to a85 a46a99 and a85 a46a193a201 respectively through tuning on the development set. The method shows a a85 a46a193a202a67a203 absolute improvement in average precision and recall (from 88.2% to 88.8% on sentences a166 a87a27a85a56a85 words), a 5.1% relative reduction in error. The boosting method of (Collins 2000) showed 89.6%/89.9% recall and precision on reranking approaches for the same datasets (sentences less than 100 words in length). (Charniak 2000) describes a different method which achieves very similar performance to (Collins 2000). (Bod 2001) describes experiments giving 90.6%/90.8% recall and precision for sentences of less than 40 words in length, using the all-subtrees representation, but using very different algorithms and parameter estimation methods from the perceptron algorithms in this paper (see section 7 for more discussion).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Named-Entity Extraction
</SectionTitle>
      <Paragraph position="0"> Over a period of a year or so we have had over one million words of named-entity data annotated. The data is drawn from web pages, the aim being to support a question-answering system over web data. A number of categories are annotated: the usual people, organization and location categories, as well as less frequent categories such as brand-names, scientific terms, event titles (such as concerts) and so on. As a result, we created a training set of 53,609 sentences (1,047,491 words), and a test set of 14,717 sentences (291,898 words).</Paragraph>
      <Paragraph position="1"> The task we consider is to recover named-entity boundaries. We leave the recovery of the categories of entities to a separate stage of processing. We evaluate different methods on the task through precision and recall.7 The problem can be framed as a tagging task - to tag each word as being either the start of an entity, a continuation of an entity, or not to be part of an entity at all. As a baseline model we used a maximum entropy tagger, very similar to the one described in (Ratnaparkhi 1996). Maximum entropy taggers have been shown to be highly competitive on a number of tagging tasks, such as part-of-speech tagging (Ratnaparkhi 1996), and named-entity recognition (Borthwick et. al 1998). Thus the maximum-entropy tagger we used represents a serious baseline for the task. We used a feature set which included the current, next, and previous word; the previous two tags; various capitalization and other features of the word being tagged (the full feature set is described in (Collins 2002a)).</Paragraph>
      <Paragraph position="2"> As a baseline we trained a model on the full 53,609 sentences of training data, and decoded the 14,717 sentences of test data using a beam search 7If a method proposes a204 entities on the test set, and a205 of these are correct then the precision of a method is a206a158a188a150a188a150a207a209a208a108a205a35a210a102a204 . Similarly, if a211 is the number of entities in the human annotated version of the test set, then the recall is a206a35a188a150a188a150a207a212a208a213a205a158a210a54a211 .  ods. &amp;quot;Imp.&amp;quot; is the relative error reduction given by using the perceptron. a214a119a215 precision, a216a194a215 recall, a217a119a215 F-measure. which keeps the top 20 hypotheses at each stage of a left-to-right search. In training the voted perceptron we split the training data into a 41,992 sentence training set, and a 11,617 sentence development set. The training set was split into 5 portions, and in each case the maximum-entropy tagger was trained on 4/5 of the data, then used to decode the remaining 1/5. In this way the whole training data was decoded. The top 20 hypotheses under a beam search, together with their log probabilities, were recovered for each training sentence. In a similar way, a model trained on the 41,992 sentence set was used to produce 20 hypotheses for each sentence in the development set.</Paragraph>
      <Paragraph position="3"> As in the parsing experiments, the final kernel incorporates the probability from the maximum entropy tagger, i.e. a1  under the tagging model, a1a3a2a5a4a7a6a83a16a59a1a14a2 a15 a6 is the tagging kernel described previously, and a195 is a parameter weighting the two terms. The other free parameter in the kernel is a165 , which determines how quickly larger structures are downweighted. In running several training runs with different parameter values, and then testing error rates on the development set, the best parameter values we found were a195 a19a220a85 a46a99 , a165a95a19a221a85 a46a193a192 . Figure 5 shows results on the test data for the baseline maximum-entropy tagger, and the voted perceptron. The results show a 15.6% relative improvement in F-measure.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML