File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1045_evalu.xml

Size: 5,591 bytes

Last Modified: 2025-10-06 13:58:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1045">
  <Title>Applying Co-Training to Reference Resolution</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> For our experiments we implemented the standard Co-Training algorithm (as described in Section 3) in Java using the Weka machine learning library3. In contrast to other Co-Training approaches, we did not use Naive Bayes as base classifiers, but J48 decision trees, which are a Weka re-implementation of C4.5.</Paragraph>
    <Paragraph position="1"> The use of decision tree classifiers was motivated by the observation that they appeared to perform better on the task at hand.</Paragraph>
    <Paragraph position="2"> We conducted a number of experiments to investigate the question if Co-Training is beneficial for the task of training a classifier for coreference resolution. In previous work (Strube et al., 2002) we obtained quite different results for different types of anaphora, i.e. if we split the data according to the ana np feature into personal and possessive pronouns (PPER PPOS), proper names (NE), and definite NPs (def NP). Therefore we performed Co-Training experiments on subsets of our data defined by these NP forms, and on the whole data set.</Paragraph>
    <Paragraph position="3"> We determined the features for the two different views with the following procedure: We trained classifiers on each feature separately and chose the best one, adding the feature which produced it as the first feature of view 1. We then trained classifiers on all remaining features separately, again choosing the best one and adding its feature as the first feature of view 2. In the next step, we enhanced the first classifier by combining it with all remaining features separately. The classifier with the best performance was</Paragraph>
    <Paragraph position="5"> then chosen and its new feature added as the second feature of view 1. We then enhanced the second classifier in the same way by selecting from the remaining features the one that most improved it, adding this feature as the second one of view 2. This process was repeated until no features were left or no significant improvement was achieved, resulting in the views shown in Table 4 (features marked na were not available for the respective class). This way we determined two views which performed reasonably well separately.</Paragraph>
    <Paragraph position="6">  2. ante gram func X X X X 3. ante npform X X X X 4. ante agree X X X X 5. ante semanticc. X X X X 6. ana gram func X X X 7. ana npform na na X 8. ana agree X X X 9. ana semanticc. na X X na 10. wdist X X X X 11. ddist X X X X 12. mdist X X X X 13. syn par X X X 14. string ident X X X X 15. string match X X X X 16. ante med X X X X 17. ana med X X X X  For Co-Training, we committed ourselves to fixed parameter settings in order to reduce the complexity of the experiments. Settings are given in the relevant subsections, where the following abbreviations are used: L=size of labeled training set, P/N=number of positive/negative instances added per iteration. All reported Co-Training results are averaged over 5 runs utilizing randomized sequences of unlabeled instances.</Paragraph>
    <Paragraph position="7"> We compare the results we obtained with Co-Training with the initial result before the Co-Training process started (zero iterations, both views combined; denoted as XX 0its in the plots). For this, we used a conventional C4.5 decision tree classifier (J48 implementation, default settings) on labeled training data sets of the same size used for the respective Co-Training experiment. We did this in order to verify the quality of the training data and for obtaining reference values for comparison with the Co-Training classifiers.</Paragraph>
    <Paragraph position="8">  PPER PPOS. In Figure 1, three curves and three baselines are plotted: For 20 (L=20), 20 0its is the baseline, i.e. the initial result obtained by just combining the two initial classifiers. For 100, L=100, and for 200, L=200. The other settings were: P=1, N=1, Pool=10. As can be seen, the baselines slightly outperform the Co-Training curves (except for 100).  NE. Then we ran the Co-Training experiment with the NP form NE (i.e. proper names). Since the distribution of positive and negative examples in the labeled training data was quite different from the previous experiment, we used P=1, N=33, Pool=120. Since all results with La66 200 were equally poor, we started with L=200, where the results were closer to ones of classifiers using the whole data set. The resulting Co-Training curve degrades substantially. However, with a training size of 1000 and 2000 the Co-Training curves are above their baselines.  def NP. In the next experiment we tested the NP form def NP, a concept which can be expected to be far more difficult to learn than the previous two NP forms. Used settings were P=1, N=30, Pool=120. For La66 500, F-measure was near 0. With L=500 the Co-Training curve is way below the baseline. However, with L=1000 and L=2000 Co-Training does show some improvement.</Paragraph>
    <Paragraph position="9">  All. In the last experiment we trained our classifier on all NP forms, using P=1, N=33, Pool=120. With L=200 the baseline clearly outperforms Co-Training. Co-Training with L=1000 initially rises above the baselines, but then decreases after about 15 to 20 iterations. With L=2000 the Co-Training curve approximates its baseline and then degenerates. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML