File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1056_evalu.xml

Size: 6,021 bytes

Last Modified: 2025-10-06 13:59:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1056">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Semi-Supervised Learning of Partial Cognates using Bilingual Bootstrapping</Title>
  <Section position="8" start_page="445" end_page="447" type="evalu">
    <SectionTitle>
6 Evaluation and Results
</SectionTitle>
    <Paragraph position="0"> In this section we present the results that we obtained with the supervised and semi-supervised methods that we applied to disambiguate partial cognates.</Paragraph>
    <Paragraph position="1"> Due to space issue we show results only for testing on the testing sets and not for the 10-fold cross validation experiments on the training data. For the same reason, we present the results that we obtained only with the French side of the parallel corpus, even though we trained classifiers on the English sentences as well. The results for the 10-fold cross validation and for the English sentences are not much different than the ones from Table 5 that describe the supervised method  Table 6 and Table 7 present results for the MB and BB. More experiments that combined MB and BB techniques were also performed. The results are presented in Table 9.</Paragraph>
    <Paragraph position="2"> Our goal is to disambiguate partial cognates in general, not only in the particular domain of Hansard and EuroParl. For this reason we used another set of automatically determined sentences from a multi-domain parallel corpus.</Paragraph>
    <Paragraph position="3"> The set of new sentences (multi-domain) was extracted in the same manner as the seeds from Hansard and EuroParl. The new parallel corpus is a small one, approximately 1.5 million words, but contains texts from different domains: magazine articles, modern fiction, texts from international organizations and academic textbooks. We are using this set of sentences in our experiments to show that our methods perform well on multi-domain corpora and also because our aim is to be able to disambiguate PC in different domains.</Paragraph>
    <Paragraph position="4"> From this parallel corpus we were able to extract the number of sentences shown in Table 8.</Paragraph>
    <Paragraph position="5"> With this new set of sentences we performed different experiments both for MB and BB. All results are described in Table 9. Due to space issue we report the results only on the average that we obtained for all the 10 pairs of partial cognates.</Paragraph>
    <Paragraph position="6"> The symbols that we use in Table 9 represent: S - the seed training corpus, TS - the seed test set, BNC and LM - sentences extracted from LeMonde and BNC (Table 4), and NC - the sentences that were extracted from the multi-domain new corpus. When we use the + symbol we put together all the sentences extracted from the respective corpora.</Paragraph>
    <Section position="1" start_page="446" end_page="447" type="sub_section">
      <SectionTitle>
6.1 Discussion of the Results
</SectionTitle>
      <Paragraph position="0"> The results of the experiments and the methods that we propose show that we can use with success unlabeled data to learn from, and that the noise that is introduced due to the seed set collection is tolerable by the ML techniques that we use.</Paragraph>
      <Paragraph position="1"> Some results of the experiments we present in Table 9 are not as good as others. What is important to notice is that every time we used MB or BB or both, there was an improvement. For some experiments MB did better, for others BB was the method that improved the performance; nonetheless for some combinations MB together with BB was the method that worked best.</Paragraph>
      <Paragraph position="2"> In Tables 5 and 7 we show that BB improved the results on the NB-K classifier with 3.24%, compared with the supervised method (no bootstrapping), when we tested only on the test set (TS), the one that represents 1/3 of the initiallycollected parallel sentences. This improvement is not statistically significant, according to a t-test. In Table 9 we show that our proposed methods bring improvements for different combinations of training and testing sets. Table 9, lines 1 and 2 show that BB with NB-K brought an improvement of 1.95% from no bootstrapping, when we tested on the multi-domain corpus NC. For the same setting, there was an improvement of 1.55% when we tested on TS (Table 9, lines 6 and 8). When we tested on the combination TS+NC, again BB brought an improvement of 2.63% from no bootstrapping (Table 9, lines 10 and 12). The difference between MB and BB with this setting is 6.86% (Table 9, lines 11 and 12). According to a t-test the 1.95% and 6.86% improvements are statistically significant.</Paragraph>
      <Paragraph position="3">  The number of features that were extracted from the seeds was more than double at each MB and BB experiment, showing that even though we started with seeds from a language restricted domain, the method is able to capture knowledge form different domains as well. Besides the change in the number of features, the domain of the features has also changed form the parliamentary one to others, more general, showing that the method will be able to disambiguate sentences where the partial cognates cover different types of context.</Paragraph>
      <Paragraph position="4"> Unlike previous work that has done with monolingual or bilingual bootstrapping, we tried to disambiguate not only words that have senses that are very different e.g. plant - with a sense of biological plant or with the sense of factory. In our set of partial cognates the French word route is a difficult word to disambiguate even for humans: it has a cognate sense when it refers to a maritime or trade route and a false-friend sense when it is used as road. The same observation applies to client (the cognate sense is client, and the false friend sense is customer, patron, or patient) and to circulation (cognate in air or blood circulation, false friend in street traffic).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML