File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2117_evalu.xml

Size: 7,946 bytes

Last Modified: 2025-10-06 13:59:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2117">
  <Title>Boosting Statistical Word Alignment Using Labeled and Unlabeled Data</Title>
  <Section position="6" start_page="917" end_page="918" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> With the data in section 5.1, we get the word alignment results shown in table 2. For all of the methods in this table, we perform bi-directional (source to target and target to source) word alignment, and obtain two alignment results on the testing set. Based on the two results, we get the &amp;quot;refined&amp;quot; combination as described in Och and Ney (2000). Thus, the results in table 2 are those of the &amp;quot;refined&amp;quot; combination. For EM training, we use the GIZA++ toolkit  .</Paragraph>
    <Paragraph position="1"> In this paper, we take English to Chinese word alignment as a case study.</Paragraph>
    <Section position="1" start_page="917" end_page="917" type="sub_section">
      <SectionTitle>
5.1 Data
</SectionTitle>
      <Paragraph position="0"> We have two kinds of training data from general domain: Labeled Data (LD) and Unlabeled Data (UD). The Chinese sentences in the data are automatically segmented into words. The statistics for the data is shown in Table 1. The labeled data is manually word aligned, including 156,421 alignment links.</Paragraph>
    </Section>
    <Section position="2" start_page="917" end_page="917" type="sub_section">
      <SectionTitle>
Results of Supervised Methods
</SectionTitle>
      <Paragraph position="0"> Using the labeled data, we use two methods to estimate the parameters in IBM model 4: one is to use the EM algorithm, and the other is to estimate the parameters directly from the labeled data as described in section 3. In table 2, the method &amp;quot;Labeled+EM&amp;quot; estimates the parameters with the EM algorithm, which is an unsupervised method without boosting. And the method &amp;quot;Labeled+Direct&amp;quot; estimates the parameters directly from the labeled data, which is a supervised method without boosting. &amp;quot;Labeled+EM+Boost&amp;quot; and &amp;quot;Labeled+Direct+Boost&amp;quot; represent the two supervised boosting methods for the above two parameter estimation methods.</Paragraph>
      <Paragraph position="1">  We use 1,000 sentence pairs as testing set, which are not included in LD or UD. The testing set is also manually word aligned, including 8,634 alignment links in the testing set  .</Paragraph>
    </Section>
    <Section position="3" start_page="917" end_page="918" type="sub_section">
      <SectionTitle>
5.2 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> We use the same evaluation metrics as described in Wu et al. (2005), which is similar to those in (Och and Ney, 2000). The difference lies in that Wu et al. (2005) take all alignment links as sure links.</Paragraph>
      <Paragraph position="1"> Our methods that directly estimate parameters in IBM model 4 are better than that using the EM algorithm. &amp;quot;Labeled+Direct&amp;quot; is better than &amp;quot;Labeled+EM&amp;quot;, achieving a relative error rate reduction of 22.97%. And &amp;quot;Labeled+Direct+Boost&amp;quot; is better than &amp;quot;Labeled+EM+Boost&amp;quot;, achieving a relative error rate reduction of 22.98%. In addition, the two boosting methods perform better than their corresponding methods without If we use to represent the set of alignment links identified by the proposed method and to denote the reference alignment set, the meth- null For a non one-to-one link, if m source words are aligned to n target words, we take it as one alignment link instead of m[?]n alignment links.</Paragraph>
      <Paragraph position="2">  It is located at http://www.fjoch.com/ GIZA++.html.  boosting. For example, &amp;quot;Labeled+Direct+Boost&amp;quot; achieves an error rate reduction of 9.92% as compared with &amp;quot;Labeled+Direct&amp;quot;.</Paragraph>
    </Section>
    <Section position="4" start_page="918" end_page="918" type="sub_section">
      <SectionTitle>
Results of Unsupervised Methods
</SectionTitle>
      <Paragraph position="0"> With the unlabeled data, we use the EM algorithm to estimate the parameters in the model.</Paragraph>
      <Paragraph position="1"> The method &amp;quot;Unlabeled+EM&amp;quot; represents an unsupervised method without boosting. And the method &amp;quot;Unlabeled+EM+Boost&amp;quot; uses the same unsupervised Adaboost algorithm as described in Wu and Wang (2005).</Paragraph>
      <Paragraph position="2"> The boosting method &amp;quot;Unlabeled+EM+Boost&amp;quot; achieves a relative error rate reduction of 16.25% as compared with &amp;quot;Unlabeled+EM&amp;quot;. In addition, the unsupervised boosting method &amp;quot;Unlabeled+EM+Boost&amp;quot; performs better than the supervised boosting method &amp;quot;Labeled+Direct+ Boost&amp;quot;, achieving an error rate reduction of 10.90%. This is because the size of labeled data is too small to subject to data sparseness problem.</Paragraph>
    </Section>
    <Section position="5" start_page="918" end_page="918" type="sub_section">
      <SectionTitle>
Results of Semi-Supervised Methods
</SectionTitle>
      <Paragraph position="0"> By using both the labeled and the unlabeled data, we interpolate the models trained by &amp;quot;Labeled+Direct&amp;quot; and &amp;quot;Unlabeled+EM&amp;quot; to get an interpolated model. Here, we use &amp;quot;interpolated&amp;quot; to represent it. &amp;quot;Method 1&amp;quot; and &amp;quot;Method 2&amp;quot; represent the semi-supervised boosting methods described in section 4.2 and section 4.3, respectively. &amp;quot;Combination&amp;quot; denotes the method described in section 4.4, which combines &amp;quot;Method 1&amp;quot; and &amp;quot;Method 2&amp;quot;. Both of the weights  l in equation (11) are set to 0.5.</Paragraph>
      <Paragraph position="1"> &amp;quot;Interpolated&amp;quot; performs better than the methods using only labeled data or unlabeled data. It achieves relative error rate reductions of 12.61% and 8.82% as compared with &amp;quot;Labeled+Direct&amp;quot; and &amp;quot;Unlabeled+EM&amp;quot;, respectively.</Paragraph>
      <Paragraph position="2"> Using an interpolation model, the two semi-supervised boosting methods &amp;quot;Method 1&amp;quot; and &amp;quot;Method 2&amp;quot; outperform the supervised boosting method &amp;quot;Labeled+Direct+Boost&amp;quot;, achieving a relative error rate reduction of 12.34% and 17.32% respectively. In addition, the two semi-supervised boosting methods perform better than the unsupervised boosting method &amp;quot;Unlabeled+ EM+Boost&amp;quot;. &amp;quot;Method 1&amp;quot; performs slightly better than &amp;quot;Unlabeled+EM+Boost&amp;quot;. This is because we only change the distribution of the labeled data in &amp;quot;Method 1&amp;quot;. &amp;quot;Method 2&amp;quot; achieves an error rate reduction of 7.77% as compared with &amp;quot;Unlabeled+EM+Boost&amp;quot;. This is because we use the interpolated model in our semi-supervised boosting method, while &amp;quot;Unlabeled+EM+Boost&amp;quot; only uses the unsupervised model.</Paragraph>
      <Paragraph position="3"> Moreover, the combination of the two semi-supervised boosting methods further improves the results, achieving relative error rate reductions of 18.20% and 13.27% as compared with &amp;quot;Method 1&amp;quot; and &amp;quot;Method 2&amp;quot;, respectively. It also outperforms both the supervised boosting method &amp;quot;Labeled+Direct+Boost&amp;quot; and the unsupervised boosting method &amp;quot;Unlabeled+EM+ Boost&amp;quot;, achieving relative error rate reductions of 28.29% and 19.52% respectively.</Paragraph>
      <Paragraph position="4"> Summary of the Results From the above result, it can be seen that all boosting methods perform better than their corresponding methods without boosting. The semi-supervised boosting methods outperform the supervised boosting method and the unsupervised boosting method.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML