XML Viewer - n06-1004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/n06-1004_concl.xml
Size: 4,385 bytes
Last Modified: 2025-10-06 13:55:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1004">
  <Title>Segment Choice Models: Feature-Rich Models for Global Distortion in Statistical Machine Translation</Title>
  <Section position="9" start_page="29" end_page="31" type="concl">
    <SectionTitle>
6 Summary and Discussion
</SectionTitle>
    <Paragraph position="0"> In this paper, we presented a new class of probabilistic model for distortion, based on the choices made during translation. Unlike some recent distortion models (Kumar and Byrne, 2005; Tillmann and Zhang, 2005; Tillmann, 2004) these Segment Choice Models (SCMs) allow phrases to be moved globally, between any positions in the sentence.</Paragraph>
    <Paragraph position="1"> They also lend themselves to quick offline comparison by means of a new metric called disperp.</Paragraph>
    <Paragraph position="2"> We developed a decision-tree (DT) based SCM whose parameters were optimized on a &amp;quot;dev&amp;quot; corpus via disperp. Two variants of the DT system were experimentally compared with two systems with a distortion penalty on a Chinese-to-English task. In pairwise bootstrap comparisons, the systems with DT-based distortion outperformed the penalty-based systems more than 99% of the time.</Paragraph>
    <Paragraph position="3"> The computational cost of training the DTs on large quantities of data is comparable to that of training phrase tables on the same data - large but manageable - and increases linearly with the amount of training data. However, currently there is a major problem with DT training: the low proportion of Chinese-English sentence pairs that can be fully segment-aligned and thus be used for DT training (about 27%). This may result in selection bias that impairs performance. We plan to implement an alignment algorithm with smoothed phrase tables (Johnson et al. 2006) to achieve segment alignment on 100% of the training data.</Paragraph>
    <Paragraph position="4"> Decoding time with the DT-based distortion model is roughly proportional to the square of the number of tokens in the source sentence. Thus, long sentences pose a challenge, particularly during the weight optimization step. In experiments on other language pairs reported elsewhere (Johnson et al. 2006), we applied a heuristic: DT training and decoding involved source sentences with 60 or fewer tokens, while longer sentences were handled with the distortion penalty. A more principled ap- null proach would be to divide long source sentences into chunks not exceeding 60 or so tokens, within each of which reordering is allowed, but which cannot themselves be reordered.</Paragraph>
    <Paragraph position="5"> The experiments above used a segmentation model that was a count of the number of source segments (sometimes called &amp;quot;phrase penalty&amp;quot;), but we are currently exploring more sophisticated models. Once we have found the best segmentation model, we will improve the system's current naive single-word segmentation of the remaining source sentence during decoding, and construct a more accurate future cost function for beam search. Another obvious system improvement would be to incorporate more advanced word-based features in the DTs, such as questions about word classes (Tillmann and Zhang 2005, Tillmann 2004).</Paragraph>
    <Paragraph position="6"> We also plan to apply SCMs to rescoring N-best lists from the decoder. For rescoring, one could apply several SCMs, some with assumptions differing from those of the decoder. E.g., one could apply right-to-left SCMs, or &amp;quot;distorted target&amp;quot; SCMs which assume a target hypothesis generated the source sentence, instead of vice versa.</Paragraph>
    <Paragraph position="7"> Finally, we are contemplating an entirely different approach to DT-based SCMs for decoding. In this approach, only one DT would be used, with only two output classes that could be called &amp;quot;C&amp;quot; and &amp;quot;N&amp;quot;. The input to such a tree would be a particular segment in the remaining source sentence, with contextual information (e.g., the sequence of segments already chosen). The DT would estimate the probability Pr(C) that the specified segment is &amp;quot;chosen&amp;quot; and the probability Pr(N) that it is &amp;quot;not chosen&amp;quot;. This would eliminate the need to guess the segmentation of the remaining source sentence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML