File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3114_intro.xml

Size: 2,677 bytes

Last Modified: 2025-10-06 14:04:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3114">
  <Title>Out-of-domain test set</Title>
  <Section position="3" start_page="103" end_page="103" type="intro">
    <SectionTitle>
ID Participant
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
1.2 Test Data
</SectionTitle>
      <Paragraph position="0"> The test data was again drawn from a segment of the Europarl corpus from the fourth quarter of 2000, which is excluded from the training data. Participants were also provided with two sets of 2,000 sentences of parallel text to be used for system development and tuning.</Paragraph>
      <Paragraph position="1"> In addition to the Europarl test set, we also collected 29 editorials from the Project Syndicate website2, which are published in all the four languages of the shared task. We aligned the texts at a sentence level across all four languages, resulting in 1064 sentence per language. For statistics on this test set, refer to Figure 1.</Paragraph>
      <Paragraph position="2"> The out-of-domain test set differs from the Europarl data in various ways. The text type are edi- null torials instead of speech transcripts. The domain is general politics, economics and science. However, it is also mostly political content (even if not focused on the internal workings of the European Union) and opinion.</Paragraph>
    </Section>
    <Section position="2" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
1.3 Participants
</SectionTitle>
      <Paragraph position="0"> We received submissions from 14 groups from 11 institutions, as listed in Figure 2. Most of these groups follow a phrase-based statistical approach to machine translation. Microsoft's approach uses de- null pendency trees, others use hierarchical phrase models. Systran submitted their commercial rule-based system that was not tuned to the Europarl corpus. About half of the participants of last year's shared task participated again. The other half was replaced by other participants, so we ended up with roughly the same number. Compared to last year's shared task, the participants represent more long-term research efforts. This may be the sign of a maturing research environment.</Paragraph>
      <Paragraph position="1"> While building a machine translation system is a serious undertaking, in future we hope to attract more newcomers to the field by keeping the barrier of entry as low as possible.</Paragraph>
      <Paragraph position="2"> For more on the participating systems, please refer to the respective system description in the proceedings of the workshop.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML