File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3005_intro.xml
Size: 2,291 bytes
Last Modified: 2025-10-06 14:02:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3005"> <Title>Customizing Parallel Corpora at the Document Level</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Evaluation Data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Medical Domain Corpus: Springer </SectionTitle> <Paragraph position="0"> The Springer corpus consists of 9640 documents (titles plus abstracts of medical journal articles) each in English and in German, with 25 queries in both languages, and relevance judgments made by native German speakers who are medical experts and are fluent in English. We split this parallel corpus into two subsets, and used the first subset (4,688 documents) for training, and the remaining subset (4,952 documents) as the test set in all our experiments. This configuration allows us to experiment with CLIR in both directions (EN-DE and DE-EN). We applied an alignment algorithm to the training documents, and obtained a sentence-aligned parallel corpus with about 30K sentences in each language.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Training Corpora </SectionTitle> <Paragraph position="0"> In addition to Springer, we have used four other English-German parallel corpora for training: * NEWS is a collection of 59K sentence aligned news stories, downloaded from the web (1996-2000), and available at</Paragraph> <Paragraph position="2"> mining the web (Nie et al., 2000), in no particular domain * EUROPARL is a parallel corpus provided by (Koehn). Its documents are sentence aligned European Parliament proceedings. This is a large collection that has been successfully used for CLEF, when the target corpora were collections of news stories (Rogati and Yang, 2003).</Paragraph> <Paragraph position="3"> * MEDTITLE is an English-German parallel corpus consisting of 549K paired titles of medical journal articles. These titles were gathered from the PubMed online database (http://www.ncbi.nlm.nih.gov/PubMed/).</Paragraph> <Paragraph position="4"> Table 1 presents a summary of the five training corpora characteristics.</Paragraph> </Section> </Section> class="xml-element"></Paper>