File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1070_evalu.xml

Size: 3,689 bytes

Last Modified: 2025-10-06 13:59:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1070">
  <Title>Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization</Title>
  <Section position="7" start_page="557" end_page="559" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> The CLTC task has been rarely attempted in the literature, and standard evaluation benchmark are not available. For this reason, we developed an evaluation task by adopting a news corpus kindly put at our disposal by AdnKronos, an important Italian news provider. The corpus consists of 32,354 Italian and 27,821 English news partitioned by AdnKronos into four xed categories: QUALITY OF LIFE, MADE IN ITALY, TOURISM, CULTURE AND SCHOOL. The English and the Italian corpora are comparable, in the sense stated in Section 2, i.e. they cover the same topics and the same period of time. Some news stories are translated in the other language (but no alignment indication is given), some others are present only in the English set, and some others only in the Italian. The average length of the news stories is about 300 words. We randomly split both the English and Italian part into 75% training and 25% test (see Table 2). We processed the corpus with PoS taggers, keeping only nouns, verbs, adjectives and adverbs.</Paragraph>
    <Paragraph position="1"> Table 3 reports the vocabulary dimensions of the English and Italian training partitions, the vocabulary of the merged training, and how many common lemmata are present (about 14% of the total). Among the common lemmata, 97% are nouns and most of them are proper nouns. Thus the initial term-by-document matrix is a 43,384 45,132 matrix, while the DLSA was acquired using 400 dimensions.</Paragraph>
    <Paragraph position="2"> As far as the CLTC task is concerned, we tried the many possible options. In all the cases we trained on the English part and we classi ed the Italian part, and we trained on the Italian and clas- null of the corpus si ed on the English part. When used, the MDM was acquired running the SVD only on the joint (English and Italian) training parts.</Paragraph>
    <Paragraph position="3"> Using only comparable corpora. Figure 2 reports the performance without any use of bilingual dictionaries. Each graph show the learning curves respectively using a BoW kernel (that is considered here as a baseline) and the multilingual domain kernel. We can observe that the latter largely outperform a standard BoW approach. Analyzing the learning curves, it is worth noting that when the quantity of training increases, the performance becomes better and better for the Multilingual Domain Kernel, suggesting that with more available training it could be possible to improve the results. Using bilingual dictionaries. Figure 3 reports the learning curves exploiting the addition of the synset-ids of the monosemic words in the corpus.</Paragraph>
    <Paragraph position="4"> As expected the use of a multilingual repository improves the classi cation results. Note that the MDM outperforms the BoW kernel.</Paragraph>
    <Paragraph position="5"> Figure 4 shows the results adding in the English and Italian parts of the corpus all the synset-ids (i.e. monosemic and polisemic) and all the translations found in the Collins dictionary respectively. These are the best results we get in our experiments. In these gures we report also the performance of the corresponding monolingual TC (we used the SVM with the BoW kernel), which can be considered as an upper bound. We can observe that the CLTC results are quite close to the performance obtained in the monolingual classi cation tasks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML