File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0802_evalu.xml
Size: 4,282 bytes
Last Modified: 2025-10-06 13:59:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0802"> <Title>Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora</Title> <Section position="9" start_page="13" end_page="15" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> In this section we present the data set (two comparable English and Italian corpora) used in the evaluation, and we show the results of the Cross Language TC tasks. In particular we tried both to train the system on the English data set and classify Italian documents and to train using Italian and classify the English test set. We compare the learning curves of the Multilingual Domain Kernel with the standard BoW kernel, which is considered as a baseline for this task.</Paragraph> <Section position="1" start_page="13" end_page="13" type="sub_section"> <SectionTitle> 6.1 Implementation details </SectionTitle> <Paragraph position="0"> As a supervised learning device, we used the SVM implementation described in (Joachims, 1999). The Multilingual Domain Kernel is implemented by de ning an explicit feature mapping as explained above, and by normalizing each vector. All the experiments have been performed with the standard SVM parameter settings.</Paragraph> <Paragraph position="1"> We acquired a Multilingual Domain Model by performing the Singular Value Decomposition process on the term-by-document matrices representing the merged training partitions (i.e. English and Italian), and we considered only the rst 400 dimensions4. null</Paragraph> </Section> <Section position="2" start_page="13" end_page="13" type="sub_section"> <SectionTitle> 6.2 Data set description </SectionTitle> <Paragraph position="0"> We used a news corpus kindly put at our disposal by ADNKRONOS, an important Italian news provider. The corpus consists of 32,354 Italian and 27,821 English news partitioned by ADNKRONOS in a number of four xed categories: Quality of Life, Made in Italy, Tourism, Culture and School. The corpus is comparable, in the sense stated in Section 2, i.e. they covered the same topics and the same period of time. Some news are translated in the other language (but no alignment indication is given), some others are present only in the English set, and some others only in the Italian. The average length of the news is about 300 words. We randomly split both the English and Italian part into 75% training and 25% test (see Table 3). In both the data sets we postagged the texts and we considered only the noun, verb, adjective, and adverb parts of speech, representing them by vectors containing the frequencies of each lemma with its part of speech.</Paragraph> </Section> <Section position="3" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 6.3 Monolingual Results </SectionTitle> <Paragraph position="0"> Before going to a cross-language TC task, we conducted two tests of classical monolingual TC by training and testing the system on Italian and English documents separately. For these tests we used the SVM with the BoW kernel. Figures 2 and 3 report the results.</Paragraph> </Section> <Section position="4" start_page="14" end_page="15" type="sub_section"> <SectionTitle> 6.4 A Cross Language Text Categorization task </SectionTitle> <Paragraph position="0"> As far as the cross language TC task is concerned, we tried the two possible options: we trained on the English part and we classi ed the Italian part, and we trained on the Italian and classi ed on the En- null the corpus lary of the merged training, and how many common lemmata are present (about 14% of the total). Among the common lemmata, 97% are nouns and most of them are proper nouns. Thus the initial term-by-document matrix is a 43,384 45,132 matrix, while the DLSA matrix is 43,384 400. For this task we consider as a baseline the BoW kernel.</Paragraph> <Paragraph position="1"> The results are reported in Figures 4 and 5. Analyzing the learning curves, it is worth noting that when the quantity of training increases, the performance becomes better and better for the Multi-lingual Domain Kernel, suggesting that with more available training it could be possible to go closer to typical monolingual TC results.</Paragraph> </Section> </Section> class="xml-element"></Paper>