File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1070_metho.xml
Size: 15,248 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1070"> <Title>Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization</Title> <Section position="3" start_page="553" end_page="554" type="metho"> <SectionTitle> 2 Comparable Corpora </SectionTitle> <Paragraph position="0"> Comparable corpora are collections of texts in different languages regarding similar topics (e.g. a collection of news published by agencies in the same period). More restrictive requirements are expected for parallel corpora (i.e. corpora composed of texts which are mutual translations), while the class of the multilingual corpora (i.e.</Paragraph> <Paragraph position="1"> collection of texts expressed in different languages without any additional requirement) is the more general. Obviously parallel corpora are also comparable, while comparable corpora are also multilingual. null In a more precise way, let L = fL1,L2,...,Llg be a set of languages, let Ti = fti1,ti2,...,ting be a collection of texts expressed in the language Li 2 L, and let ps(tjh,tiz) be a function that returns 1 if tiz is the translation of tjh and 0 otherwise. A multilingual corpus is the collection of texts de ned by T[?] = uniontextiTi. If the function ps exists for every text tiz 2 T[?] and for every language Lj, and is known, then the corpus is parallel and aligned at document level.</Paragraph> <Paragraph position="2"> For the purpose of this paper it is enough to assume that two corpora are comparable, i.e. they are composed of documents about the same topics and produced in the same period (e.g. possibly from different news agencies), and it is not known if a function ps exists, even if in principle it could exist and return 1 for a strict subset of document pairs.</Paragraph> <Paragraph position="3"> The texts inside comparable corpora, being about the same topics, should refer to the same concepts by using various expressions in different languages. On the other hand, most of the proper nouns, relevant entities and words that are not yet lexicalized in the language, are expressed by using their original terms. As a consequence the same entities will be denoted with the same words in different languages, allowing us to automatically detect couples of translation pairs just by looking at the word shape (Koehn and Knight, 2002).</Paragraph> <Paragraph position="4"> Our hypothesis is that comparable corpora contain a large amount of such words, just because texts, referring to the same topics in different languages, will often adopt the same terms to denote the same However, the simple presence of these shared words is not enough to get signi cant results in CLTC tasks. As we will see, we need to exploit these common words to induce a second-order similarity for the other words in the lexicons.</Paragraph> <Section position="1" start_page="554" end_page="554" type="sub_section"> <SectionTitle> 2.1 The Multilingual Vector Space Model </SectionTitle> <Paragraph position="0"> Let T = ft1,t2,...,tng be a corpus, and V = fw1,w2,...,wkg be its vocabulary. In the mono-lingual settings, the Vector Space Model (VSM) is a k-dimensional space Rk, in which the text tj 2 T is represented by means of the vector vectortj such that the zth component of vectortj is the frequency of wz in tj. The similarity among two texts in the VSM is then estimated by computing the cosine of their vectors in the VSM.</Paragraph> <Paragraph position="1"> Unfortunately, such a model cannot be adopted in the multilingual settings, because the VSMs of different languages are mainly disjoint, and the similarity between two texts in different languages would always turn out to be zero. This situation is represented in Figure 1, in which both the leftbottom and the rigth-upper regions of the matrix are totally lled by zeros.</Paragraph> <Paragraph position="2"> On the other hand, the assumption of corpora comparability seen in Section 2, implies the presence of a number of common words, represented by the central rows of the matrix in Figure 1.</Paragraph> <Paragraph position="3"> As we will show in Section 5, this model is rather poor because of its sparseness. In the next section, we will show how to use such words as seeds to induce a Multilingual Domain VSM, in which second order relations among terms and documents in different languages are considered to improve the similarity estimation.</Paragraph> </Section> </Section> <Section position="4" start_page="554" end_page="554" type="metho"> <SectionTitle> 3 Exploiting Comparable Corpora </SectionTitle> <Paragraph position="0"> Looking at the multilingual term-by-document matrix in Figure 1, a rst attempt to merge the subspaces associated to each language is to exploit the information provided by external knowledge sources, such as bilingual dictionaries, e.g. collapsing all the rows representing translation pairs. In this setting, the similarity among texts in different languages could be estimated by exploiting the classical VSM just described. However, the main disadvantage of this approach to estimate inter-lingual text similarity is that it strongly terion to decide whether two corpora are comparable is to estimate the percentage of terms in the intersection of their vocabularies.</Paragraph> <Paragraph position="1"> relies on the availability of a multilingual lexical resource. For languages with scarce resources a bilingual dictionary could be not easily available.</Paragraph> <Paragraph position="2"> Secondly, an important requirement of such a resource is its coverage (i.e. the amount of possible translation pairs that are actually contained in it).</Paragraph> <Paragraph position="3"> Finally, another problem is that ambiguous terms could be translated in different ways, leading us to collapse together rows describing terms with very different meanings. In Section 4 we will see how the availability of bilingual dictionaries in uences the techniques and the performance. In the present Section we want to explore the case in which such resources are supposed not available.</Paragraph> <Section position="1" start_page="554" end_page="554" type="sub_section"> <SectionTitle> 3.1 Multilingual Domain Model </SectionTitle> <Paragraph position="0"> A MDM is a multilingual extension of the concept of Domain Model. In the literature, Domain Models have been introduced to represent ambiguity and variability (Gliozzo et al., 2004) and successfully exploited in many NLP applications, such as Word Sense Disambiguation (Strapparava et al., 2004), Text Categorization and Term Categoriza- null tion.</Paragraph> <Paragraph position="1"> A Domain Model is composed of soft clusters of terms. Each cluster represents a semantic domain, i.e. a set of terms that often co-occur in texts having similar topics. Such clusters identify groups of words belonging to the same semantic eld, and thus highly paradigmatically related. MDMs are Domain Models containing terms in more than one language.</Paragraph> <Paragraph position="2"> A MDM is represented by a matrix D, containing the degree of association among terms in all the languages and domains, as illustrated in Table 1. For example the term virus is associated to both</Paragraph> </Section> </Section> <Section position="5" start_page="554" end_page="556" type="metho"> <SectionTitle> MEDICINE COMPUTER SCIENCE </SectionTitle> <Paragraph position="0"> English terms, wi Italian terms and we/i the common terms to both languages.</Paragraph> <Paragraph position="1"> the domain COMPUTER SCIENCE and the domain MEDICINE while the domain MEDICINE is associated to both the terms AIDS and HIV. Inter-lingual</Paragraph> <Paragraph position="3"> domain relations are captured by placing different terms of different languages in the same semantic eld (as for example HIV e/i, AIDSe/i, hospitale, and clinicai). Most of the named entities, such as Microsoft and HIV are expressed using the same string in both languages.</Paragraph> <Paragraph position="4"> Formally, let V i = fwi1,wi2,...,wikig be the vocabulary of the corpus Ti composed of document expressed in the language Li, let V[?] =uniontext i V i be the set of all the terms in all the languages, and let k[?] = jV[?]j be the cardinality of this set. Let D = fD1,D2,...,Ddg be a set of domains. A DM is fully de ned by a k[?] d domain matrix D representing in each cell di,z the domain relevance of the ith term of V[?] with respect to the domain Dz. The domain matrix D is used to dene a function D : Rk[?] ! Rd, that maps the document vectors vectortj expressed into the multilingual classical VSM (see Section 2.1), into the vectors vectortprimej in the multilingual domain VSM. The function</Paragraph> <Paragraph position="6"> where IIDF is a diagonal matrix such that iIDFi,l = IDF(wli), vectortj is represented as a row vector, and IDF(wli) is the Inverse Document Frequency of 2In (Wong et al., 1985) the formula 1 is used to de ne a Generalized Vector Space Model, of which the Domain VSM is a particular instance.</Paragraph> <Paragraph position="7"> wli evaluated in the corpus Tl.</Paragraph> <Paragraph position="8"> In this work we exploit Latent Semantic Analysis (LSA) (Deerwester et al., 1990) to automatically acquire a MDM from comparable corpora. LSA is an unsupervised technique for estimating the similarity among texts and terms in a large corpus. In the monolingual settings LSA is performed by means of a Singular Value Decomposition (SVD) of the term-by-document matrix T describing the corpus. SVD decomposes the term-by-document matrix T into three matrixes T ' VSkprimeUT where Skprime is the diagonal k k matrix containing the highest kprime k eigenvalues of T, and all the remaining elements are set to 0. The parameter kprime is the dimensionality of the Domain VSM and can be xed in advance (i.e.</Paragraph> <Paragraph position="9"> kprime = d).</Paragraph> <Paragraph position="10"> In the literature (Littman et al., 1998) LSA has been used in multilingual settings to de ne a multilingual space in which texts in different languages can be represented and compared. In that work LSA strongly relied on the availability of aligned parallel corpora: documents in all the languages are represented in a term-by-document matrix (see Figure 1) and then the columns corresponding to sets of translated documents are collapsed (i.e. they are substituted by their sum) before starting the LSA process. The effect of this step is to merge the subspaces (i.e. the right and the left sectors of the matrix in Figure 1) in which the documents have been originally represented.</Paragraph> <Paragraph position="11"> In this paper we propose a variation of this strategy, performing a multilingual LSA in the case in which an aligned parallel corpus is not available.</Paragraph> <Paragraph position="12"> It exploits the presence of common words among different languages in the term-by-document matrix. The SVD process has the effect of creating a LSA space in which documents in both languages are represented. Of course, the higher the number of common words, the more information will be provided to the SVD algorithm to nd common LSA dimension for the two languages. The resulting LSA dimensions can be perceived as multilingual clusters of terms and document. LSA can then be used to de ne a Multilingual Domain Matrix DLSA. For further details see (Gliozzo and Strapparava, 2005).</Paragraph> <Paragraph position="13"> As Kernel Methods are the state-of-the-art supervised framework for learning and they have been successfully adopted to approach the TC task (Joachims, 2002), we chose this framework to perform all our experiments, in particular Support Vector Machines3. Taking into account the external knowledge provided by a MDM it is possible estimate the topic similarity among two texts expressed in different languages, with the following kernel:</Paragraph> <Paragraph position="15"> (2) where D is de ned as in equation 1. Note that when we want to estimate the similarity in the standard Multilingual VSM, as described in Section 2.1, we can use a simple bag of words kernel. The BoW kernel is a particular case of the Domain Kernel, in which D = I, and I is the identity matrix. In the evaluation typically we consider the BoW Kernel as a baseline.</Paragraph> </Section> <Section position="6" start_page="556" end_page="557" type="metho"> <SectionTitle> 4 Exploiting Bilingual Dictionaries </SectionTitle> <Paragraph position="0"> When bilingual resources are available it is possible to augment the the common portion of the matrix in Figure 1. In our experiments we exploit two alternative multilingual resources: MultiWordNet and the Collins English-Italian bilingual dictionary.</Paragraph> <Paragraph position="1"> tional lexicon, conceived to be strictly aligned with the Princeton WordNet. The available languages are Italian, Spanish, Hebrew and Romanian. In our experiment we used the English and the Italian components. The last version of the Italian WordNet contains around 58,000 Italian word senses and 41,500 lemmas organized into 32,700 synsets aligned whenever possible with WordNet English synsets. The Italian synsets are created in correspondence with the Princeton WordNet synsets, whenever possible, and semantic relations are imported from the corresponding English synsets. This implies that the synset index structure is the same for the two languages.</Paragraph> <Paragraph position="2"> Thus for the all the monosemic words, we augment each text in the dataset with the corresponding synset-id, which act as an expansion of the common terms of the matrix in Figure 1. Adopting the methodology described in Section 3.1, we exploit these common sense-indexing to induce a second-order similarity for the other terms in the lexicons. We evaluate the performance of the cross-lingual text categorization, using both the BoW Kernel and the Multilingual Domain Kernel, observing that also in this case the leverage of the external knowledge brought by the MDM is effective. null It is also possible to augment each text with all the synset-ids of all the words (i.e. monosemic and polysemic) present in the dataset, hoping that the SVM machine learning device cut off the noise due to the inevitable spurious senses introduced in the training examples. Obviously in this case, differently from the monosemic enrichment seen above, it does not make sense to apply any dimensionality reduction supplied by the Multilingual Domain Model (i.e. the resulting second-order relations among terms and documents produced on a such extended corpus should not be meaningful)5. null Collins. The Collins machine-readable bilingual dictionary is a medium size dictionary including 37,727 headwords in the English Section and 32,602 headwords in the Italian Section.</Paragraph> <Paragraph position="3"> This is a traditional dictionary, without sense indexing like the WordNet repository. In this case we follow the way, for each text of one language, to augment all the present words with the translation words found in the dictionary. For the same reason, we chose not to exploit the MDM, while experimenting along this way.</Paragraph> </Section> class="xml-element"></Paper>