File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0802_metho.xml
Size: 14,159 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0802"> <Title>Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora</Title> <Section position="5" start_page="9" end_page="10" type="metho"> <SectionTitle> 3 The Multilingual Vector Space Model </SectionTitle> <Paragraph position="0"> Let T = ft1,t2,...,tng be a corpus, and V = fw1,w2,...,wkg be its vocabulary. In the mono-lingual settings, the Vector Space Model (VSM) is a k-dimensional space Rk, in which the text tj 2 T is represented by means of the vector vectortj such that the zth component of vectortj is the frequency of wz in tj. The similarity among two texts in the VSM is then estimated by computing the cosine of their vectors in the VSM.</Paragraph> <Paragraph position="1"> 1According to our assumption, a possible additional criterion to decide whether two corpora are comparable is to estimate the percentage of terms in the intersection of their vocabularies. null Unfortunately, such a model cannot be adopted in the multilingual settings, because the VSMs of different languages are mainly disjoint, and the similarity between two texts in different languages would always turn out zero. This situation is represented in Figure 1, in which both the left-bottom and the rigth-upper regions of the matrix are totally lled by zeros.</Paragraph> <Paragraph position="2"> A rst attempt to solve this problem is to exploit the information provided by external knowledge sources, such as bilingual dictionaries, to collapse all the rows representing translation pairs. In this setting, the similarity among texts in different languages could be estimated by exploiting the classical VSM just described. However, the main disadvantage of this approach to estimate inter-lingual text similarity is that it strongly relies on the availability of a multilingual lexical resource containing a list of translation pairs. For languages with scarce resources a bilingual dictionary could be not easily available. Secondly, an important requirement of such a resource is its coverage (i.e. the amount of possible translation pairs that are actually contained in it). Finally, another problem is that ambiguos terms could be translated in different ways, leading to collapse together rows describing terms with very different meanings.</Paragraph> <Paragraph position="3"> On the other hand, the assumption of corpora comparability seen in Section 2, implies the presence of a number of common words, represented by the central rows of the matrix in Figure 1.</Paragraph> <Paragraph position="4"> As we will show in Section 6, this model is rather poor because of its sparseness. In the next section, we will show how to use such words as seeds to induce a Multilingual Domain VSM, in which second order relations among terms and documents in different languages are considered to improve the similarity estimation.</Paragraph> </Section> <Section position="6" start_page="10" end_page="10" type="metho"> <SectionTitle> 4 Multilingual Domain Models </SectionTitle> <Paragraph position="0"> A MDM is a multilingual extension of the concept of Domain Model. In the literature, Domain Models have been introduced to represent ambiguity and variability (Gliozzo et al., 2004) and successfully exploited in many NLP applications, such us Word Sense Disambiguation (Strapparava et al., 2004), Text Categorization and Term Categorization.</Paragraph> <Paragraph position="1"> A Domain Model is composed by soft clusters of terms. Each cluster represents a semantic domain, i.e. a set of terms that often co-occur in texts having similar topics. Such clusters identi es groups of words belonging to the same semantic eld, and thus highly paradigmatically related. MDMs are Domain Models containing terms in more than one language.</Paragraph> <Paragraph position="2"> A MDM is represented by a matrix D, containing the degree of association among terms in all the languages and domains, as illustrated in Table 1.</Paragraph> </Section> <Section position="7" start_page="10" end_page="12" type="metho"> <SectionTitle> MEDICINE COMPUTER SCIENCE </SectionTitle> <Paragraph position="0"> HIV e/i 1 0 AIDSe/i 1 0 viruse/i 0.5 0.5 hospitale 1 0 laptope 0 1 Microsofte/i 0 1 clinicai 1 0 English terms, wi Italian terms and we/i the common terms to both languages. MDMs can be used to describe lexical ambiguity, variability and inter-lingual domain relations. Lexical ambiguity is represented by associating one term to more than one domain, while variability is represented by associating different terms to the same domain. For example the term virus is associated to both the domain COMPUTER SCIENCE and the domain MEDICINE while the domain MEDICINE is associated to both the terms AIDS and HIV. Inter-lingual domain relations are captured by placing different terms of different languages in the same semantic eld (as for example HIV e/i, AIDSe/i, hospitale, and clinicai). Most of the named entities, such as Microsoft and HIV are expressed using the same string in both languages.</Paragraph> <Paragraph position="1"> When similarity among texts in different languages has to be estimated, the information contained in the MDM is crucial. For example the two sentences I went to the hospital to make an HIV check and Ieri ho fatto il test dell'AIDS in clinica (lit. yesterday I did the AIDS test in a clinic) are very highly related, even if they share no tokens. Having an a priori knowledge about the inter-lingual domain similarity among AIDS, HIV, hospital and clinica is then a useful information to recognize inter-lingual topic similarity. Obviously this relation is less restrictive than a stronger association among translation pair. In this paper we will show that such a representation is suf cient for TC puposes, and easier to acquire.</Paragraph> <Paragraph position="2"> In the rest of this section we will provide a formal de nition of the concept of MDM, and we de ne some similarity metrics that exploit it.</Paragraph> <Paragraph position="3"> Formally, let V i = fwi1,wi2,...,wikig be the vocabulary of the corpus Ti composed by document expressed in the language Li, let V [?] = uniontextiV i be the set of all the terms in all the languages, and let k[?] = jV [?]j be the cardinality of this set. Let D = fD1,D2,...,Ddg be a set of domains. A DM is fully de ned by a k[?] d domain matrix D representing in each cell di,z the domain relevance of the ith term of V [?] with respect to the domain Dz. The domain matrix D is used to de ne a function D : Rk[?] ! Rd, that maps the document vectors vectortj expressed into the multilingual classical VSM, into the vectors vectortprimej in the multilingual domain VSM. The function D is de ned by2 2In (Wong et al., 1985) the formula 1 is used to de ne a Generalized Vector Space Model, of which the Domain VSM is a particular instance.</Paragraph> <Paragraph position="5"> where IIDF is a diagonal matrix such thatiIDFi,i = IDF(wli), vectortj is represented as a row vector, and IDF(wli) is the Inverse Document Frequency of wli evaluated in the corpus Tl.</Paragraph> <Paragraph position="6"> The matrix D can be determined for example using hand-made lexical resources, such as WORD-NET DOMAINS (Magnini and Cavagli a, 2000). In the present work we followed the way to acquire D automatically from corpora, exploiting the technique described below.</Paragraph> <Section position="1" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 4.1 Automatic Acquisition of Multilingual Domain Models </SectionTitle> <Paragraph position="0"> In this work we propose the use of Latent Semantic Analysis (LSA) (Deerwester et al., 1990) to induce a MDM from comparable corpora. LSA is an unsupervised technique for estimating the similarity among texts and terms in a large corpus. In the monolingual settings LSA is performed by means of a Singular Value Decomposition (SVD) of the term-by-document matrix T describing the corpus.</Paragraph> <Paragraph position="1"> SVD decomposes the term-by-document matrix T into three matrixes T ' VSkprimeUT where Skprime is the diagonal k k matrix containing the highest kprime k eigenvalues of T, and all the remaining elements are set to 0. The parameter kprime is the dimensionality of the Domain VSM and can be xed in advance (i.e.</Paragraph> <Paragraph position="2"> kprime = d).</Paragraph> <Paragraph position="3"> In the literature (Littman et al., 1998) LSA has been used in multilingual settings to de ne a multilingual space in which texts in different languages can be represented and compared. In that work LSA strongly relied on the availability of aligned parallel corpora: documents in all the languages are represented in a term-by-document matrix (see Figure 1) and then the columns corresponding to sets of translated documents are collapsed (i.e. they are substituted by their sum) before starting the LSA process. The effect of this step is to merge the subspaces (i.e. the right and the left sectors of the matrix in Figure 1) in which the documents have been originally represented. null In this paper we propose a variation of this strategy, performing a multilingual LSA in the case in which an aligned parallel corpus is not available.</Paragraph> <Paragraph position="4"> It exploits the presence of common words among different languages in the term-by-document matrix.</Paragraph> <Paragraph position="5"> The SVD process has the effect of creating a LSA space in which documents in both languages are represented. Of course, the higher the number of common words, the more information will be provided to the SVD algorithm to nd common LSA dimension for the two languages. The resulting LSA dimensions can be perceived as multilingual clusters of terms and document. LSA can then be used to de ne a Multilingual Domain Matrix DLSA.</Paragraph> <Paragraph position="7"> , vectorwprimei is the ith row of the matrix VpSkprime.</Paragraph> <Paragraph position="8"> Thus DLSA3 can be exploited to estimate similarity among texts expressed in different languages (see Section 5).</Paragraph> <Paragraph position="9"> 3When DLSA is substituted in Equation 1 the Domain VSM is equivalent to a Latent Semantic Space (Deerwester et al., 1990). The only difference in our formulation is that the vectors representing the terms in the Domain VSM are normalized by the matrix IN, and then rescaled, according to their IDF value, by matrix IIDF. Note the analogy with the tf idf term weighting schema, widely adopted in Information Retrieval.</Paragraph> </Section> <Section position="2" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 4.2 Similarity in the multilingual domain space </SectionTitle> <Paragraph position="0"> As an example of the second-order similarity provided by this approach, we can see in Table 2 the ve most similar terms to the lemma bank. The similarity among terms is calculated by cosine among the rows in the matrix DLSA, acquired from the data set used in our experiments (see Section 6.2). It is worth noting that the Italian lemma banca (i.e. bank in English) has a high similarity score to the English lemma bank. While this is not enough to have a precise term translation, it is suf cient to capture relevant aspects of topic similarity in a cross-language text categorization task.</Paragraph> </Section> </Section> <Section position="8" start_page="12" end_page="13" type="metho"> <SectionTitle> 5 The Multilingual Domain Kernel </SectionTitle> <Paragraph position="0"> Kernel Methods are the state-of-the-art supervised framework for learning, and they have been successfully adopted to approach the TC task (Joachims, 2002).</Paragraph> <Paragraph position="1"> The basic idea behind kernel methods is to embed the data into a suitable feature space F via a mapping function ph : X ! F, and then to use a linear algorithm for discovering nonlinear patterns. Kernel methods allow us to build a modular system, as the kernel function acts as an interface between the data and the learning algorithm. Thus the kernel function becomes the only domain speci c module of the system, while the learning algorithm is a general purpose component. Potentially any kernel function can work with any kernel-based algorithm, as for example Support Vector Machines (SVMs).</Paragraph> <Paragraph position="2"> During the learning phase SVMs assign a weight li 0 to any example xi 2 X. All the labeled instances xi such that li > 0 are called Support Vectors. Support Vectors lie close to the best separating hyper-plane between positive and negative examples. New examples are then assigned to the class of the closest support vectors, according to equation</Paragraph> <Paragraph position="4"> The kernel function K(xi,x) returns the similarity between two instances in the input space X, and can be designed just by taking care that some formal requirements are satis ed, as described in (Schcurrency1olkopf and Smola, 2001).</Paragraph> <Paragraph position="5"> In this section we de ne the Multilingual Domain Kernel, and we apply it to a cross language TC task.</Paragraph> <Paragraph position="6"> This kernel can be exploited to estimate the topic similarity among two texts expressed in different languages by taking into account the external knowledge provided by a MDM. It de nes an explicit mapping D : Rk ! Rkprime from the Multilingual VSM into the Multilingual Domain VSM. The Multilingual Domain Kernel is speci ed by</Paragraph> <Paragraph position="8"> where D is the Domain Mapping de ned in equation 1. Thus the Multilingual Domain Kernel requires Multilingual Domain Matrix D, in particular DLSA that can be acquired from comparable corpora, as explained in Section 4.1.</Paragraph> <Paragraph position="9"> To evaluate the Multilingual Domain Kernel we compared it to a baseline kernel function, namely the bag of words kernel, that simply estimates the topic similarity in the Multilingual VSM, as described in Section 3. The BoW kernel is a particular case of the Domain Kernel, in which D = I, and I is the identity matrix.</Paragraph> </Section> class="xml-element"></Paper>