File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1099_metho.xml
Size: 13,264 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1099"> <Title>Query Translation by Text Categorization</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Our strategy </SectionTitle> <Paragraph position="0"> Soergel describes a general framework for the use of multilingual thesauri in CLIR (Soergel, 1997), noting that a number of operational European systems employ multilingual thesauri for indexing and searching. However, except for very early work (Salton, 1970), there has been little empirical evaluation of multilingual thesauri in the context of free-text based CLIR, particularly when compared to dictionary and corpus-based methods. This may be due to the expense of constructing multilingual thesauri, but this expense is unlikely to be any more than that of creating bilingual dictionaries or even realistic parallel collections. In fact, it seems that multilingual thesauri can be built quite efiectively by merging existing monolingual thesauri, as shown by the current development of the Unifled Medical Language System (UMLS1).</Paragraph> <Paragraph position="1"> Our approach to CLIR in MedLine exploit the UMLS resources and its multilingual components. The core technical component of our cross language engine is an automatic text categorizer, which associates a set of MeSH terms to any input text. The experimental design is the following: 1. original English OHSUMED (Hersh et al., 1994) queries have been translated into French queries by domain experts; 2. the OHSUMED document collection is indexed using a standard engine; 3. French queries are mapped to a set of lated into English MeSH terms, using MeSHunique identiflers as interlingua: different values of N terms are tested; 5. these English MeSH terms are concatenated to query the OHSUMED document collection.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 MeSH-driven Text Categorization </SectionTitle> <Paragraph position="0"> Automatic text categorization has been largely studied and has led to an impressive amount of papers. A partial list2 of machine learning approaches applied to text categorization includes naive Bayes (McCallum and Nigam, 1998), k-nearest neighbors (Yang, 1999), boosting (Schapire and Singer, 2000), and rule-learning algorithms (AptPe et al., 1994). However, most of these studies apply text classiflcation to a small set of classes; usually a few hundred, as in the Reuters collection (Hayes and Weinstein, 1990). In comparison, our system is designed to handle large class sets (Ruch et al., 2003): retrieval tools, which are used, are only limited by the size of the inverted flle, but 105!6 is still a modest range 3 .</Paragraph> <Paragraph position="1"> Our approach is data-poor because it only demands a small collection of annotated texts for flne tuning: instead of inducing a complex model using large training data, our categorizer indexes the collection of MeSH terms as if they were documents and then it treats the input as if it was a query to be ranked regarding each MeSH term. The classifler is tuned by using English abstracts and English MeSH terms.</Paragraph> <Paragraph position="2"> Then, we apply the indexing system on the French MeSH to categorize French queries into French MeSH terms. The category set ranges from about 19 936 -if only unique canonic English MeSH terms are taken into account- up to 139 956 -if synonym strings are considered in addition to their canonic class. For evaluating the categorizer, the top 15 returned terms are selected, because it is the average number the scalability issue is twofold: it concerns both the ability of these data-driven systems to work with large concept sets, and their ability to learn and generalize regularities for rare events: (Larkey and Croft, 1996) show how the frequency of concepts in the collection is a major parameter for learning systems.</Paragraph> <Paragraph position="3"> of MeSH terms per abstract in the OHSUMED collection.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Collection and Metrics </SectionTitle> <Paragraph position="0"> The mean average precision (noted Av. Prec.</Paragraph> <Paragraph position="1"> in the following tables): is the main measure for evaluating ad hoc retrieval tasks (for both monolingual and bilingual runs). Following (Larkey and Croft, 1996), we also use this measure to tune the automatic text categorization system.</Paragraph> <Paragraph position="2"> Among the 348 566 MedLine citations of the OHSUMED collection4, we use the 233 445 records provided with an abstract and annotated with MeSH keywords. We tune the categorization system on a small set of OHSUMED abstracts: 1200 randomly selected abstracts were used to select the weighting parameters of the vector space classifler, and the best combination of these parameters with the regular expression-based classifler.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Methods </SectionTitle> <Paragraph position="0"> We flrst present the MeSH categorizer and its tuning, then the query translation system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Categorization </SectionTitle> <Paragraph position="0"> In this section, we present the basic classiflers and their combination for the categorization task. Two main modules constitute the skeleton of our system: the regular expression (RegEx) component, and the vector space (VS) component. Each of the basic classiflers implement known approaches to document retrieval. The flrst tool is based on a regular expression pattern matcher (Manber and Wu, 1994), it is expected to perform well when applied on very short documents such as keywords: MeSH terms do not contains more than 5 tokens. The second classifler is based on a vector space engine5. This second tool is expected to provide high recall in contrast with the regular expression-based tool, which should privilege precision. The former component uses tokens as indexing units and can be merged with a thesaurus, while the latter uses stems (Porter). Table 1 shows the results of each gine, tf.idf parameters are provided: the flrst triplet indicates the weighting applied to the \document&quot;, i.e. the concept, while the second is for the\query&quot;, i.e. the abstract. The total number of relevant terms is 15193.</Paragraph> <Paragraph position="1"> classiflers.</Paragraph> <Paragraph position="2"> Regular expressions and MeSH thesaurus. The regular expression search tool is applied on the canonic MeSH collection augmented with the MeSH thesaurus (120 020 synonyms). In this system, string normalization is mainly performed by the MeSH terminological resources when the thesaurus is used. Indeed, the MeSH provides a large set of related terms, which are mapped to a unique MeSH representative in the canonic collection. The related terms gather morpho-syntactic variants, strict synonyms, and a last class of related terms, which mixes up generic and speciflc terms: for example, Inhibition is mapped to Inhibition (Psychology). The system cuts the abstract into 5 token-long phrases and moves the window through the abstract: the edit-distance is computed between each of these 5 token sequence and each MeSH term.</Paragraph> <Paragraph position="3"> Basically, the manually crafted flnite-state automata allow two insertions or one deletion within a MeSH term, and ranks the proposed candidate terms based on these basic edit operations: insertion costs 1, while deletion costs 2. The resulting pattern matcher behaves like a term proximity scoring system (Rasolofo and Savoy, 2003), but restricted to a 5 token matching window.</Paragraph> <Paragraph position="4"> Vector space classifler. The vector space module is based on a general IR engine with tf.idf 6 weighting schema. The engine uses a list of 544 stop words.</Paragraph> <Paragraph position="5"> As for setting the weighting factors, we ob6We use the SMART representation for expressing statistical weighting factors: a formal description can be found in (Ruch, 2002).</Paragraph> <Paragraph position="6"> served that cosine normalization was especially efiective for our task. This is not surprising, considering the fact that cosine normalization performs well when documents have a similar length (Singhal et al., 1996). As for the respective performance of each basic classiflers, table 1 shows that the RegEx system performs better than any tf.idf schema used by the VS engine, so the pattern matcher provide better results than the vector space engine for automatic text categorization. However, we also observe in table 1 that the VS system gives better precision at high ranks (Precisionat Recall=0 or mean reciprocal rank) than the RegEx system: this difierence suggests that merging the classiflers could be a efiective. The idf factor seems also an important parameter, as shown in table 1, the four best weighting schema use the idf factor. This observation suggests that even in a controlled vocabulary, the idf factor is able to discriminate between content and non-content bearing features (such as syndrome and disease).</Paragraph> <Paragraph position="7"> Classiflers' fusion. The hybrid system combines the regular expression classifler with the vector-space classifler. Unlike (Larkey and Croft, 1996) we do not merge our classiflers by linear combination, because the RegEx module does not return a scoring consistent with the vector space system. Therefore the combination does not use the RegEx's edit distance, and instead it uses the list returned by the vector space module as a reference list (RL), while the list returned by the regular expression module is used as boosting list (BL), which serves to improve the ranking of terms listed in RL.</Paragraph> <Paragraph position="8"> A third factor takes into account the length of terms: both the number of characters (L1) and thenumberoftokens(L2, withL2 > 3)arecomputed, so that long and compound terms, which appear in both lists, are favored over single and short terms. We assume that the reference list has good recall, and we do not set any threshold on it. For each concept t listed in the RL, the combined Retrieval Status Value (cRSV, equa-</Paragraph> <Paragraph position="10"> The value of the k parameter is set empirically. Table 2 shows that the optimal tf.idf parameters (lnc.atn) for the basic VS classifler does not provide the optimal combination Weighting function Relevant Prec. at Av.</Paragraph> <Paragraph position="11"> concepts.abstracts retrieved Rec. = 0 Prec.</Paragraph> <Paragraph position="12"> with RegEx. Measured by mean average precision, the optimal combination is obtained with ltc.lnn settings (.1818) 7, whereas atn.ntn maximizes the Precisionat Recall=0 (.9143). For a general purpose system, we prefer to maximize average precision, since this is the only measure that summarizes the performance of the full ordering of concepts, so ltc.lnn factors will be used for the following CLIR experiments.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Translation </SectionTitle> <Paragraph position="0"> To translate user queries, we transform the English MeSH mapping tool described above, which attributes MeSH terms to English abstracts in a French mapping tool for mapping French OHSUMED queries into French MeSH terms. The English version of the MeSH is simply replaced by the accented French version (Zweigenbaum and Grabar, 2002) of the MeSH.</Paragraph> <Paragraph position="1"> We use the weighting schema and system combination (ltc.lnn + RegEx) as selected in the above experiments, so we assume that the best weighting schema regarding average precision for mapping abstracts to MeSH terms is appropriate for categorizing OHSUMED queries.</Paragraph> <Paragraph position="2"> The only technical difierences concern: 1) the thesaural resources, 2) the stemming algorithm.</Paragraph> <Paragraph position="3"> The former are provided by the Unifled Medical Lexicon for French consortium (Zweigenbaum et al., 2003) and contains about 20000 French medical lexemes, with synonyms, while the latter is based on Savoy's stemmer (Savoy, 1999). An additional parameter is used, in order to avoid translating too many irrelevant concepts, we try to take advantage of the concept ranking. Depending on the length of the query, a balance must be found between having a couple of high precision concepts and missing an important one. To evaluate this aspect we do not select the top 15 terms, as in text categorization, but we vary this number and we allow 7For the augmented term frequency factor (noted a, which is deflned by the function fi + fl PS(tf=max(tf)), the value of the parameters is fi = fl = 0:5.</Paragraph> <Paragraph position="4"> difierent thresholds: 1, 2, 3, 5, 10, and 25. Finally, by linear regression, we also attempt to determine a linear flt between the length of the query (in byte) and the optimal threshold.</Paragraph> </Section> </Section> class="xml-element"></Paper>