File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1035_metho.xml
Size: 12,527 bytes
Last Modified: 2025-10-06 14:14:26
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1035"> <Title>ADVANCES IN MULTILINGUAL TEXT RETRIEVAL</Title> <Section position="3" start_page="0" end_page="186" type="metho"> <SectionTitle> MLTR IN TREC </SectionTitle> <Paragraph position="0"> Starting with TREC- 3, Spanish corpora and query sets have been available for evaluating text retrieval engines. The queries and corpus are monolingual, however, so testing a multilingual system is only possible if the query set or the corpus is translated into a different language. We chose to translate the queries since they were very short. With translated queries, a query translation system that produces Spanish queries from hand-translated English versions of original Spanish queries can then be compared against the original queries. The differences between the two results are then a reasonable measure of the effectiveness of the translation process in preserving the characteristics of the original query that contribute to retrieval. Several of the Spanish TREC queries and their hand-translated versions are shown in Table 1, below.</Paragraph> <Paragraph position="1"> The query translation methods that we applied to produce new Spanish queries were of two major types: methods that used a prepared lexicon and methods that used a parallel training corpus. While a lexicon tends to produce translations that are shallow but comprehensive, covering all possible senses of a term but limited in the range of synonyms that are produced for each term, corpus methods tend to produce translations that are deep but narrow, with enormous repetition of domain-related senses of terminology. This justified an examination of the comparative merits of both approaches.</Paragraph> <Paragraph position="2"> As is often the case, our parallel corpus was not precisely of the same domain as the TREC document collection for the ultimate evaluation. The corpus itself was extremely large, however, which we hoped would offset the difficulties of using a distinctly different type of text. The corpus was 1.6 Gb of Spanish and English translations from the United Nations, containing proceedings of meetings, policy documents and notes on UN activities in member countries. The documents were automatically aligned \[1\] at the sentence level using a procedure that is conservatively estimated to have an 83% accuracy over grossly noisy document pairs (which the UN documents were not). This produced a parallel corpus of around 680,000 aligned sentence pairs.</Paragraph> <Paragraph position="3"> Lexical Transfer The first method was to perform term-by-term translation with the Collins English-Spanish bilingual dictionary. Individual terms in the English query were reduced to their morphological roots and lookup was performed. The resulting set of Spanish terms became the Spanish query. Some repetition of terms is apparent in the resulting queries because all senses of each term were used with no attempt to disambiguate the contextual usage of the English terms. For example, Query 28 is transformed from Indicators of economic and business relations between Mexico and Asian countries, such as Japan, China and Korea.</Paragraph> <Paragraph position="4"> to indicador indicador ayuda expansi6n previsiones crecimiento comercio comercio narraci6n relaci6n parentesco M4xico Ciudad gripe patria campo regidn amor semejante parecido tanto el laca China Mar t4 porcelana vitrina coalln Corea Corea Corea mexicana mexicano M~xico Note that &quot;China&quot; has been replaced with both &quot;China&quot; and &quot;porcelana&quot; as a result of this simple lexical substitution scheme, and that &quot;relations&quot; has included the familial sense &quot;parentesco&quot;. Lexicon-generated Spanish Queries The lexical-transfer approach produced Spanish queries rapidly, requiring only a simple database lookup procedure. This process is shown in Figure 1 (a).</Paragraph> <Paragraph position="5"> High-Frequency Terms from Parallel Text In text, the terms that occur with the highest frequency are rarely of statistical significance, and are more often than not merely redundant. Yet the terms that occur with moderate frequency are sometimes significant. In order to evaluate other corpus-based methods, we wanted to establish a baseline for queries formed from these moderate frequency term sets. Using a vector-based text retrieval system with no term spreading or other modifications, the English queries were translated by performing a lookup on the English side of the parallel corpus, collecting the Spanish sentences that were parallels to the top 100 retrieved documents, filtering the remaining terms to eliminate the top 500 most frequent Spanish terms, and collecting the next 100 most frequent Spanish terms to create the new query. This process is shown in Figure 1 (b): Several of the resulting queries are given in Table 2.</Paragraph> <Paragraph position="6"> Some formatting codes from the UN documents have been eliminated in some of the queries, reducing the count to below 100 terms in those queries. For brevity, only the first two queries are shown in Table 2.</Paragraph> <Paragraph position="7"> Whereas the high-frequency terms extracted in the previous method provide a baseline for examining improved methods, high-frequency terms are themselves not necessarily the best terms for discriminating the significant features involved in text retrieval. A better approach is to extract the terms which are statistically significant in the retrieved segments of parallel text in comparison to the corpus as a whole. Various methods are possible for testing statistical significance, but the method we applied is based on a log-likelihood ratio test that assumes a X 2 distribution is an accurate model of the term distributions in text \[2\].</Paragraph> <Paragraph position="8"> The method begins by extracting all of the terms from the sentences that are parallels to the top 100 retrieved English sentences. The counts of the pooled terms are then compared with the counts for the entire UN training corpus to evaluate their statistical significance. The top 100 most-significant terms are then extracted and become the new Spanish query. Figure 1 (c) diagrams the process. The resulting queries are in</Paragraph> <Section position="1" start_page="185" end_page="186" type="sub_section"> <SectionTitle> Evolutionary Optimization of Queries </SectionTitle> <Paragraph position="0"> If we could make a set of derived Spanish queries retrieve documents in a manner that is similar to the English queries over a training corpus, then the Spanish query could conceivably produce similar results on a novel corpus. One way to change Spanish queries is to add and remove terms. The number of possible unique deletions that can be performed on a 70 word query is quite large, however, making the direct examination of all possible modified queries effectively impossible. We applied an evolutionary programming (EP) \[3\] approach to modify a population of 50 queries. In an EP approach, an initial population of queries is needed along with a mutation strategy to modify queries. Optimization then proceeds by evaluating the comparative fitnesses of the queries, mutating a selected sub-population of the queries to produce &quot;offspring&quot; solutions and re-evaluating the queries iteratively until a suitable number of generations have passed. Our EP approach considered the comparative evaluation of document score vectors as an objective measure of the relative fitness of a query to the collection. This process is diagrammed in Figure 1 (d).</Paragraph> <Paragraph position="1"> The initial queries for this test were the queries from the high-frequency lookup strategy discussed above. Previously, we have used a lexicon to generate initial queries \[4\]. The mutation strategy applied between one and ten modification operations to each of the 50 queries per generation and collected only the best 10% of the queries to propagate into the next generation. Optimization proceeded for 50 generations, resulting in a wide range of changes to each query.</Paragraph> <Paragraph position="2"> The types of queries produced by this system typically showed the repetition of key terminology combined with the elimination of irrelevant terms. The fitness judgment for a query was based on comparative retrieval results using a training corpus of only 80,000 aligned sentences. Table 4, below, shows two of the resulting queries from the EP method.</Paragraph> </Section> <Section position="2" start_page="186" end_page="186" type="sub_section"> <SectionTitle> Singular Value Decomposition and the Translation Matrix </SectionTitle> <Paragraph position="0"> The final query translation method was a radical departure from the others, but is derived from earlier work by \[5\] and \[6\]. This method is at heart a numerical approach to derive a translation matrix from parallel texts.</Paragraph> <Paragraph position="1"> In this effort, we applied a QR-decomposition technique to reduce the complexity of calculating the singular value decomposition, resulting in query translation that took only a matter of seconds on a SPARC 10. Several of the generated queries are given in Table 6. Figure 1 (e) diagrams the process.</Paragraph> </Section> </Section> <Section position="4" start_page="186" end_page="186" type="metho"> <SectionTitle> OVERVIEW OF RESULTS </SectionTitle> <Paragraph position="0"> The resulting queries were given to University of Massachusetts, Amherst, who ran them against the Spanish TREC document collection using Spanish Inquery. The original Spanish TREC queries were also evaluated to establish a reference baseline. The results were as follows: 1. On average, the dictionary-based queries produced performance which was about 50% worse than the reference queries.</Paragraph> <Paragraph position="1"> 2. The EP-derived queries produced performance which was 60-70% worse than the reference queries, except at higher recall levels (.6-1.0), at which they performed better than the Method 1 queries. 3. The other methods performed even more poorly. 4. On at least two queries, performance of the lexical methods was as good or better than the reference queries.</Paragraph> <Paragraph position="2"> 5. On two queries, performance of the EP approach was as good as the reference queries, although they tended to have better precision at higher recall. These modest results demonstrate that lexical and corpus methods can be applied to query translation in a large-scale multilingual text retrieval scenario, although at a fair penalty in performance. Each of these methods was purposely limited to as simple a scheme as possible, however, so there is plenty of room for improvement and further experimentation.The average precision-recall curve for all 25 queries is shown in Figure 2.</Paragraph> </Section> <Section position="5" start_page="186" end_page="187" type="metho"> <SectionTitle> RECENT AND ONGOING WORK </SectionTitle> <Paragraph position="0"> Current work is focusing on improving the performance of MLTR methods, applying the methods to new languages and making use of new retrieval engines.</Paragraph> <Paragraph position="1"> An example of the latter is shown in Figure 3. Mundial is a query interface to Infoseek and Yahoo that takes queries in English, translates them to Spanish and submits the resulting queries to the Infoseek and Yahoo search engines directly. Figure 4 shows the completed search for Spanish documents on Infoseek. The Mundial demo uses a bilingual dictionary combined with several heuristics to limit the terminological expansion of the input query. Limiting query size is important because most search engines, like Infoseek, restrict the size of a query to around 80 characters. Overgeneration in the translation process is handled by using the longest terms (in character count) in Mundial. Although in some cases this may be in error, the hope is that automatic stemming of query terms at the search engine will reduce long terms to stems common to many of the keywords that might have been substituted if the entire definition was transferred. The second motivation was that long terms tend to be more precise than short terms, and content words should be as precise as possible.</Paragraph> <Paragraph position="2"> Mundial may be accessed at: http://crl.nmsu.edu/ANG/ML/ml.html.</Paragraph> </Section> class="xml-element"></Paper>