File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/x96-1031_intro.xml
Size: 4,062 bytes
Last Modified: 2025-10-06 14:06:09
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1031"> <Title>RECENT ADVANCES IN HNC'S CONTEXT VECTOR INFORMATION RETRIEVAL TECHNOLOGY</Title> <Section position="3" start_page="0" end_page="149" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> While the current MatchPlus learning law has proven to be effective in encoding relationships between words, it is computationally intensive and requires multiple passes through the training corpus.</Paragraph> <Paragraph position="1"> The purpose of the one step learning law is to approximate the behavior of the original learning law while performing only a single pass through the training corpus. The one step learning law uses a single pass through the training corpus to obtain desired dot product values for the set of trained context vectors. The desired dot product values are determined on the basis of information theoretic statistical relationships between co-occurring word stems found in the training corpus. Desired dot products are found such that words that tend to co-occur will have context vectors that point in similar directions while words that do not co-occur will have context vectors that tend to be orthogonal. These desired dot products are used to perform a quasi-linear transformation on an initial set of quasi-orthogonal, high dimensional vectors. This vector transformation and subsequent renormalization results in a set of context vectors that represents the relationships between word stems in a near-optimal fashion. The time requirements for training this set of vectors scale as O(Nn) where N is the number of word stems in the vocabulary and n is the average number of word stems found to co-occur (and/or be related to) any given word stem (usually on the order of several hundred).</Paragraph> <Paragraph position="2"> This new learning law reduces the training time by a factor on the order of 100 over the original context vector learning law with little or no degradation in performance. Results can be improved even further by adjusting the dimension of the context vectors.</Paragraph> <Paragraph position="3"> HNC has also developed an approach to learning stem-level relationships across multiple languages and has used this approach to develop a prototype multilingual retrieval system. This technique, called &quot;symmetric learning&quot;, is based upon the use of tie words, which provide connectivity between each language's portion of the context vector space. In the symmetric approach, learning is conducted using both languages simultaneously, thus removing any donor language biases. Tie words are used to connect the context vector space for multiple languages through a &quot;unified hash table&quot;. The unified hash table provides the mechanism to translate a stem into an associated context vector. In the English-only MatchPlus system, this is a straight forward process.</Paragraph> <Paragraph position="4"> The stem in question is fed to the hashing function and the index is produced. The resulting index is the offset in the stem hash table. The content of that location in the hash table is a pointer to the context vector data structure. Using this approach (hash fimction collisions not withstanding), each unique stem results in a unique entry and thus a unique context vector. In the multilingual system, a tie word list is used to provide multiple references, one word stem ffi'om each language, for common context vectors. Context vector learning is performed in multiple languages simultaneously using multilingual training corpora.</Paragraph> <Paragraph position="5"> HNC has performed preliminary evaluation of an English-Spanish version of this system by examining stem trees for tie words and non-tie words. Results indicate that the English-Spanish MatchPlus prototype is able to learn reasonable word stem interrelationships for tie words and non-tie words, thereby demonstrating the suitability of this concept for further development.</Paragraph> </Section> class="xml-element"></Paper>