File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1031_metho.xml
Size: 24,100 bytes
Last Modified: 2025-10-06 14:14:25
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1031"> <Title>RECENT ADVANCES IN HNC'S CONTEXT VECTOR INFORMATION RETRIEVAL TECHNOLOGY</Title> <Section position="4" start_page="149" end_page="150" type="metho"> <SectionTitle> 2. TECHNICAL BACKGROUND </SectionTitle> <Paragraph position="0"> The HNC MatchPlus system was developed as part of the ARPA-sponsored TIPSTER text program. MatchPlus uses an information representation scheme called context vectors to encode similarity of usage. Other vector space approaches to text retrieval exist, but none embody the ability to learn word-level relationships \[1-5\]. Key attributes of the context vector approach are as follows: During this effort, the initially proposed context vector approach using human defined coordinates and initial conditions was extended and refmed to allow fully automatic generation of context vectors for text symbols (stems) based upon their demonstrated context of usage in training text. The MatchPlus system learns relationships at the stem level and then uses those relationships to construct a context vector representation for sets of symbols. For the text case, these sets of symbols are paragraphs, documents and queries.</Paragraph> <Paragraph position="1"> To start the learning process, each stem is associated with a random vector in the context vector space. Random unit vectors in high dimensional floating point spaces have a property that is referred to a &quot;quasi-orthogonality&quot;\[6\]. That is, the expected value of the dot product between any pair of random context vectors selected fi'om the set is approximately equal to zero (i.e. all vectors are approximately perpendicular to one another). This property of quasi-orthogonality is important because it serves as the initial condition for the context vector learning algorithm. The usage of the context vector technique is predicated upon the rule that symbols (stems) that are used in a similar context (exhibit proximate co-occurrence behavior) will have trained vectors that point in similar directions. Conversely, stems that never appear in a similar context will have context vectors that are approximately orthogonal.</Paragraph> <Paragraph position="2"> To achieve the desired representation, the context vector learning algorithm must take the context vectors for symbols that co-occur and move them toward each other. Symbols that do not co-occur are left in their quasi-orthogonal original condition. It is a basic tenet of the MatchPlus approach that &quot;words that are used in a similar context convey similar meaning&quot;. Since the learning is driven by proximate co-occurrence of words, the learning results in a vector set where closeness in the space is equivalent to closeness in subject content. To perform learning, a learning window is used to identify local context. The window is &quot;slid&quot; through each document in the corpus. The window has 1 target stem and multiple neighbor stems. Once the context window has been determined, the learning rule of &quot;Move context vector for target in the direction of the context vector of the neighbors&quot; is applied. Once the correction is made, we move the learning window to next location and the learning operation is repeated. The equation for this learning is shown in Figure 1.</Paragraph> <Paragraph position="4"> Several points should be noted: * All stem vectors are of length 1 (unit vectors). In this paradigm, only the direction of the vector carries information.</Paragraph> <Paragraph position="5"> * Fully trained vectors have the property that words that are used in a similar context will have vectors that point in similar directions as measured by the dot product.</Paragraph> <Paragraph position="6"> * Words that are never used in a similar context will retain their initial condition of quasiorthogonality. That is, approximately orthogonal with a dot product of approximately zero.</Paragraph> <Paragraph position="7"> * Trained context vectors result in a concept space where similarity of direction corresponds to similarity of meaning.</Paragraph> <Paragraph position="8"> * No human knowledge is required for training to occur. Only flee text examples are needed.</Paragraph> <Paragraph position="9"> * The algorithm determines the coordinate space of the context vectors.</Paragraph> <Paragraph position="10"> When the training is complete, &quot;words that are used in a similar context will have their associated vectors point in similar directions&quot;. Conversely, words that are never used in a similar context will have vectors that are approximately orthogonal. At the summary level, the MatchPlus system translates flee text into a mathematical representation in a meaningful way. Note that the MatchPlus approach does not use any external dictionaries, thesauri or knowledge bases to determine word vector relationships. These relationships are learned automatically using only the text examples provided for learning. The result of the learning procedure is a vocabulary of stem context vectors that can be used for a variety of applications including document retrieval \[7\], routing \[8\], document clustering and other text processing tasks.</Paragraph> <Paragraph position="11"> Once the stem learning is complete, it is possible to &quot;query&quot; the vector set to determine the nature of the learned relationships. To perform this operation, the user selects a &quot;root&quot; word and the trained context vector for that word is determined by a table lookup in the context vector vocabulary.</Paragraph> <Paragraph position="12"> MatchPlus computes the dot product of every other word vector in the vocabulary to the selected word. The resulting dot products are sorted by magnitude where larger means closer in usage.</Paragraph> <Paragraph position="13"> Sets of words (text passages and queries) and documents can also be represented by context vectors in the same information space. Document context vectors are derived as the inverse document frequency-weighted sum of the context vectors associated with words in the document. Document context vectors are normalized to prevent long documents from being favored over short documents. The resulting document context vectors have the property that documents that discuss similar themes will have context vectors that point in similar directions. It is this property that translates the problem of assessment of similarity of content for text into a geometry problem. Documents that are similar are close in the space and dissimilar documents are far away. Additionally, it should be noted that all document vectors are unit length. This prevents system biases in retrieval due to document length.</Paragraph> </Section> <Section position="5" start_page="150" end_page="152" type="metho"> <SectionTitle> 3. ONE STEP CONTEXT VECTOR LEARNING </SectionTitle> <Paragraph position="0"> The sections below describe an approach to context vector learning that greatly reduces the amount of computer time and resources required to obtain a trained set of stem context vectors. This approach uses a single pass through the training corpus (or corpora) to obtain desired dot product values for the set of trained context vectors. These desired dot products are used in a single pass through the vocabulary of word stems to expand a starting set of quasi-orthogonal, high dimensional vectors. This vector expansion and subsequent renormalization results in a set of context vectors that represents the relationships between words stems in a near-optimal fashion. The time requirements for training this set of vectors scale as O(Nn) where N is the number of word stems in the vocabulary and n is the average number of word stems found to co-occur (and/or be related to) any given word stem (usually on the order of several hundred). Using a near-worst case estimate of n = 1000 word stems, a vocabulary size of 50,000 words, and assuming that at least ten iterations of the original learning law are required for convergence (more often at least one hundred iterations are required), this new learning law reduces the training time by a factor of between 10 and 500 (depending on whether or not the non co-occurring terms are explicitly considered in the current learning law).</Paragraph> <Section position="1" start_page="150" end_page="151" type="sub_section"> <SectionTitle> 3.1 Current MatchPlus Context Vector Learning Law </SectionTitle> <Paragraph position="0"> The current MatchPlus context vector learning law is presented in Figure 1 and discussed in Section 2. This learning law can be derived as a stochastic gradient descent procedure for minimizing the cost function</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> The factors at, b are the desired dot products for the trained set of context vectors. These desired dot products are found as a function of co-occurrence statistics for word stems i and j. In most cases the number of words for which aj.j is non-zero (i.e. the co-occurring words) is several orders of magnitude smaller than the size of the vocabulary. In theory, the summation on the right hand side extends over all word stems in the vocabulary. In practice, however, this summation is performed only over words that co-occur with the target word stem i. Since n=number of co-occurring words is usually much less than N=number of vocabulary word stems, summing only over co-occurring words represents a considerable time savings. Non co-occurring word stem context vectors are adjusted by subtracting the mean context vector at the end of each update iteration. This has the effect of spreading out the context vectors, hopefully driving the context vectors of non co-occurring words closer to orthogonality. With this approximation, the time requirements for the current learning law scale as O(kNn) where k is the number of iterations required for convergence.</Paragraph> </Section> <Section position="2" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 3.2 Approach to One Step Learning </SectionTitle> <Paragraph position="0"> The objective of any learning law used to train context vectors is to minimize the cost function specified in Figure 2 subject to the constraints in Figure 3. In order to avoid the requirement for multiple iterations, HNC proposes to evaluate the performance of the following one step learning law:</Paragraph> <Paragraph position="2"> stem j co- occurs with word stem i tl = Design parameter chosen to optimize performance a u = Desired context vector dot product for target i and co- occurring stemj Figure 4. One Step Learning Law Equations.</Paragraph> <Paragraph position="3"> Note that the summation in Figure 4 is over co-occurring word stems. This learning law is motivated by the following observation. Suppose there exists a cost function of two variables Xl and x2, where J(x,,x2)=~X(a#-x, xb) 2 . Suppose i,j=l,2 further that we wish to choose 8x I and 8x 2 such that replacing x~ and xz with the quantities x I + 8x I and x 2 +Sx 2 minimizes the cost function. For the situation in which IIx,ll=llx211 = 1 and 8x I and 8x 2 are assumed to be small, it is easily demonstrated that the solutions for 8x I and #x 2 are 8x I = a~2 x2 and 2 _ O~12 #x 2 --~--x I . Adding these solutions to x~ and x2 yields an expression similar to that of Figure 4. Of course, the fact that the resulting vectors must be normalized makes the analogy only approximate.</Paragraph> <Paragraph position="4"> However, Figure 4 can be viewed as a one step approximation to the optimal solution. The value of this approximate solution is that it provides adequate performance with only a fraction of the computational requirements. This one step learning law scales as O(Nn), so that it is faster than the current learning law by a factor of k (number of original learning law iterations) and is faster than the theoretically derived learning law by a factor of kN/n. For reasonable values of k, N, and n, this translated into a time savings of a factor of 10 to 1000.</Paragraph> </Section> <Section position="3" start_page="151" end_page="152" type="sub_section"> <SectionTitle> 3.3 Summary of One Step Learning </SectionTitle> <Paragraph position="0"> The successful development and testing of the one step learning law offers the possibility of much faster context vector training. The performance of the system using this law can be optimized through parameter sweeps on context vector dimension and free parameter 1&quot;1 (see Figure 4).</Paragraph> </Section> </Section> <Section position="6" start_page="152" end_page="154" type="metho"> <SectionTitle> 4. APPROACH TO MULTILINGUAL INFORMATION RETRIEVAL (MIR) </SectionTitle> <Paragraph position="0"> The objective of solving the MIR problem is to provide the analyst/user with a flexible high performance tool to allow retrieval of relevant information from multilingual corpora without the need for prior translation of large volumes of text.</Paragraph> <Paragraph position="1"> The key issue is prior translation of the foreign language material. Clearly, if all material was translated to a uniform representation, say English, the problem is solved. However, translation is time consuming, costly and subjective. Additionally, the current volumes of information would overwhelm any organization who attempted to perform bulk translation. Machine translation efforts have been partially successful, but these techniques frequently ignore subtleties in the translation process.</Paragraph> <Paragraph position="2"> Additionally, the cost of development, tuning and validation of this approach is a hindrance to widespread use.</Paragraph> <Paragraph position="3"> HNC has developed an approach to the MIR problem that leverages the context vector technology. It is called symmetric learning and its attributes, as well as the implications of its attributes, are discussed in the section below. Explanations of the approach will be given from the frame of reference of two simultaneous languages. However, it should be noted that these approaches are extensible to many languages being processed simultaneously. These discussions assume that language 1 is English and language 2 is Spanish. It should also be noted that HNC has implemented a minimal subset of the symmetric approach as a proof of concept. The preliminary results are extremely encouraging. A description of this system, the training corpus and the preliminary results are provided in Section 5.</Paragraph> <Section position="1" start_page="152" end_page="154" type="sub_section"> <SectionTitle> 4.1 Symmetric Learning </SectionTitle> <Paragraph position="0"> HNC has developed an approach to learning stem-level relationships across multiple languages.</Paragraph> <Paragraph position="1"> This technique, called &quot;symmetric learning&quot;, is based upon the use of tie words. These tie words provide connectivity between each language's portion of the context vector space. However, learning is conducted using both languages simultaneously, thus removing any donor language biases.</Paragraph> <Paragraph position="2"> The symmetric approach is based upon the use of a &quot;unified hash table&quot;. The unified hash table provides the mechanism to translate a stem into an associated context vector. In the English-only MatchPlus system, this is a straight forward process.</Paragraph> <Paragraph position="3"> The stem in question is fed to the hashing function and an index is produced. The resulting index is the offset in the stem hash table. The contents of that location in the hash table is a pointer to the context vector data structure. Using this approach (hash function collisions not withstanding), each unique stem results in a unique entry and thus a unique context vector.</Paragraph> <Paragraph position="4"> What is proposed is to use the tie word list to provide references for common context vectors. An example is shown in Figure 5. Assume that &quot;attack&quot; and &quot;ataque&quot; have been chosen as a tie word pair. Since these words should have the same context vector, some form of connection must be made between the words. Figure 5 shows 4 words in the unified hash table: &quot;rebel&quot;, &quot;attack&quot;, &quot;ataque&quot;, and &quot;contra&quot;. Without hash table unification based upon the tie word list, all four words would have unique and independent context vectors. However, as can be seen in the figure, the hash table entries for the tie words have been forced to point to a common context vector entry.</Paragraph> <Paragraph position="5"> This very simple approach allows multiple references to the same context vector entity.</Paragraph> <Paragraph position="6"> Once the mechanism for multiple references has been established, the next step is to consider the actual training algorithm. Example training text for English and Spanish is shown in Figure 6. For this example, it is assumed that the pair &quot;attack&quot; and &quot;ataque&quot; are a tie word pair. Note that in this example, the text chosen is a near-literal translation.</Paragraph> <Paragraph position="7"> There is no requirement for parallel text for the symmetric learning algorithm. The English text in Figure 6 comes from the passage, &quot;Four people were killed in the attack by the rebel group Shining Path&quot;. The corresponding Spanish text is &quot;Quatro personas fiJeron matadas en el ataque por el group contras Sendero Luminoso&quot;. Figure 6 shows the context window for the stemmed text centered on the tie word attack.</Paragraph> <Paragraph position="8"> Like the standard MatchPlus context vector learning algorithm, the symmetric learning approach will utilize a convolutional &quot;context window&quot; with a center and neighbors. The stem at the center of the window is called the &quot;target&quot;. The context vector for the target stem is adjusted in the direction of its neighbors' context vectors.</Paragraph> <Paragraph position="9"> peopl kill attack rebel group neighbor target neighbor person mat ataq group contr neighbor target neighbor The steps that will occur during learning given the text example shown in Figure 6 are as follows: * The convolutional window location is chosen and the target and neighbor stems are identified. In the English portion of this example, the window is centered on the word &quot;attack&quot;. The neighbor words are &quot;people&quot;, &quot;killed&quot;, &quot;rebel&quot;, and &quot;group&quot;.</Paragraph> <Paragraph position="10"> * The context vector for &quot;attack&quot; is moved in the direction of its neighbors. When the update is completed, the window is moved and the process is repeated.</Paragraph> <Paragraph position="11"> * Spanish text is processed using the same approach. In the Spanish portion of this example, the window has as its center the word &quot;ataque&quot;. Neighbors for &quot;ataque&quot; are &quot;personas&quot;, &quot;matado&quot;, &quot;groupo&quot;, and &quot;contra&quot;. The context vector for &quot;ataque&quot; is moved in the direction of its neighbors. When the update is completed, the window is moved and the process is repeated.</Paragraph> <Paragraph position="12"> * Note that &quot;attack&quot; and &quot;ataque&quot; are a tie word pair. As a consequence, they share a common context vector. As a consequence, the context vector for this pair has been influenced by the words that have occurred in a similar context in both languages. Specifically, the attack-ataque tie word pair has been influenced by &quot;people&quot;, &quot;kill&quot;, &quot;rebel&quot;, &quot;group&quot;, &quot;personas&quot;, &quot;matado&quot;, &quot;groupo&quot;, and &quot;contra&quot;.</Paragraph> <Paragraph position="13"> * Since all context vectors are in the same information space, the symmetric learning technique will result in a unified information space for both languages. Because of the &quot;second order&quot; learning effects of the context vector approach, not only will &quot;attack&quot; be related to &quot;people&quot; and &quot;personas&quot;, but &quot;people&quot; will be related to &quot;personas&quot;, &quot;matado&quot;, &quot;groupo&quot;, etc. The block diagram for generation of a system using the symmetric approach is shown in Figure 7. As can be seen in this figure, the symmetric system build uses the unified hash table as the basis for combining the stem sets from both languages. Once this process has taken place, all stem context vectors are stored in a single dataset. This unified set of context vectors is the basis for formation of document context vectors.</Paragraph> <Paragraph position="14"> When the system generation is complete, MIR is ready for query processing. The block diagram for this process is shown in Figure 8.</Paragraph> <Paragraph position="15"> Attributes of the symmetric learning approach are as follows: 1. Once tie word pairs (or n-toples) have been selected, all subsequent processing is fully automated. No other external knowledge sources are required.</Paragraph> <Paragraph position="16"> 2. Training text can be presented in any order. All of language 1 can be presented, followed by language 2. Alternately, documents from the two languages can be presented in intermixed order.</Paragraph> <Paragraph position="17"> 3. Context vector approach will learn &quot;second order&quot; relationships between the languages used for training. The resulting unified context vector set can be used to identify relationships between words in the two languages.</Paragraph> <Paragraph position="18"> 4. The user can enter multi lingual queries based on tie words as well as non tie words. Because all the text is used during training second order relationships will be formed between non tie words in different languages. As an extreme example if &quot;white&quot; is only used as &quot;white house&quot; and &quot;blanca&quot; is only used as in &quot;casablanca&quot; the user will be able to query using only &quot;white&quot; and Spanish documents about &quot;casa blanca&quot; will be retrieved. This is not so in the previous approach where the user is limited to using only tie words as query terms.</Paragraph> <Paragraph position="19"> 5. The basic approach described here is extensible and capable of processing more than two languages at once. Additionally, this approach can be utilized for ideographic languages such as Japanese, Chinese and Korean.</Paragraph> <Paragraph position="20"> The key benefit of the MatchPlus context vector approach is its ability to learn the relationships between words. To simply disregard the relationships contained in the foreign data simply does not make sense. The Symmetric Learning approach exploits the learned relationships without the need to translate the foreign text. The Symmetric Learning approach requires only the translation of a limited number of words (tie words). Furthermore, this operation need only be done once.</Paragraph> <Paragraph position="21"> The benefits of a multilingual approach to text processing extend well beyond text retrieval.</Paragraph> <Paragraph position="22"> Obviously, text routing and index term assignment could benefit from multilingual technology. Language learning tools could exploit the technology to analyze the relationships between word usage's across languages. Finally, as innovative text visualization techniques are found, multilingual text processing will surely enhance the value of such technology.</Paragraph> </Section> </Section> class="xml-element"></Paper>