XML Viewer - e06-1013

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1013_metho.xml
Size: 19,530 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1013">
  <Title>Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing</Title>
  <Section position="3" start_page="98" end_page="100" type="metho">
    <SectionTitle>
2 The Algorithm
</SectionTitle>
    <Paragraph position="0"> This section introduces the Generalized Hebbian Algorithm, and shows how the technique can be adapted to the rectangular matrix form of singular value decomposition. Eigen decomposition requires as input a square diagonallysymmetrical matrix, that is to say, one in which the cell value at row x, column y is the same as that at row y, column x. The kind of data described by such a matrix is the correlation between data in a particular space with other data in the same space. For example, we might wish to describe how often a particular word appears with a particular other word. The data therefore are symmetrical relations between items in the same space; word a appears with word b exactly as often as word b appears with word a. In singular value decomposition, rectangular input matrices are handled. Ordered word bigrams are an example of this; imagine a matrix in which rows correspond to the first word in a bigram, and columns to the second. The number of times that word b appears after word a is by no means the same as the number of times that word a appears after word b.</Paragraph>
    <Paragraph position="1"> Rows and columns are different spaces; rows are the space of first words in the bigrams, and columns are the space of second words.</Paragraph>
    <Paragraph position="2"> The singular value decomposition of a rectangular data matrix, A, can be presented as;</Paragraph>
    <Paragraph position="4"> where U and V are matrices of orthogonal left and right singular vectors (columns) respectively, and S is a diagonal matrix of the corresponding singular values. The U and V matrices can be seen as a matched set of orthogonal basis vectors in their corresponding spaces, while the singular values specify the effective magnitude of each vector pair. By convention, these matrices are sorted such that the diagonal of S is monotonically decreasing, and it is a property of SVD that preserving only the first (largest) N of these (and hence also only the first N columns of U and V) provides a least-squared error, rank-N approximation to the original matrix A.</Paragraph>
    <Paragraph position="5"> Singular Value Decomposition is intimately related to eigenvalue decomposition in that the singular vectors, U and V, of the data matrix, A, are simply the eigenvectors of A[?]AT and AT [?]A, respectively, and the singular values, S, are the square-roots of the corresponding eigenvalues.</Paragraph>
    <Section position="1" start_page="98" end_page="99" type="sub_section">
      <SectionTitle>
2.1 Generalised Hebbian Algorithm
</SectionTitle>
      <Paragraph position="0"> Oja and Karhunen (Oja and Karhunen, 1985) demonstrated an incremental solution to finding the first eigenvector from data arriving in the form of serial data items presented as vectors, and Sanger (Sanger, 1989) later generalized this to finding the first N eigenvectors with the Generalized Hebbian Algorithm. The algorithm converges on the exact eigen decomposition of the data with a probability of one.</Paragraph>
      <Paragraph position="1"> The essence of these algorithms is a simple Hebbian learning rule:</Paragraph>
      <Paragraph position="3"> Un is the n'th column of U (i.e., the n'th eigenvector, see equation 1), l is the learning rate and Aj is the j'th column of training matrix A. t is the timestep. The only modification to this required in order to extend it to multiple eigenvectors is that each Un needs to shadow any lower-ranked Um(m &gt; n) by removing its projection from the input Aj in order to assure both orthogonality and an ordered ranking of  the resulting eigenvectors. Sanger's final formulation (Sanger, 1989) is:</Paragraph>
      <Paragraph position="5"> In the above, cij is an individual element in the current eigenvector, xj is the input vector and yi is the activation (that is to say, ci.xj, the dot product of the input vector with the ith eigenvector). g is the learning rate.</Paragraph>
      <Paragraph position="6"> To summarise, the formula updates the current eigenvector by adding to it the input vector multiplied by the activation minus the projection of the input vector on all the eigenvectors so far including the current eigenvector, multiplied by the activation. Including the current eigenvector in the projection subtraction step has the effect of keeping the eigenvectors normalised. Note that Sanger includes an explicit learning rate, g. The formula can be varied slightly by not including the current eigenvector in the projection subtraction step.</Paragraph>
      <Paragraph position="7"> In the absence of the autonormalisation influence, the vector is allowed to grow long. This has the effect of introducing an implicit learning rate, since the vector only begins to grow long when it settles in the right direction, and since further learning has less impact once the vector has become long. Weng et al. (Weng et al., 2003) demonstrate the efficacy of this approach. So, in vector form, assuming C to be the eigenvector currently being trained, expanding y out and using the implicit learning</Paragraph>
      <Paragraph position="9"> Delta notation is used to describe the update here, for further readability. The subtracted element is responsible for removing from the training update any projection on previous singular vectors, thereby ensuring orthgonality. Let us assume for the moment that we are calculating only the first eigenvector. The training update, that is, the vector to be added to the eigenvector, can then be more simply described as follows, making the next steps</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="2" start_page="99" end_page="100" type="sub_section">
      <SectionTitle>
2.2 Extension to Paired Data
</SectionTitle>
      <Paragraph position="0"> Let us begin with a simplification of 5:</Paragraph>
      <Paragraph position="2"> Here, the upper case X is the entire data matrix. n is the number of training items. The simplification is valid in the case that c is stabilised; a simplification that in our case will become more valid with time. Extension to paired data initially appears to present a problem. As mentioned earlier, the singular vectors of a rectangular matrix are the eigenvectors of the matrix multiplied by its transpose, and the eigenvectors of the transpose of the matrix multiplied by itself. Running GHA on a nonsquare non-symmetrical matrix M, ie. paired data, would therefore beachievable using standard GHA as follows:</Paragraph>
      <Paragraph position="4"> In the above, ca and cb are left and right singular vectors. However, to be able to feed the algorithm with rows of the matrices MMT and MTM, we would need to have the entire training corpus available simultaneously, and square it, which we hoped to avoid. This makes it impossible to use GHA for singular value decomposition of serially-presented paired input in this way without some further transformation. Equation 1, however, gives:</Paragraph>
      <Paragraph position="6"> Here, s is the singular value and a and b are left and right data vectors. The above is valid in the case that left and right singular vectors ca and cb have settled (which will become more accurate over time) and that data vectors a and b outer-product and sum to M.</Paragraph>
      <Paragraph position="7">  Inserting 9 and 10 into 7 and 8 allows them to be reduced as follows:</Paragraph>
      <Paragraph position="9"> This element can then be reinserted into GHA.</Paragraph>
      <Paragraph position="10"> To summarise, where GHA dotted the input with the eigenvector and multiplied the result by the input vector to form the training update (thereby adding the input vector to the eigenvector with a length proportional to the extent to which it reflects the current direction of the eigenvector) our formulation dots the right input vector with the right singular vector and multiplies the left input vector by this quantity before adding it to the left singular vector, and vice versa. In this way, the two sides cross-train each other. Below is the final modification of GHA extended to cover multiple vector pairs. The original GHA is given beneath it for comparison.</Paragraph>
      <Paragraph position="12"> In equations 6 and 9/10 we introduced approximations that become accurate as the direction of the singular vectors settles. These approximations will therefore not interfere with the accuracy of the final result, though they might interfere with the rate of convergence. The constant s3 has been dropped in 19 and 20.</Paragraph>
      <Paragraph position="13"> Its relevance is purely with respect to the calculation of the singular value. Recall that in (Weng et al., 2003) the eigenvalue is calculable as the average magnitude of the training update trianglec. In our formulation, according to 17 and 18, the singular value would be trianglec divided by s3. Dropping the s3 in 19 and 20 achieves that implicitly; the singular value is once more the average length of the training update.</Paragraph>
      <Paragraph position="14"> The next section discusses practical aspects of implementation. The following section illustrates usage, with English language word and letter bigram data as test domains.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="100" end_page="101" type="metho">
    <SectionTitle>
3 Implementation
</SectionTitle>
    <Paragraph position="0"> Within the framework of the algorithm outlined above, there is still room for some implementation decisions to be made. The naive implementation can be summarised as follows: the first datum is used to train the first singular vector pair; theprojection of thefirstsingular vector pair onto this datum is subtracted from the datum; the datum is then used to train the second singular vector pair and so on for all the vector pairs; ensuing data items are processed similarly. The main problem with this approach is as follows. At the beginning of the training process, the singular vectors are close to the values they were initialised with, and far away from the values they will settle on. The second singular vector pair is trained on the datum minus its projection onto the first singular vector pair in order to prevent the second singular vector pair from becoming the same as the first. But if the first pair is far away from its eventual direction, then the second has a chance to move in the direction that the first will eventually take on. In fact, all the vectors, such as they can whilst remaining orthogonal to each other, will move in the strongest direction. Then, when the first pair eventually takes on the right direction, the others have difficulty recovering, since they start to receive data that they have very little projection on, meaning that they learn very  slowly. The problem can be addressed by waiting until each singular vector pair is relatively stable before beginning to train the next. By &amp;quot;stable&amp;quot;, we mean that the vector is changing little in its direction, such as to suggest it is very close to its target. Measures of stability might include the average variation in position of the endpoint of the (normalised) vector over a number of training iterations, or simply length of the (unnormalised) vector, since a long vector is one that is being reinforced by the training data, such as it would be if it was settled on the dominant feature. Termination criteria might include that a target number of singular vector pairs have been reached, or that the last vector is increasing in length only very slowly.</Paragraph>
  </Section>
  <Section position="5" start_page="101" end_page="102" type="metho">
    <SectionTitle>
4 Application
</SectionTitle>
    <Paragraph position="0"> The task of relating linguistic bigrams to each other, as mentioned earlier, is an example of a task appropriate to singular value decomposition, in that the data is paired data, in which each item is in a different space to the other. Consider word bigrams, for example.</Paragraph>
    <Paragraph position="1"> First word space is in a non-symmetrical relationship to second word space; indeed, the spaces are not even necessarily of the same dimensionality, since there could conceivably be words in the corpus that never appear in the first word slot (they might never appear at the start of a sentence) or in the second word slot (they might never appear at the end.) So a matrix containing word counts, in which each unique first word forms a row and each unique second word forms a column, will not be a square symmetrical matrix; the value at row a, column b, will not be the same as the value at row b column a, except by coincidence.</Paragraph>
    <Paragraph position="2"> The significance of performing dimensionality reduction on word bigrams could be thought of as follows. Language clearly adheres to some extent to a rule system less rich than the individual instances that form its surface manifestation. Those rules govern which words might follow which other words; although the rule system is more complex and of a longer range that word bigrams can hope to illustrate, nonetheless the rule system governs the surface form of word bigrams, and we might hope that it would be possible to discern from word bigrams something of the nature of the rules. In performing dimensionality reduction on word bigram data, we force the rules to describe themselves through a more impoverished form than via the collection of instances that form the training corpus. The hope is that the resulting simplified description will be a generalisable system that applies even to instances not encountered at training time.</Paragraph>
    <Paragraph position="3"> On a practical level, the outcome has applications in automatic language acquisition.</Paragraph>
    <Paragraph position="4"> For example, the result might be applicable in language modelling. Use of the learning algorithm presented in this paper is appropriate given the very large dimensions of any realistic corpus of language; The corpus chosen for this demonstration is Margaret Mitchell's &amp;quot;Gone with the Wind&amp;quot;, which contains 19,296 unique words (421,373 in total), which fully realized as a correlation matrix with, for example, 4-byte floats would consume 1.5 gigabytes, and which in any case, within natural language processing, would not be considered a particularly large corpus. Results on the word bigram task are presented in the next section.</Paragraph>
    <Paragraph position="5"> Letter bigrams provide a useful contrasting illustration in this context; an input dimensionality of 26 allows the result to be more easily visualised. Practical applications might include automatic handwriting recognition, where an estimate of the likelihood of a particular letter following another would be useful information. The fact that there are only twenty-something letters in most western alphabets though makes the usefulness of the incremental approach, and indeed, dimensionality reduction techniques in general, less obvious in this domain. However, extending the space to letter trigrams and even four-grams would change the requirements. Section 4.2 discusses results on a letter bigram task.</Paragraph>
    <Section position="1" start_page="101" end_page="102" type="sub_section">
      <SectionTitle>
4.1 Word Bigram Task
</SectionTitle>
      <Paragraph position="0"> &amp;quot;Gone with the Wind&amp;quot; was presented to the algorithm as word bigrams. Each word was mapped to a vector containing all zeros but for a one in the slot corresponding to the unique word index assigned to that word. This had the effect of making input to the algorithm a normalised vector, and of making word vectors orthogonal to each other. The singular vector pair's reaching a combined Euclidean  magnitude of 2000 was given as the criterion for beginning to train the next vector pair, the reasoning being that since the singular vectors only start to grow long when they settle in the approximate right direction and the data starts to reinforce them, length forms a reasonable heuristic for deciding if they are settled enough to begin training the next vector pair.</Paragraph>
      <Paragraph position="1"> 2000 was chosen ad hoc based on observation of the behaviour of the algorithm during training. null The data presented are the words most representative of the top two singular vectors, that is to say, the directions these singular vectors mostly point in. Table 1 shows the words with highest scores in the top two vector pairs. It says that in this vector pair, the normalised left hand vector projected by 0.513 onto the vector for the word &amp;quot;of&amp;quot; (or in other words, these vectors have a dot product of 0.513.) The normalised right hand vector has a projection of 0.876 onto the word &amp;quot;the&amp;quot; etc. This first table shows a left side dominated by prepositions, with a right side in which &amp;quot;the&amp;quot; is by far the most important word, but which also contains many pronouns. The fact that the first singular vector pair is effectively about &amp;quot;the&amp;quot; (the right hand side points far more in the direction of &amp;quot;the&amp;quot; than any other word) reflects its status as the most common word in the English language. What thisresult is saying is that were we to be allowed only one feature with which to describe word English bigrams, a feature describing words appearing before &amp;quot;the&amp;quot; and words behaving similarly to &amp;quot;the&amp;quot; would be the best we could choose. Other very common words in English are also prominent in this feature.</Paragraph>
    </Section>
    <Section position="2" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
4.2 Letter Bigram Task
</SectionTitle>
      <Paragraph position="0"> Running the algorithm on letter bigrams illustrates different properties. Because there are only 26 letters in the English alphabet, it is meaningful to examine the entire singular vector pair. Figure 1 shows the third singular vector pair derived by running the algorithm on letter bigrams. The y axis gives the projection of the vector for the given letter onto the singular vector. The left singular vector is given on the left, and the right on the right, that is to say, the first letter in the bigram is on the left and the second on the right. The first two singular vector pairs are dominated by letter frequency effects, but the third is interesting because it clearly shows that the method has identified vowels. It means that the third most useful feature for determining the likelihood of letter b following letter a is whether letter a is a vowel. If letter b is a vowel, letter a is less likely to be (vowels dominate the negative end of the right singular vector). (Later features could introduce subcases where a particular vowel is likely to follow another particular vowel, but this result suggests that the most dominant case is that this does not happen.) Interestingly, the letter 'h' also appears at the negative end of the right singular vector, suggesting that 'h' for the most part does not follow a vowel in English. Items near zero ('k', 'z' etc.) are not strongly represented in this singular vector pair; it tells us little about them.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML