File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1310_metho.xml

Size: 16,319 bytes

Last Modified: 2025-10-06 14:07:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1310">
  <Title>Nonlocal Language Modeling based on Context Co-occurrence Vectors</Title>
  <Section position="4" start_page="80" end_page="81" type="metho">
    <SectionTitle>
2 Word Co-occurrence Vector
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
2.1 Word-Document Co-occurrence
Matrix
</SectionTitle>
      <Paragraph position="0"> Word co-occurrences are directly represented in a matrix whose rows correspond to words and whose columns correspond to documents (e.g. a newspaper article). The element of the matrix is 1 if the word of the row appears in the document of the colunm (Figure 1). Wre call such a matrix a word-document co-occurrence matrix.</Paragraph>
      <Paragraph position="1"> The row-vectors of a word-document co-occurrence matrix represent the co-occurrence information of words. If two words tend to appear in the same documents, that is: tend to co-occur, their row-vectors are similar, that is, they point in sinfilar directions.</Paragraph>
      <Paragraph position="2"> The more document is considered, the more reliable and realistic the co-occurrence information will be. Then, the row size of a word-document co-occurrence matrix may become very large. Since enormous amounts of online text are available these days, row size can become more than a million documents. Then, it is not practical to use a word-docmnent co-occurrence matrix as it is. It is necessary to reduce row size and to simulate the tendency in the original matrix by a reduced matrix.</Paragraph>
    </Section>
    <Section position="2" start_page="80" end_page="81" type="sub_section">
      <SectionTitle>
2.2 Reduction of Word-Document
Co-occurrence Matrix
</SectionTitle>
      <Paragraph position="0"> The aim of a word-document co-occurrence matrix is to measure co-occurrence of two words by the angle of the two row-vectors.</Paragraph>
      <Paragraph position="1"> In the reduction of a matrix, angles of two row-vectors in the original matrLx should be maintained in the reduced matrLx.</Paragraph>
      <Paragraph position="2">  As such a matrix reduction, we utilized a learning method developed by HNC Software (Ilgen and Rushall, 1996). 1  1. Not the word-docmnent co-occurrence matrix is constructed from tile learning corpus, but a word-word co-occurrence matrix. In this matrix: the rows and colunms correspond to words and the i-th diagonal element denotes the number of documents in which the word wl appears, F(wi). The i:j-th element denotes the number of documents in which both  words w,: and wj appear, F(wi, wj) (Figure 2).</Paragraph>
      <Paragraph position="3"> The importmlt information in a word-document co-occurrence matrix is the cosine of the angle of the row-vector of wi and that of wj, which can be calculated by the word-word co-occurrence matrix as follows: F(w,:, wj) (2) This is because x/F(wi) corresponds to the magnitude of the row-vector of wl, and F(wl, wi) corresponds to the dot product of the row-vector of wl and that of wj in the word-docmnent co-occurrence matrix.</Paragraph>
      <Paragraph position="4"> 2. Given a reduced row size, a matrix is initialized as follows: matrix elements are chosen from a normal distribution randomly, then each row-vector is normalized to magnitude 1.0. The random refit row-vector of the word wl is denoted as  semantic representation of words and used to represent documents and queries.</Paragraph>
      <Paragraph position="5">  property that is referred to a &amp;quot;qnasiorthogonality'. That is; the expected ~C/alue of the dot product between an3&amp;quot; pair of random row-vectors, wci Rand and wet and, is approximately equal to zero (i.e. all vectors are approximately orthogonal). null  3. The trained row-vector, wai is calculated as follows:</Paragraph>
      <Paragraph position="7"> The procedure iterates the following calculation: null</Paragraph>
      <Paragraph position="9"> The learning method by HNC is a rather simple approximation of the procedure, doing just one step of it. Note that wci.wcj is approximately zero for the initialized random vectors.</Paragraph>
      <Paragraph position="10"> aij corresponds to the degree of the co-occurrence of two words. By adding wc~ and to wet a'd depending on aij, th.e learning formula (3) achieves that two words that, tend to co-occur will have trained vectors that point in shnilar directions, r/is a design parameter chosen to optimize performance. The formula (4) is to normalize vectors to magnitude 1.0.</Paragraph>
      <Paragraph position="11"> We call the trained row-vector we/of the word wi a word co-occurrence vector.</Paragraph>
      <Paragraph position="12"> The background of the above method is a stochastic gradient descent procedure for minimizing the cost function:</Paragraph>
      <Paragraph position="14"> subject to the constraints \[\[we/I\[ = 1.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="81" end_page="82" type="metho">
    <SectionTitle>
3 Context Co-occurrence Vector
</SectionTitle>
    <Paragraph position="0"> The next question is how to represent the context of a document based on word co-occurrence vectors. We propose a simple model which represents the context as the sum of the word co-occurrence vectors associated with content words ill a document so far. It should be noted that the vector is normalized to unit length. V~re call the resulting vector a context co-occurrence vector.</Paragraph>
    <Paragraph position="1"> W'ord co-occurrence vectors have the prop-erty that words which tend to co-occur have vectors that. point in similar directions. Context co-occurrence vectors are expected to have the sinfilar property. That is, if a word tends to appear in a given context, the word co-occurrence vector of the word and the context co-occurrence vector of the context will point in similar directions ......</Paragraph>
    <Paragraph position="2"> Such a context co-occurrence vector can be seen to predict the occurrence of words in a</Paragraph>
    <Paragraph position="4"> given context, mad is utilized as a component of statistical language modeling, as shown in the next section.</Paragraph>
  </Section>
  <Section position="6" start_page="82" end_page="84" type="metho">
    <SectionTitle>
4 Language Modeling using
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="82" end_page="82" type="sub_section">
      <SectionTitle>
Context Co-occurrence Vector
4.1 Context Co-occurrence
Probability
</SectionTitle>
      <Paragraph position="0"> The dot product of a context co-occurrence vector and a word co-occurrence vector shows the degree of affinity of the context m:d the word. The probability of a content word based on such dot products, called a context co-occurrence probability, can be calculated as follows:</Paragraph>
      <Paragraph position="2"> where cc~ -1 denotes the context co-occurrence vector of the left context, Wl ... wi-1, and Cc denotes a content word class. Pc(wilw~-lcc) metals the conditional probability of wi given that a content word follows wj-:.</Paragraph>
      <Paragraph position="3"> One choice for the function .f(x) is the identity. However, a linear contribution of dot products to the probability results in poorer estimates, since the differences of dot products of related words (tend to co-occur) and unrelated words are not so large. Experiments showed that x 2 or x 3 is a better estimate.</Paragraph>
      <Paragraph position="4"> An example of context co-occurrence probabilities is shown in Figure 3.</Paragraph>
    </Section>
    <Section position="2" start_page="82" end_page="83" type="sub_section">
      <SectionTitle>
4.2 Language Modeling using Context
Co-occurrence Probability
</SectionTitle>
      <Paragraph position="0"> Context co-occurrence probabilities can ham dle long-distance lexical dependencies while a standard trigram model can handle local contexts more clearly: in this way they complement each other. Therefore, language modeling of their linear interpolation is employed.</Paragraph>
      <Paragraph position="1"> Note that tile linear interpolation of unigram, bigram and trigram models is simply referred to 'trigxan: model' in this paper.</Paragraph>
      <Paragraph position="2"> The proposed language model, called a context language model, computes probabilities as shown in Figure 4. Since context co-occurrence probabilities are considered only for content words (Cc), probabilities are calculated separately for content words (Co) and function words (C/).</Paragraph>
      <Paragraph position="3"> P(Cc\[w~ -1) denotes the probability that a content word follows w~-:, which is approximated by a trigrmn nmdel. P(.wi\[w~-lcc) denotes the probability that wi follows w~-: given that a content word follows w~-:, which is a linear interpolation of a standard trigram model and the context co-occurrence probabilities. null In the case of a function word, since the context co-occurrence probability is not considered, P(wdw~-lCi) is just a standard trigranl model.</Paragraph>
      <Paragraph position="4"> X's adapt using an EM re-estimation procedure on the held-out data.</Paragraph>
      <Paragraph position="5">  shijyo no ~ wo ~ ni Wall-gai ga kakkyou wo teishi, bei kabushiki 'US' 'stock' 'market' 'sudden rise' 'background' %Vall Street' 'activity' 'show' wagayonoharu wo ~a~ shire iru. \[shoukenl kaisha, ~h~ ginkou wa 1996 nen ni 'prosperity' 'enjoy' 'do' 'stock' 'company' 'investment' 'bank' 'year' halite ka o saiko l ko shi \] '96 ne, I k b shiki l so.ha '95 'enter' 'past' 'maximum' 'profit' 'renew' 'year' ni I .tsuzuki\] kyushin . mata \] kab.uka\] kyushin wo 'continue' 'rapid increase' 'stock price' 'rapidly increase' I shinkabul hakkou ga ~ saikou to natta.</Paragraph>
      <Paragraph position="6"> 'new stock' 'issue' 'past' 'maximum' 'become' 'stock' 'market' 'year'</Paragraph>
      <Paragraph position="8"> model. (Note that wa, ga, wo, ni; to and no are Japanese postpositions.)</Paragraph>
    </Section>
    <Section position="3" start_page="83" end_page="84" type="sub_section">
      <SectionTitle>
4.3 Test Set Perplexity
</SectionTitle>
      <Paragraph position="0"> By using the Mainichi Newspaper corpus (from 1991 to 1997, 440,000 articles), test set perplexities of a standard trigrmn/bigram model and the proposed context language model are compared. The articles of six years were used for the leanfing of word co-occurrence vectors, unigrams, bigrmns and trigrams; the articles of half a year were used as a held-out data for EM re-estimation of A's; the remaining articles (half a year) for computing test set perplexities.</Paragraph>
      <Paragraph position="1"> Word co-occurrence vectors were computed for the top 50,000 frequent content words (excluding pronouns, numerals, temporal nouns, mad light verbs) in the corpus, and unigrmn: bigrmn and trigrmn were computed for the top 60,000 frequent words.</Paragraph>
      <Paragraph position="2"> The upper part of Table 1 shows thecomparison results of the stmldard trigram model and the context language model. For the best parameters (marked by *), the overall perplexity decreased 5.0% and the perplexity on target vocabulary (50,000 content words) decreased 27.270 relative to the standard trigram model. For the best parameters, A's were adapted as follows:</Paragraph>
      <Paragraph position="4"> As for parazneter settings, note that performance is decreased by using shorter word co-occurrence vector size. The vaxiation of ~/does not change the performance so much.</Paragraph>
      <Paragraph position="6"> The lower part of Table 1 shows the comparison results of the standard bigram model and the context language model. Here, the context language model is based on the bigrana model, that is, the terms concerning trigrmn in Figure 4 were eliminated. The result was similar, but the perplexity decreased a bit more; 5.7% overall and 28.9% on target vocabulary.</Paragraph>
      <Paragraph position="7"> Figure 5 shows a test article in which the probabilities of content words by the trigram lnodel aald the context model are compared. If that by the context model is bigger (i.e. the context model predicts better), the word is boxed; if not, the word is underlined.</Paragraph>
      <Paragraph position="8"> The figure shows that the context model usually performs better after a function word, where the trigram model usually has little prediction. On the other hand, the trigram model performs better after a content word (i.e. in a compound noun) because a clear prediction by the trigram model is reduced by paying attention to the relatively vague context co-occurrence probability (Acc is 0.17).</Paragraph>
      <Paragraph position="9"> The proposed model is a constant interpolation of a trigram model and the context co.0ccurrence probabilities. More adaptive interpolation depending on the N-gram probability distribution may improve the performance.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="84" end_page="84" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> Cache language models (Kuhn mad de Mori, 1990) boost the probability of the words already seen in the history.</Paragraph>
    <Paragraph position="1"> Trigger models (Lau et al., 1993), even more general, try to capture the co-occurrences between words. While the basic idea of our model is similar to trigger models, they handle co-occurrences of word pairs independently and do not use a representation of the whole context. This omission is also done in applications such as word sense dismnbiguation (Yarowsky: 1994; FUNG et al., 1999).</Paragraph>
    <Paragraph position="2"> Our model is the most related to Coccaro mad Jurafsky (1998), in that a reduced vector space approach was taken and context is represented by the accumulation of word co-occurrence vectors. Their model was reported to decrease the test set perplexity by 12%, compared to the bigram nmdel. The major differences are:</Paragraph>
  </Section>
  <Section position="8" start_page="84" end_page="85" type="metho">
    <SectionTitle>
1. SVD (Singular Value Decomposition)
</SectionTitle>
    <Paragraph position="0"> was used to reduce the matrix which is common in the Latent Semaaltic Analysis (Deerwester et ai.; 1990), and 2. context co-occurrence probabilities were computed for all words, and the degree of combination of context co-occurrence probabilities and N-gram probabilities was computed for each word, depending on its distribution over the set of doculnents. null As for the first point, we utilized the computationally-light, iteration-based procedure. One reason for this is that the computational cost of SVD is very high when millions or more documents are processed.</Paragraph>
    <Paragraph position="1"> Furthermore, considering an extension of our nmdel with a cognitive viewpoint, we believe an iteration-based model seems more reasonable than an algebraic model such as SVD. As for the second point, we doubt the appropriateness to use the word's distribution as a measure of combination of two models.</Paragraph>
    <Paragraph position="2"> What we need to do is to distinguish words to which semantics should be considered and other words. We judged the distinction of content words and function words is good enough for that purpose, and developed their trigram-based distinction as shown in Figure 4. Several topic-based models have been proposed based on the observation that certain words tend to have different probability distributions in different topics. For example, Florian and Yarowsky (1999) proposed the following model:</Paragraph>
    <Paragraph position="4"> where t denotes a topic id. Topics are obtained by hierarchical clustering from a training corpus, and a topic-specific language model, Pt, is learned from the clustered documents. Reductions in perplexity relative to a bigrmn model were 10.5% for the entire text and 33.5% for the target vocabulary.</Paragraph>
    <Paragraph position="5"> Topic-based models capture long-distance lexical dependencies via intermediate topics.</Paragraph>
    <Paragraph position="6"> In other words, the estimated distribution of topics, P(t\]w~), is the representation of a context. Our model does not use such intermediate topics, but accesses word cg-occurrence information directly aald represents a context as the accumulation of this information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML