File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1805_metho.xml

Size: 21,772 bytes

Last Modified: 2025-10-06 14:08:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1805">
  <Title>A Language Model Approach to Keyphrase Extraction</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Related work
</SectionTitle>
    <Paragraph position="0"> Word collocation Various collocation metrics have been proposed, including mean and variance (Smadja, 1994), the t-test (Church et al., 1991), the chi-square test, pointwise mutual information (MI) (Church and Hanks, 1990), and binomial log-likelihood ratio test (BLRT) (Dunning, 1993).</Paragraph>
    <Paragraph position="1"> According to (Manning and Sch&amp;quot;utze, 1999), BLRT is one of the most stable methods for collocation discovery. (Pantel and Lin, 2001) reports, however, that BLRT score can be also high for two frequent terms that are rarely adjacent, such as the word pair &amp;quot;the the,&amp;quot; and uses a hybrid of MI and BLRT.</Paragraph>
    <Paragraph position="2"> Keyphrase extraction Damerau (1993) uses the relative frequency ratio between two corpora to extract domain-specific keyphrases. One problem of using relative frequency is that it tends to assign too high a score for words whose frequency in the background corpus is small (or even zero).</Paragraph>
    <Paragraph position="3"> Some work has been done in extracting keyphrases from technical documents treating keyphrase extraction as a supervised learning problem (Frank et al., 1999; Turney, 2000). The portability of a learned classifier across various unstructured/structured text is not clear, however, and the agreement between classifier and human judges is not high.1 We would like to have the ability to extract keyphrases from a totally new domain of text without building a training corpus.</Paragraph>
    <Paragraph position="4"> Combining keyphrase and collocation Yamamoto and Church (2001) compare two metrics, MI and Residual IDF (RIDF), and observed that MI is suitable for finding collocation and RIDF is suitable for finding informative phrases. They took the intersection of each top 10% of phrases identified by MI and RIDF, but did not extend the approach to combining the two metrics into a unified score.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Baseline method based on binomial
</SectionTitle>
    <Paragraph position="0"> log-likelihood ratio test We can use various statistics as a measure for phraseness and informativeness. For our baseline, we have selected the method based on binomial log-likelihood ratio test (BLRT) described in (Dunning, 1993).</Paragraph>
    <Paragraph position="1"> The basic idea of using BLRT for text analysis is to consider a word sequence as a repeated sequence of binary trials comparing each word in a corpus to a target word, and use the likelihood ratio of two hypotheses that (i) two events, observed a0a2a1 times out of a3a4a1 total tokens and a0a6a5 times out of a3a7a5 total tokens respectively, are drawn from different distributions and (ii) from the same distribution.</Paragraph>
    <Paragraph position="2"> 1e.g. Turney reports 62% &amp;quot;good&amp;quot;, 18% &amp;quot;bad&amp;quot;, 20% &amp;quot;no opinion&amp;quot; from human judges.</Paragraph>
    <Paragraph position="3"> The BLRT score is calculated with</Paragraph>
    <Paragraph position="5"> In the case of calculating the phraseness score of an adjacent word pair (a43 a13a45a44 ), the null hypothesis is that a43 and a44 are independent, which can be expressed as a11a2a9 a44a25a46a43 a15 a21a47a11a29a9 a44a22a46a49a48 a43 a15 . We can use Equation (1) to calculate phraseness by setting:</Paragraph>
    <Paragraph position="7"> where a51a60a9a54a43 a15 is the frequency of the word a43 and a51a53a9a54a43 a13a45a44a59a15 is the frequency of a44 following a43 .</Paragraph>
    <Paragraph position="8"> For calculating informativeness of a word a66 ,</Paragraph>
    <Paragraph position="10"> where a51a68a67a70a69a42a9a54a66 a15 and a51a68a71a72a69a42a9a54a66 a15 are the frequency of a66 in the foreground and background corpus, respectively.</Paragraph>
    <Paragraph position="11"> Combining a phraseness score a73a25a74 and an informativeness score a73a29a19 into a single score value is not a trivial task since the the BLRT scores vary a lot between phraseness and informativeness and also depending on data (c.f. Figure 6 (a)).</Paragraph>
    <Paragraph position="12"> One way to combine those scores is to use an exponential model. We experimented with the following logistic function:</Paragraph>
    <Paragraph position="14"> whose parameters a81 ,a82 , and a84 are estimated on a held-out data set, given feedback from users (i.e. supervised). null Figure 2 shows some example phrases extracted with this method from the data set described in Section 6.1, where the parameters, a81 , a82 , a84 , are manually optimized on the test data.</Paragraph>
    <Paragraph position="15"> Although it is possible to rank keyphrases using this approach, there are a couple of drawbacks.</Paragraph>
    <Paragraph position="16">  b=0.000005, c=8) Necessity of tuning parameters the existence of parameters in the combining function requires human labeling, which is sometimes an expensive task to do, and the robustness of learned weight across domains is unknown. We would like to have a parameter-free and robust way of combining scores.</Paragraph>
    <Paragraph position="17"> Inappropriate symmetry BLRT tests to see if two random variables are independent or not. This sometimes leads to unwanted phrases getting a high score. For example, when the background corpus happens to have many occurrences of phrase al jazeera which is an unusual phrase in the foreground corpus, then the phrase still gets high score of informativeness because the distribution is so different. What we would like to have instead is asymmetric scoring function to test the loss of the action of not taking the target phrase as a keyphrase.</Paragraph>
    <Paragraph position="18"> In the next section, we propose a new method trying to address these issues.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Proposed method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Language models and expected loss
</SectionTitle>
      <Paragraph position="0"> A language model assigns a probability value to every sequence of words a86a87a21a64a66 a1a88a66 a5a2a89a70a89a70a89a88a66  . The probability a90a60a9a54a86 a15 can be decomposed as</Paragraph>
      <Paragraph position="2"> Assuming a66a38a19 only depends on the previous a95 words, N-gram language models are commonly used. The following is the trigram language model case.</Paragraph>
      <Paragraph position="4"> Here each word only depends on the previous two words. Please refer to (Jelinek, 1990) and (Chen and Goodman, 1996) for more about N-gram models and associated smoothing methods.</Paragraph>
      <Paragraph position="5"> Now suppose we have a foreground corpus and a background corpus and have created a language model for each corpus. The simplest language model is a unigram model, which assumes each word of a given word sequence is drawn independently. We denote the unigram model for the foreground corpus as a7 a1 a1fg and for the background corpus as a7 a1 a1bg. We can also train higher order models  Among those four models, a7 a1 a2fg will be the best model to describe the foreground corpus in the sense that it has the smallest cross-entropy or perplexity value over the corpus.</Paragraph>
      <Paragraph position="6"> If we use one of the other three models instead, then we have some inefficiency or loss to describe the corpus. We expect the amount of loss between using a7 a1a3a2fg and a7 a1 a1fg is related to phraseness and the loss between a7 a1a13a2fg and a7 a1a3a2bg is related to informativeness. Figure 3 illustrates these relationships. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Pointwise KL-divergence between models
</SectionTitle>
      <Paragraph position="0"> One natural metric to measure the loss between two language models is the Kullback-Leibler (KL) divergence. The KL divergence (also called relative entropy) between two probability mass function a11a29a9a54a43 a15 and a14a59a9a54a43 a15 is defined as</Paragraph>
      <Paragraph position="2"> of assuming that the distribution is a14 when the true distribution is a11 .&amp;quot; (Cover and Thomas, 1991) You can see this by the following relationship:</Paragraph>
      <Paragraph position="4"> The first term a62 a21 a11a29a9a54a43 a15 a1a27a29a28a31a30a33a32a35a34 a21a37a36 is the cross entropy and the second term a24a85a9a26a25 a15 is the entropy of the random variable a25 , which is how much we could compress symbols if we know the true distribution a11 .</Paragraph>
      <Paragraph position="5"> We define pointwise KL divergence a38a40a39 a9a12a11a41a16a42a14 a15 to be the term inside of the summation of Equation (6):</Paragraph>
      <Paragraph position="7"> Intuitively, this is the contribution of the phrase a86 to the expected loss of the entire distribution.</Paragraph>
      <Paragraph position="8"> We can now quantify phraseness and informativeness as follows: Phraseness of a86 is how much we lose information by assuming independence of each word by applying the unigram model, instead of the a95 gram model.</Paragraph>
      <Paragraph position="10"> Informativeness of a86 is how much we lose information by assuming the phrase is drawn from the background model instead of the foreground model.</Paragraph>
      <Paragraph position="12"> Combined The following is considered to be a mixture of phraseness and informativeness.</Paragraph>
      <Paragraph position="14"> Note that the KL divergence is always nonnegative2, but the pointwise KL divergence can be a negative value. An example is the phraseness of the bigram &amp;quot;the the&amp;quot;.</Paragraph>
      <Paragraph position="16"> since a11a2a9 thea13 thea15a4a3 a11a29a9 thea15 a11a29a9 thea15 .</Paragraph>
      <Paragraph position="17"> Also note that in the case of phraseness of a bigram, the equation looks similar to pointwise mutual information (Church and Hanks, 1990) , but they are different. Their relationship is as follows.</Paragraph>
      <Paragraph position="19"> The pointwise KL divergence does not assign a high score to a rare phrase, whose contribution of loss is small by definition, unlike pointwise mutual information, which is known to have problems (as described in (Manning and Sch&amp;quot;utze, 1999), e.g.).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Combining phraseness and informativeness
</SectionTitle>
      <Paragraph position="0"> One way of getting a unified score of phraseness and informativeness is using equation (11). We can also calculate phraseness and informativeness separately and then combine them.</Paragraph>
      <Paragraph position="1"> We combine the phraseness score a73a25a74 and informativeness score a73a80a19 by simply adding them into a single score a73 .</Paragraph>
      <Paragraph position="2"> a73a75a21a57a73a20a74 a28a47a73 a19 (12) Intuitively, this can be thought of as the total loss. We will show some empirical results to justify this scoring in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experimental results
</SectionTitle>
    <Paragraph position="0"> In this section, we show some preliminary experimental results of applying our method on real data.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Data set
</SectionTitle>
      <Paragraph position="0"> We used the 20 newsgroups data set3, which contains 20,000 messages (7.4 million words) between February and June 1993 taken from 20  Usenet newsgroups, as the background data set, and another 20,000 messages (4 million words) between June and September 2002 taken from rec.arts.movies.current-films newsgroup as the foreground data set. Each message's subject header and the body of the message (including quoted text) is tokenized into lowercase tokens on both data set. No stemming is applied.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Finding key-bigrams
</SectionTitle>
      <Paragraph position="0"> The first experiment we show is to find key-bigrams, which is the simplest case requiring combination of phraseness and informativeness scores. Figure 4 outlines the extraction procedure.</Paragraph>
      <Paragraph position="1">  a10 Inputs: foreground and background corpus. 1. create background language model from the background corpus.</Paragraph>
      <Paragraph position="2"> 2. count all adjacent word pairs in the foreground cor- null pus, skipping pre-annotated boundaries (such as HTML tag boundaries) and stopwords.</Paragraph>
      <Paragraph position="3"> 3. for each pair of words (x,y) in the count, calculate phraseness froma11a13a12a15a14a17a16a19a18a21a20 fg and a11a13a12a15a14a22a20 fga11a17a12a15a18a23a20 fg and informativeness from a11a17a12a15a14a17a16a24a18a21a20 fg and a11a17a12a15a14a17a16a24a18a21a20 bg. Add the two score values as the unified score.</Paragraph>
      <Paragraph position="4">  4. sort the results by the unified score.</Paragraph>
      <Paragraph position="5"> a10 Output: a list of key-bigrams ranked by unified score.  For this experiment we used unsmoothed count for calculating phraseness a11a29a9a54a43 a13a45a44a59a15 a21 a51a53a9a54a43 a13a45a44a59a15 a23 a95 ,</Paragraph>
      <Paragraph position="7"> a51a60a9a54a43 a13a45a44a55a15 , and used the unigram model for calculating informativeness with Katz smoothing (Chen and Goodman, 1996)4 to handle zero occurrences. null Figure 5 shows the extracted key-bigrams using this method. Comparing to Figure 2, you can see that those two methods extract almost identical ranked phrases. Note that we needed to tune three parameters to combine phraseness and informativeness in BLRT, but no parameter tuning was required in this method.</Paragraph>
      <Paragraph position="8"> The reason why &amp;quot;message news&amp;quot; becomes the top phrase in both methods is that it appears frequently enough in message citation headers such  as John Smith a0 js@foo.coma1 wrote in message news:1pk0a@foo.com, which was not common in the 20 newsgroup dataset.5 A more sophisticated document analysis tool to remove citation headers is required to improve the quality further.</Paragraph>
      <Paragraph position="9"> Figure 6 shows the distribution of phraseness and informativeness scores of bigrams extracted using the BLRT and pointwise KL methods. One can see that there is little correlation between phraseness and informativeness in both ranking methods. Also note that the range of x and y axis is very different in BLRT, but in the pointwise KL method they are comparable ranges. That makes combining two scores easy in the pointwise KL approach.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Ranking n-length phrases
</SectionTitle>
      <Paragraph position="0"> The next example is ranking a3 -length phrases. We applied a phrase extension algorithm based on the APriori algorithm (Agrawal and Srikant, 1994) to the output of the key-bigram finder in the previous example to generate a3 -length candidates whose frequency is greater than 5, then applied a linguistic filter which rejects phrases that do not occur in valid noun-phrase contexts (e.g. following articles or possessives) at least once in the corpus. We ranked resulting phrases using pointwise KL score, using the  extracted from the same movie corpus. We can see that bigrams and trigrams are interleaved in natural order (although not many long phrases are extracted from the dataset, since longer NP did not occur more than five times). Figure 1 was another example of the result of the same pipeline of methods.</Paragraph>
      <Paragraph position="1">  One question that might be asked is &amp;quot;what if we just sort by frequency?&amp;quot;. If we sort by frequency, &amp;quot;blair witch project&amp;quot; is 92nd and &amp;quot;empire strikes back&amp;quot; is 110th on the ranked list. Since the longer the phrase becomes, the lower the frequency of the phrase is, frequency is not an appropriate method for ranking phrases.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.4 Revisiting unigram informativeness
</SectionTitle>
      <Paragraph position="0"> An alternative approach to calculate informativeness from the foreground LM and the background LM is just to take the ratio of likelihood scores, a11 fga9a54a86 a15 a23 a11 bga9a54a86 a15 . This is a smoothed version of relative frequency ratio which is commonly used to find subject-specific terms (Damerau, 1993).</Paragraph>
      <Paragraph position="1"> Figure 8 compares extracted keywords ranked with pointwise KL and likelihood ratio scores, both of which use the same foreground and background unigram language model. We used messages retrieved from the query Infiniti G35 as the foreground corpus and the same 20 newsgroup data as the background corpus. Katz smoothing is applied to both language models.</Paragraph>
      <Paragraph position="2"> As we can see, those two methods return very different ranked lists. We think the pointwise KL returns a set of keywords closer to human judgment.</Paragraph>
      <Paragraph position="3"> One example is the word &amp;quot;infiniti&amp;quot;, which we expected to be one of the informative words since it is the query word. The pointwise KL score picked the word as the third informative word, but the likelihood score missed it. Whereas &amp;quot;6mt&amp;quot;, picked up by the likelihood ratio, which occurs 37 times in the  likelihood ratio (after stopwords removed) from messages retrieved from the query &amp;quot;Infiniti G35&amp;quot; foreground corpus and none in the background corpus does not seem to be a good keyword.</Paragraph>
      <Paragraph position="4"> The following table shows statistics of those two  Since the likelihood of &amp;quot;6mt&amp;quot; with respect to the background LM is so small, the likelihood ratio of the word becomes very large. But the pointwise KL score discounts the score appropriately by consider6&amp;quot;infiniti&amp;quot; occurs 34 times in the &amp;quot;rec.autos&amp;quot; section of the 20 newsgroup data set.</Paragraph>
      <Paragraph position="5"> ing that the frequency of the word is low. Likelihood ratio (or relative frequency ratio) has a tendency to pick up rare words as informative. Pointwise KL seems more robust in sparse data situations.</Paragraph>
      <Paragraph position="6"> One disadvantage of the pointwise KL statistic might be that it also picks up stopwords or punctuation, when there is a significant difference in style of writing, etc., since these words have significantly high frequency. But stopwords are easy to define or can be generated automatically from corpora, and we don't consider this to be a significant drawback.</Paragraph>
      <Paragraph position="7"> We also expect a better background model and better smoothing mechanism could reduce the necessity of the stopword list.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Discussion
</SectionTitle>
    <Paragraph position="0"> Necessity of both phraseness and informativeness Although phraseness itself is domain-dependent to some extent (Smadja, 1994), we have shown that there is little correlation between informativeness and phraseness scores.</Paragraph>
    <Paragraph position="1"> Combining method One way to calculate a combined score is directly comparing a7 a1 a2fg and a7 a1 a1bg in Figure 3. We have tried both approaches and got a better result from combining separate phraseness and informativeness scores. We think this is due to data sparseness of the higher order ngram in the background corpus. Further investigation is required to make a conclusion.</Paragraph>
    <Paragraph position="2"> We have used the simplest method of combining two scores by adding them. We have also tried harmonic mean and geometric mean but they did not improve the result. We could also apply linear interpolation to put more weight on one score value, or use an exponential model to combine score, but this will require tuning parameters.</Paragraph>
    <Paragraph position="3"> Benefits of using a language model One benefit of using a language model approach is that one can take advantage of various smoothing techniques.</Paragraph>
    <Paragraph position="4"> For example, by interpolating with a character-based n-gram model, we can make the LM more robust with respect to spelling errors and variations. Consider the following variations, which we need to treat as a single entity: al-Qaida, al Qaida, al Qaeda, al Queda, al-Qaeda, al-Qa'ida, al Qa'ida (found in online sources). Since these are such unique spellings in English, character n-gram is expected to be able to give enough likelihood score to different spellings as well.</Paragraph>
    <Paragraph position="5"> It is also easy to incorporate other models such as topic or discourse model, use a cache LM to capture local context, and a class-based LM for the shared concept. It is also possible to add a phrase length prior probability in the model for better likelihood estimation.</Paragraph>
    <Paragraph position="6"> Another useful smoothing technique is linear interpolation of the foreground and background language models, when the foreground and background corpus are disjoint.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML