A Language Model Approach to Keyphrase Extraction
Takashi Tomokiyo and Matthew Hurst
Applied Research Center
Intelliseek, Inc.
Pittsburgh, PA 15213
a0 ttomokiyo,mhurst
a1 @intelliseek.com
Abstract
We present a new approach to extract-
ing keyphrases based on statistical lan-
guage models. Our approach is to use
pointwise KL-divergence between mul-
tiple language models for scoring both
phraseness and informativeness, which
can be unified into a single score to rank
extracted phrases.
1 Introduction
In many real world deployments of text mining tech-
nologies, analysts are required to deal with large col-
lections of documents from unfamiliar domains. Fa-
miliarity with the domain is necessary in order to
get full leverage from text analysis tools. However,
browsing data is not an efficient way to get an un-
derstanding of the topics and events which are par-
ticular to a domain.
For example, an analyst concerned with the area
of hybrid cars may harvest messages from online fo-
rums. They may then want to rapidly construct a hi-
erarchy of topics based on the content of these mes-
sages. In addition, in cases where these messages
are harvested via a search of some sort, there is a re-
quirement to obtain a rich and effective set of search
terms.
The technology described in this paper is an ex-
ample of a phrase finder capable of delivering a set
of indicative phrases given a particular set of docu-
ments from a target domain.
In the hybrid car example, the result of this pro-
cess is a set of phrases like that shown in Figure 1.
1 civic hybrid
2 honda civic hybrid
3 toyota prius
4 electric motor
5 honda civic
6 fuel cell
7 hybrid cars
8 honda insight
9 battery pack
10 sports car
11 civic si
12 hybrid car
13 civic lx
14 focus fcv
15 fuel cells
16 hybrid vehicles
17 tour de sol
18 years ago
19 daily driver
20 jetta tdi
21 mustang gt
22 ford escape
23 steering wheel
24 toyota prius today
25 electric motors
26 gasoline engine
27 internal combustion engine
28 gas engine
29 front wheels
30 key sense wire
31 civic type r
32 test drive
33 street race
34 united states
35 hybrid powertrain
36 rear bumper
37 ford focus
38 detroit auto show
39 parking lot
40 rear wheels
Figure 1: Top 40 keyphrases automatically extracted from
messages relevant to “civic hybrid” using our system
In order to capture domain-specific terms effi-
ciently in limited time, the extraction result should
be ranked with more indicative and good phrase first,
as shown in this example.
2 Phraseness and informativeness
The word keyphrase implies two features: phrase-
ness and informativeness.
Phraseness is a somewhat abstract notion which
describes the degree to which a given word sequence
is considered to be a phrase. In general, phraseness
is defined by the user, who has his own criteria for
the target application. For instance, one user might
want only noun phrases while another user might be
interested only in phrases describing a certain set of
products. Although there is no single definition of
the term phrase, in this paper, we focus on colloca-
tion or cohesion of consecutive words.
Informativeness refers to how well a phrase cap-
tures or illustrates the key ideas in a set of docu-
ments. Because informativeness is defined with re-
spect to background information and new knowl-
edge, users will have different perceptions of infor-
mativeness. In our calculations, we make use of
the relationship between foreground and background
corpora to formalize the notion of informativeness.
The target document set from which representa-
tive keyphrases are extracted is called the foreground
corpus. The document set to which this target set
is compared is called the background corpus. For
example, a foreground corpus of the current week’s
news would be compared to a background corpus of
an entire news article archive to determine that cer-
tain phrases, like “press conference” are typical of
news stories in general and do not capture the par-
ticulars of current events in the way that “national
museum of antiquities” does.
Other examples of foreground and background
corpora include: a web site for a certain company
and web data in general; a newsgroup and the whole
Usenet archive; and research papers of a certain con-
ference and research papers in general.
In order to get a ranked keyphrase list, we need to
combine both phraseness and informativeness into a
single score. A sequence of words can be a good
phrase but not an informative one, like the expres-
sion “in spite of.” A word sequence can be informa-
tive for a particular domain but not a phrase; “toy-
ota, honda, ford” is an example of a non-phrase se-
quence of informative words in a hybrid car domain.
The algorithm we propose for keyphrase finding re-
quires that the keyphrase score well for both phrase-
ness and informativeness.
3 Related work
Word collocation Various collocation metrics
have been proposed, including mean and variance
(Smadja, 1994), the t-test (Church et al., 1991),
the chi-square test, pointwise mutual information
(MI) (Church and Hanks, 1990), and binomial log-
likelihood ratio test (BLRT) (Dunning, 1993).
According to (Manning and Sch¨utze, 1999),
BLRT is one of the most stable methods for col-
location discovery. (Pantel and Lin, 2001) reports,
however, that BLRT score can be also high for two
frequent terms that are rarely adjacent, such as the
word pair “the the,” and uses a hybrid of MI and
BLRT.
Keyphrase extraction Damerau (1993) uses the
relative frequency ratio between two corpora to ex-
tract domain-specific keyphrases. One problem of
using relative frequency is that it tends to assign too
high a score for words whose frequency in the back-
ground corpus is small (or even zero).
Some work has been done in extracting
keyphrases from technical documents treating
keyphrase extraction as a supervised learning
problem (Frank et al., 1999; Turney, 2000). The
portability of a learned classifier across various
unstructured/structured text is not clear, however,
and the agreement between classifier and human
judges is not high.1
We would like to have the ability to extract
keyphrases from a totally new domain of text with-
out building a training corpus.
Combining keyphrase and collocation Ya-
mamoto and Church (2001) compare two metrics,
MI and Residual IDF (RIDF), and observed that
MI is suitable for finding collocation and RIDF
is suitable for finding informative phrases. They
took the intersection of each top 10% of phrases
identified by MI and RIDF, but did not extend
the approach to combining the two metrics into a
unified score.
4 Baseline method based on binomial
log-likelihood ratio test
We can use various statistics as a measure for
phraseness and informativeness. For our baseline,
we have selected the method based on binomial log-
likelihood ratio test (BLRT) described in (Dunning,
1993).
The basic idea of using BLRT for text analysis is
to consider a word sequence as a repeated sequence
of binary trials comparing each word in a corpus to
a target word, and use the likelihood ratio of two
hypotheses that (i) two events, observed a0a2a1 times out
of a3a4a1 total tokens and a0a6a5 times out of a3a7a5 total tokens
respectively, are drawn from different distributions
and (ii) from the same distribution.
1e.g. Turney reports 62% “good”, 18% “bad”, 20% “no
opinion” from human judges.
The BLRT score is calculated with
a0a2a1a4a3a6a5a8a7a10a9a12a11 a1a14a13 a0 a1a14a13 a3 a1a16a15 a7a10a9a12a11 a5a17a13 a0 a5a17a13 a3 a5a18a15
a7a10a9a12a11 a13 a0 a1 a13 a3a4a1 a15 a7a10a9a12a11 a13 a0 a5 a13 a3 a5 a15
(1)
where a11a20a19a22a21 a0a6a19a24a23 a3a25a19 , a11a26a21a27a9 a0 a1a29a28 a0 a5 a15 a23a30a9 a3a4a1a29a28 a3 a5 a15 , and
a7a10a9a12a11 a13 a0 a13 a3 a15 a21a31a11a33a32a34a9a36a35a38a37a26a11 a15a40a39a42a41 a32 (2)
In the case of calculating the phraseness score
of an adjacent word pair (a43 a13a45a44 ), the null hypothesis
is that a43 and a44 are independent, which can be ex-
pressed as a11a2a9 a44a25a46a43 a15 a21a47a11a29a9 a44a22a46a49a48 a43 a15 . We can use Equation
(1) to calculate phraseness by setting:
a0 a1a50a21a52a51a53a9a54a43 a13a45a44a55a15a16a13
a3a4a1a56a21a57a51a53a9a54a43 a15a16a13
a0 a5a58a21a52a51a53a9 a48 a43 a13a45a44a59a15 a21a57a51a60a9 a44a59a15 a37a61a51a53a9a54a43 a13a45a44a55a15a16a13
a3 a5 a21a57a51a53a9 a48 a43 a15 a21a57a62a64a63a65a51a53a9a54a66 a15 a37a31a51a53a9a54a43 a15
(3)
where a51a60a9a54a43 a15 is the frequency of the word a43 and
a51a53a9a54a43 a13a45a44a59a15 is the frequency of a44 following a43 .
For calculating informativeness of a word a66 ,
a0 a1a50a21a57a51a68a67a70a69a42a9a54a66 a15a16a13
a3 a1 a21a57a62a64a63a65a51a68a67a70a69a42a9a54a66 a15a16a13
a0 a5 a21a57a51a50a71a72a69a42a9a54a66 a15a16a13
a3 a5a38a21a57a62a64a63a65a51a50a71a72a69a42a9a54a66 a15
(4)
where a51a68a67a70a69a42a9a54a66 a15 and a51a68a71a72a69a42a9a54a66 a15 are the frequency of a66 in
the foreground and background corpus, respectively.
Combining a phraseness score a73a25a74 and an infor-
mativeness score a73a29a19 into a single score value is not a
trivial task since the the BLRT scores vary a lot be-
tween phraseness and informativeness and also de-
pending on data (c.f. Figure 6 (a)).
One way to combine those scores is to use an ex-
ponential model. We experimented with the follow-
ing logistic function:
a73a75a21
a35
a35a56a28a61a76a78a77a30a79a80a9a36a37a38a81a59a73 a74 a37a83a82a70a73 a19 a28a85a84 a15
(5)
whose parameters a81 ,a82 , and a84 are estimated on a held-
out data set, given feedback from users (i.e. super-
vised).
Figure 2 shows some example phrases extracted
with this method from the data set described in Sec-
tion 6.1, where the parameters, a81 , a82 , a84 , are manually
optimized on the test data.
Although it is possible to rank keyphrases using
this approach, there are a couple of drawbacks.
1 message news
2 minority report
3 star wars
4 john harkness
5 derek janssen
6 robert frenchu
7 sean o’hara
8 box office
9 dawn taylor
10 anthony gaza
11 star trek
12 ancient race
13 scooby doo
14 austin powers
15 home.attbi.com hey
16 sixth sense
17 hey kids
18 gaza man
19 lee harrison
20 years ago
21 julia roberts
22 national guard
23 bourne identity
24 metrotoday www.zap2it.com
25 starweek magazine
26 eric chomko
27 wilner starweek
28 tim gueguen
29 jodie foster
30 johnnie kendricks
Figure 2: Keyphrases extracted with BLRT (a=0.0003,
b=0.000005, c=8)
Necessity of tuning parameters the existence of
parameters in the combining function requires
human labeling, which is sometimes an expen-
sive task to do, and the robustness of learned
weight across domains is unknown. We would
like to have a parameter-free and robust way of
combining scores.
Inappropriate symmetry BLRT tests to see if two
random variables are independent or not. This
sometimes leads to unwanted phrases getting a
high score. For example, when the background
corpus happens to have many occurrences of
phrase al jazeera which is an unusual phrase in
the foreground corpus, then the phrase still gets
high score of informativeness because the dis-
tribution is so different. What we would like to
have instead is asymmetric scoring function to
test the loss of the action of not taking the target
phrase as a keyphrase.
In the next section, we propose a new method try-
ing to address these issues.
5 Proposed method
5.1 Language models and expected loss
A language model assigns a probability value to ev-
ery sequence of words a86a87a21a64a66 a1a88a66 a5a2a89a70a89a70a89a88a66
a39
. The prob-
ability a90a60a9a54a86 a15 can be decomposed as
a90a60a9a54a86 a15 a21
a39
a91
a19a93a92 a1
a90a60a9a54a66a94a19 a46a66 a1a45a66 a5a2a89a70a89a70a89a88a66a94a19
a41
a1 a15
Assuming a66a38a19 only depends on the previous a95
words, N-gram language models are commonly
used. The following is the trigram language model
case.
a90a60a9a54a86 a15 a21
a39
a91
a19a93a92 a1
a90 a9a54a66a94a19 a46a66a94a19
a41
a5 a13 a66a94a19
a41
a1 a15
Here each word only depends on the previous two
words. Please refer to (Jelinek, 1990) and (Chen and
Goodman, 1996) for more about N-gram models and
associated smoothing methods.
Now suppose we have a foreground corpus and
a background corpus and have created a language
model for each corpus. The simplest language
model is a unigram model, which assumes each
word of a given word sequence is drawn indepen-
dently. We denote the unigram model for the fore-
ground corpus as a7 a1 a1fg and for the background cor-
pus as a7 a1 a1bg. We can also train higher order models
a7
a1a3a2
fg and a7
a1a3a2
bg for each corpus, each of which is
a a95 -gram model, where a95a61a9a5a4 a35 a15 is the order.
a6a7
phrasenessa7
a8
a9
a37a30a37 informativeness a37a30a37a11a10
a7
a1a12a2
fg a7
a1a3a2
bg
a7
a1
a1
fg a7
a1
a1
bg
Figure 3: Phraseness and informativeness as loss between lan-
guage models.
Among those four models, a7 a1 a2fg will be the best
model to describe the foreground corpus in the sense
that it has the smallest cross-entropy or perplexity
value over the corpus.
If we use one of the other three models instead,
then we have some inefficiency or loss to describe
the corpus. We expect the amount of loss between
using a7 a1a3a2fg and a7 a1 a1fg is related to phraseness and
the loss between a7 a1a13a2fg and a7 a1a3a2bg is related to in-
formativeness. Figure 3 illustrates these relation-
ships.
5.2 Pointwise KL-divergence between models
One natural metric to measure the loss between two
language models is the Kullback-Leibler (KL) diver-
gence. The KL divergence (also called relative en-
tropy) between two probability mass function a11a29a9a54a43 a15
and a14a59a9a54a43 a15 is defined as
a15
a9a12a11a17a16a18a14 a15 a21a20a19a22a21 a11a29a9a54a43 a15
a1a93a3a6a5 a11a29a9a54a43 a15
a14a59a9a54a43 a15
(6)
KL divergence is “a measure of the inefficiency
of assuming that the distribution is a14 when the true
distribution is a11 .” (Cover and Thomas, 1991)
You can see this by the following relationship:
a15
a9a12a11a17a16a18a14 a15 a21 a19a22a21 a11a29a9a54a43 a15
a1a4a3a6a5
a11a29a9a54a43 a15 a37 a19a23a21 a11a29a9a54a43 a15
a1a4a3a6a5
a14a59a9a54a43 a15
a21 a19a22a21 a11a29a9a54a43 a15
a35
a1a4a3a6a5
a14a59a9a54a43 a15
a37a17a24a85a9a26a25 a15
The first term a62 a21 a11a29a9a54a43 a15 a1a27a29a28a31a30a33a32a35a34 a21a37a36 is the cross entropy
and the second term a24a85a9a26a25 a15 is the entropy of the ran-
dom variable a25 , which is how much we could com-
press symbols if we know the true distribution a11 .
We define pointwise KL divergence a38a40a39 a9a12a11a41a16a42a14 a15 to
be the term inside of the summation of Equation (6):
a38 a39 a9a12a11a17a16a18a14 a15a44a43a46a45a48a47a21a27a11a2a9a54a86 a15
a1a4a3a6a5 a11a29a9a54a86 a15
a14a59a9a54a86 a15
(7)
Intuitively, this is the contribution of the phrase a86 to
the expected loss of the entire distribution.
We can now quantify phraseness and informative-
ness as follows:
Phraseness of a86 is how much we lose information
by assuming independence of each word by ap-
plying the unigram model, instead of the a95 -
gram model.
a38 a39 a9a72a7
a1 a2
fg a16a56a7
a1
a1
fga15 (8)
Informativeness of a86 is how much we lose in-
formation by assuming the phrase is drawn
from the background model instead of the fore-
ground model.
a38a49a39 a9a72a7
a1 a2
fg a16a56a7
a1 a2
bga15a16a13 or (9)
a38a49a39 a9a72a7
a1
a1
fg a16a56a7
a1
a1
bga15 (10)
Combined The following is considered to be a mix-
ture of phraseness and informativeness.
a38a49a39 a9a72a7
a1 a2
fg a16a56a7
a1
a1
bga15 (11)
Note that the KL divergence is always non-
negative2, but the pointwise KL divergence can be
a negative value. An example is the phraseness of
the bigram “the the”.
a11a29a9 thea13 thea15
a1a4a3a6a5 a11a29a9 thea13 thea15
a11a29a9 thea15 a11a29a9 thea15a1a0
a2
since a11a2a9 thea13 thea15a4a3 a11a29a9 thea15 a11a29a9 thea15 .
Also note that in the case of phraseness of a bi-
gram, the equation looks similar to pointwise mutual
information (Church and Hanks, 1990) , but they are
different. Their relationship is as follows.
a38a49a39 a9a12a11a29a9a54a43 a13a45a44a55a15 a16a22a11a2a9a54a43 a15 a11a29a9 a44a55a15a45a15 a21a31a11a29a9a54a43 a13a45a44a59a15
a1a4a3a6a5 a11a29a9a54a43 a13a45a44a59a15
a11a2a9a54a43 a15 a11a29a9 a44a55a15
a5 a6a8a7 a9
pointwise MI
The pointwise KL divergence does not assign a high
score to a rare phrase, whose contribution of loss is
small by definition, unlike pointwise mutual infor-
mation, which is known to have problems (as de-
scribed in (Manning and Sch¨utze, 1999), e.g.).
5.3 Combining phraseness and informativeness
One way of getting a unified score of phraseness and
informativeness is using equation (11). We can also
calculate phraseness and informativeness separately
and then combine them.
We combine the phraseness score a73a25a74 and infor-
mativeness score a73a80a19 by simply adding them into a
single score a73 .
a73a75a21a57a73a20a74 a28a47a73 a19 (12)
Intuitively, this can be thought of as the total loss.
We will show some empirical results to justify this
scoring in the next section.
6 Experimental results
In this section, we show some preliminary experi-
mental results of applying our method on real data.
6.1 Data set
We used the 20 newsgroups data set3, which con-
tains 20,000 messages (7.4 million words) be-
tween February and June 1993 taken from 20
2from Jensen’s inequality.
3http://www-2.cs.cmu.edu/afs/cs.cmu.edu/
project/theo-20/www/data/news20.html
Usenet newsgroups, as the background data set,
and another 20,000 messages (4 million words)
between June and September 2002 taken from
rec.arts.movies.current-films newsgroup as
the foreground data set. Each message’s subject
header and the body of the message (including
quoted text) is tokenized into lowercase tokens on
both data set. No stemming is applied.
6.2 Finding key-bigrams
The first experiment we show is to find key-bigrams,
which is the simplest case requiring combination
of phraseness and informativeness scores. Figure 4
outlines the extraction procedure.
a10 Inputs: foreground and background corpus.
1. create background language model from the back-
ground corpus.
2. count all adjacent word pairs in the foreground cor-
pus, skipping pre-annotated boundaries (such as
HTML tag boundaries) and stopwords.
3. for each pair of words (x,y) in the count, calculate
phraseness froma11a13a12a15a14a17a16a19a18a21a20 fg and a11a13a12a15a14a22a20 fga11a17a12a15a18a23a20 fg and in-
formativeness from a11a17a12a15a14a17a16a24a18a21a20 fg and a11a17a12a15a14a17a16a24a18a21a20 bg. Add
the two score values as the unified score.
4. sort the results by the unified score.
a10 Output: a list of key-bigrams ranked by unified score.
Figure 4: Procedure to find key-bigrams
For this experiment we used unsmoothed count
for calculating phraseness a11a29a9a54a43 a13a45a44a59a15 a21 a51a53a9a54a43 a13a45a44a59a15 a23 a95 ,
a11a29a9a54a66 a15 a21 a51a53a9a54a66 a15 a23 a95 where a95 a21 a62
a21
a51a53a9a54a43 a15 a21
a62
a21
a25a26
a51a60a9a54a43 a13a45a44a55a15 , and used the unigram model for
calculating informativeness with Katz smoothing
(Chen and Goodman, 1996)4 to handle zero occur-
rences.
Figure 5 shows the extracted key-bigrams us-
ing this method. Comparing to Figure 2, you can
see that those two methods extract almost identical
ranked phrases. Note that we needed to tune three
parameters to combine phraseness and informative-
ness in BLRT, but no parameter tuning was required
in this method.
The reason why “message news” becomes the
top phrase in both methods is that it appears fre-
quently enough in message citation headers such
4with cutoff
a27a29a28a31a30
1 message news
2 minority report
3 star wars
4 john harkness
5 robert frenchu
6 derek janssen
7 box office
8 sean o’hara
9 dawn taylor
10 anthony gaza
11 star trek
12 ancient race
13 home.attbi.com hey
14 scooby doo
15 austin powers
16 hey kids
17 years ago
18 gaza man
19 sixth sense
20 lee harrison
21 julia roberts
22 national guard
23 bourne identity
24 metrotoday www.zap2it.com
25 starweek magazine
26 eric chomko
27 wilner starweek
28 tim gueguen
29 jodie foster
30 kevin filmnutboy
Figure 5: Key-bigrams extracted with pointwise KL
as John Smith a0 js@foo.coma1 wrote in message
news:1pk0a@foo.com, which was not common in
the 20 newsgroup dataset.5 A more sophisticated
document analysis tool to remove citation headers
is required to improve the quality further.
Figure 6 shows the distribution of phraseness and
informativeness scores of bigrams extracted using
the BLRT and pointwise KL methods. One can
see that there is little correlation between phraseness
and informativeness in both ranking methods. Also
note that the range of x and y axis is very differ-
ent in BLRT, but in the pointwise KL method they
are comparable ranges. That makes combining two
scores easy in the pointwise KL approach.
6.3 Ranking n-length phrases
The next example is ranking a3 -length phrases. We
applied a phrase extension algorithm based on the
APriori algorithm (Agrawal and Srikant, 1994) to
the output of the key-bigram finder in the previous
example to generate a3 -length candidates whose fre-
quency is greater than 5, then applied a linguistic
filter which rejects phrases that do not occur in valid
noun-phrase contexts (e.g. following articles or pos-
sessives) at least once in the corpus. We ranked re-
sulting phrases using pointwise KL score, using the
same smoothing method as in the bigram case.
Figure 7 shows the result of re-ranking keyphrases
extracted from the same movie corpus. We can see
that bigrams and trigrams are interleaved in natural
order (although not many long phrases are extracted
from the dataset, since longer NP did not occur more
than five times). Figure 1 was another example of
the result of the same pipeline of methods.
5a popular citation pattern in 1993 was “In article
a0 1pk0a@foo.coma1 , js@foo.com (John Smith) writes:”
One question that might be asked is “what if we
just sort by frequency?”. If we sort by frequency,
“blair witch project” is 92nd and “empire strikes
back” is 110th on the ranked list. Since the longer
the phrase becomes, the lower the frequency of the
phrase is, frequency is not an appropriate method for
ranking phrases.
1 minority report
2 box office
3 scooby doo
4 sixth sense
5 national guard
6 bourne identity
7 air national guard
8 united states
9 phantom menace
10 special effects
11 hotel room
12 comic book
13 blair witch project
14 short story
15 real life
16 jude law
17 iron giant
18 bin laden
19 black people
20 opening weekend
21 bad guy
22 country bears
23 man’s man
24 long time
25 spoiler space
26 empire strikes back
27 top ten
28 politically correct
29 white people
30 tv show
31 bad guys
32 freddie prinze jr
33 monster’s ball
34 good thing
35 evil minions
36 big screen
37 political correctness
38 martial arts
39 supreme court
40 beautiful mind
Figure 7: Result of re-ranking output from the phrase exten-
sion module
6.4 Revisiting unigram informativeness
An alternative approach to calculate informative-
ness from the foreground LM and the background
LM is just to take the ratio of likelihood scores,
a11 fga9a54a86 a15 a23 a11 bga9a54a86 a15 . This is a smoothed version of rela-
tive frequency ratio which is commonly used to find
subject-specific terms (Damerau, 1993).
Figure 8 compares extracted keywords ranked
with pointwise KL and likelihood ratio scores, both
of which use the same foreground and background
unigram language model. We used messages re-
trieved from the query Infiniti G35 as the foreground
corpus and the same 20 newsgroup data as the back-
ground corpus. Katz smoothing is applied to both
language models.
As we can see, those two methods return very dif-
ferent ranked lists. We think the pointwise KL re-
turns a set of keywords closer to human judgment.
One example is the word “infiniti”, which we ex-
pected to be one of the informative words since it
is the query word. The pointwise KL score picked
the word as the third informative word, but the like-
lihood score missed it. Whereas “6mt”, picked up
by the likelihood ratio, which occurs 37 times in the
 0
 100000
 200000
 300000
 400000
 500000
 600000
 700000
 800000
 0  2000  4000  6000  8000  10000  12000  14000  16000  18000  20000
informativeness
phraseness
blrt
-0.001
-0.0005
 0
 0.0005
 0.001
 0.0015
 0.002
-0.0005  0  0.0005  0.001  0.0015  0.002
informativeness
phraseness
pointKR
(a) BLRT (b) LM-pointKL
Figure 6: Phraseness and informativeness score of bigrams extracted with BLRT (a) and pointwise KL divergence between LMs
(b).
point KL likelihood ratio
rank freq term freq term
1 1599 g35 1599 g35
2 1145 car 156 330i
3 450 infiniti 117 350z
4 299 coupe 113 doo
5 299 nissan 90 wrx
6 383 bmw 76 is300
7 156 330i 47 willow
8 441 cars 39 rsx
9 248 sedan 37 6mt
10 331 originally 35 scooby
11 201 altima 35 s2000
12 117 350z 33 gt-r
13 113 doo 32 lol
14 235 sport 30 heatwave
15 172 maxima 28 g22
16 90 wrx 26 gtr
17 111 skyline 23 g21
18 76 is300 23 g17
19 186 honda 23 nsx
20 221 engine 22 tl-s
Figure 8: Top 20 keywords extracted using pointwise-KL and
likelihood ratio (after stopwords removed) from messages re-
trieved from the query “Infiniti G35”
foreground corpus and none in the background cor-
pus does not seem to be a good keyword.
The following table shows statistics of those two
words:6
token a11 fga9a54a66 a15 a11 bga9a54a66 a15 PKL LR
6mt 1.837E-4 8.705E-8 0.0020 2110
infiniti 2.269E-3 4.475E-6 0.0204 506
Since the likelihood of “6mt” with respect to the
background LM is so small, the likelihood ratio of
the word becomes very large. But the pointwise KL
score discounts the score appropriately by consider-
6“infiniti” occurs 34 times in the “rec.autos” section of the
20 newsgroup data set.
ing that the frequency of the word is low. Likelihood
ratio (or relative frequency ratio) has a tendency to
pick up rare words as informative. Pointwise KL
seems more robust in sparse data situations.
One disadvantage of the pointwise KL statistic
might be that it also picks up stopwords or punctu-
ation, when there is a significant difference in style
of writing, etc., since these words have significantly
high frequency. But stopwords are easy to define
or can be generated automatically from corpora, and
we don’t consider this to be a significant drawback.
We also expect a better background model and better
smoothing mechanism could reduce the necessity of
the stopword list.
7 Discussion
Necessity of both phraseness and informativeness
Although phraseness itself is domain-dependent to
some extent (Smadja, 1994), we have shown that
there is little correlation between informativeness
and phraseness scores.
Combining method One way to calculate a com-
bined score is directly comparing a7 a1 a2fg and a7 a1 a1bg
in Figure 3. We have tried both approaches and got
a better result from combining separate phraseness
and informativeness scores. We think this is due
to data sparseness of the higher order ngram in the
background corpus. Further investigation is required
to make a conclusion.
We have used the simplest method of combining
two scores by adding them. We have also tried har-
monic mean and geometric mean but they did not
improve the result. We could also apply linear inter-
polation to put more weight on one score value, or
use an exponential model to combine score, but this
will require tuning parameters.
Benefits of using a language model One bene-
fit of using a language model approach is that one
can take advantage of various smoothing techniques.
For example, by interpolating with a character-based
n-gram model, we can make the LM more robust
with respect to spelling errors and variations. Con-
sider the following variations, which we need to treat
as a single entity: al-Qaida, al Qaida, al Qaeda,
al Queda, al-Qaeda, al-Qa’ida, al Qa’ida (found
in online sources). Since these are such unique
spellings in English, character n-gram is expected to
be able to give enough likelihood score to different
spellings as well.
It is also easy to incorporate other models such as
topic or discourse model, use a cache LM to capture
local context, and a class-based LM for the shared
concept. It is also possible to add a phrase length
prior probability in the model for better likelihood
estimation.
Another useful smoothing technique is linear in-
terpolation of the foreground and background lan-
guage models, when the foreground and background
corpus are disjoint.
8 Conclusion
We have explained that phraseness and informative-
ness should be unified into a single score to return
useful ranked keyphrases for analysts. Our proposed
approach calculates both scores based on language
models and unified into a single score. The phrases
generated by this method are intuitively very useful,
but the results are difficult to evaluate quantitatively.
In future work we would like to further explore
evaluation of keyphrases, as well as investigate dif-
ferent smoothing techniques. Further extensions in-
clude developing a phrase boundary segmentation
algorithm based on this framework and exploring
applicability to other languages.

References
Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast
algorithms for mining association rules. In Jorge B.
Bocca, Matthias Jarke, and Carlo Zaniolo, editors,
Proc. 20th Int. Conf. Very Large Data Bases, VLDB,
pages 487–499. Morgan Kaufmann, 12–15 .
Stanley F. Chen and Joshua T. Goodman. 1996. An
empirical study of smoothing techniques for language
modeling. In Proceedings of the 34th Annual Meeting
of the ACL, pages 310–318, Santa Cruz, California,
June.
Kenneth W. Church and Patrick Hanks. 1990. Word
association norms, mutual information, and lexicog-
raphy. In Computational Linguistics, volume 16.
K. Church, P. Hanks, D. Hindle, and W. Gale, 1991.
Using Statistics in Lexical Analysis, pages 115–164.
Lawrence Erlbaum.
Thomas M. Cover and Joy A. Thomas. 1991. Elements
of Information Theory. John Wiley.
Fred J. Damerau. 1993. Generating and evaluating
domain-oriented multi-word terms from texts. Infor-
mation Processing and Management, 29(4):433–447.
Ted E. Dunning. 1993. Accurate methods for the statis-
tics of surprise and coincidence. Computational Lin-
guistics, 19(1):61–74.
Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl
Gutwin, and Craig G. Nevill-Manning. 1999.
Domain-specific keyphrase extraction. In IJCAI,
pages 668–673.
Frederick Jelinek. 1990. Self-organized language mod-
eling for speech recognition. In Alex Waibel and
Kai-Fu Lee, editors, Readings in Speech Recognition,
pages 450–506. Morgan Kaufmann Publishers, Inc.,
San Maeio, California.
Christopher D. Manning and Hinrich Sch¨utze. 1999.
Foundations of Statistical Natural Language Process-
ing. The MIT Press, Cambridge, Massachusetts.
Patrick Pantel and Dekang Lin. 2001. A statisti-
cal corpus-based term extractor. In E. Stroulia and
S. Matwin, editors, Lecture Notes in Artificial Intel-
ligence, pages 36–46. Springer-Verlag.
Frank Z. Smadja. 1994. Retrieving collocations from
text: Xtract. Computational Linguistics, 19(1):143–
177.
Peter D. Turney. 2000. Learning algorithms
for keyphrase extraction. Information Retrieval,
2(4):303–336.
Mikio Yamamoto and Kenneth W. Church. 2001. Us-
ing suffix arrays to compute term frequency and docu-
ment frequency for all substrings in a corpus. Compu-
tational Linguistics, 27(1):1–30.
