Quantitative Portraits of Lexical Elements
Kyo Kageura
Human and Social Information Research Division
National Institute of Informatics,
2-1-2 Hitotsubashi, Chiyoda-ku,
Tokyo, 101-8430, Japan
kyo@nii.ac.jp
Abstract
This paper clarifies the basic concepts and theoret-
ical perspectives by and from which quantitative
“weighting” of lexical elements are defined, and
then draws, quantitative portraits of a few lexical el-
ements in order to exemplify the relevance of the
concepts and perspectives examined.
1 Introduction
Since Luhn’s pioneering work (Luhn, 1958) in au-
tomatic term weighting, many methods have been
proposed in the fields of IR (e.g. Spark-Jones, 1973;
Harter, 1975) and NLP (e.g. Church et al., 1990).
Some “standard” methods of term weighting such
as a0a2a1a4a3a6a5a7a1 have been established (Aizawa, 2003; a8
a9 , 1999) and the application range has widened;
term weighting has become a mature technology.
Despite this, what has been technically proposed
has not been examined from a theoretical point
of view, i.e. what kind of weighting scheme re-
flects what kind of lexical nature within what kind
of framework of interpretations in language. We
will clarify this and then illustrate the relevance of
this clarification by drawing quantitative portraits of
some lexical items using the quantitative measures.
2 Texts and lexica
Automatic term weighting starts from
texts/documents. To what spheres the weights
are attributed can differ. Figure 1 shows the lin-
guistic spheres of lexica and texts (Kageura, 2002);
there are both concrete data spheres and abstract
spheres on both the lexical and textual sides.
Within this scheme, three types of relations be-
tween lexica and texts can be identified: concrete
terms attributed to concrete texts, concrete terms
corresponding to discourse, and abstract lexica cor-
responding to abstract discourse. We will show be-
low that three major types of automatic term weight-
ing methods correspond to these three types of rela-
tions between lexica and texts.
text texttext text texttexttext
A set of actual texts (targets of IR)
Textual sphere / theoretical sphere of discourse
term termterm termtermterm termterm
Terms as attributes of concrete set of documents
Lexicological sphere / theoretical sphere of lexica
Figure 1: Textual sphere and lexicological sphere.
3 Methods of term weighting
3.1 Tfidf
a10 a1a4a3a6a5a7a1 is defined as:
a0a2a1a4a3a6a5a11a1a13a12a14a1a16a15a18a17a20a19a18a21a23a22a25a24
a24a27a26
(1)
wherea1a15a18a17 is the total frequency of a terma0a26,a24 is the
total number of the documents, and a24 a26 is the total
number of documents in which the terma0a26 occurs.
Aizawa (2003) has shown that this can be derived
from an information theoretic measure. Let a28 and
a29 be random variables defined over events in a set
of documentsa30 a12a32a31a33a5a35a34a37a36a38a5a40a39a41a36a43a42a44a42a44a42a44a36a38a5a26a36a43a42a44a42a44a42a44a36a38a5a46a45a48a47 and a set
of different termsa10 a12a49a31a37a0a50a34a37a36a2a0a51a39a16a36a43a42a44a42a44a42a44a36a2a0a53a52a40a36a43a42a44a42a44a42a44a36a2a0a51a54a55a47 ina30 .
Leta1a26a52 denote the frequency ofa0a26 ina5a56a52 ,a1a16a57a58a17 the to-
tal frequency ofa0a26, a1a41a59a6a60 the total number of running
terms ina5a56a52 , anda61 the total number of term tokens
ina30 . The “weight” of a terma0a26 can be given by:
a62a25a63a0
a26a51a64a28a66a65
a12 a67 a63a0
a26a65a69a68
a63a67 a63
a28a25a70
a0
a26a65a43a70a44a70
a67 a63
a28a71a65a2a65
a12 a67 a63a0
a26a65a73a72
a59a6a60a37a74a41a75
a67 a63a5a16a52
a70
a0
a26a65
a19a18a21a23a22
a67 a63a5a16a52
a70
a0
a26a65
a67 a63a5a56a52
a65
Giving probabilities by relative frequencies, and as-
suming that all the documents have equal size and
the frequency ofa0a26 in the documents that containa0a26
is equal, this measure becomesa0a2a1a4a3a6a5a7a1 ;a0a51a1a4a3a6a5a7a1 has an
information theoretic meaning within the given set
of documents (Figure 2).
3.2 Term representativeness
Hisamitsu, et al. (2000a) proposed a measure of
“term representativeness”, in order to overcome the
CompuTerm 2004 Poster Session  -  3rd International Workshop on Computational Terminology 75
text texttext text texttexttext
A set of actual texts (targets of IR)
term termterm termtermterm termterm
Terms as attributes of concrete set of documents
Figure 2: The position ofa0a2a1a4a3a6a5a7a1 .
text texttext text texttexttext
A set of actual texts (a manifestation of discourse)
Textual sphere / theoretical sphere of discourse
term termterm termtermterm termterm
Terms as attributes of theoretical discourse
represented by the given set of documents
Figure 3: The position of term representativeness.
excessive sensitivity of weighting measures to to-
ken frequencies. They hypothesised that, for a term
a0 , if the term is representative,
a30
a15 (the set of all
documents containing a0 ) have some specific char-
acteristic. They define a measure which calculates
the distance between a distributional characteristic
of words arounda0 and the same distributional char-
acteristic in the whole document set.
In order to remove the factor of data size depen-
dency, Hisamitsu et al. (2000a) defines the “baseline
function,” which indicates the distance between the
distribution of words in the original document set
and the distribution of words in randomly selected
document subsets for each size. The distance be-
tween the distribution of words in the original doc-
ument set and the distribution of words in the doc-
uments which accompany the focal term a0 is nor-
malised by the “baseline function.”
Formally, a76a78a77a2a79
a63a0
a65
a12 a30
a3a11a80a33a0a63a67a69a15a2a36a38a67
a65
a30
a3a7a80a37a0a63a67a82a81a84a83a38a36a38a67
a65
(2)
where a30 denotes the set of all documents; a67 the
distribution of words in a30 ; a0 a focal term; a30 a15 the
set of all documents containing a0 ; a67a85a15 distribution
of words in a30 a15; a67 a81a58a83 distribution of words in ran-
domly selected documents whose size equals a30 a15;
a30
a3a11a80a37a0a63a67
a26
a36a38a67a86a52
a65 the distance between two distributions
of wordsa67 a26 anda67a86a52 . Log-likelihood ratio was used
to measure the distance.
This measure observes the centripetal force of a
term vis-`a-vis discourse. i.e. it captures the charac-
teristic of terms in the general discourse as repre-
sented by the given set of documents (Figure 3).
3.3 Lexical productivity
Nakagawa (2000) incorporates a factor of lexical
productivity of constituent elements of compound
text texttext text texttexttext
A set of actual texts (a ladder to be discarded)
Textual sphere / theoretical sphere of discourse
term termterm termtermterm termterm
Lexicological sphere / theoretical sphere of lexica
Terms as an attribute of autonomous lexicological shpere
Figure 4: The position of lexical productivity.
units for complex term extraction. The method ob-
serves in how many different compounds an ele-
menta0a26 is used in a given document set (let us de-
note this as a5 a63a3a38a36a24 a65 where a24 indicates the size of
the overall document set as counted by the number
of word tokens), and used that in the weighting of
compounds containinga0a26, by taking weighted aver-
age. By explicitly limiting the syntagmatic range
of observation of cooccurrence to the unit of com-
pounds, he focused on the lexical productivity as
manifasted in texts.
This measure depends on the token occurrence,
but we can also think of the theoretical lexical pro-
ductivity in the lexicological sphere: how many
compounds a0a26 can potentially make” (let us denote
this bya5 a63a3a65). For that, it is necessary to remove the
factor of token occurrence. This can be done by:
a5 a63a3
a65
a12a87a5 a63a3a50a36a89a88
a24 a65
a63a88a91a90a93a92
a65
a42
This has so far been unexplored. Potential lex-
ical productivity of an element can be estimated
from textual data: Letting
a79
a15a18a17 be the occurrence
probability of a0a26 in texts, a1 a63a3a38a36a24 a65 be the token
occurrence of a0a26 in texts, and a94 a26 be the sample
spacea31a37a3a2a34a37a36a2a3a11a39a41a36a2a3a6a95a23a36a43a42a44a42a44a42a44a36a2a3a59a97a96
a26a18a98
a47 of the distribution of com-
pounds (and simplex word) that contains a0a26 with
probability
a79
a96a100a99
a98
a17a102a101 given to each compound a3a11a103 , and
assuming the combination of binomial distribution,
we have:
a104a106a105a1 a63a3a50a36
a24 a65a6a107
a12
a79
a15a18a17a109a108
a24
a104a110a105a5 a63a3a38a36
a24 a65a6a107
a12 a111
a83
a17a7a112
a45
a72
a113a115a114
a34
a59a97a96
a26a18a98
a72
a103
a114
a34
a116
a79
a15a18a17a117a108
a24
a118 a119
a79
a113
a26
a101a63a51a120a122a121
a79
a26
a101
a65
a34a2a123
a113
a42
What is given in the data is the empirical value for
a5 a63a3a50a36
a24 a65 , with the empirical distributions of what ac-
tually occur in the document set amonga94 a26.a5 a63a3a65 can
be estimated by LNRE methods (Baayen, 2001).
Being a measure representing the potential power
of a lexical elementa0a26 for constructing compounds,
a5 a63a3
a65 indicates the lexical productivity in the lex-
icological sphere which correspond to theoretical
sphere of discourse as represented by the given doc-
ument set (Figure 4).
CompuTerm 2004 Poster Session  -  3rd International Workshop on Computational Terminology76
•
•
•
•
•
•
Figure 5:a0a51a1a4a3a11a5a11a1 and term representativeness.
4 Portraits of lexical elements
As the three different measures capture three differ-
ent aspects of lexical elements, they are not com-
petitive 1. We here use these measures to illustrate
characteristics of a few lexical elements.
We used NII morphologically tagged-corpus for
observation (Okada et al., 2001), which consists of
Japanese abstracts in the field of artificial intelli-
gence. Table 1 shows the basic quantitative infor-
mation.
No. of word tokens word types
abstracts (simplex/compound) (simp./comp.)
1816 299846/230708 8764/23243
Table 1: The basic data for NII corpus.
We chose the six most frequently occurring nom-
inal element for observation, i.e.a124a126a125a13a127a129a128 (system),
a130a126a131 (knowledge),
a132a126a133 (learning), a134a91a135 (problem),
a136a138a137a140a139 (model), and
a141a14a142 (information). Intu-
itively, “system”, and “model” are rather general
with respect to the domain of artificial intelligence,
“knowledge” and “learning” are domain specific,
and “information” and “problem” are in between.
Table 2 shows the basic quantitative information for
these six lexical elements.
Figure 5 plotsa0a2a1a4a3a6a5a11a1 and term representativeness
for the six elements. Table 3 shows the estimated
value of lexical productivity.
a79
a5 a63a3
a65
system 0.96 273402688337
knowledge 0.88 689
learning 0.39 2251563675
problem 0.70 1951
model 0.47 3676671255
information 0.84 667
Table 3: Lexical productivity for the six elements.
Figure 5 shows “learning” and “knowledge”, in-
tuitively the domain-dependent elements, take high
1It is thus simplistic to evaluate which measures work better
in an application, unless the conceptual status of the applica-
tions is sufficiently clarified.
a0a2a1a4a3a6a5a11a1 values, while “information” takes the lowest
value. Term representativeness gives “learning” a
high value but the values of “knowledge” is much
lower, and about the same as “information”. In-
terestingly, the lexical productivity of “knowledge”
and “information” is also very close to each other.
It is possible to infer from these values of term
representativeness and lexical productivity that both
“information” and “knowledge” are, within the dis-
course of artificial intelligence, not with high cen-
tripetal value as both are rather “base” concepts of
the domain. If we observe Table 2, “knowledge” is
more often used as it is, while “information” tends
to occur as compounds. From this we might be
able to hypothesise that “knowledge” is in itself the
“base” concept of artificial intelligence while “in-
formation” becomes the “base” concept in combina-
tion with other lexical items. This fits our intuition,
as “information” in itself is more a “base” concept
of information and computer science, which is a
broader domain of which artificial intelligence is
a subdomain. The low a0a2a1a4a3a6a5a7a1 value of “informa-
tion” comes from the low token frequency coupled
with relatively high DF, which shows that “informa-
tion”, as long as it is used, tends to scatter across
documents. This is in accordance with the inter-
pretation that “information” tends to occur in com-
pounds. Still, however, it is difficult to interpret sen-
sibly the fact that thea0a51a1a4a3a6a5a7a1 value of “information”
is lower than those of “model” and “system”. Per-
haps it is more sensible to interpret a0a2a1a4a3a6a5a7a1 among
elements which take the values of term representa-
tiveness higher than a certain threshold. Then we
can say that “learning” and “knowledge” represent
concepts more “central” to the domain of artificial
intelligence than “information”.
The element “learning”, which takes the highest
values both ina0a51a1a4a3a6a5a7a1 and in term representativeness,
is conspicuous in its lexical productivity. Compared
to “knowledge” whose a0a2a1a4a3a6a5a11a1 value is also high,
and with the three elements “problem”, “informa-
tion” and “knowledge” whose term representative-
ness values are relatively high, the order of lexical
productivity of “learning” is a million times higher
(and similar to “model” or “system”). Table 2 shows
that “learning” does not occur much as it is, nor does
it occur much as the head of compounds. This in-
dicates that “learning” represents an important con-
cept of the given data and in the discourse of ar-
tificial intelligence, but only “indirectly” in com-
bination with other elements in compounds where
“learning” tend to contribute to as a modifier rather
than a head.
The two “general” lexical elements, i.e. “model”
CompuTerm 2004 Poster Session  -  3rd International Workshop on Computational Terminology 77
TF DF Comp(A) Comp(H) Simp a5 a63a3a38a36a24 a65(A) a5 a63a3a38a36a24 a65(H)
system 2659 989 1922 1247 737 937 502
knowledge 2183 669 1399 443 784 424 137
learning 1776 462 1513 208 263 375 73
problem 1758 660 1197 558 561 334 152
model 1480 550 1144 687 343 447 263
information 1038 460 656 268 382 207 155
Note: Comp(A) indicates the number of compounds that contains the lexical element; Comp(H) indicates the number
of compounds that contains the lexical element as the head;a143a145a144a18a146a51a147a11a148a150a149(A) indicates the number of different compounds
(plus one simplex) that contains the lexical element;a143a46a144a151a146a51a147a11a148a150a149(H) indicates the number of different compounds (plus
one simplex) that contains the lexical element as the head.
Table 2: The basic data for the six lexical elements.
and “system”, take low term representativeness val-
ues2. This is in accordance with our intuition. The
lexical productivity of these two elements are ex-
tremely high (practically infinite). This indicates
that these two elements can be widely used in va-
rieties of discoursal contexts, without in itself con-
tributing much to consolidating the content of dis-
course. This fits nicely to our intuitive interpretation
of the meanings of these two elements, i.e. they are
orthogonal to to such domain-dependent elements
as “knowledge” or “learning”.
This leaves us with the final element “problem”.
The value of term representativeness is high, second
only to “learning” and in between “learning” and
“information”/“knowledge”. The lexical productiv-
ity is much closer to “information” and “knowl-
edge” than to the other three. As such, “prob-
lem” can be interpreted as a kind of “base” concept,
though it retains stronger centripetal force than “in-
formation” and “knowledge”. If we ignore a0a2a1a4a3a6a5a11a1
values of “model” and “system” and only compare
“information”, “problem”, “learning” and “knowl-
edge”, it is also sensible to see that “problem” rep-
resent a concept more central to the domain than
“information” but less central than “learning” and
“knowledge”.
5 Conclusions
We have shown that different term weighting mea-
sures have different spheres of interpretation; on
the basis of that we have illustrated that the com-
bination of these can illustrates complex aspects of
lexical nature. Though it can be argued that the
present study does not show ways for applications
nor “empirical” evaluations within applications, we
believe that “empirical” evaluations should be prop-
erly founded by the framework of interpretation in
order for the results to be generalised in a scientific
2This is in accordance with the observation by Hisamitsu et
al. (2000) which says that the measure of term representative-
ness is particularly useful to exclude general elements.
way; history of sciences have shown that often re-
liance on “empirical” evaluations correlates with the
lack of theory or scientific wholesomeness.

References
Akiko N. Aizawa. 2003. An information-theoretic
perspective of tf-idf measures. Information Pro-
cessing and Management, 39(1): 45–65.
Harald Baayen. 2001. Word Frequency Distribu-
tions. Dordrecht: Kluwer.
Kenneth W. Church and Patrick Hanks. 1990. Word
association norms, mutual information and lexi-
cography. Computational Linguistics, 16(1): 22–
29.
S. P. Harter. 1975. A probabilistic approach to au-
tomatic keyword indexing. Journal of the Ameri-
can Society for Information Science, 26(4): 197–
206.
Toru Hisamitsu, et. al. 2000. A method of mea-
suring term representativeness. COLING 2000,
320–326.
Kyo Kageura. 2002. The Dynamics of Terminology.
Amsterdam: John Benjamins.
Hans P. Luhn. 1958. The automatic creation of lit-
erature abstracts. IBM Journal of Research and
Development, 2(2): 159–165.
Hiroshi Nakagawa. 2000. Automatic term recogni-
tion based on statistics of compound nouns. Ter-
minology, 6(2): 195–210.
Maho Okada, et. al. 2001. Defining principled but
practically manageable lexical units in Japanese
textual corpora. NLPRS’01 Workshop on Lan-
guage Resources in Asia, 47–53.
Karen Sparck-Jones. 1973. Index term weighting.
Information Storage and Retrieval, 9(11): 619–
633.
In Chinese. 1999. In Chinese.
