Word Extraction from Corpora and Its Part-of-Speech 
Estimation Using Distributional Analysis 
Shinsuke Mori and Makoto Nagao 
l)ept, of Electrical Engineering Kyoto University 
Yoshida-honmachi, Sakyo, Kyoto, 606-01 Japan 
{mor ±, nagao } ~kuee. kyoto-u, ac. j p 
Abstract 
Unknown words are inevitable at any 
step of analysis in natural language pro- 
cessing. Wc propose a method to ex- 
tract words from a corl)us and estimate 
the probability that each word belongs 
to given parts of speech (POSs), using a 
distributional analysis. Our experiments 
have shown that this method is etfective 
for inferring the POS of unknown words. 
1 Introduction 
Dictionaries are indispensable in NLP in order to 
determine the grammatical functions and mean- 
ings of words, but the coutinuous increase of new 
words and technical terms make unknown words 
an ongoing problem. A good <teal of research has 
been directed to finding efficient and effective ways 
of expanding the lexicon. With agglutinative lan- 
guages like Japanese, tile problem is even greater, 
since even word boundaries are ambiguous. To 
solve these problems, we propose a method that 
uses distributional analysis to extract words from 
a corpus and estimate the probability distribution 
of their usc a.s ditferent parts of speech. 
Distributional analysis was originally proposed 
by Harris (1951), a structural linguist, tLs a tech- 
nique to uncover the structure of a language. Llar- 
ris intended it ~Ls a substitute for what he per- 
ceived ~ts unscientific information-gathering by lin- 
guists doing field work at that time. Thus, lin- 
guists determine whether two words belong to 
the same class by observing the environments in 
which the words occur. Recently, this technique 
has been mathematically refined and used to dis- 
cover phrase structure from a corpus annotated 
with POS tags (Brill and Marcus, 1992; Mori and 
Nagao, 1995). Schiltze (1995) used the technique 
to induce POSs. However, in these researches, 
the problem of catcgorial ambiguity (the fact that 
some words or POS sequences can belong to more 
titan one category), has been ignored. 
In this paper, we propose a method that as- 
sumes that a word may belong to more than one 
POS, and provides estimates of the relative i)rol) - 
ability that it may belong to each of a number of 
POSs. Our method decomposes an observed prob- 
ability distribution into a particular linear sum- 
mation of a given set of model l)robability distri- 
butions. '1't1¢.' resulting set of coefficients rcl)rc- 
sents tim probability that the observed event be- 
longs to each model event. The application d:ls- 
cussed here is word extraction fron\] a Japanese 
corpus. First we calculate the model probability 
distribution of each POS by observing the context 
of each occurrence in a tagged corpus. Then, for 
each unknown word, we similarly calculate its en- 
vironment by collecting all occurren('es from a raw 
corpus. Finally, wc coml)ute tile probability dis- 
tribution of POSs for a word by comparing its ob- 
served environment with tile model environments 
represented by the set of POS distributions. 
1,1 subsequent sections, first we discuss the hy- 
pothesis, secondly describe the algorithm, thirdly 
present results of the exl)eriments on the ED\]{ cor- 
pus and journal articles, and finally conclude this 
rcsearch. 
2 Hypothesis 
In this section, first we define environment of a 
string occurring ix, a corpus. Next, we prol)ose 
a hypothesis which gives foundation to our word 
extraction method. 
2.1 Environment of a String in a Corpus 
We detine tile "environment" of a type (character 
string, group of morl)hemes , or as tile prol)ability 
distribution of the elements preceding and follow° 
ing occurrences of that type in a corpus. The ele- 
ments which precede tile type a.re described by the 
left probability distribution, and those which fol- 
low it, by the right probability distribution. For 
instance, Table 1 shows the one-character envio 
romnent of the string ".~-L" in the I~DR corpus 
(Jap, 1993). This string occurs 181 times, wittl 12 
different characters appearing to its left and l0 to 
its right. 
In general, a probability distribution can be re- 
garded a.s a vector, so the concatenatiori of two 
1119 
vectors is also a vector. Thus, the concatenation 
of the left and right probability distributions for 
a type is what we call the "environment" of that 
type, and we represent this by D in the subse- 
quent part of this paper. 
Table 1: Environment of the string "~-1_." 
freq. prob. str. str. freq. prob. 
13 7.2% , :~b v,, 16 8.9% 
6 3.3% o < 3 1.6% 
13 7.2% ~ ~ 8 4.4% 
10 5.6% ~ ~: 10 5.6% 
8 4.4% <" J\[ 7 3.8% 
14 7.8% if- & 41 22.6% 
19 10.4% m ~ 38 21.0% 
4 2.2% ?& ab 16 8.9% 
7 3.8% ~ % 4 2.2% 
4 2.2% ¢0 L/ 38 21.0% 
83 45.9% 
181 100.0% total 181 100.0% 
2.2 Hypothesis Concerning Environment 
In general, if a string a is a word which belongs to 
a POS, it is expected that the environment D(a) 
of the string in a particular corpus will be similar 
to the environment D(pos) of that POS. Since 
a word can belong to more than one POS, it is 
expected that the environment of tire string will 
be similar to the summation across all POSs of 
the environment of each POS multiplied by tile 
probability that the string occurs ms that POS. 
Therefore, we obtain the following formula: 
D(et) ..~ Zp(posk\[a)D(posk ) (1) 
k 
where p(poskla) is the probability that the string 
a belongs to posk, and D(posk) is tire environ- 
ment of posk. In this formula, summation is 
calculated for the set of POSs in consideration. 
As an example, let us take the string "~.1_.", 
which is used in tile corpus only as a verb and 
an adjective. Ifp(Adjl'Z~l..) and p(Verbl~-b ) are 
the probabilities that a particular instance of the 
string is used as an adjective and a verb respec- 
tively, then the enviromnent of the string "x~-I~" 
is described by the tbllowing formula: D(-x~-U) 
p(Adjl~- L )D(Adj) + p( VerblX~ - L )D(Verb), 
In most cases, however, formula (1) cannot be 
solved as a linear equation, since the dimension of 
probability distribution vector D is greater than 
that of the independent variables. In addition, we 
need to minimize the effects of sample bias inher- 
ent in statistical estimates of this sort. We there- 
fore reason that the question is to find the set of 
p(posk let) which minimizes the difference between 
both sides of formula (1) in terms of some mea- 
sure. We use, as this measure, the square of Eu- 
clidean distance betwen vectors. Then it follows 
that the problem is formalized as an optimization 
problem (minimize). The decision variables are 
the elements of tile probability distribution vector 
p which expresses tile likelihood that the string is 
used as each POS: 
F(p) = \]D(et) - ZpkD(posk)l 2 (2) 
k 
where p = (Pl,P2,...,P,~), Pk = p(poskl o~) and 
n is tile number of POSs in consideration. Since 
each element ofp represents a probability, the fea- 
sible region V is given as follows: 
V={plO<_pk<_l,~p~=l} (3/ 
k 
The minimum value of F(p) will be relatively 
small when tile environment of the string can 
be decomposed into a linear summation of some 
POS environments, while it will be relatively large 
when such a decomposition does not exist. Since 
all true words must belong to one or more POSs, 
the minimum value of F(p) can be used to decide 
whether a string is a word or not. We call this 
value the "word measure," and accept as words 
all strings with word measure Less than a certain 
threshold. 
3 Algorithm 
In this section we describe the algorithm used to 
calcnlate tile word rneasure of all arbitrary string 
and tire probabilities that the string belongs to 
each of a set of POSs. We used observations from 
tile EDI{ corpus, which is divided into words and 
tagged as to POS, to calculate tile POS environ- 
ments, and then used a raw corpus (no indication 
of word or morpheme boundaries, and no POS 
tags) for calculating the string environments. 
3.1 Calculating POS Environments 
The environment of each POS is obtained by cal- 
culating statistics on all contexts that precede and 
follow the POS in a tagged corpus, as follows: 
1. Let all elements of left and right probability 
vectors be 0. 
2. For each occurrence of the POS in the corpus, 
iucrement the left vector elenmnt correspond- 
ing to the context preceding this occurrence 
of the POS, and increment the right vector el- 
ement corresponding to the context following 
the POS. 
3. Divide each vector element by the total num- 
ber of o<:currences of the POS. 
Figure 1 shows a sample sentence from the EI)R 
corpus, and Table 2 shows the computation of 
the one-character environment of Noun in the tiny 
corpus consisting of this single sentence. 
In practice, instead of a single character, we 
used as contexts the preceding or following POS- 
tagged string (a morpheme or word). Thus the 
1120 
1 2 a 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
L:b,t , 5~H 03 ~IN tJ: ~iL w q~. --- © ~N ¢g k~ J- q~Y~@ tff o 
conj. sign noun pp noun pp adj. intl. noun pp pp noun pp verb in\[I. noun aux. sign 
Figure 1: An Example of EDR Corpus 
probability vectors, which consisted of all the con- 
texts found for any POS, were so sparse that we 
used a hash algorithm. 
Table 2: Environment of the Noun 
freq. prob. str. str. freq. prob. 
1 20% , noun td 1 20% 
1 20% m a.) 1 20% 
1 20% ~J- a 1 20% 
2 40% a) ~. 1 20% 
1 20% 
5 100% total 5 100% 
3.2 Calculating String Environments 
The cMculation of the enviromnent of an arbi- 
trary string (possible word) in a corpus is basi- 
cally identical to tire POS algorithm above, ex- 
cept that because Japanese has no blank space 
between words arr(t a raw (unsegmented) corpus 
is used, the extent of the environment is ambigu- 
ous. There are two ways to determine the extent 
of the left and right environment: one is to spec- 
ify a fixed number of characters, and the other 
is to use a look-up-and-match procedure to iden- 
tify specific morphenms. We adol)ted the second 
method, and used as a mort)henm lexicon the set 
of hash keys representing the POS envirouments. 
Where there was a conflict between two or more 
possible matches of a string context with tire POS 
hash keys, the longest match was selected. For in- 
stance, although a right context zi, ro 'kate' couht 
match either the postposition 'ka' or the postl)osi- 
lion 'kara', the longer match 'kara' would always 
be chosen. 
3.3 Optimization 
The environments for a string and for each POS 
which it represents become the parameters of the 
objective flmction defined I)y formula (2), and the 
optimization of this flmction then yields the prob- 
abilities that the string belongs to each l)OS. The. 
problem can be solved e~sily by the optimal gra- 
dient method because both the objective function 
and the feasible region are convex. 
4 Results 
We conducted two experiments, in each using a 
range of different thresholds for word measure. 
One experiment used the El)\]{ corl)us a.s a raw 
corpus (ignoring the POS tags) in order to cal- 
('ulate recall and precision. The other experiment 
Table 3: Recall and precision on EI)I~ corpus 
threshold recall precision 
of l"min tokens types tokens types 
0.10 26.2% 12.6% 96.8% 86.2% 
0.15 46.4% 28.3% 94.0% 80.4% 
0.20 61.8% 44.0% 90.7% 74.9% 
0.25 73.2% 57.1% 87.4% 69.1% 
0.30 79.8% 66.7% 
0.35 84.4% 73.7% 
used articles fl'om one year of the J apanese versiorl 
of Scientific A meT~ican in order to test whether we 
could incre~Lse the accuracy of the morphological 
analyzer (tagger) by this method. 
4.1 Conditions of the Exi)eriments 
For I)oth experiments, we considered tire five 
POSs to which almost all unknown words in 
Japanese belong: 
I. verbal noun, e.g. N'~ (-~- 6) 'benkyou(.~,,'u)' 
"to study" 
2. nora,s, e.g. '-"~ 'gakkou' "school" 
3. re-type verb, e.g. :1~ -'e. ( 5 ) 'tal)e(ru)' "to eat" 
4. i-type adjective, e.g. -~- (v,) 'samu(i)' "cold" 
5. na-type adjective, e.g. ~'bv, (~av) 'kirei(na)' 
"cleaI|" 
POS environments were defined as one POS- 
tagged string (assumed to be one morpheme), and 
were limited to strings made up only of h*ragana 
characters plus comma and period. The aim of 
this limitation was to reduce computational time 
during inatehing, and it was \['ell, that morl)hemes 
using kanji and katakana characters are too infre- 
quent ~s contexts to exert much intluence on the 
results. 
Candidate for unknown words were limited to 
strings of two or more characters appearing in 
the corpus at least ten times and not containing 
any symbols such as parentheses. Since there are 
very few unknown words which consist of only one 
character, this limitation will not have much effect 
on the recall. 
4.2 Experiment 1: Word Extraction 
For evaluation purposes, we conducted a word 
extraction ext)eriment using the El)l{. corpus as 
a raw corpus, and calculated recall and precision 
\['or each threshold value (see Table 3). First, we 
calculated f'mi,~and p for all character n-grams, 
1121 
Table 4: Examples of extracted words from "Science" 
string Action noun Noun R,a-type verb !-type adj. Na-type adj. 
N~; 0.00 0.31 0.00 0.00 0.69 
J~Y'} O.O0 0.00 0.00 0.00 1.00 
:-'~'~ ~ ~ * 0.00 1.00 0.00 0.00 0.00 
MH C 3r)-f~ * 0.00 1.00 O.O0 0.00 0.00 
• means unknown word. 
F~i, freq. 
0.04 115 
0.05 179 
0.08 103 
0.11 63 
2 < n < 20, excluding strings which consisted only 
of hiragana characters. Then, for each threshold 
level, our algorithm decided which of the candi- 
date strings were words, and assigned a POS to 
each instance of the word-strings. 
Recall was computed as the percent, of all POS- 
tagged strings in the EDR corpus that were suc- 
cessfully identified by our algorithm as words 
and as belonging to the correct POS. In calcu- 
lation of the recalls and the precisions, both POS 
and string is distinguished. Precision was calcu- 
lated using the estimated frequency f((~,pos) = 
p(posl~ ) .f(tx) where f(,x)is the frequency of the 
string ~t in the corpus, and p(poslot) is the esti- 
mated probability that ct belongs to the pos. 
Judgement whether the string ~ belongs to pos 
or not was made by hand. The recalls are calcu- 
lated for ones with the estimated probability more 
than or equal to 0.1. The reason for this is that 
the amount of the output is too enormous to check 
by hand. For the same reason we did not calcu- 
late the precisions for thresholds more than 0.25 
in Table 3. This table tells us that the lower the 
threshold is, the higher the precision is. This re- 
sult is consistent with the result derived from the 
hypothesis that we described in section 2.2. Be- 
sides, there is a tendency that in proportion ,as the 
frequency increases the precision rises. 
4.3 Experiment 2: Improvement of 
Stochastic Tagging 
In order to test how much the accuracy of a tag- 
ger could be improved by adding extracted words 
to its dictionary, we developed a tagger based on 
a simple Markov model and analyzed one journal 
article 1. Using statistical parameters estimated 
from the EDR corpus, and an unknown word 
model based on character set heuristics (any kauji 
sequence is a noun, etc.), tagging accuracy was 
95.9% (the percent of output morphemes which 
were correctly segmented and tagged). 
Next, we extracted words from the Japanese 
version of Scientific American (1990; 617,837 
characters) using a threshold of 0.25. Unknown 
words were considered to he those which could not 
be divided into morphemes appearing in the learn- 
ing corpus of the Markov model. Table 4 shows 
examples of extracted words, with unknown words 
a"Progress in Gallium Arsenide Semiconductors" 
(Scientific American; February, 1990) 
starred. Notice that some extracted words consist 
of more than one type of character, such as "3~ :/ 
-'~'~ (protein)." This is one of the advantages 
of our method over heuristics based on character 
type, which can never recognize mixed-character 
words. Another advantage is that our method is 
applicable to words belonging to more than one 
POS. For example, in Table 4 "Et;~ (nature)" is 
both a noun and the stem of a na-type adjective. 
We added the extracted unknown words to the 
dictionary of the stochastic tagger, where they 
are recorded with a frequency calculated by the 
following fo,'mula: (size~/size,)f(c~,pos), where 
size~ and size, are the size of the EDR corpus 
and the size of the Scientific A merican corpus re- 
spectively. Using this expanded dictionary, the 
tagger's accuracy improved to 98.2%. This result 
tells us that our method is useful as a preprocessor 
for a tagger. 
5 Conclusion 
We have described a new method to extract words 
from a corpus and estimate their POSs using dis- 
tributional analysis. Our method is based on the 
hypothesis that sets of strings preceding or fol- 
lowing two arbitrary words belonging to the same 
POS are similar to each other. We have proposed 
a mathematically well-founded method to com- 
pute probability distrii)ution in which a string be- 
longs to given POSs. The results of word extrac- 
tion experiments attested the correctness of our 
hypothesis. Adding extracted words to the dic- 
tionary, the accuracy of a morphological analyzer 
augmented considerably. 
References 
Eric Brill and Mitchell Marcus. 1992. Automati- 
cally acquiring phrase structure using distribu- 
tional analysis. In Proc. of the DARPA Speech 
and Natural Language Workshop 
Zellig Ilarris. 1951. Structural Linguistics. Uni- 
versity of Chicago Press. 
Japan Electronic Dictionary Research Institute, 
Ltd., 1993. EDR Electronic Diclwnary Techni- 
cal Guide. 
Shinsuke Mori and Makoto Nagao. 1995. Parsing 
witilout grammar. In Proc. of the IWPT95 
Hinrich Schiitze. 1995. Distributional part-of- 
speech tagging. In Proc. of the EA CL95. 
1122 
