Distributional Part-of-Speech Tagging 
Hinrich Schfitze 
CSLI, Ventura Hall 
Stanford, CA 94305-4115 , USA 
emMl: schuetze~csli.stanford.edu 
URL: ftp://csli.stanford.edu/pub/prosit/DisPosTag.ps 
Abstract 
This paper presents an algorithm for tag- 
ging words whose part-of-speech proper- 
ties are unknown. Unlike previous work, 
the algorithm categorizes word tokens in 
con$ezt instead of word ~ypes. The algo- 
rithm is evaluated on the Brown Corpus. 
1 Introduction 
Since online text becomes available in ever increas- 
ing volumes and an ever increasing number of lan- 
guages, there is a growing need for robust pro- 
cessing techniques that can analyze text without 
expensive and time-consuming adaptation to new 
domains and genres. This need motivates research 
on fully automatic text processing that may rely 
on general principles of linguistics and computa- 
tion, but does not depend on knowledge about 
individual words. 
In this paper, we describe an experiment on 
fully automatic derivation of the knowledge nec- 
essary for part-of-speech tagging. Part-of-speech 
tagging is of interest for a number of applications, 
for example access to text data bases (Kupiec, 
1993), robust parsing (Abney, 1991), and general 
parsing (deMarcken, 1990; Charniak et al., 1994). 
The goal is to find an unsupervised method for 
tagging that relies on general distributional prop- 
erties of text, properties that are invariant across 
languages and sublanguages. While the proposed 
algorithm is not successful for all grammatical cat- 
egories, it does show that fully automatic tagging 
is possible when demands on accuracy are modest. 
The following sections discuss related work, de- 
scribe the learning procedure and evaluate it on 
the Brown Corpus (Francis and Ku~era, 1982). 
2 Related Work 
The simplest part-of-speech taggers are bigram 
or trigram models (Church, 1989; Charniak et 
al., 1993). They require a relatively large tagged 
training text. Transformation-based tagging as 
introduced by Brill (1993) also requires a hand- 
tagged text for training. No pretagged text is nec- 
essary for Hidden Markov Models (Jelinek, 1985; 
Cutting et al., 1991; Kupiec, 1992). Still, a lexi- 
con is needed that specifies the possible parts of 
speech for every word. Brill and Marcus (1992a) 
have shown that the effort necessary to construct 
the part-of-speech lexicon can be considerably re- 
duced by combining learning procedures and a 
partial part-of-speech categorization elicited from 
an informant. 
The present paper is concerned with tagging 
languages and sublanguages for which no a priori 
knowledge about grammatical categories is avail- 
able, a situation that occurs often in practice 
(Brill and Marcus, 1992a). 
Several researchers have worked on learning 
grammatical properties of words. Elman (1990) 
trains a connectionist net to predict words, a pro- 
cess that generates internal representations that 
reflect grammatical category. Brill et al. (1990) 
try to infer grammatical category from bi- 
gram statistics. Finch and Chater (1992) and 
Finch (1993) use vector models in which words are 
clustered according to the similarity of their close 
neighbors in a corpus. Kneser and Ney (1993) 
present a probabilistic model for entropy maxi- 
mization that also relies on the immediate neigh- 
bors of words in a corpus. Biber (1993) ap- 
plies factor analysis to collocations of two target 
words ("certain" and "right") with their immedi- 
ate neighbors. 
What these approaches have in common is that 
they classify words instead of individual occur- 
rences. Given the widespread part-of-speech am- 
biguity of words this is problematicJ How should 
a word like "plant" be categorized if it has uses 
both as a verb and as a noun? How can a cate- 
gorization be considered meaningful if the infini- 
tive marker "to" is not distinguished from the ho- 
mophonous preposition? 
In a previous paper (Schfitze, 1993), we trained 
a neural network to disambiguate part-of-speech 
*Although Biber (1993) classifies collocations, 
these can also be ambiguous. For example, "for cer- 
tain" has both senses of "certain": "particular" and 
"sure". 
141 
word side nearest neighbors 
onto left 
onto right 
seemed left 
seemed right 
into toward away off together against beside around down 
reduce among regarding against towards plus toward using unlike 
appeared might would remained had became could must should 
seem seems wanted want going meant tried expect likely 
Table h Words with most similar left and right neighbors for "onto" and "seemed". 
using context; however, no information about the 
word that is to be categorized was used. This 
scheme fails for cases like "The soldiers rarely 
come home." vs. "The soldiers will come home." 
where the context is identical and information 
about the lexical item in question ("rarely" vs. 
"will") is needed in combination with context for 
correct classification. In this paper, we will com- 
pare two tagging algorithms, one based on clas- 
sifying word types, and one based on classifying 
words-plus-context. 
3 Tag induction 
We start by constructing representations of the 
syntactic behavior of a word with respect to its 
left and right context. Our working hypothe- 
sis is that syntactic behavior is reflected in co- 
occurrence patterns. Therefore, we will measure 
the similarity between two words with respect to 
their syntactic behavior to, say, their left side by 
the degree to which they share the same neighbors 
on the left. If the counts of neighbors are assem- 
bled into a vector (with one dimension for each 
neighbor), the cosine can be employed to measure 
similarity. It will assign a value close to 1.0 if two 
words share many neighbors, and 0.0 if they share 
none. We refer to the vector of left neighbors of 
a word as its left contezt vector, and to the vec- 
tor of right neighbors as its right contezt vector. 
The unreduced context vectors in the experiment 
described here have 250 entries, corresponding to 
the 250 most frequent words in the Brown corpus. 
This basic idea of measuring distributional sim- 
ilarity in terms of shared neighbors must be mod- 
ified because of the sparseness of the data. Con- 
sider two infrequent adjectives that happen to 
modify different nouns in the corpus. Their right 
similarity according to the cosine measure would 
be zero. This is clearly undesirable. But even with 
high-frequency words, the simple vector model can 
yield misleading similarity measurements. A case 
in point is "a" vs. "an". These two articles do not 
share any right neighbors since the former is only 
used before consonants and the latter only before 
vowels. Yet intuitively, they are similar with re- 
spect to their right syntactic context despite the 
lack of common right neighbors. 
Our solution to these problems is the applica- 
tion of a singular value decomposition. We can 
represent the left vectors of all words in the cor- 
pus as a matrix C with n rows, one for each word 
whose left neighbors are to be represented, and k 
columns, one for each of the possible neighbors. 
SVD can be used to approximate the row and col- 
umn vectors of C in a low-dimensional space. In 
more detail, SVD decomposes a matrix C, the ma- 
trix of left vectors in our case, into three matrices 
To, So, and Do such that: 
C = ToSoD' o 
So is a diagonal k-by-k matrix that contains the 
singular values of C in descending order. The ith 
singular value can be interpreted as indicating the 
strength of the ith principal component of C. To 
and Do are orthonormal matrices that approxi- 
mate the rows and columns of C, respectively. By 
restricting the matrices To, So, and Do to their 
first m < k columns (= principal components) 
one obtains the matrices T, S, and D. Their prod- 
uct C is the best least square approximation of C 
by a matrix of rank m: C = TSD'. We chose 
m = 50 (reduction to a 50-dimensional space) for 
the SVD's described in this paper. 
SVD addresses the problems of generalization 
and sparseness because broad and stable general- 
izations are represented on dimensions with large 
values which will be retained in the dimensionality 
reduction. In contrast, dimensions corresponding 
to small singular values represent idiosyncrasies, 
like the phonological constraint on the usage of 
"an" vs. "a", and will be dropped. We also gain 
efficiency since we can manipulate smaller vectors, 
reduced to 50 dimensions. We used SVDPACK 
to compute the singular value decompositions de- 
scribed in this paper (Berry, 1992). 
Table 1 shows the nearest neighbors of two 
words (ordered according to closeness to the head 
word) after the dimensionality reduction. Neigh- 
bors with highest similarity according to both 
left and right context are listed. One can see 
clear differences between the nearest neighbors in 
the two spaces. The right-context neighbors of 
"onto" contain verbs because both prepositions 
and verbs govern noun phrases to their right. 
The left-context neighborhood of "onto" reflects 
the fact that prepositional phrases are used in 
the same position as adverbs like "away" and 
"together", thus making their left context sim- 
ilar. For "seemed", left-context neighbors are 
words that have similar types of noun phrases in 
subject position (mainly auxiliaries). The right- 
context neighbors all take "to"-infinitives as com- 
plements. An adjective like "likely" is very sim- 
142 
ilar to "seemed" in this respect although its left 
context is quite different from that of "seemed". 
Similarly, the generalization that prepositions and 
transitive verbs are very similar if not identical 
in the way they govern noun phrases would be 
lost if "left" and "right" properties of words were 
lumped together in one representation. These ex- 
amples demonstrate the importance of represent- 
ing generalizations about left and right context 
separately. 
The left and right context vectors are the basis 
for four different tag induction experiments, which 
are described in detail below: 
• induction based on word type only 
• induction based on word type and context 
• induction based on word type and context, 
restricted to "natural" contexts 
• induction based on word type and context, 
using generalized left and right context vec- 
tors 
3.1 Induction based on word type only 
The two context vectors of a word characterize the 
distribution of neighboring words to its left an.d 
right. The concatenation of left and right context 
vector can therefore serve as a representation of a 
word's distributional behavior (Finch and Chater, 
1992; Sch/itze, 1993). We formed such concate- 
nated vectors for all 47,025 words (surface forms) 
in the Brown corpus. Here, we use the raw 250- 
dimensional context vectors and apply the SVD 
to the 47,025-by-500 matrix (47,025 words with 
two 250-dimensional context vectors each). We 
obtained 47,025 50-dimensional reduced vectors 
from the SVD and clustered them into 200 classes 
using the fast clustering algorithm Buckshot (Cut- 
ting et al., 1992) (group average agglomeration ap- 
plied to a sample). This classification constitutes 
the baseline performance for distributional part- 
of-speech tagging. All occurrences of a word are 
assigned to one class. As pointed out above, such 
a procedure is problematic for ambiguous words. 
3.2 Induction based on word type and 
context 
In order to exploit contextual information in the 
classification of a token, we simply use context 
vectors of the two words occurring next to the 
token. An occurrence of word w is represented by 
a concatenation of four context vectors: 
• The right context vector of the preceding 
word. 
• The left context vector of w. 
• The right context vector of w. 
• The left context vector of the following word. 
The motivation is that a word's syntactic role 
depends both on the syntactic properties of its 
neighbors and on its own potential for entering 
into syntactic relationships with these neighbors. 
The only properties of context that we consider 
are the right-context vector of the preceding word 
and the left-context vector of the following word 
because they seem to represent the contextual in- 
formation most important for the categorization 
of w. For example, for the disambiguation of 
"work" in "her work seemed to be important", 
only the fact that "seemed" expects noun phrases 
to its left is important, the right context vector of 
"seemed" does not contribute to disambiguation. 
That only the immediate neighbors are crucial for 
categorization is clearly a simplification, but as 
the results presented below show it seems to work 
surprisingly well. 
Again, an SVD is applied to address the prob- 
lems of sparseness and generalization. We ran- 
domly selected 20,000 word triplets from the cor- 
pus and formed concatenations of four context 
vectors as described above. The singular value de- 
composition of the resulting 20,000-by-l,000 ma- 
trix defines a mapping from the 1,000-dimensional 
space of concatenated context vectors to a 50- 
dimensional reduced space. Our tag set was then 
induced by clustering the reduced vectors of the 
20,000 selected occurrences into 200 classes. Each 
of the 200 tags is defined by the centroid of the cor- 
responding class (the sum of its members). Dis- 
tributional tagging of an occurrence of a word 
w proceeds then by retrieving the four relevant 
context vectors (right context vector of previous 
word, left context vector of following word, both 
context vectors of w) concatenating them to one 
1000-component vector, mapping this vector to 50 
dimensions, computing the correlations with the 
200 cluster centroids and, finally, assigning the oc- 
currence to the closest cluster. This procedure was 
applied to all tokens of the Brown corpus. 
We will see below that this method of distribu- 
tional tagging, although partially successful, fails 
for many tokens whose neighbors are punctuation 
marks. The context vectors of punctuation marks 
contribute little information about syntactic cate- 
gorization since there are no grammatical depen- 
dencies between words and punctuation marks, in 
contrast to strong dependencies between neigh- 
boring words. 
For this reason, a second induction on the ba- 
sis of word type and context was performed, but 
only for those tokens with informative contexts. 
Tokens next to punctuation marks and tokens 
with rare words as neighbors were not included. 
Contexts with rare words (less than ten occur- 
rences) were also excluded for similar reasons: If 
a word only occurs nine or fewer times its left 
and right context vectors capture little informa- 
tion for syntactic categorization. In the experi- 
ment, 20,000 natural contexts were randomly se- 
lected, processed by the SVD and clustered into 
143 
tag 
ADN 
CC 
CD 
DT 
IN 
ING 
MD 
N 
description 
adnominal modifier 
conjunction 
cardinal 
determiner 
preposition 
"-ing" forms 
modal 
nominal 
Table 2: Evaluation tag 
Penn Treebank tags 
ADN* $ 
CC 
CD 
DT PDT PRP$ 
IN 
VBG 
MD 
NNP(S) NN(S) 
tag 
POS 
PRP 
RB 
TO 
VB 
VBD 
VBN 
WDT 
description Penn Treebank tags 
possessive marker POS 
pronoun PRP 
adverbial RB RP RBR RBS 
infinitive marker TO 
infinitive VB 
inflected verb form VBD VBZ VBP 
predicative VBN PRD ° 
wh-word WP($) WRB WDT 
set. Structural tags derived from parse trees are marked with .. 
200 classes. The classification was then applied to 
all natural contexts of the Brown corpus. 
3.3 Generalized context vectors 
The context vectors used so far only capture infor- 
mation about distributional interactions with the 
250 most frequent words. Intuitively, it should be 
possible to gain accuracy in tag induction by us- 
ing information from more words. One way to do 
this is to let the right context vector record which 
classes of left conte~t vectors occur to the right of 
a word. The rationale is that words with similar 
left context characterize words to their right in a 
similar way. For example, "seemed" and "would" 
have similar left contexts, and they characterize 
the right contexts of "he" and "the firefighter" 
as potentially containing an inflected verb form. 
Rather than having separate entries in its right 
context vector for "seemed", "would", and "likes", 
a word like "he" can now be characterized by a 
generalized entry for "inflected verb form occurs 
frequently to my right". 
This proposal was implemented by applying a 
singular value decomposition to the 47025-by-250 
matrix of left context vectors and clustering the 
resulting context vectors into 250 classes. A gen- 
eralized right context vector v for word w was 
then formed by counting how often words from 
these 250 classes occurred to the right of w. En- 
try vi counts the number of times that a word 
from class i occurs to the right of w in the cor- 
pus (as opposed to the number of times that the 
word with frequency rank i occurs to the right of 
w). Generalized left context vectors were derived 
by an analogous procedure using word-based right 
context vectors. Note that the information about 
left and right is kept separate in this computation. 
This differs from previous approaches (Finch and 
Chater, 1992; Schfitze, 1993) in which left and 
right context vectors of a word are always used 
in one concatenated vector. There are arguably 
fewer different types of right syntactic contexts 
than types of syntactic categories. For example, 
transitive verbs and prepositions belong to differ- 
ent syntactic categories, but their right contexts 
are virtually identical in that they require a noun 
phrase. This generalization could not be exploited 
if left and right context were not treated sepa- 
rately. 
Another argument for the two-step derivation 
is that many words don't have any of the 250 
most frequent words as their left or right neighbor. 
Hence, their vector would be zero in the word- 
based scheme. The class-based scheme makes it 
more likely that meaningful representations are 
formed for all words in the vocabulary. 
The generalized context vectors were input to 
the tag induction procedure described above for 
word-based context vectors: 20,000 word triplets 
were selected from the corpus, encoded as 1,000- 
dimensional vectors (consisting of four generalized 
context vectors), decomposed by a singular value 
decomposition and clustered into 200 classes. The 
resulting classification was applied to all tokens in 
the Brown corpus. 
4 Results 
The results of the four experiments were evalu- 
ated by forming 16 classes of tags from the Penn 
Treebank as shown in Table 2. Preliminary ex- 
periments showed that distributional methods dis- 
tinguish adnominal and predicative uses of adjec- 
tives (e.g. "the black cat" vs. "the cat is black"). 
Therefore the tag "ADN" was introduced for uses 
of adjectives, nouns, and participles as adnominal 
modifiers. The tag "PRD" stands for predicative 
uses of adjectives. The Penn Treebank parses of 
the Brown corpus were used to determine whether 
a token functions as an adnominal modifier. Punc- 
tuation marks, special symbols, interjections, for- 
eign words and tags with fewer than 100 instances 
were excluded from the evaluation. 
Tables 3 and 4 present results for word type- 
based induction and induction based on word type 
and context. For each tag t, the table lists the 
frequency of t in the corpus ("frequency") 2, the 
number of induced tags i0, il, • •., iz, that were as- 
signed to it ("# classes"); the number of times an 
occurrence of t was correctly labeled as belong- 
ing to one of i0, Q,...,iz ("correct"); the num- 
ber of times that a token of a different tag t ~ was 
2The small difference in overall frequency in the 
tables is due to the fact that some word-based context 
vectors consist entirely of zeros. There were about a 
hundred word triplets whose four context vectors did 
not have non-zero entries and could not be assigned a 
cluster. 
144 
tag J~ frequency 
108586 
CC 36808 
CD 15085 
DT 129626 
IN 132079 
ING 14753 
MD 13498 
N 231434 
POS 5086 
PRP 47686 
RB 54525 
TO 25196 
VB 35342 
VBD 80058 
VBN 41146 
WDT 14093 
avg. 
# classes }correct 
0 
4 \[ 3376 
2 \[ 125540 
3 \[118726 
5 \[ 2111 
2 \[ 13383 
98 \[ 193838 
1 \[ 4641 
3 \[ 43839 
0 
8 I 29138 
12 I 36653 
0 
incorrect precision 
19528 0.66 
0 0.00 
1431 0.70 
31783 0.80 
75829 0.61 
1016 0.68 
13016 0.51 
79652 0.71 
1213 0.79 
21723 0.67 
56505 0.38 
0 0.00 
17945 0.62 
3855 0.90 
8841 0.47 
0 0.00 
0.53 
0T--¢ 35 m 
\[ 0.00 
\[ 0.22 
\[ 0.97 
\[ 0.90 \[ 
0.14 \[ 
0.99 
\[ 0.84 \[ 
0.91 
I 0.92 
\[ 0.65 \[ 
0.00 
\[ 0.82 
\[ 0.46 \[ 
0.19 
j 0.52 
Table 3: Precision and recall for induction based on word type. 
F 
0.46 
0.00 
0.34 
0.87 
0.73 
0.24 
0.67 
0.77 
0.85 
0.77 
0.48 
0.00 
0.71 
0.61 
0.27 
0.00 
0.49 
miscategorized as being an instance of i0, il, ..., il 
("incorrect"); and precision and recall of the cate- 
gorization of t. Precision is the number of correct 
tokens divided by the sum of correct and incorrect 
tokens. Recall is the number of correct tokens di- 
vided by the total number of tokens of t (in the 
first column). The last column gives van Rijs- 
bergen's F measure which computes an aggregate 
score from precision and recall: (van Rijsbergen, 
1 1979) F = ~-~+(1-~)~" We chose c~ = 0.5 to give 
equal weight to precision and recall. 
It is clear from the tables that incorporating 
context improves performance considerably. The 
F score increases for all tags except CD, with an 
average improvement of more than 0.20. The tag 
CD is probably better thought of as describing a 
word class. There is a wide range of heterogeneous 
syntactic functions of cardinals in particular con- 
texts: quantificational and adnominal uses, bare 
NP's ("is one of"), dates and ages ("Jan 1", "gave 
his age as 25"), and enumerations. In this light, it 
is not surprising that the word-type method does 
better on cardinals. 
Table 5 shows that performance for generalized 
context vectors is better than for word-based con- 
text vectors (0.74 vs. 0.72). However, since the 
number of tags with better and worse performance 
is about the same (7 and 5), one cannot con- 
clude with certainty that generalized context vec- 
tors induce tags of higher quality. Apparently, the 
250 most frequent words capture most of the rel- 
evant distributional information so that the addi- 
tional information from less frequent words avail- 
able from generalized vectors only has a small ef- 
fect. 
Table 6 looks at results for "natural" contexts, 
i.e. those not containing punctuation marks and 
rare words. Performance is consistently better 
than for the evaluation on all contexts, indicating 
that the low quality of the distributional informa- 
tion about punctuation marks and rare words is a 
difficulty for successful tag induction. 
Even for "natural" contexts, performance varies 
considerably. It is fairly good for prepositions, de- 
terminers, pronouns, conjunctions, the infinitive 
marker, modals, and the possessive marker. Tag 
induction fails for cardinals (for the reasons men- 
tioned above) and for "-ing" forms. Present par- 
ticiples and gerunds are difficult because they ex- 
hibit both verbal and nominal properties and oc- 
cur in a wide variety of different contexts whereas 
other parts of speech have a few typical and fre- 
quent contexts. 
It may seem worrying that some of the tags are 
assigned a high number of clusters (e.g., 49 for 
N, 36 for ADN). A closer look reveals that many 
clusters embody finer distinctions. Some exam- 
pies: Nouns in cluster 0 are heads of larger noun 
phrases, whereas the nouns in cluster 1 are full- 
fledged NPs. The members of classes 29 and 111 
function as subjects. Class 49 consists of proper 
nouns. However, there are many pairs or triples 
of clusters that should be collapsed into one on 
linguistic grounds. They were separated on distri- 
butional criteria that don't have linguistic corre- 
lates. 
An analysis of the divergence between our clas- 
sification and the manually assigned tags revealed 
three main sources of errors: rare words and rare 
syntactic phenomena, indistinguishable distribu- 
tion, and non-local dependencies. 
Rare words are difficult because of lack of dis- 
tributional evidence. For example, "ties" is used 
as a verb only 2 times (out of 15 occurrences in 
the corpus). Both occurrences are miscategorized, 
since its context vectors do not provide enough 
evidence for the verbal use. Rare syntactic con- 
structions pose a related problem: There are not 
enough instances to justify the creation of a sepa- 
rate cluster. For example, verbs taking bare in- 
145 
recall ~ 
CC 
CD 
DT 
IN 
ING 
MD 
N 
POS 
PRP 
RB 
TO 
VB 
VBD 
VBN 
WDT 
avg. 
108532 
36808 
15084 
129626 
132079 
14753 
13498 
231424 
5086 
47686 
54524 
25196 
35342 
80058 
41145 
14093 
42 
2 
1 
6 
11 
4 
2 
68 
2 
7 
16 
1 
8 
17 
11 
2 
tag frequency ~ classes precision 
24743 0.78 
1501 0.95 
809 0.48 
6178 0.95 
25316 0.83 
4876 0.39 
936 0.93 
51695 0.80 
533 0.90 
12759 0.78 
17403 0.64 
61 1.00 
6152 0.83 
8663 0.88 
11972 0.68 
1017 0.61 
0.78 
F 
0.79 
0.86 
0.09 
0.94 
0.89 
0.27 
0.95 
0.85 
0.90 
0.85 
0.60 
0.96 
0.83 
0.84 
0.65 
0.19 
0.72 
Table 4: Precision and recall for induction based on word type and context. 
tag 
ADN 
CC 
CD 
DT 
IN 
ING 
MD 
N 
POS 
PRP 
RB 
TO 
VB 
VBD 
VBN 
WDT 
avg. 
~equency 
108586 
36808 
15085 
129626 
132079 
14753 
13498 
231434 
5086 
47686 
54524 
25196 
35342 
80058 
41145 
14093 
classes 
50 
4 
3 
10 
8 
2 
3 
70 
2 
5 
9 
1 
7 
15 
10 
1 
~incorrect 
3707 I 
120968 I 
123516 I 
3798 I 
13175 I 
201890 I 
4932 \] 
37535 \] 
29892 I 
25181 I 
28879 I 
66457 I 
26960 I 
precision 
26790 0.77 
6430 0.84 
1530 0.71 
5780 0.95 
22070 0.85 
7161 0.35 
1059 0.93 
33206 0.86 
1636 0.75 
9221 0.80 
18398 0.62 
27 1.00 
6560 0.81 
12079 0.85 
17356 0.61 
563 0.80 
0.78 
~F 
0 • 8-------~ 
0.88 
0.36 
0.94 
0.89 
0.30 
0.95 
0.87 
0.85 
0.79 
0.58 
1.00 I 1.00 
0.82 I 0.82 
0.83 I 0.84 
0.66 \[ 0.63 
0.26 
o.73 I0.74 
Table 5: Precision and recall for induction based on generalized context vectors. 
tag 
ADN 63771 
CC 16148 
CD 7011 
DT 87914 
IN 91950 
ING 7268 
MD 11244 
N 111368 
POS 3202 
PRP 23946 
RB 32331 
TO 19859 
VB 26714 
VBD 56540 
VBN 24804 
WDT 8329 
avg. 
frequency ~ classes 
36 
4 
1 
9 
9 
2 
3 
49 
i 
7 
16 
2 
11 
33 
14 
3 
__~ incorrect 
12203 
1798 
918 
2664 
6842 
1412 
476 
14452 
precision ~ __ 
0.82 ~--0-~-~ -- 
0.90 I 0.97 
0.67 I 0.26 
0.97 I 0.94 
093 I 094 
0.47 I 0.17 
0.96 I 0.92 
0.87 I 0.90 I 
0.91 I 
0.96 
I 0.65 
Io.98 
I 0.90 
\] 0.90 
I 0.76 
Io.78 
255 0.92 
4062 0.85 
9922 0.68 
53 1.00 
4119 0.85 
8488 0.86 
7448 0.72 
670 0.85 
0.83 
Table 6: Precision and recall for induction for natural contexts. 
F 
0.83 
0.93 
0.38 
0.95 
0.94 
0.25 
0.94 
0.89 
0.91 
0.90 
0.66 
0.99 
0.88 
0.88 
0.74 0.58 
0.79 
146 
finitives were classified as adverbs since this is 
too rare a phenomenon to provide strong distri- 
butional evidence ("we do not DARE speak of", 
"legislation could HELP remove"). 
The case of the tags "VBN" and "PRD" (past 
participles and predicative adjectives) demon- 
strates the difficulties of word classes with indis- 
tinguishable distributions. There are hardly any 
distributional clues for distinguishing "VBN" and 
"PRD" since both are mainly used as comple- 
ments of "to be".s A common tag class was cre- 
ated for "VBN" and "PRD" to show that they 
are reasonably well distinguished from other parts 
of speech, even if not from each other. Semantic 
understanding is necessary to distinguish between 
the states described by phrases of the form "to be 
adjective" and the processes described by phrases 
of the form "to be past participle". 
Finally, the method fails if there are no local 
dependencies that could be used for categoriza- 
tion and only non-local dependencies are informa- 
tive. For example, the adverb in "Mc*N. Hester, 
CURRENTLY Dean of..." and the conjunction 
in "to add that, IF United States policies ..." 
have similar immediate neighbors (comma, NP). 
The decision to consider only immediate neighbors 
is responsible for this type of error since taking 
a wider context into account would disambiguate 
the parts of speech in question. 
5 Future Work 
There are three avenues of future research we are 
interested in pursuing. First, we are planning to 
apply the algorithm to an as yet untagged lan- 
guage. Languages with a rich morphology may 
be more difficult than English since with fewer to- 
kens per type, there is less data on which to base 
a categorization decision. 
Secondly, the error analysis suggests that con- 
sidering non-local dependencies would improve re- 
sults. Categories that can be induced well (those 
characterized by local dependencies) could be in- 
put into procedures that learn phrase structure 
(e.g. (Brill and Marcus, 19925; Finch, 1993)). 
These phrase constraints could then be incorpo- 
rated into the distributional tagger to characterize 
non-local dependencies. 
Finally, our procedure induces a "hard" part-of- 
speech classification of occurrences in context, i.e., 
each occurrence is assigned to only one category. 
It is by no means generally accepted that such 
a classification is linguistically adequate. There 
is both synchronic (Ross, 1972) and diachronic 
(Tabor, 1994) evidence suggesting that words and 
their uses can inherit properties from several pro- 
totypical syntactic categories. For example, "fun" 
SBecause of phrases like "I had sweet potatoes", 
forms of "have" cannot serve as a reliable discrimina- 
tor either. 
in "It's a fun thing to do." has properties of both a 
noun and an adjective (superlative "funnest" pos- 
sible). We are planning to explore "soft" classifi- 
cation algorithms that can account for these phe- 
nomena. 
6 Conclusion 
In this paper, we have attempted to construct an 
algorithm for fully automatic distributional tag- 
ging, using unannotated corpora as the sole source 
of information. The main innovation is that the 
algorithm is able to deal with part-of-speech am- 
biguity, a pervasive phenomenon in natural lan- 
guage that was unaccounted for in previous work 
on learning categories from corpora. The method 
was systematically evaluated on the Brown cor- 
pus. Even if no automatic procedure can rival the 
accuracy of human tagging, we hope that the al- 
gorithm will facilitate the initial tagging of texts 
in new languages and sublanguages. 
7 Acknowledgments 
I am grateful for helpful comments to Steve Finch, 
Jan Pedersen and two anonymous reviewers (from 
ACL and EACL). I'm also indebted to Michael 
Berry for SVDPACK and to the Penn Treebank 
Project for the parsed Brown corpus. 
References 
Steven Abney. 1991. Parsing by chunks. In 
Berwick, Abney, and Tenny, editors, Principle- 
Based Parsing. Kluwer Academic Publishers. 
Michael W. Berry. 1992. Large-scale sparse 
singular value computations. The Interna- 
tional Journal of Supercomputer Applications, 
6(1):13-49. 
Douglas Biber. 1993. Co-occurrence patterns 
among collocations: A tool for corpus-based 
lexical knowledge acquisition. Computational 
Linguistics, 19(3):531-538. 
Eric Brill and Mitch Marcus. 1992a. Tagging 
an unfamiliar text with minimal human super- 
vision. In Robert Goldman, editor, Working 
Notes of the AAAI Fall Symposium on Proba- 
bilistic Approaches to Natural Language. AAAI 
Press. 
Eric Brill and Mitchell Marcus. 1992b. Au- 
tomatically acquiring phrase structure using 
distributional analysis. In Proceedings of the 
DARPA workshop "Speech and Natural Lan- 
guage", pages 155-159. 
Eric Brill, David Magerman, Mitch Marcus, and 
Beatrice Santorini. 1990. Deducing linguistic 
structure from the statistics of large corpora. In 
Proceedings of the DARPA Speech and Natural 
Language Workshop, pages 275-282. 
147 
Eric Brill. 1993. Automatic grammar induction 
and parsing free text: A transformation-based 
approach. In Proceedings of ACL 31, Columbus 
OH. 
Eugene Charniak, Curtis Hendrickson, Neil Ja- 
cobson, and Mike Perkowitz. 1993. Equations 
for part-of-speech tagging. In Proceedings of the 
Eleventh National Conference on Artificial In- 
telligence, pages 784-789. 
Eugene Charniak, Glenn Carroll, John Adcock, 
Anthony Cassandra, Yoshihiko Gotoh, Jeremy 
Katz, Michael Littman, and John McCann. 
1994. Tatters for parsers. Technical Report 
CS-94-06, Brown University. 
Kenneth W. Church. 1989. A stochastic parts 
program and noun phrase parser for unre- 
stricted text. In Proceedings of ICASSP-S9, 
Glasgow, Scotland. 
Doug Cutting, Julian Kupiec, Jan Pedersen, and 
Penelope Sibun. 1991. A practical part- 
of-speech tagger. In The 3rd Conference on 
Applied Natural Language Processing, Trento, 
Italy. 
Douglas R. Cutting, Jan O. "Pedersen, David 
Karger, and John W. Tukey. 1992. Scat- 
ter/gather: A cluster-based approach to brows- 
ing large document collections. In Proceedings 
of SIGIR 'g2, pages 318-329. 
C. G. deMarcken. 1990. Parsing the LOB corpus. 
In Proceedings of the 28th Annual Meeting of 
the Association for Computational Linguistics, 
pages 243-259. 
Jeffrey L. Elman. 1990. Finding structure in time. 
Cognitive Science, 14:179-211. 
Steven Finch and Nick Chater. 1992. Bootstrap- 
ping syntactic categories using statistical meth- 
ods. In Walter Daelemans and David Powers, 
editors, Background and Ezperiments in Ma- 
chine Learning of Natural Language, pages 229- 
235, Tilburg University. Institute for Language 
Technology and AI. 
Steven Paul Finch. 1993. Finding Structure in 
Language. Ph.D. thesis, University of Edin- 
burgh. 
W.N. Francis and F. Kufiera. 1982. Frequency 
Analysis of English Usage. Houghton Mifflin, 
Boston. 
F. Jelinek. 1985. Robust part-of-speech tagging 
using a hidden markov model. Technical report, 
IBM, T.J. Watson Research Center. 
Reinhard Kneser and I-Iermann Ney. 1993. Form- 
ing word classes by statistical clustering for sta- 
tistical language modelling. In Reinhard KShler 
and Burghard B. Rieger, editors, Contribu- 
tions to Quantitative Linguistics, pages 221- 
226. Kluwer Academic Publishers, Dordrecht, 
The Netherlands. 
Julian Kupiec. 1992. Robust part-of-speech tag- 
ging using a hidden markov model. Computer 
Speech and Language, 6:225-242. 
Julian Kupiec. 1993. Murax: A robust linguistic 
approach for question answering using an on- 
line encyclopedia. In Proceedings of SIGIR '93, 
pages 181-190. 
John R. Ross. 1972. The category squish: End- 
station Hauptwort. In Papers from the Eighth 
Regional Meeting. Chicago Linguistic Society. 
Hinrich Schfitze. 1993. Part-of-speech induction 
from scratch. In Proceedings of ACL 31, pages 
251-258, Columbus OH. 
Whitney Tabor. 1994. Syntactic Innovation: A 
Connectionist Model. Ph.D. thesis, Stanford 
University. 
C. J. van Rijsbergen. 1979. Information Re- 
trieval Butterworths, London. Second Edition. 
148 
