PART-OF-SPEECH INDUCTION FROM SCRATCH 
Hinrich Schiitze 
Center for the Study of Language and Information 
Ventura Hall 
Stanford, CA 94305-4115 
schuetze~csli.stanford.edu 
Abstract 
This paper presents a method for inducing 
the parts of speech of a language and part- 
of-speech labels for individual words from a 
large text corpus. Vector representations for 
the part-of-speech of a word are formed from 
entries of its near lexical neighbors. A dimen- 
sionality reduction creates a space represent- 
ing the syntactic categories of unambiguous 
words. A neural net trained on these spa- 
tial representations classifies individual con- 
texts of occurrence of ambiguous words. The 
method classifies both ambiguous and unam- 
biguous words correctly with high accuracy. 
INTRODUCTION 
Part-of-speech information about individual words 
is necessary for any kind of syntactic and higher 
level processing of natural language. While it is 
easy to obtain lists with part of speech labels for 
frequent English words, such information is not 
available for less common languages. Even for En- 
glish, a categorization of words that is tailored to a 
particular genre may be desired. Finally, there are 
rare words that need to be categorized even if fre- 
quent words are covered by an available electronic 
dictionary. 
This paper presents a method for inducing the 
parts of speech of a language and part-of-speech 
labels for individual words from a large text cor- 
pus. Little, if any, language-specific knowledge is 
used, so that it is applicable to any language in 
principle. Since the part-of-speech representations 
are derived from the corpus, the resulting catego- 
rization is highly text specific and doesn't contain 
categories that are inappropriate for the genre in 
question. The method is efficient enough for vo- 
cabularies of tens of thousands of words thus ad- 
dressing the problem of coverage. 
The problem of how syntactic categories can be 
induced is also of theoretical interest in language 
acquisition and learnability. Syntactic category 
information is part of the basic knowledge about 
language that children must learn before they can 
acquire more complicated structures. It has been 
claimed that "the properties that the child can 
detect in the input - such as the serial positions 
and adjacency and co-occurrence relations among 
words - are in general linguistically irrelevant." 
(Pinker 1984) It will be shown here that relative 
position of words with respect to each other is suf- 
ficient for learning the major syntactic categories. 
In the first part of the derivation, two iterations 
of a massive linear approximation of cooccurrence 
counts categorize unambiguous words. Then a 
neural net trained on these words classifies indi- 
vidual contexts of occurrence of ambiguous words. 
An evaluation suggests that the method classi- 
fies both ambiguous and unambiguous words cor- 
rectly. It differs from previous work in its effi- 
ciency and applicability to large vocabularies; and 
in that linguistic knowledge is only used in the 
very last step so that theoretical assumptions that 
don't hold for a language or sublanguage have min- 
imal influence on the classification. 
The next two sections describe the linear ap- 
proximation and a birecurrent neural network for 
the classification of ambiguous words. The last 
section discusses the results. 
CATEGORY SPACE 
The goal of the first step of the induction is to com- 
pute a multidimensional real-valued space, called 
category space, in which the syntactic category of 
each word is represented by a vector. Proximity in 
the space is related to similarity of syntactic cat- 
egory. The vectors in this space will then be used 
as input and target vectors for the connectionist 
net. 
The vector space is bootstrapped by collecting 
relevant distributional information about words. 
The 5,000 most frequent words in five months of 
the New York Times News Service (June through 
251 
October 1990) were selected for the experiments. 
For each pair of these words < wi, w i >, the num- 
ber of occurrences of wi immediately to the left of 
wj (hi,j), the number of occurrences of wi immedi- 
ately to the right ofwj (cij), the number of occur- 
rences of wl at a distance of one word to the left of 
wj (ai,j), and the number of occurrences ofwi at a 
distance of one word to the right of wj (d/j) were 
counted. The four sets of 25,000,000 counts were 
collected in the 5,000-by-5,000 matrices B, C, A, 
and D, respectively. Finally these four matrices 
were combined into one large 5,000-by-20,000 ma- 
trix as shown in Figure 1. The figure also shows 
for two words where their four cooccurrence counts 
are located in the 5,000-by-20,000 matrix. In the 
experiments, w3000 was resistance and ~/24250 was 
theaters. The four marks in the figure, the posi- 
tions of the counts 1:13000,4250, b3000,4250, e3000,4250, 
and d3000,4~50, indicate how often resistance oc- 
curred at positions -2, -1, 1, and 2 with respect 
to theaters. 
These 20,000-element rows of the matrix could 
be used directly to compute the syntactic similar- 
ity between individual words: The cosine of the 
angle between the vectors of a pair of words is a 
measure of their similarity. I However, computa- 
tions with such large vectors are time-consuming. 
Therefore a singular value decomposition was per- 
formed on the matrix. Fifteen singular values were 
computed using a sparse matrix algorithm from 
SVDPACK (Berry 1992). As a result, each of the 
5,000 words is represented by a vector of real num- 
bers. Since the original 20,000-component vectors 
of two words (corresponding to rows in the ma- 
trix in Figure 1) are similar if their collocations 
are similar, the same holds for the reduced vectors 
because the singular value decomposition finds the 
best least square approximation for the 5,000 orig- 
inal vectors in a 15-dimensional space that pre- 
serves similarity between vectors. See (Deerwester 
et al. 1990) for a definition of SVD and an appli- 
cation to a similar problem. 
Close neighbors in the 15-dimensional space 
generally have the same syntactic category as can 
be seen in Table 1. However, the problem with this 
method is that it will not scale up to a very large 
number of words. The singular value decomposi- 
tion has a time complexity quadratic in the rank 
of the matrix, so that one can only treat a small 
part of the total vocabulary of a large corpus. 
Therefore, an alternative set of features was con- 
sidered: classes of words in the 15-dimensional 
space. Instead of counting the number of occur- 
rences of individual words, we would now count 
1The cosine between two vectors corresponds to 
the normalized correlation coefficient: cos(c~(~,ff)) = 
the number of occurrences of members of word 
classes. 2 The space was clustered with Buckshot, a 
linear-time clustering algorithm described in (Cut- 
ting et al. 1992). Buckshort applies a high-quality 
quadratic clustering algorithm to a random sam- 
ple of size v/k-n, where k is the number of desired 
cluster centers and n is the number of vectors to 
be clustered. Each of the remaining n - ~ vec- 
tors is assigned to the nearest cluster center. The 
high-quality quadratic clustering algorithm used 
was truncated group average agglomeration (Cut- 
ting et al. 1992). 
Clustering algorithms generally do not con- 
struct groups with just one member. But there 
are many closed-class words such as auxiliaries and 
prepositions that shouldn't be thrown together 
with the open classes (verbs, nouns etc.). There- 
fore, a list of 278 closed-class words, essentially the 
words with the highest frequency, was set aside. 
The remaining 4722 words were classified into 222 
classes using Buckshot. 
The resulting 500 classes (278 high-frequency 
words, 222 clusters) were used as features in the 
matrix shown in Figure 2. Since the number of 
features has been greatly reduced, a larger num- 
ber of words can be considered. For the second 
matrix all 22,771 words that occurred at least 100 
times in 18 months of the New York Times News 
Service (May 1989 - October 1990) were selected. 
Again, there are four submatrices, corresponding 
to four relative positions. For example, the entries 
aij in the A part of the matrix count how often 
a member of class i occurs at a distance of one 
word to the left of word j. Again, a singular value 
decomposition was performed on the matrix, this 
time 10 singular values were computed. (Note that 
in the first figure the 20,000-element rows of the 
matrix are reduced to 15 dimensions whereas in 
the second matrix the 2,000-element columns are 
reduced to 10 dimensions.) 
Table 2 shows 20 randomly selected words and 
their nearest neighbors in category space (in order 
of proximity to the head word). As can be seen 
from the table, proximity in the space is a good 
predictor of similar syntactic category. The near- 
est neighbors of athlete, clerk, declaration, and 
dome are singular nouns, the nearest neighbors 
of bowers and gibbs are family names, the near- 
est neighbors of desirable and sole are adjectives, 
and the nearest neighbors of financings are plu- 
ral nouns, in each case without exception. The 
neighborhoods of armaments, cliches and luxuries 
(nouns), and b'nai and northwestern (NP-initial 
modifiers) fail to respect finer grained syntactic 
2Cf. (Brown et al. 1992) where the same idea of 
improving generalization and accuracy by looking at 
word classes instead of individual words is used. 
252 
4250 
A 
+ 
I 
B 
+ 
I 
C 
+ 
I 
D 
+ 
I 
3000 3000 3000 3000 
Figure 1: The setup ofthematrixforthe first singular value decomposition. 
Table 1: Ten random and three selected words and their nearest neighbors in category space 1. 
word 
accompanied 
almost 
causing 
classes 
directors 
goal 
japanese 
represent 
think 
york 
nearest neighbors 
submitted banned financed developed authorized headed canceled awarded barred 
virtually merely formally fully quite officially just nearly only less 
reflecting forcing providing creating producing becoming carrying particularly 
elections courses payments losses computers performances violations levels pictures 
professionals investigations materials competitors agreements papers transactions 
mood roof eye image tool song pool scene gap voice 
chinese iraqi american western arab foreign european federal soviet indian 
reveal attend deliver reflect choose contain impose manage establish retain 
believe wish know realize wonder assume feel say mean bet 
angeles francisco sox rouge kong diego zone vegas inning layer 
Oil 
must 
through in at over into with from for by across 
we you i he she nobody who it everybody there they 
might would could cannot will should can may does helps 
500 features 
500 features 
500 features 
500 features 
A 
B 
C 
D 
22,771 words 
Figure 2: The setup of the matrix for the second singular value decomposition. 
253 
Table 2: Twenty random and four selected words and their neigborhoods in category space 2. 
word 
armaments 
athlete 
b'nal 
bowers 
clerk 
cliches 
cruz 
declaration 
desirable 
dome 
equally 
financings 
gibbs 
luxuries 
northwestern 
oh 
sole 
nearest neighbors 
turmoil weaponry landmarks coordination prejudices secrecy brutality unrest harassment \[ 
virus scenario \[ event audience disorder organism candidate procedure epidemic 
I suffolk sri allegheny cosmopolitan berkshire cuny broward multimedia bovine nytimes 
jacobs levine cart hahn schwartz adams bucldey dershowitz fitzpatrick peterson \[ 
salesman \] psychologist photographer preacher mechanic dancer lawyer trooper trainer 
pests wrinkles outbursts streams icons endorsements I friction unease appraisals lifestyles 
antonio I' clara pont saud monica paulo rosa mae attorney palma 
sequence mood profession marketplace concept facade populace downturn moratorium I 
re'cognizable I frightening loyal devastating excit!ng troublesome awkward palpable 
blackout furnace temblor quartet citation chain countdown thermometer shaft I 
I somewhat progressively acutely enormously excessively unnecessarily largely scattered 
\[ endeavors monopolies raids patrols stalls offerings occupations philosophies religions 
adler reid webb jenkins stevens carr lanrent dempsey hayes farrell \[ 
volatility insight hostility dissatisfaction stereotypes competence unease animosity residues \] 
transports 
vividly 
walks 
\[ baja rancho harvard westchester ubs humboldt laguna guinness vero granada 
gee gosh ah hey I appleton ashton dolly boldface baskin lo 
I lengthy vast monumental rudimentary nonviolent extramarital lingering meager gruesome 
I spokesman copyboy staffer barrios comptroller alloy stalks spokeswoman dal spokesperson 
Iskillfully frantically calmly confidently streaming relentlessly discreetly spontaneously 
floats \[ jumps collapsed sticks stares crumbled peaked disapproved runs crashed 
claims 
Oil 
must 
they 
credits promises \[ forecasts shifts searches trades practices processes supplements controls 
through from in \[ at by 'within with under against for 
will might would cannot could can should won't \[ doesn't may 
we \[ i you who nobody he it she everybody there 
distinctions, but are reasonable representations of 
syntactic category. The neighbors of cruz (sec- 
ond components of names), and equally and vividly 
(adverbs) include words of the wrong category, but 
are correct for the most part. 
In order to give a rough idea of the density of 
the space in different locations, the symbol "1" is 
placed before the first neighbor in Table 2 that 
has a correlation of 0.978 or less with the head 
word. As can be seen from the table, the re- 
gions occupied by nouns and proper names are 
dense, whereas adverbs and adjectives have more 
distant nearest neighbors. One could attempt to 
find a fixed threshold that would separate neigh- 
bors of the same category from syntactically dif- 
ferent ones. For instance, the neighbors of oh with 
a correlation higher than 0.978 are all interjections 
and the neighbors of cliches within the threshold 
region are all plural nouns. However, since the 
density in the space is different for different re- 
gions, it is unlikely that a general threshold for all 
syntactic categories can be found. 
The neighborhoods of transports and walks are 
not very homogeneous. These two words are 
ambiguous between third person singular present 
tense and plural noun. Ambiguity is a problem 
for the vector representation scheme used here, be- 
cause the two components of an ambiguous vector 
can add up in a way that makes it by chance simi- 
lar to an unambiguous word of a different syntactic 
category. If we call the distributional vector fi'¢ of 
words of category c the profile of category c, and 
if a word wl is used with frequency c~ in category 
cl and with frequency ~ in category c2, then the 
weighted sum of the profiles (which corresponds to 
a column for word Wl in Figure 2) may turn out 
to be the same as the profile of an unrelated third 
category c3: 
This is probably what happened in the cases of 
transports and walks. The neighbors of claims 
demonstrate that there are homogeneous "am- 
biguous" regions in the space if there are enough 
words with the same ambiguity and the same fre- 
quency ratio of the categories, lransports and 
walks (together with floats, jumps, sticks, stares, 
and runs) seem to have frequency ratios a/fl dif- 
ferent from claims, so that they ended up in dif- 
ferent regions. 
The last three lines of Table 2 indicate that func- 
tion words such as prepositions, auxiliaries, and 
nominative pronouns and quantifiers occupy their 
own regions, and are well separated from each 
other and from open classes. 
254 
A BIRECURRENT NETWORK 
FOR PART-OF-SPEECH 
PREDICTION 
A straightforward way to take advantage of the 
vector representations for part of speech catego- 
rization is to cluster the space and to assign part- 
of-speech labels to the clusters. This was done 
with Buckshot. The resulting 200 clusters yielded 
good results for unambiguous words. However, for 
the reasons discussed above (linear combination of 
profiles of different categories) the clustering was 
not very successful for ambiguous words. There- 
fore, a different strategy was chosen for assigning 
category labels. In order to tease apart the differ- 
ent uses of ambiguous words, one has to go back to 
the individual contexts of use. The connectionist 
network in Figure 3 was used to analyze individual 
contexts. 
The idea of the network is similar to Elman's re- 
current networks (Elman 1990, Elman 1991): The 
network learns about the syntactic structure of the 
language bY trying to predict the next word from 
its own context units in the previous step and the 
current word. The network in Figure 3 has two 
novel features: It uses the vectors from the second 
singular vMue decomposition as input and target. 
Note that distributed vector representations are 
ideal for connectionist nets, so that a connection- 
ist model seems most appropriate for the predic- 
tion task. The second innovation is that the net 
is birecurrent. It has recurrency to the left as well 
as to the right. 
In more detail, the network's input consists of 
the word to the left tn-1, its own left context in the 
previous time step c-l,,-1, the word to the right 
tn+l and its own right context C-rn+l in the next 
time step. The second layer has the context units 
of the current time step. These feed into thirty 
hidden units h,~ which in turn produce the output 
vector o,,. The target is the current word tn. The 
output units are linear, hidden units are sigmoidM. 
The network was trained stochastically with 
truncated backpropagation through time (BPTT, 
Rumelhart et al. 1986, Williams and Peng 1990). 
For this purpose, the left context units were un- 
folded four time steps to the left and the right con- 
text units four time steps to the right as shown 
in Figure 4. The four blocks of weights on the 
connections to c-in-3, c-ln-~., c-in-l, and c-In 
are linked to ensure identical mapping from one 
"time step" to the next. The connections on the 
right side are linked in the same way. The train- 
ing set consisted of 8,000 words in the New York 
Times newswire (from June 1990). For each train- 
ing step, four words to the left of the target word (tn_3, tn_2,tn_l, 
and in) and four words to the 
right of the target word (tn, tn+l, tn+2, and in+3) 
F 
U:q 
I h. I 
.--q 
,+-z\] 
,+-;71 
Figure 4: Unfolded birecurrent network in train- 
ing. 
were the input to the unfolded network. The tar- 
get was the word tn. A modification of bp from 
the pdp package was used with a learning rate of 
0.01 for recurrent units, 0.001 for other units and 
no momentum. 
After training, the network was applied to 
the category prediction tasks described below by 
choosing a part of the text without unknown 
words, computing all left contexts from left to 
right, computing all right contexts from right to 
left, and finally predicting the desired category of 
a word t, by using the precomputed contexts c-l,, 
and c-rn. 
In order to tag the occurrence of a word, one 
could retrieve the word in category space whose 
vector is closest to the output vector computed by 
the network. However, this would give rise to too 
much variety in category labels. To illustrate, con- 
sider the prediction of the category NOUN. If the 
network categorizes occurrences of nouns correctly 
as being in the region around declaration, then the 
slightest variation in the output will change the 
nearest neighbor of the output vector from decla- 
ration to its nearest neighbors sequence or mood 
(see Table 2). This would be confusing to the hu- 
man user of the categorization program. 
Therefore, the first 5,000 output vectors of the 
network (from the first day of June 1990), were 
clustered into 200 output clusters with Buckshot. 
Each output cluster was labeled by the two words 
closest to its centroid. Table 3 lists labels of some 
of the output clusters that occurred in the ex- 
periment described below. They are easily in- 
terpretable for someone with minimal linguistic 
knowledge as the examples show. For some cat- 
egories such as HIS_THI~. one needs to look at a 
couple of instances to get a "feel" for their mean- 
255 
I ,tn (10) I 
I o-(lO) I 
\[ h. (30) 
It,,-, (10) I \[ C-In-, (15)I I ~n+'l (15) { 
Figure 3: The architecture of the birecurrent network 
Table 3: The labels of 10 output clusters. 
output cluster label 
exceLdepart 
prompt_select 
cares_sonnds 
office_staff 
promotion_trauma 
famous_talented 
publicly_badly 
his_the 
part of speech 
intransitive verb (base form) 
transitive verb (base form) 
3. person sg. present tense 
noun 
noun 
adjective 
adverb 
NP-initial 
ing. 
The syntactic distribution of an individual word 
can now be more accurately determined by the 
following algorithm: 
• compute an output vector for each position in 
the text at which the target word occurs. 
• for each output vector j do the following: 
- determine the centroid of the cluster i which 
is closest 
- compute the correlation coefficient of the out- 
put vector j and the centroid of the output 
cluster i. This is the score si,i for cluster i 
and vector j. Assign zero to the scores of the 
other clusters for this vector: s~,j :- 0, k ~ i 
• for each cluster i, compute the final score fi as 
the sum of the scores sij : fi := ~j si,j 
• normalize the vector of 200 final scores to unit 
length 
This algorithm was applied to June 1990. If for 
a given word, the sum of the unnormalized final 
scores was less than 30 (corresponding to roughly 
100 occurrences in June), then this word was dis- 
carded. Table 4 lists the highest scoring categories 
for 10 random words and 11 selected ambiguous 
words. (Only categories with a score of at least 
0.2 are listed.) 
The network failed to learn the distinctions be- 
tween adjectives, intransitive present participles 
and past participles in the frame "to-be + \[\] + 
non-NP'. For this reason, the adjective close, the 
present participle beginning, and the past partici- 
ple shot are all classified as belonging to the cate- 
gory STRUGGLING_TRAVELING. (Present Partici- 
ples are successfully discriminated in the frame 
"to-be + \[\] + NP": see winning in the table, which 
is classified as the progressive form of a transitive 
verb: HOLDING_PROMISING.) This is the place 
where linguistic knowledge has to be injected in 
form of the following two rules: 
• If a word in STRUGGLING_TRAVELING is a mor- 
phological present participle or past participle 
assign it to that category, otherwise to the cat- 
egory ADJECTIVE_PREDICATIVE. 
* If a word in a noun category is a morpho- 
logical plural assign it to NOUN_PLURAL, to 
NOUN_SINGULAR otherwise. 
With these two rules, all major categories 
are among the first found by the algorithm; 
in particular the major categories of the am- 
biguous words better (adjective/adverb), close 
(verb/adjective), work (noun/base form of verb), 
hopes (noun/third person singular), beginning 
(noun/present-participle), shot (noun/past par- 
ticiple) and's ('s/is). There are two clear errors: 
GIVEN_TAKING for contain, and RICAN_ADVISORY 
for 's, both of rank three in the table. 
256 
Table 
word 
adequate 
admit 
appoint 
consensus 
contain 
dodgers 
genes 
language 
legacy 
thirds 
good 
better 
close 
work 
hospital 
buy 
hopes 
beginning 
shot 
'S 
winning 
4: The highest scoring categories for 10 random and 11 selected words. 
highest scoring categories 
universal_martial (0.50) 
excel_depart (0.88) 
prompt_select (0.72) 
office_staff (0.71) 
gather_propose (0.76) 
promotion_trauma (0.57) 
office_staff (0.43) 
promotion_trauma (0.65) 
promotion_trauma (0.95) 
hand_shooting (0.75) 
famous_talented (0.86) 
famous_talented (0.65) 
gather_propose (0.43) 
exceLdepart (0.72) 
promotion_trauma (0.75) 
gather_propose (0.77) 
promotion_trauma (0.56) 
promotion_trauma (0.90) 
hand_shooting (0.54) 
's_f~cto (0.54) 
famous_talented (0.71) 
struggling_traveling (0.33) 
gather_propose (0.30) 
gather_propose (0.65) 
promotion_trauma (0.43) 
prompt_select (0.43) 
yankees_paper (0.52) 
promotion_trauma (0.75) 
office_staff (0.57) 
office_staff (0.22) 
famous_talented (0.41) 
his_the (0.34) 
struggling_traveling (0.42) 
promotion_trauma (0.51) 
office_agent (0.40) 
prompt_select (0.47) 
cares.sounds (0.53) 
struggling_travehng (0.34) 
struggling_traveling (0.45) 
makes_is (0.40) 
holding_promising (0.33) 
several_numerous (0.33) 
prompt_select (0.20) 
hand_shooting (0.39) 
given_taking (0.24) 
fantasy_ticket (0.48) 
route_style (0.22) 
office_agent (0.21) 
iron_pickup (0.36)_ 
pubhcly_badly (0.27) 
famous_talented (0.36) 
remain_want (0.27) 
fantasy_ticket (0.24) 
remain_want (0.22) 
windows_pictures (0.21) 
promotion_trauma (0.40) 
rican_advisory (0.~7) 
iron_pickup (0.29) 
These results seem promising given the fact that 
the context vectors consist of only 15 units. It 
seems naive to believe that all syntactic informa- 
tion of the sequence of words to the left (or to the 
right) can be expressed in such a small number 
of units. A larger experiment with more hidden 
units for each context vector will hopefully yield 
better results. 
DISCUSSION AND CONCLUSION 
Brill and Marcus describe an approach with simi- 
lar goals in (Brill and Marcus 1992). Their method 
requires an initial consultation of a native speaker 
for a couple of hours. The method presented here 
makes a short consultation of a native speaker nec- 
essary, however it occurs at the end, as the last 
step of category induction. This has the advantage 
of avoiding bias in an initial a priori classification. 
Finch and Chater present an approach to cat- 
egory induction that also starts out with offset 
counts, proceeds by classifying words on the ba- 
sis of these counts, and then goes back to the lo- 
cal context for better results (Finch and Chater 
1992). But the mathematical and computational 
techniques used here seem to be more efficient and 
more accurate than Finch and Chater's, and hence 
applicable to vocabularies of a more realistic size. 
An important feature of the last step of the pro- 
cedure, the neural network, is that the lexicogra- 
pher or linguist can browse the space of output 
vectors for a given word to get a sense of its syn- 
tactic distribution (for instance uses of better as 
an adverb) or to improve the classification (for in- 
stance by splitting an induced category that is too 
coarse). The algorithm can also be used for cate- 
gorizing unseen words. This is possible as long as 
the words surrounding it are known. 
The procedure for part-of-speech categorization 
introduced here may be of interest even for words 
whose part-of-speech labels are known. The di- 
mensionality reduction makes the global distribu- 
tional pattern of a word available in a profile con- 
sisting of a dozen or so real numbers. Because 
of its compactness, this profile can be used effi- 
ciently as an additional source of information for 
improving the performance of natural language 
processing systems. For example, adverbs may 
be lumped into one category in the lexicon of a 
processing system. But the category vectors of 
adverbs that are used in different positions such 
as completely (mainly pre~adjectival), normally 
(mainly pre-verbal) and differently (mainly post- 
verbal) are different because of their different dis- 
tributional properties. This information can be 
exploited by a parser if the category vectors are 
available as an additional source of information. 
The model has also implications for language 
acquisition. (Maratsos and Chalkley 1981) pro- 
pose that the absolute position of words in sen- 
tences is important evidence in children's learn- 
ing of categories. The results presented here show 
that relative position is sufficient for learning the 
major syntactic categories. This suggests that rel- 
ative position could be important information for 
learning syntactic categories in child language ac- 
quisition. 
The basic idea of this paper is to collect a 
257 
large amount of distributional information con- 
sisting of word cooccurrence counts and to com- 
pute a compact, low-rank approximation. The 
same approach was applied in (Sch/itze, forth- 
coming) to the induction of vector representations 
for semantic information about words (a differ- 
ent source of distributional information was used 
there). Because of the graded information present 
in a multi-dimensional space, vector representa- 
tions are particularly well-suited for integrating 
different sources of information for disambigua- 
tion. 
In summary, the algorithm introduced here pro- 
vides a language-independent, largely automatic 
method for inducing highly text-specific syntactic 
categories for a large vocabulary. It is to be hoped 
that the method for distributional analysis pre- 
sented here will make it easier for computational 
and traditional lexicographers to build dictionar- 
ies that accurately reflect language use. 
ACKNOWLEDGMENTS 
I'm indebted to Mike Berry for SVDPACK and 
to Marti Hearst, Jan Pedersen and two anony- 
mous reviewers for very helpful comments. This 
work was partially supported by the National Cen- 
ter for Supercomputing Applications under grant 
BNS930000N. 

REFERENCES 
Berry, Michael W. 1992. Large-scale sparse singu- 
lar value computations. The International Jour- 
nal of Supercomputer Applications 6(1):13-49. 
Brill, Eric, and Mitch Marcus. 1992. Tagging an 
Unfamiliar Text with Minimal Human Supervi- 
sion. In Working Notes of the AAAI Fall Sym- 
posium on Probabilistic Approaches to Natural 
Language, ed. Robert Goldman. AAAI Press. 
Brown, Peter F., Vincent J. Della Pietra, Pe- 
ter V. deSouza, Jenifer C. Lai, and Robert L. 
Mercer. 1992. Class-Based n-gram Models of 
Natural Language. Computational Linguistics 
18(4):467-479. 
Cutting, Douglas R., Jan O. Pedersen, David 
Karger, and John W. Tukey. 1992. Scat- 
ter/Gather: A Cluster-based Approach to 
Browsing Large Document Collections. In Pro- 
ceedings of SIGIR '92. 
Deerwester, Scott, Susan T. Dumais, George W. 
Furnas, Thomas K. Landauer, and Richard 
Harshman. 1990. Indexing by latent semantic 
analysis. Journal of the American Society for 
Information Science 41(6):391-407. 
Elman, Jeffrey L. 1990. Finding Structure in 
Time. Cognitive Science 14:179-211. 
Elman, Jeffrey L. 1991. Distributed Repre- 
sentations, Simple Recurrent Networks, and 
Grammatical Structure. Machine Learning 
7(2/3):195-225. 
Finch, Steven, and Nick Chater. 1992. Boot- 
strapping Syntactic Categories Using Statisti- 
cal Methods. In Background and Experiments 
in Machine Learning of Natural Language, ed. 
Walter Daelemans and David Powers. Tilburg 
University. Institute for Language Technology 
and AI. 
Maratsos, M. P., and M. Chalkley. 1981. The inter- 
nal language of children's syntax: the ontogene- 
sis and representation of syntactic categories. In 
Children's language, ed. K. Nelson. New York: 
Gardner Press. 
Pinker, Steven. 1984. Language Learnability and 
Language Development. Cambridge MA: Har- 
vard University Press. 
Rumelhart, D. E., G. E. Hinton, and R. J. 
Williams. 1986. Learning Internal Representa- 
tions by Error Propagation. In Parallel Dis- 
tributed Processing. Explorations in the Mi- 
crostructure of Cognition. Volume I: Founda- 
tions, ed. David E. Rumelhart, James L. Mc- 
Clelland, and the PDP Research Group. Cam- 
bridge MA: The MIT Press. 
Schiitze, Hinrich. Forthcoming. Word Space. In 
Advances in Neural Information Processing Sys- 
tems 5, ed. Stephen J. Hanson, Jack D. Cowan, 
and C. Lee Giles. San Mateo CA: Morgan Kauf- 
mann. 
Williams, Ronald J., and Jing Peng. 1990. An Ef- 
ficient Gradient-Based Algorithm for On-Line 
Training of Recurrent Network Trajectories. 
Neural Computation 2:490-501. 
