Finding Parts in Very Large Corpora 
Matthew Berland, Eugene Charniak 
rob, ec @ cs. brown, edu 
Department of Computer Science 
Brown University, Box 1910 
Providence, RI 02912 
Abstract 
We present a method for extracting parts of objects 
from wholes (e.g. "speedometer" from "car"). Given 
a very large corpus our method finds part words with 
55% accuracy for the top 50 words as ranked by the 
system. The part list could be scanned by an end-user 
and added to an existing ontology (such as WordNet), 
or used as a part of a rough semantic lexicon. 
1 Introduction 
We present a method of extracting parts of objects 
from wholes (e.g. "speedometer" from "car"). To 
be more precise, given a single word denoting some 
entity that has recognizable parts, the system finds 
and rank-orders other words that may denote parts 
of the entity in question. Thus the relation found 
is strictly speaking between words, a relation Miller 
\[1\] calls "meronymy." In this paper we use the more 
colloquial "part-of" terminology. 
We produce words with 55°£ accuracy for the top 
50 words ranked by the system, given a very large 
corpus. Lacking an objective definition of the part-of 
relation, we use the majority judgment of five human 
subjects to decide which proposed parts are correct. 
The program's output could be scanned by an end- 
user and added to an existing ontology (e.g., Word- 
Net), or used as a part of a rough semantic lexicon. 
To the best of our knowledge, there is no published 
work on automatically finding parts from unlabeled 
corpora. Casting our nets wider, the work most sim- 
ilar to what we present here is that by Hearst \[2\] on 
acquisition of hyponyms ("isa" relations). In that pa- 
per Hearst (a) finds lexical correlates to the hyponym 
relations by looking in text for cases where known hy- 
ponyms appear in proximity (e.g., in the construction 
(NP, NP and (NP other NN)) as in "boats, cars, and 
other vehicles"), (b) tests the proposed patterns for 
validity, and (c) uses them to extract relations from 
a corpus. In this paper we apply much the same 
methodology to the part-of relation. Indeed, in \[2\] 
Hearst states that she tried to apply this strategy to 
the part-of relation, but failed. We comment later on 
the differences in our approach that we believe were 
most important to our comparative success. 
Looking more widely still, there is an ever- 
growing literature on the use of statistical/corpus- 
based techniques in the automatic acquisition of 
lexical-semantic knowledge (\[3-8\]). We take it as ax- 
iomatic that such knowledge is tremendously useful 
in a wide variety of tasks, from lower-level tasks like 
noun-phrase reference, and parsing to user-level tasks 
such as web searches, question answering, and digest- 
ing. Certainly the large number of projects that use 
WordNet \[1\] would support this contention. And al- 
though WordNet is hand-built, there is general agree- 
ment that corpus-based methods have an advantage 
in the relative completeness of their coverage, partic- 
ularly when used as supplements to the more labor- 
intensive methods. 
2 Finding Parts 
2.1 Parts 
Webster's Dictionary defines "part" as "one of the 
often indefinite or unequal subdivisions into which 
something is or is regarded as divided and which to- 
gether constitute the whole." The vagueness of this 
definition translates into a lack of guidance on exactly 
what constitutes a part, which in turn translates into 
some doubts about evaluating the results of any pro- 
cedure that claims to find them. More specifically, 
note that the definition does not claim that parts 
must be physical objects. Thus, say, "novel" might 
have "plot" as a part. 
In this study we handle this problem by asking in- 
formants which words in a list are parts of some target 
word, and then declaring majority opinion to be cor- 
rect. We give more details on this aspect of the study 
later. Here we simply note that while our subjects 
often disagreed, there was fair consensus that what 
might count as a part depends on the nature of the 
57 
word: a physical object yields physical parts, an in- 
stitution yields its members, and a concept yields its 
characteristics and processes. In other words, "floor" 
is part of "building" and "plot" is part of "book." 
2.2 Patterns 
Our first goal is to find lexical patterns that tend to 
indicate part-whole relations. Following Hearst \[2\], 
we find possible patterns by taking two words that 
are in a part-whole relation (e.g, basement and build- 
ing) and finding sentences in our corpus (we used the 
North American News Corpus (NANC) from LDC) 
that have these words within close proximity. The 
first few such sentences are: 
... the basement of the building. 
... the basement in question is 
in a four-story apartment building ... 
... the basement of the apartment building. 
From the building's basement ... 
... the basement of a building ... 
... the basements of buildings ... 
From these examples we construct the five pat- 
terns shown in Table 1. We assume here that parts 
and wholes are represented by individual lexical items 
(more specifically, as head nouns of noun-phrases) as 
opposed to complete noun phrases, or as a sequence of 
"important" noun modifiers together with the head. 
This occasionally causes problems, e.g., "conditioner" 
was marked by our informants as not part of "car", 
whereas "air conditioner" probably would have made 
it into a part list. Nevertheless, in most cases head 
nouns have worked quite well on their own. 
We evaluated these patterns by observing how 
they performed in an experiment on a single example. 
Table 2 shows the 20 highest ranked part words (with 
the seed word "car") for each of the patterns A-E. 
(We discuss later how the rankings were obtained.) 
Table 2 shows patterns A and B clearly outper- 
form patterns C, D, and E. Although parts occur in 
all five patterns~ the lists for A and B are predom- 
inately parts-oriented. The relatively poor perfor- 
mance of patterns C and E was ant!cipated, as many 
things occur "in" cars (or buildings, etc.) other than 
their parts. Pattern D is not so obviously bad as it 
differs from the plural case of pattern B only in the 
lack of the determiner "the" or "a". However, this 
difference proves critical in that pattern D tends to 
pick up "counting" nouns such as "truckload." On 
the basis of this experiment we decided to proceed 
using only patterns A and B from Table 1. 
A. whole NN\[-PL\] 's POS part NN\[-PL\] 
... building's basement ... 
B. part NN\[-PL\] of PREP {theIa } DET 
roods \[JJINN\]* whole NN 
... basement of a building... 
C. part NN in PREP {thela } DET 
roods \[JJINN\]* whole NN 
... basement in a building ... 
D. parts NN-PL of PREP wholes NN-PL 
... basements of buildings ... 
E. parts NN-PL in PREP wholes NN-PL 
... basements in buildings ... 
Format: type_of_word TAG type_of_word TAG ... 
NN = Noun, NN-PL = Plural Noun 
DET = Determiner, PREP = Preposition 
POS = Possessive, JJ = Adjective 
Table h Patterns for partOf(basement,building) 
3 Algorithm 
3.1 Input 
We use the LDC North American News Corpus 
(NANC). which is a compilation of the wire output 
of several US newspapers. The total corpus is about 
100,000,000 words. We ran our program on the whole 
data set, which takes roughly four hours on our net- 
work. The bulk of that time (around 90%) is spent 
tagging the corpus. 
As is typical in this sort of work, we assume that 
our evidence (occurrences of patterns A and B) is 
independently and identically distributed (lid). We 
have found this assumption reasonable, but its break- 
down has led to a few errors. In particular, a draw- 
back of the NANC is the occurrence of repeated ar- 
ticles; since the corpus consists of all of the articles 
that come over the wire, some days include multiple, 
updated versions of the same story, containing iden- 
tical paragraphs or sentences. We wrote programs 
to weed out such cases, but ultimately found them 
of little use. First, "update" articles still have sub- 
stantial variation, so there is a continuum between 
these and articles that are simply on the same topic. 
Second, our data is so sparse that any such repeats 
are very unlikely to manifest themselves as repeated 
examples of part-type patterns. Nevertheless since 
two or three occurrences of a word can make it rank 
highly, our results have a few anomalies that stem 
from failure of the iid assumption (e.g., quite appro- 
priately, "clunker"). 
58 
Pattern A 
headlight windshield ignition shifter dashboard ra- 
diator brake tailpipe pipe airbag speedometer con- 
verter hood trunk visor vent wheel occupant en- 
gine tyre 
Pattern B 
trunk wheel driver hood occupant seat bumper 
backseat dashboard jalopy fender rear roof wind- 
shield back clunker window shipment reenactment 
axle 
Pattern C 
passenger gunmen leaflet hop houseplant airbag 
gun koran cocaine getaway motorist phone men 
indecency person ride woman detonator kid key 
Pattern D 
import caravan make dozen carcass shipment hun- 
dred thousand sale export model truckload queue 
million boatload inventory hood registration trunk 
ten 
Pattern E 
airbag packet switch gem amateur device handgun 
passenger fire smuggler phone tag driver weapon 
meal compartment croatian defect refugee delay 
Table 2: Grammatical Pattern Comparison 
Our seeds are one word (such as "car") and its 
plural. We do not claim that all single words would 
fare as well as our seeds, as we picked highly probable 
words for our corpus (such as "building" and "hos- 
pital") that we thought would have parts that might 
also be mentioned therein. With enough text, one 
could probably get reasonable results with any noun 
that met these criteria. 
3.2 Statistical Methods 
The program has three phases. The first identifies 
and records all occurrences of patterns A and B in our 
corpus. The second filters out all words ending with 
"ing', "ness', or "ity', since these suffixes typically 
occur in words that denote a quality rather than a 
physical object. Finally we order the possible parts 
by the likelihood that they are true parts according 
to some appropriate metric. 
We took some care in the selection of this met- 
ric. At an intuitive level the metric should be some- 
thing like p(w \[ p). (Here and in what follows w 
denotes the outcome of the random variable gener- 
ating wholes, and p the outcome for parts. W(w) 
states that w appears in the patterns AB as a whole, 
while P(p) states that p appears as a part.) Met- 
rics of the form p(w I P) have the desirable property 
that they are invariant over p with radically different 
base frequencies, and for this reason have been widely 
used in corpus-based lexical semantic research \[3,6,9\]. 
However, in making this intuitive idea someone more 
precise we found two closely related versions: 
p(w, W(w) I P) 
p(w, w(~,) I p, e(p)) 
We call metrics based on the first of these "loosely 
conditioned" and those based on the second "strongly 
conditioned". 
While invariance with respect to frequency is gen- 
erally a good property, such invariant metrics can 
lead to bad results when used with sparse data. In 
particular, if a part word p has occurred only once in 
the data in the AB patterns, then perforce p(w \[ P) 
= 1 for the entity w with which it is paired. Thus 
this metric must be tempered to take into account 
the quantity of data that supports its conclusion. To 
put this another way, we want to pick (w,p) pairs 
that have two properties, p(w I P) is high and \[ w, pl 
is large. We need a metric that combines these two 
desiderata in a natural way. 
We tried two such metrics. The first is Dun- 
ning's \[10\] log-likelihood metric which measures how 
"surprised" one would be to observe the data counts 
I w,p\[,\[ -,w, pl, \[ w,-,pland I-'w,-'Plifone 
assumes that p(w I P) = p(w). Intuitively this will be 
high when the observed p(w I P) >> p(w) and when 
the counts supporting this calculation are large. 
The second metric is proposed by Johnson (per- 
sonal communication). He suggests asking the ques- 
tion: how far apart can we be sure the distributions 
p(w \[ p)and p(w) are if we require a particular signif- 
icance level, say .05 or .01. We call this new test the 
"significant-difference" test, or sigdiff. Johnson ob- 
serves that compared to sigdiff, log-likelihood tends 
to overestimate the importance of data frequency at 
the expense of the distance between p(w I P) and 
3.3 Comparison 
Table 3 shows the 20 highest ranked words for each 
statistical method, using the seed word "car." The 
first group contains the words found for the method 
we perceive as the most accurate, sigdiff and strong 
conditioning. The other groups show the differences 
between them and the first group. The + category 
means that this method adds the word to its list, - 
means the opposite. For example, "back" is on the 
sigdiff-loose list but not the sigdiff-strong list. 
In general, sigdiff worked better than surprise and 
strong conditioning worked better than loose condi- 
tioning. In both cases the less favored methods tend 
to promote words that are less specific ("back" over 
"airbag", "use" over "radiator"). Furthermore, the 
59 
Sigdiff, Strong 
airbag brake bumper dashboard driver fender 
headlight hood ignition occupant pipe radi- 
ator seat shifter speedometer tailpipe trunk 
vent wheel windshield 
Sigdiff, Loose 
+ back backseat oversteer rear roof vehicle visor 
- airbag brake bumper pipe speedometer 
tailpipe vent 
Surprise, Strong 
+ back cost engine owner price rear roof use 
value window 
- airbag bumper fender ignition pipe radiator 
shifter speedometer tailpipe vent 
Surprise, Loose 
+ back cost engine front owner price rear roof 
side value version window 
- airbag brake bumper dashboard fender ig- 
nition pipe radiator shifter speedometer 
tailpipe vent 
Table 3: Methods Comparison 
combination of sigdiff and strong conditioning worked 
better than either by itself. Thus all results in this 
paper, unless explicitly noted otherwise, were gath- 
ered using sigdiff and strong conditioning combined. 
4 Results 
4.1 Testing Humans 
We tested five subjects (all of whom were unaware 
of our goals) for their concept of a "part." We asked 
them to rate sets of 100 words, of which 50 were in our 
final results set. Tables 6 - 11 show the top 50 words 
for each of our six seed words along with the number 
book 
10 8 
20 14 
30 20 
40 24 
50 28 
10 
20 
30 
40 
5O 
hospital 
7 
16 
21 
23 
26 
building car 
7 
12 
18 
21 
29 
plant 
5 
10 
15 
20 
22 
8 
17 
23 
26 
31 
school 
10 
14 
20 
26 
31 
Table 4: Result Scores 
of subjects who marked the wordas a part of the seed 
concept. The score of individual words vary greatly 
but there was relative consensus on most words. We 
put an asterisk next to words that the majority sub- 
jects marked as correct. Lacking a formal definition 
of part, we can only define those words as correct 
and the rest as wrong. While the scoring is admit- 
tedly not perfect 1, it provides an adequate reference 
result. 
Table 4 summarizes these results. There we show 
the number of correct part words in the top 10, 20, 
30, 40, and 50 parts for each seed (e.g., for "book", 8 
of the top 10 are parts, and 14 of the top 20). Over- 
all, about 55% of the top 50 words for each seed are 
parts, and about 70% of the top 20 for each seed. The 
reader should also note that we tried one ambigu- 
ous word, "plant" to see what would happen. Our 
program finds parts corresponding to both senses, 
though given the nature of our text, the industrial use 
is more common. Our subjects marked both kinds of 
parts as correct, but even so, this produced the weak- 
est part list of the six words we tried. 
As a baseline we also tried using as our "pattern" 
the head nouns that immediately surround our target 
word. We then applied the same "strong condition- 
ing, sigdiff" statistical test to rank the candidates. 
This performed quite poorly. Of the top 50 candi- 
dates for each target, only 8% were parts, as opposed 
to the 55% for our program. 
4.2 WordNet 
WordNet 
+ door engine floorboard gear grille horn mirror 
roof tailfin window 
- brake bumper dashboard driver headlight ig- 
nition occupant pipe radiator seat shifter 
speedometer tailpipe vent wheel windshield 
Table 5: WordNet Comparison 
We also compared out parts list to those of Word- 
Net. Table 5 shows the parts of "car" in WordNet 
that are not in our top 20 (+) and the words in our 
top 20 that are not in WordNet (-). There are defi- 
nite tradeoffs, although we would argue that our top- 
20 set is both more specific and more comprehensive. 
Two notable words our top 20 lack are "engine" and 
"door", both of which occur before 100. More gener- 
ally, all WordNet parts occur somewhere before 500, 
with the exception of "tailfin', which never occurs 
with car. It would seem that our program would be 
l For instance, "shifter" is undeniably part of a car, while 
"production" is only arguably part of a plant. 
60 
a good tool for expanding Wordnet, as a person can 
scan and mark the list of part words in a few minutes. 
5 Discussion and Conclusions 
The program presented here can find parts of objects 
given a word denoting the whole object and a large 
corpus of unmarked text. The program is about 55% 
accurate for the top 50 proposed parts for each of six 
examples upon which we tested it. There does not 
seem to be a single cause for the 45% of the cases 
that are mistakes. We present here a few problems 
that have caught our attention. 
Idiomatic phrases like "a jalopy of a car" or "the 
son of a gun" provide problems that are not easily 
weeded out. Depending on the data, these phrases 
can be as prevalent as the legitimate parts. 
In some cases problems arose because of tagger 
mistakes. For example, "re-enactment" would be 
found as part of a "car" using pattern B in the 
phrase "the re-enactment of the car crash" if "crash" 
is tagged as a verb. 
The program had some tendency to find qualities 
of objects. For example, "driveability" is strongly 
correlated with car. We try to weed out most of the 
qualities by removing words with the suffixes "hess", 
"ing', and "ity." 
The most persistent problem is sparse data, which 
is the source of most of the noise. More data would 
almost certainly allow us to produce better lists, 
both because the statistics we are currently collecting 
would be more accurate, but also because larger num- 
bers would allow us to find other reliable indicators. 
For example, idiomatic phrases might be recognized 
as such. So we see "jalopy of a car" (two times) but 
not, of course, "the car's jalopy". Words that appear 
in only one of the two patterns are suspect, but to use 
this rule we need sufficient counts on the good words 
to be sure we have a representative sample. At 100 
million words, the NANC is not exactly small, but 
we were able to process it in about four hours with 
the machines at our disposal, so still larger corpora 
would not be out of the question. 
Finally, as noted above, Hearst \[2\] tried to find 
parts in corpora but did not achieve good results. 
She does not say what procedures were used, but as- 
suming that the work closely paralleled her work on 
hyponyms, we suspect that our relative success was 
due to our very large corpus and the use of more re- 
fined statistical measures for ranking the output. 
6 Acknowledgments 
This research was funded in part by NSF grant IRI- 
9319516 and ONR Grant N0014-96-1-0549. Thanks 
to the entire statistical NLP group at Brown, and 
particularly to Mark Johnson, Brian Roark, Gideon 
Mann, and Ann-Maria Popescu who provided invalu- 
able help on the project. 

References 
\[1\] George Miller, Richard Beckwith, Cristiane Fell- 
baum, Derek Gross & Katherine J. Miller, "Word- 
Net: an on-line lexicai database," International 
Journal of Lexicography 3 (1990), 235-245. 
\[2\] Marti Hearst, "Automatic acquisition of hy- 
ponyms from large text corpora," in Proceed- 
ings of the Fourteenth International Conference 
on Computational Linguistics,, 1992. 
\[3\] Ellen Riloff & Jessica Shepherd, "A corpus-based 
approach for building semantic lexicons," in Pro- 
ceedings of the Second Conference on Empirical 
Methods in Natural Language Processing, 1997, 
117-124. 
\[4\] Dekang Lin, "Automatic retrieval and cluster- 
ing of similar words," in 36th Annual Meeting 
of the Association for Computational Linguistics 
and 17th International Conference on Computa- 
tional Linguistics, 1998, 768-774. 
\[5\] Gregory Grefenstette, "SEXTANT: extracting se- 
mantics from raw text implementation details," 
Heuristics: The Journal of Knowledge Engineer- 
ing (1993). 
\[6\] Brian Roark & Eugene Charniak, "Noun-phrase 
co-occurrence statistics for semi-automatic se- 
mantic lexicon construction," in 36th Annual 
Meeting of the Association for Computational 
Linguistics and 17th International Conference on 
Computational Linguistics, 1998, 1110-1116. 
\[7\] Vasileios Hatzivassiloglou & Kathleen R. McKe- 
own, "Predicting the semantic orientation of ad- 
jectives," in Proceedings of the 35th Annual Meet- 
ing of the ACL, 1997, 174-181. 
\[8\] Stephen D. Richardson, William B. Dolan & Lucy 
Vanderwende, "MindNet: acquiring and structur- 
ing semantic information from text," in 36th An- 
nual Meeting of the Association for Computa- 
tional Linguistics and 17th International Confer- 
ence on Computational Linguistics, 1998, 1098- 
1102. 
\[9\] William A. Gale, Kenneth W. Church & David 
Yarowsky, "A method for disambiguating word 
senses in a large corpus," Computers and the Hu- 
manities (1992). 
\[10\] Ted Dunning, "Accurate methods for the statis- 
tics of surprise and coincidence," Computational 
Linguistics 19 (1993), 61-74. 
