Fast Statistical Parsing of Noun Phrases 
for Document Indexing 
Chengxiang Zhai 
Laboratory for Computational Linguistics 
Carnegie Mellon University 
Pittsburgh, PA 15213 
{cz25@andrew.cmu.edu} 
Abstract 
Information Retrieval (IR) is an important 
application area of Natural Language Pro- 
cessing (NLP) where one encounters the 
genuine challenge of processing large quan- 
tities of unrestricted natural language text. 
While much effort has been made to apply 
NLP techniques to IR, very few NLP tech- 
niques have been evaluated on a document 
collection larger than several megabytes. 
Many NLP techniques are simply not ef- 
ficient enough, and not robust enough, to 
handle a large amount of text. This pa- 
per proposes a new probabilistic model for 
noun phrase parsing, and reports on the 
application of such a parsing technique to 
enhance document indexing. The effective- 
ness of using syntactic phrases provided by 
the parser to supplement single words for 
indexing is evaluated with a 250 megabytes 
document collection. The experiment's re- 
sults show that supplementing single words 
with syntactic phrases for indexing consis- 
tently and significantly improves retrieval 
performance. 
1 Introduction 
Information Retrieval (IR) is an increasingly impor- 
tant application area of Natural Language Process- 
ing (NLP). An IR task can be described as to find, 
from a given document collection, a subset of docu- 
ments whose content is relevant to the information 
need of a user as expressed by a query. As the doc- 
uments and query are often natural language texts, 
an IR task can usually be regarded as a special NLP 
task, where the document text and the query text 
need to be processed in order to judge the relevancy. 
A general strategy followed by most IR systems is 
to transform documents and the query into certain 
level of representation. A query representation can 
then be compared with a document representation 
to decide if the document is relevant to the query. In 
practice, the level of representation in an IR system 
is quite "shallow" -- often merely a set of word-like 
strings, or indexing terms. The process to extract in- 
dexing terms from each document in the collection 
is called indexing. A query is often subject to simi- 
lax processing, and the relevancy is judged based on 
the matching of query terms and document terms. 
In most systems, weights are assigned to terms to 
indicate how well they can be used to discriminate 
relevant documents from irrelevant ones. 
The challenge in applying NLP to IR is to deal 
with a large amount of unrestricted natural lan- 
guage text. The NLP techniques used must be very 
efficient and robust, since the amount of text in 
the databases accessed is typically measured in gi- 
gabytes. In the past, NLP techniques of different 
levels, including morphological, syntactic/semantic, 
and discourse processing, were exploited to enhance 
retrieval (Smeaton 92; Lewis and Spaxck Jones 96), 
but were rarely evaluated using collections of docu- 
ments larger than several megabytes. Many NLP 
techniques are simply not efficient enough or are 
too labor-intensive to successfully handle a large size 
document set. However, there are some exceptions. 
Evans et al. used selective NLP techniques, that are 
especially robust and efficient, for indexing (Evans 
et al. 91). Strzalkowski reported a fast and robust 
parser called TTP in (Strzalkowski 92; Strzalkowski 
and Vauthey 92). These NLP techniques have been 
successfully used to process quite large collections, 
as shown in a series of TREC conference reports by 
the CLARIT TM1 system group and the New York 
University (later GE/NYU) group (cf., for example, 
(Evans and Lefferts 95; Evans et al. 96), and (Strza- 
lkowski 95; Strzalkowski et al. 96)) These research 
efforts demonstrated the feasibility of using selec- 
tive NLP to handle large collections. A special NLP 
track emphasizing the evaluation of NLP techniques 
for IR is currently held in the context of TREC (Hat- 
man 96). 
In this paper, a fast probabilistic noun phrase 
parser is described. The parser can be exploited to 
1CLARIT is a registered trademark of CLARITECH 
Corporation. 
312 
automatically extract syntactic phrases from a large 
amount of documents for indexing. A 250-megabyte 
document set 2 is used to evaluate the effectiveness of 
indexing using the phrases extracted by the parser. 
The experiment's results show that using syntactic 
phrases to supplement single words for indexing im- 
proves the retrieval performance significantly. This 
is quite encouraging compared to earlier experiments 
on phrase indexing. The noun phrase parser pro- 
vides the possibility of combining different kinds of 
phrases with single words. 
The rest of the paper is organized as follows. Sec- 
tion 2 discusses document indexing, and argues for 
the rationality of using syntactic phrases for index- 
ing; Section 3 describes the fast noun phrase parser 
that we use to extract candidate phrases; Section 4 
describes how we use a commercial IR system to per- 
form the desired experiments; Section 5 reports and 
discusses the experiment results; Section 6 summa- 
rizes the conclusions. 
2 Phrases for Document Indexing 
In most current IR systems, documents are primarily 
indexed by single words, sometimes supplemented by 
phrases obtained with statistical approaches, such as 
frequency counting of adjacent word pairs. However, 
single words are often ambiguous and not specific 
enough for accurate discrimination of documents. 
For example, only using the word "baalS' and "ter- 
minology" for indexing is not enough to distinguish 
"bank terminology" from "terminology baalS'. More 
specific indexing units are needed. Syntactic phrases 
(i.e., phrases with certain syntactic relations) are al- 
most always more specific than single words and thus 
are intuitively attractive for indexing. For example, 
if "bank terminology" occurs in the document, then, 
we can use the phrase "bank terminology" as an ad- 
ditional unit to supplement the single words "banld' 
and "terminology" for indexing. In this way, a query 
with "terminology banlZ' will match better with the 
document than one with "bank terminology", since 
the indexing phrase "bank terminology" provides ex- 
tra discrimination. 
Despite the intuitive rationality of using phrases 
for indexing, syntactic phrases have been reported 
to show no significant improvement of retrieval per- 
formance (Lewis 91; Belkin and Croft 87; Fagan 
87). Moreover Fagan (Fagan 87) found that syn- 
tactic phrases are not superior to simple statistical 
phrases. Lewis discussed why the syntactic phrase 
indexing has not worked and concluded that the 
problems with syntactic phrases are for the most 
part statistical (Lewis 91). Indeed, many (perhaps 
most) syntactic phrases have very low frequency and 
tend to be over-weighted by the normal weighting 
method. However, the size of the collection used in 
2the Wall Street Journal database in Tipster Disk2 
(Harman 96) 
these early experiments is relatively small. We want 
to see if a much larger size of collection will make a 
difference. It is possible that a larger document col- 
lection might increase the frequency of most phrases, 
and thus alleviate the problem of low frequency. 
We only consider noun phrases and the sub- 
phrases derived from them. Specifically, we want to 
obtain the full modification structure of each noun 
phrase in the documents and query. From the view- 
point of NLP, the task is noun phrase parsing (i.e., 
the analysis of noun phrase structure). When the 
phrases are used only to supplement, not replace, 
the single words for indexing, some parsing errors 
may be tolerable. This means that the penalty for 
a parsing error may not be significant. The chal- 
lenge, however, is to be able to parse gigabytes of 
text in practically feasible time and as accurately 
as possible. The previous work taking on this chal- 
lenge includes (Evans et al. 91; Evans et al. 96; 
Evans and Zhal 96; Strzalkowski and Carballo 94; 
Strzalkowski et al. 95) among others. Evans et 
al. exploited the "attestedness" of subphrases to 
partially reveal the structure of long noun phrases 
(Evans et al. 91; Evans et al. 96). Strzalkowski et 
al. adopted a fast Tagged Text Parser (TTP) to ex- 
tract head modifier pairs including those in a noun 
phrase (Strzalkowski 92; Strzalkowski and Vauthey 
92; Strzalkowski and Carballo 94; Strzalkowski et 
al. 95). In (Strzalkowski et al. 95), the structure 
of a noun phrase is disambiguated based on certain 
statistical heuristics, but there seems to be no ef- 
fort to assign a full structure to every noun phrase. 
Furthermore, manual effort is needed in constructing 
grammar rules. Thus, the approach in (Strzalkowski 
et M. 95) does not address the special need of scal- 
ability and robustness along with speed. Evans and 
Zhai explored a hybrid noun phrase analysis method 
and used a quite rich set of phrases for document in- 
dexing (Evans and Zhai 96). The indexing method 
was evaluated using the Associated Press newswire 
89 (AP89) database in Tipster Diskl, and a general 
improvement of retrieval performance over the in- 
dexing with single words and full noun phrases was 
reported. However, the phrase extraction system 
as reported in (Evans and Zhal 96) is still not fast 
enough to deal with document collections measured 
by gigabytes. 3 
We propose here a probabilistic model of noun 
phrase parsing. A fast statistical noun phrase parser 
has been developed based on the probabilistic model. 
The parser works fast and can be scaled up to parse 
gigabytes text within acceptable time. 4 Our goal 
is to generate different kinds of candidate syntactic 
3It was reported to take about 3.5 hours to process 
20 MB documents 
4With a 133MH DEC alpha workstation, it is esti- 
mated to parse at a speed of 4 hours/gigabyte-text or 
8 hours/gigabyte-nps, after 20 hours of training with 1 
gigabyte text 
313 
phrases from the structure of a noun phrase so that 
the effectiveness of different combinations of phrases 
and single words can be tested. 
3 Fast Noun Phrase Parsing 
A fast and robust noun phrase parser is a key to 
the exploration of syntactic phrase indexing. Noun 
phrase parsing, or noun phrase structure analy- 
sis ( also known as compound noun analysisS), 
is itself an important research issue in computa- 
tional linguistics and natural language processing. 
Long noun phrases, especially long compound nouns 
such as "information retrieval technique", generally 
have ambiguous structures. For instance, "informa- 
tion retrieval technique" has two possible structures: 
"\[\[information retrieval\] technique\]' and "\[informa- 
tion \[retrieval technique\]\]'. A principal difficulty 
in noun phrase structure analysis is to resolve such 
structural ambiguity. When a large corpus is avail- 
able, which is true for an IR task, statistical prefer- 
ence of word combination or word modification can 
be a good clue for such disambiguation. As summa- 
rized in (Lauer 95), there are two different models 
for corpus-based parsing of noun phrases: the adja- 
cency model and the dependency model. The differ- 
ence between the two models can be illustrated by 
the example compound noun "informationsretrieval 
technique". In the adjacency model, the structure 
would be decided by looking at the adjacency as- 
sociation of "information retrievaF and "retrieval 
technique". "information retrievat' will be grouped 
first, if "information retrievaF has a stronger as- 
sociation than "retrieval technique", otherwise, "re- 
trieval technique" will be grouped first. In the de- 
pendency model, however, the structure would be 
decided by looking at the dependency between "in- 
formation" and "retrievaP (i.e., the tendency for 
"information" to modify "retrievat') and the depen- 
dency between "information" and "technique". If 
"information" has a stronger dependency associa- 
tion with "retrievaP than with "technique", "infor- 
mation retrievat' will be grouped first, otherwise, 
"retrieval technique" will be grouped first. The ad- 
jacency model dates at least from (Marcus 80) and 
has been explored recently in (Liberman and Sproat 
92; Pustejovsky et al. 93; Resnik and Hearst 93; 
Lauer 95; Strzalkowski et al. 95; Evans and Zhai 
96); The dependency model has mainly been stud- 
ied in (Lauer 94). Evans and Zhai (Evans and Zhai 
96) use primarily the adjacency model, but the as- 
sociation score also takes into account some degree 
of dependency. Lauer (Lauer 95) compared the ad- 
jacency model and the dependency model for com- 
pound noun disambiguation, and concluded that the 
SStrictly speaking, however, compound noun analysis 
is a special case of noun phrase analysis, but the same 
technique can oRen be used for both. 
314 
dependency model provides a substantial advantage 
over the adjacency model. 
We now propose a probabilistic model in which the 
dependency structure, or the modification structure, 
of a noun phrase is treated as "hidden", similar to 
the tree structure in the probabilistic context-free 
grammar (Jelinek et al. 90). The basic idea is as 
follows. 
A noun phrase can be assumed to be generated 
from a word modification structure (i.e., a depen- 
dency structure). Since noun phrases with more 
than two words are structurally ambiguous, if we 
only observe the noun phrase, then the actual struc- 
ture that generates the noun phrase is "hidden". We 
treat the noun phrases with their possible structures 
as the complete data and the noun phrases occur- 
ring in the corpus (without the structures) as the 
observed incomplete data. In the training phase, an 
Expectation Maximization (EM) algorithm (Demp- 
ster et al. 77) can be used to estimate the parame- 
ters of word modification probabilities by iteratively 
maximizing the conditional expectation of the likeli- 
hood of the complete data given the observed incom- 
plete data and a previous estimate of the parameters. 
In the parsing phase, a noun phrase is assigned the 
structure that has the maximum conditional proba- 
bility given the noun phrase. 
Formally, assume that each noun phrase is gener- 
ated using a word modification structure. For exam- 
ple, "information retrieval technique" may be gener- 
ated using either the structure "\[XI\[X2Xz\]\]" or the 
structure "\[\[X1X2\]X3\]". The log likelihood of gen- 
erating a noun phrase, given the set of noun phrases 
observed in a corpus NP = {npi} can be written as: 
L(¢) = ~\] c(npi)log ~ P¢(npi, sj) 
npiENP sjES 
where, S is the set of all the possible modification 
structures; c(npi) is the count of the noun phrase npi 
in the corpus; and P¢(npi, sj) gives the probability of 
deriving the noun phrase npi using the modification 
structure sj. 
With the simplification that generating a noun 
phrase from a modification structure is the same as 
generating all the corresponding word modification 
pairs in the noun phrase and with the assumption 
that each word modification pair in the noun phrase 
is generated independently, P¢(npi, sj) can further 
be written as 
P¢(npi, sj) = P¢(sj) H PC(u, v) c(u'v;'~p''sD 
(u,v)eM(np~,sj) 
where, M(npi, sj) is the set of all word pairs (u, v) 
in npi such that u modifies (i.e., depends on) v ac- 
cording to sj. 6 c(u, v; npi, sj) is the count of the 
~For example, if npl is "information retrieval tech- nique", 
and sj is "\[\[X1X~IX3\]", then, M(npi, sj) = {(information, retrieval), (retrieval, technique)}. 
modification pairs (u, v) being generated when npi 
is derived from sj. P¢(sj) is the probability of struc- 
ture sj; while Pc(u, v) is the probability of generat- 
ing the word pair (u, v) given any word modifica- 
tion relation. P¢(sj) and Pc(u, v) are subject to the 
constraint of summing up to 1 over all modification 
structures and over all possible word combinations 
respectively. 7 
The model is clearly a special case of the class of 
the algebraic language models, in which the proba- 
bilities are expressed as polynomials in the param- 
eters(Lafferty 95). For such models, the M-step in 
the EM algorithm can be carried out exactly, and 
the parameter update formulas are: 
P,+I(U, v) 
= A'{ 1 ~ c(npi) ~ P~(sjlnpi)c(u,v;np,,sj) 
npi6NP s16S 
= )~1 ~ c(npi)P,(sklnpi) 
npiENP 
where, A1 and A2 are the Lagrange multipliers cor- 
responding to the two constraints mentioned above, 
and are given by the following formulas: 
(u,v)EWP rtpi 6NP sj ES 
8kESnpi6NP 
where, WP is the set of all possible word pairs. 
Pn(sj Inpi) can be computed as: 
st) 
P.(np , st) 
,,) 
''j) 
;'One problem with such simplification is that the 
model may generate a set of word modification pairs that 
do not form a noun phrase, although such "illegal noun 
phrases" are never observed. A better model would be 
to write the probability of each word modification pair 
as the conditional probability of the modifier (i.e., the 
modifying word) given the head (i.e., the word being 
modified). That is, 
P,(npi, st) = 
P¢(si)P~(h(npi)JsJ) 1-\[ P*(ulv)¢(~'~;'"~J) 
(u,v)EM(npi,sj) 
where h(np,) is the head (i.e., the last word) of the noun 
phrase npi(Lafferty 96). 
The EM algorithm ensures that L(n+ 1) is greater 
than L(n). In other words, every step of parameter 
update increases the likelihood. Thus, at the time of 
training, the parser can first randomly initialize the 
parameters, and then, iteratively update the param- 
eters according to the update formulas until the in- 
crease of the likelihood is smaller than some pre-set 
threshold, s In the implementation described here, 
the maximum length of any noun phrase is limited 
to six. In practice, this is not a very tight limit, since 
simple noun phrases with more than six words are 
quite rare. Summing over all the possible structures 
for any noun phrase is computed by enumerating all 
the possible structures with an equal length as the 
noun phrase. For example, in the case of a three- 
word noun phrase, only two structures need to be 
enumerated. 
At the time of parsing noun phrases, the structure 
of any noun phrase np (S(np)) is determined by 
S(np) = argmaxsP(slnp) 
= argmax,P(np\[s)P(s) 
= argmaxs H P(u, v)P(s) 
(u,v)eM(np,s) 
We found that the parameters may easily be bi- 
ased owing to data sparseness. For example, the 
modification structure parameters naturally prefer 
left association to right association in the case of 
three-word noun phrases, when the data is sparse. 
Such bias in the parameters of the modification 
structure probability will be propagated to the word 
modification parameters when the parameters are 
iteratively updated using EM algorithm. In the ex- 
periments reported in this paper, an over-simplified 
solution is adopted. We simply fixed the modifica- 
tion structure parameter and assumed every depen- 
dency structure is equally likely. 
Fast training is achieved by reading all the noun 
phrase instances into memory. 9 This forces us to 
split the whole noun phrase corpus into small chunks 
for training. In the experiments reported in this 
paper, we split the corpus into chunks of a size of 
around 4 megabytes. Each chunk has about 170,000 
(or about 100,000 unique) raw multiple word noun 
phrases. The parameters estimated on each sub- 
corpus are then merged (averaged). We do not know 
how much the merging of parameters affects the pa- 
rameter estimation, but it seems that a majority of 
phrases are correctly parsed with the merged param- 
eter estimation, based on a rough check of the pars- 
ing results. With this approach, it takes a 133-MHz 
DEC Alpha workstation about 5 hours to train the 
parser over the noun phrases from a 250-megabyte 
SFor the experiments reported in this paper, the 
threshold is 2. 
9An alternative way would be to keep the corpus in 
the disk. In this way, it is not necessary to split the 
corpus, unless it is extremely large. 
315 
text corpus. Parsing is much faster, taking less than 
1 hour to parse all noun phrases in the corpus of 
a 250-megabyte text. The parsing speed can be 
scaled up to gigabytes of text, even when the parser 
needs to be re-trained over the noun phrases in the 
whole corpus. However, the speed has not taken into 
account the time required for extracting the noun 
phrases for training. In the experiments described 
in the following section, the CLARIT noun phrase 
extractor is used to extract all the noun phrases from 
the 250-megabyte text corpus. 
After the training on each chunk, the estimation 
of the parameter of word modifications is smoothed 
to account for the unseen word modification pairs. 
Smoothing is made by "dropping" a certain number 
of parameters that have the least probabilities, tak- 
ing out the probabilities of the dropped parameters, 
and evenly distributing these probabilities among 
all the unseen word pairs as well as those pairs of 
the dropped parameters. It is unnecessary to keep 
the dropped parameters after smoothing, thus this 
method of smoothing helps reduce the memory over- 
load when merging parameters. In the experiments 
reported in the paper, nearly half of the total num- 
ber of word pairs seen in the training chunk were 
dropped. Since, word pairs with the least probabil- 
ities generally occur quite rarely in the corpus and 
usually represent semantically illegal word combina- 
tions, dropping such word pairs does not affect the 
parsing output so significantly as it seems. In fact, it 
may not affect the parsing decisions for the majority 
of noun phrases in the corpus at all. 
The potential parameter space for the probabilis- 
tic model can be extremely large, when the size of 
the training corpus is getting larger. One solution 
to this problem is to use a class-based model similar 
to the one proposed in (Brown et al. 92) or use pa- 
rameters of conceptual association rather than word 
association, as discussed in (Lauer 94)(Lauer 95). 
4 Experiment Design 
We used the CLARIT commercial retrieval system 
as a retrieval engine to test the effectiveness of differ- 
ent indexing sets. The CLARIT system uses the vec- 
tor space retrieval model(Salton and McGill 83), in 
which documents and the query are all represented 
by a vector of weighted terms (either single words or 
phrases), and the relevancy judgment is based on the 
similarity (measured by the cosine measure) between 
the query vector and any document vector(Evans et 
al. 93; Evans and Lefferts 95; Evans et al. 96). The 
experiment procedure is described by Figure 1. 
First, the original database is parsed to form dif- 
ferent sets of indexing terms (say, using different 
combination of phrases). Then, each indexing set is 
passed to the CLARIT retrieval engine as a source 
document set. The CLARIT system is configured to 
accept the indexing set we passed as is to ensure that 
316 
Original Document Set ) 
t 
CLARIT NP Extractor I 
Raw Noun Phrases ) 
Statistical NP Parser 
Phrase Extractor 
Indexing Term Set ) 
CLARIT Retrieval Engine I 
Figure h Phrase indexing experiment procedure 
the actual indexing terms used inside the CLARIT 
system are exactly those generated. 
It is possible to generate three different 
kinds/levels of indexing units from a noun phrase: 
(1) single words; (2) head modifier pairs (i.e., any 
word pair in the noun phrase that has a linguis- 
tic modification relation); and (3) the full noun 
phrase. For example, from the phrase structure "\[\[~neavy=construction\]-industry\]\]-group\]" 
(a real 
example from WS390), it is possible to generate the 
following candidate terms: 
SINGLE WORDs : 
heavy, construction, industry, group 
HEAD MODIFIERS : 
construction industry, industry group, 
heavy construction 
FULL NP : 
heavy construction industry group 
Different combinations of the three kinds of terms 
can be selected for indexing. In particular, the in- 
dexing set formed solely of single words is used as a 
baseline to test the effect of using phrases. In the ex- 
periments reported here, we generated four different 
combinations of phrases: 
-- WD-SET : 
single word only (no phrases, baseline) 
-- WD-HM-SET: 
single word + head modifier pair 
-- ND-NP-SET : 
single word + full NP 
-- WD-HM-NP-SET : 
single word + head modifier + full NP 
The results from these different phrase sets are 
discussed in the next section. 
5 Results analysis 
We used, as our document set, the Wall Street Jour- 
nal database in Tipster Disk2 (Harman 96) the size 
of which is about 250 megabytes. We performed 
the experiments by using the TREC-5 ad hoc topics 
(i.e., TREC topics 251-300). Each run involves an 
automatic feedback with the top 10 documents re- 
turned from the initial retrieval. The CLARIT au- 
tomatic feedback is performed by adding terms from 
a query-specific thesaurus extracted from the top N 
documents returned from the initial retrieval(Evans 
and Lefferts 95). The results are evaluated using 
the standard measures of recall and precision. Re- 
call measures how many of the relevant documents 
have actually been retrieved. Precision measures 
how many of the retrieved documents are indeed rel- 
evant. They are calculated by the following simple 
formulas: 
number of relevant items retrieved Recall = 
total number of relevant items in collection 
number of relevant items retrieved Precision = 
total number of items retrieved 
We used the standard TREC evaluation package 
provided by Cornell University and used the judged- 
relevant documents from the TREC evaluations as 
the gold standard(Harman 94). 
In Table 1, we give a summary of the results and 
compare the three phrase combination runs with the 
corresponding baseline run. In the table, "Ret-rel" 
means "retrieved-relevant" and refers to the total 
number of relevant documents retrieved. "Init Prec" 
means "initial precision" and refers to the highest 
level of precision over all the points of recall. "Avg 
Prec" means "average precision" and is the average 
of all the precision values computed after each new 
relevant document is retrieved. 
It is clear that phrases help both recall and pre- 
cision when supplementing single words, as can be 
seen from the improvement of all phrase runs (WD- 
HM-SET, WD-NP-SET, WD-I-IM-NP-SET) over 
the single word run WD-SET. 
It can also be seen that when only one kind of 
phrase (either the full NPs or the head modifiers) is 
used to supplement the single words, each can lead 
to a great improvement in precision. However, when 
we combine the two kinds of phrases, the effect is a 
greater improvement in recall rather than precision. 
The fact that each kind of phrase can improve pre- 
cision significantly when used separately shows that 
Experiments 
WD-SET 
WD-HM-SET 
inc over WD-SET 
WD-NP-SET 
inc over WD-SET 
WD-HM-NP-SET 0.63(666) 
inc over WD-SET 13% 
Recall (l:tet-Rel) Init Prec Av~ Prec 
0.56(597) 0.4546 0.2208 
0.60( 638 ) 0.5162 0.2402 
7% 14% 9% 
0.56(613) 0.5373 0.2564 
4% 18% 16% 
0.4747 0.2285 
4% 3% 
Total relevant documents: 1064 
Table 1: Effects of Phrases with feedback and 
TREC-5 topics 
these phrases are indeed very useful for indexing. 
The combination of phrases results in only a smaller 
precision improvement but causes a much greater 
increase in recall. This may indicate that more ex- 
periments are needed to understand how to combine 
and weight different phrases effectively. 
The same parsing method has also been used 
to generate phrases from the same data for the 
CLARIT NLP track experiments in TREC-5(Zhai 
et al. 97), and similar results were obtained, al- 
though the WD-NP-SET was not tested. The results 
in (Zhai et al. 97) are not identical to the results 
here, because they are based on two separate train- 
ing processes. It is possible that different training 
processes may result in slightly different parameter 
estimations, because the corpus is arbitrarily seg- 
mented into chunks of only roughly 4 megabytes for 
training, and the chunks actually used in different 
training processes may vary slightly. 
6 Conclusions 
Information retrieval provides a good way to quanti- 
tatively (although indirectly) evaluate various NLP 
techniques. We explored the application of a fast 
statistical noun phrase parser to enhance document 
indexing in information retrieval. We proposed a 
new probabilistic model for noun phrase parsing and 
developed a fast noun phrase parser that can han- 
dle relatively large amounts of text efficiently. The 
effectiveness of enhancing document indexing with 
the syntactic phrases provided by the noun phrase 
parser was evaluated on the Wall Street Journal 
database in Tipster Disk2 using 50 TREC-5 ad hoc 
topics. Experiment results on this 250-megabyte 
document collection have shown that using differ- 
ent kinds of syntactic phrases provided by the noun 
phrase parser to supplement single words for index- 
ing can significantly improve the retrieval perfor- 
mance, which is more encouraging than many early 
experiments on syntactic phrase indexing. Thus, us- 
ing selective NLP, such as the noun phrase parsing 
technique we proposed, is not only feasible for use in 
information retrieval, but also effective in enhancing 
the retrieval performance./° 
1°Whether such syntactic phrases are more effective 
than simple statistical phrases (e.g., high frequency word 
317 
There are two lines of future work: 
First, the results from information retrieval ex- 
periments often show variances on different kinds 
of document collections and different sizes of collec- 
tions. It is thus desirable to test the noun phrase 
parsing technique in other and larger collections. 
More experiments and analyses are also needed to 
better understand how to more effectively combine 
different phrases with single words. In addition, it 
is very important to study how such phrase effects 
interact with other useful IR techniques such as rel- 
evancy feedback, query expansion, and term weight- 
ing. 
Second, it is desirable to study how the parsing 
quality (e.g., in terms of the ratio of phrases parsed 
correctly) would affect the retrieval performance. It 
is very interesting to try the conditional probabil- 
ity model as mentioned in a footnote in section 3 
The improvement of the probabilistic model of noun 
phrase parsing may result in phrases of higher qual- 
ity than the phrases produced by the current noun 
phrase parser. Intuitively, the use of higher qual- 
ity phrases might enhance document indexing more 
effectively, but this again needs to be tested. 
7 Acknowledgments 
The author is especially grateful to David A. Evans 
for his advising and supporting of this work. Thanks 
are also due to John Lafferty, Nata~a Milid-Frayling, 
Xiang Tong, and two anonymous reviewers for their 
useful comments. Naturally, the author alone is re- 
sponsible for all the errors. 

References 

Belkin, N., and Croft, B. 1987. Retrieval techniques. 
In: Williams, Martha E.(Ed.), Annual Review of 
Information Science Technology, Vol. 22. Amster- 
dam, NL: Elsevier Science Publishers. 1987. 110- 
145. 

Brown, P. et at. 1992. Class-based n-gram models 
of natural language. Computational Linguistics, 
18(4), December, 1992. 467-479. 

Dempster, A. P. et al. 1977. Maximum likelihood 
from incomplete data via the EM algorithm. Jour- 
nal of the Royal Statistical Society, 39 B, 1977. 
1-38. 

Evans, D. A., Ginther-Webster, K. , Hart, M., Lefferts, R., Monarch, I., 1991. Automatic indexing 
using selective NLP and first-order thesauri. In: 
A. Lichnerowicz (ed.), Intelligent Text and Image Handling. Proceedings of a Conference, RIAO 
'91. Amsterdam, NL: Elsevier. 1991. pp. 624-644. 

Evans, D. A., Lefferts, R. G., Grefenstette, G., Han- 
derson, S. H., Hersh, W. R., and Archbold, A. 
A. 1993. CLARIT TREC design, experiments, 
and results. In: Donna K. Harman (ed.), The 
First Text REtrieval Conference (TREC-1). NIST 
Special Publication 500-207. Washington, DC: 
U.S. Government Printing Office, 1993. pp. 251- 
286; 494-501. 

Evans, David A. and Lefferts, Robert G. 1995. 
CLARIT-TREC experiments, Information Pro- 
cessing and Management, Vol. 31, No. 3, 1995. 
385-395. 

Evans, D., Milid-Frayling, N., and Lefferts, R. 1996. 
CLARIT TREC-4 Experiments, in Donna K. Hat- 
man (Ed.), The Fourth Text REtrieval Confer- 
ence (TREC-). NIST Special Publication 500- 
236. Washington, DC: U.S. Government Printing 
Office, 1996. pp. 305-321. 

Evans, D. and Zhai, C. 1996. Noun-phrase analysis in unrestricted text for information retrieval. 
Proceedings of the 34th Annual meeting of Association for Computational Linguistics, Santa Cruz, 
University of California, June 24-28, 1996.17-24. 

Fagan, Joel L. 1987. Experiments in Automatic 
Phrase Indexing for Document Retrieval: A Com- 
parison of Syntactic and Non-syntactic methods, 
PhD thesis, Dept. of Computer Science, Cornell 
University, Sept. 1987. 

Harman, D. 1994. The Second Text REtrieval Con- 
ference (TREC-2), NIST Special publication 500- 
215. National Institute of Standards and Technol- 
ogy, 1994. 

Harman, D. 1996. TREC 5 Conference Notes, Nov. 
20-22, 1996. 

Jelinek, F., Lafferty, J.D., and Mercer, R. L. 1990. 
Basic methods of probabilistic context free gram- 
mars. Yorktown Heights,N.Y.: IBM T.J. Wat- 
son Research Center, 1990. Research report RC. 
16374. 

Lafferty, J. 1995. Notes on the EM Algorithm, In- 
formation Theory course notes, Carnegie Mellon 
University. 

Lafferty, J. 1996. Personal Communications. 
Lauer, Mark. 1994. Conceptual association for compound noun analysis. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Student Session, Las Cruces, 
NM, 1994. 337-339. 

Lauer, Mark. 1995. Corpus statistics meet with the 
noun compound: Some empirical results. Proceedings of the 33th Annual Meeting of the Association 
for Computational Linguistics, 1995. 

Lewis, D. 1991. Representation and Learning in Information Retrieval. Ph.D thesis, COINS Technical Report 91-93, Univ. of Massachusetts, 1991. 

Lewis, D. and Sparck Jones, K. 1996. Applications 
of natural language processing in information re- 
trieval. Communications of ACM, Vol. 39, No. 1, 
1996, 92-101. 

Liberman, M. and Sproat, R. 1992. The stress and 
structure of modified noun phrases in English. In: 
Sag, I. and Szabolcsi, A. (Eds.), Lexical Matters, 
CSLI Lecture Notes No. 24. University of Chicago 
Press, 1992. 131-181. 

Marcus, Mitchell. 1980. A Theory of Syntactic Rec 
ognition for Natural Language. MIT Press, Cam- 
bridge, MA, 1980. 

Pustejovsky, J., Bergler, S., and Anick, P. 1993. Lexical semantic techniques for corpus analysis. In: 
Computational Linguistics, Vol. 19 (2), Special Issue on Using Large Corpora II, 1993. 331-358. 

Resnik, P. and Hearst, M. 1993. Structural ambiguity and conceptual relations. In: Proceedings of 
the Workshop on Very Large Corpora: Academic 
and Industrial Perspectives, June 22, 1993. Ohio 
State University. 58-64. 

Salton, G. and McGill, M. 1983. Introduction to 
Modern Information Retrieval, New York, NY: 
McGraw-Hill, 1983. 

Smeaton, Alan F. 1992. Progress in application of 
natural language processing to information re- 
trieval. The Computer Journal, Vol. 35, No. 3, 
1992. 268-278. 

Strzalkowski, T. 1992. TTP: A fast and robust parser for natural language processing. Proceedings of the l~th International Conference on Computational Linguistics (COLING),Nantes, France, 
July, 1992. 198-204. 

Strzalkowski, T. and Vauthey, B. 1992. Information 
retrieval using robust natural language processing. 
Proceedings of the 30th ACL Meeting, Neward, 
DE, June-July, 1992. 104-111. 

Strzalkowski, T. and Carballo, J. 1994. Recent de- 
velopments in natural language text retrieval. In: 
Harman, D. (Ed.), The Second Text REtrieval 
Conference (TREC-2), NIST Special Publication 
500-215. 1994. 123-136. 

Strzalkowski, T. 1995. Natural language informa- 
tion retrieval. Information Processing and Man- 
agement. Vol. 31, No. 3, 1995. 397-417. 

Strzalkowski, T. et al. 1995. Natural language in- 
formation retrieval: TREC-3 report. In: Har- 
man, D. (Ed.), The Third Text REtrieval Con- 
ference (TREC-3), NIST Special Publication 500- 
225. 1995.39-53. 

Strzalkowski, T. et al. 1996. Natural language in- 
formation retrieval: TREC-4 report. In: Har- 
man, D. (Ed.), The Fourth Text REtrieval Con- 
ference (TREC-4). NIST Special Publication 500-236. Washington, DC: U.S. Government Printing 
Office, 1996. pp. 245-258. 

Zhai, C., Tong, X., Milid-Frayling, N., and Evans D. 
1997. Evaluation of syntactic phrase indexing - 
CLARIT TREC5 NLP track report, to appear in 
The Fifth Text REtrieval Conference (TREC-5), 
NIST special publication, 1997, forthcoming. 
