Exploiting Sophisticated Representations for Document 
Retrieval 
Steven Finch 
Language Technology Group, HCRC 
University of Edinburgh 
S. Finch~ed. ac. uk 
Abstract 
The use of NLP techniques for docu- 
ment classification has not produced signif- 
icant improvements in performance within 
the standard term weighting statistical as- 
signment paradigm (Fagan 1987; Lewis, 
1992bc; Buckley, 1993). This perplexing 
fact needs both an explanation and a so- 
lution if the power of recently developed 
NLP techniques are to be successfully ap- 
plied in IR. A novel method for adding lin- 
guistic annotation to corpora is presented 
which involves using a statistical POS tag- 
ger in conjunction with unsupervised struc- 
ture finding methods to derive notions of 
"noun group", "verb group", and so on 
which is inherently extensible to more so- 
phisticated annotation, and does not re- 
quire a pre-tagged corpus to fit. One of the 
distinguishing features of a more linguisti- 
cally sophisticated representation of docu- 
ments over a word set based representation 
of them is that linguistically sophisticated 
units are more frequently individually good 
predictors of document descriptors (key- 
words) than single words are. This leads 
us to consider the assignment of descriptors 
from individual phrases rather than from 
the weighted sum of a word set representa- 
tion. We investigate how sets of individu- 
ally high-precision rules can result in a low 
precision when used together, and develop 
some theory about these probably-correct 
rules. We then proceed to repeat results 
which show that standard statistical mod- 
els are not particularly suitable for exploit- 
ing linguistically sophisticated representa- 
tions, and show that a statistically fitted 
rule-based model provides significantly im- 
proved performance for sophisticated rep- 
resentations. It therefore shows that statis- 
tical systems can exploit sophisticated rep- 
resentations of documents, and lends some 
suppor t to the use of more linguistically 
65 
sophisticated representations for document 
classification. This paper reports on work 
done for the LRE project SmTA, which is 
creating a PC based tool to be used in the 
technical abstracting industry. 
1 Models and Representations 
First, I discuss the general paradigm for document 
classification, along with the conventions for nota- 
tion used throughout this document. We have a 
set of documents {zi}, and set of descriptors, {di}. 
Each document is represented in one or more ways 
in some domain, usually as a set. The elements of 
this set will be called diagnostic units or predicates, 
{wi} or {¢i). These diagnostic units might be the 
words comprising the document, or more linguisti- 
cally sophisticated annotations of parts of the doc- 
ument. They may, in general, be predicates over 
documents. The representation of the document by 
diagnostic units will be called the DU-representation 
of the document, and for a document z, will be de- 
noted T~(x). From the DU representation of the doc- 
uments, one or more descriptors are assigned to each 
of them by some automatic system. This paradigm 
of description is applicable to much of the work on 
text classification (and other fields in information 
retrieval). 
This paper assesses the utility of using linguisti- 
cally sophisticated diagnostic units together with a 
slightly non-standard statistical assignment model 
in order to assign descriptors to a document. 
2 The Corpus 
This paper reports work undertaken for the LRE 
project SISTA (Semi-automatic Indexing System for 
Technical Abstracts). This section briefly describes 
one of the corpora used by this project. 
The RAPRA corpus comprises some 212,000 tech- 
nical abstracts pertaining to research and commer- 
cial exploitation in the rubber and plastics industry. 
To each abstract, an average of 15 descriptors se- 
lected from a thesaurus of some 10,000 descriptors 
is assigned to each article. The frequency of assign- 
ment of descriptors varies roughly in the same way 
as the frequency of word use varies (the frequencies 
of descriptor tokens (very) approximately satisfies 
the Zipf-Mandelbrot law). Descriptors are assigned 
by expert indexers from the entire article and expert 
domain knowledge, not just from the abstract, so it 
is unlikely that any automatic system which analy- 
ses only the abstracts can assign all the descriptors 
which are manually assigned to the abstract. 
We show a fairly typical example below. It is clear 
that many of these descriptors must have been as- 
signed from the main text of the article, and not 
from the abstract alone. Moreover, this is common 
practice in the technical abstract indexing industry, 
so it seems unlikely that the situation will be better 
for other corpora. Nevertheless, we can hope to fol- 
low a strategy of assigning descriptors when there is 
enough information to do so. 
Macromolecular Deformation Model to Estimate 
Viscoelastic Flow Effects in Polymer Melts 
The elastic deformation of polymer macromolecules in a 
shear field is used as the basis for quantitative predic- 
tions of viscoelastic flow effects in a polymer melt. Non- 
Newtonian viscosity, capillary end correction factor, maxi- 
mum die swell, and die swell profile of a polymer melt arc 
predicted by the model. All these effects can be reduced 
to generic master curves, which are independent of polymer 
type. Macromolecular deformation also influences the brit- 
tle failure strength of a processed polymer glass. The model 
gives simple and accurate estimates of practically important 
processing effects, and uses fitting parameters with the clear 
physical identity of viscoelastic constants, which follow well 
established trends with respect to changes in polymcr com- 
position or processing conditions. 12 refs. 
Original assignment: BRITTLE FAILURE; COMPANY; 
DATA; DIE SWELL; ELASTIC DEFORMATION; EQUATION; 
GRAPH; MACROMOLECULE; MELT FLOW; MODEL; NON- 
NEWTONIAN; PLASTIC; POLYMERIC GLASS; PROCESSING; 
RHEOLOGICAL PROPERTIES; RHEOLOGY; TECHNICAL; THE- 
ORY; THERMOPLASTIC; VISCOELASTIC PROPERTIES; VIS- 
COELASTICITY; VISCOSITY 
3 Models 
Two classes of models for assessing descriptor appro- 
priateness were used. One class comprises variants 
of Salton's term-weighting models, and one is more 
allied to fuzzy or default logic in so much as it assigns 
descriptors due to the presence of certain diagnos- 
tic units. What is interesting for us is that term 
weighting models do not seem able to easily exploit 
the additional information provided by a more so- 
phisticated representation of a document, while an 
alternative statistical single term model can. 
3.1 Term weighting models 
The standard term weighting model is defined by 
chosing a set of parameters {c~ij } (one for each word- 
descriptor pair) and {fli} (one for each desc,'iptor) 
so that a likelihood or appropriateness function, /2, 
can be defined by 
C(alw) = (1) 
wEW 
This has been widely used, and is provably equiv- 
alent to a large class of probabilistic models (e.g. 
Van Risjbergen, 1979) which make various assump- 
tions about the independence between descriptors 
and diagnostic units (Fuhr & Buckley, 1993). Vari- 
ous strategies for estimating the parameters for this 
model have been proposed (e.g. Salton & Yang, 
1973, Buckley 1993, Fuhr & Buekley, 1993). Some 
of these concentrate on the need for re-estimating 
weights according to relevance feedback information, 
while some make use of various functions of term 
frequency, document frequency, maximum within- 
document frequency, and various other measure- 
ments of corpora. Nevertheless, the problem of esti- 
mating the huge number of parameters needed for 
such a model is statistically problematic, and as 
Buckley (1993) points out, the choice of weights has 
a large influence on the effectiveness of any model 
for classification or for retrieval. 
There are so many variations on the theme of term 
weighting models that it is impossible to try them 
all in one experiment, so this paper uses a variation 
of a model used by Lewis (1992e) in which he re- 
.ports the results of some experiments using phrases 
In a term weighting model (which has a probabilistic 
interpretation). Several term weighting models have 
been tried, but they all evaluate within 5 points of 
each other on both precision and recall (when suit- 
ably tweaked). 
The model eventually chosen for the tests reported 
here was a smoothed logistic model which gave the 
best results of all the probabilistically inspired term 
weighting models considered. 
3.2 Single term model 
In contrast to making assumptions of independence 
about the relationship between diagnostic units and 
words, the next model utilises only those diagnostic 
units which strongly predict descriptors (i.e. have 
frequently been associated with descriptors) with- 
out making assumptions about the independence of 
diagnostic units given descriptors. 
We shall investigate this class of models using 
probability theory. The main problem with using 
probability theory for problems in document classi- 
fication is that while it might be relatively easy to 
estimate probabilities such as P(dlw ) for some diag- 
nostic unit w and some descriptor d, it is not possible 
66 
to infer much about P(dIw~), where • is some ad- 
ditional information (e.g. the other DUs which rep- 
resent the document), since these probabilities have 
not been estimated, and would take a far larger cor- 
pus to reliably estimate in any case. The situation 
gets exponentially worse as the information we have 
about the document increases. The exception to this 
rule is when P(dlw ) is close to 1, in which case it is 
very unlikely that additional information changes its 
value much. This fact is further investigated now. 
The strategy explored here is to concentrate on 
finding "sure-fire" indicators of descriptors, in a 
somewhat similar manner to how Carnegie's TCS 
works, by exploiting the fact that with a pre- 
classified training corpus we can identify sure-fire 
indicators empirically and "trawl" in a large set of 
informative diagnostic units for those which identify 
descriptors with high precision. The basis of the 
model is the following: 
We consider a likelihood function, Z: defined by: 
Z(dlw) = gd~ N~ 
That is, the number of articles in the training cor- 
pus that d was observed to occur with w divided by 
the number of articles in which w occurred in the 
training corpus. This is an empirical estimate of the 
conditional probability, P(d\[w). We shall assume 
(for simplicity's sake) that we have a large enough 
corpus do reliably estimate these probabilities. 
The strategy for descriptor assignment we are in- 
vestigating is to assign a descriptor d if and only 
if one of a set of predicates over representations of 
documents is true. We define the rule ¢(x) ~ d 
to be Probably Correct do degree ¢ if and only if 
P(dl¢ ) > 1-¢. We wish to keep the precision result- 
ing from using this strategy high while increasing the 
number of rules to improve recall. The predicates ¢ 
we shall consider for this paper will be very simple 
(they will typically be true iff w E T~(x) for some 
diagnostic unit w), but in principle, they could be 
arbitrarily complex (as they are in Carnegie's TCS). 
The primary question of concern is whether the en- 
semble of rules {¢i --~ d} retains precision or not. 
Unfortunately, the answer to this question is that 
this is not necessarily the case unless we put some 
constraints on the predicates. 
Proposition 1 Let • be a set of predicates with the 
property that for some fixed descriptor d, ¢ E • ---+ 
P(d\]¢ ) > 1 - ¢. That is each of the rules ¢i --+ d is 
probably correct to degree c. 
The expected precision of the rule (V ¢i) --* d is 
_ ne where n is the cardinality, \](I)\]. at least 1 
Proof: 
\[Straight-forward and omitted\] 
This proposition asserts that one cannot be guar- 
anteed to be able to keep adding diagnostic units to 
improve recall without hurting precision, unless the 
67 
quality of those diagnostic units is also improved (i.e. 
c is decreased in proportion to the number of DUs 
which are considered). This is unfortunate, but nev- 
ertheless the question of how much adding diagnostic 
units to help recall will hurt precision is an entirely 
empirical matter dependent on the true nature of P; 
this proposition is a worst case, and gives us reason 
to be careful. Performance will be expected to be 
poorest if there are many rules which correspond to 
the same true positives, but different sets of false 
positives. If the predicates are disjoint, for example, 
then the precision of a disjunction is at least as great 
as the precision of applying any single rule. 
So if we design our predicates so that they are dis- 
joint, then we retain precision while increasing recall. 
In practice, this is infeasible, but it is feasible to look 
more carefully at frequently co-occurring predicates, 
since these will be most likely to reduce precision. 1 
The main moral we can draw from the above two 
propositions is that we must be careful about the 
case where diagnostic units are highly correlated. 
One situation which is relatively frequent as the 
sophistication of representation increases is that 
some diagnostic units always co-occur with others. 
For example, if the document were represented by 
sequences of words, then the sequence "olefin poly- 
merisation" always occurs whenever the sequence 
"high temperature olefin polymerisation" occurs. In 
this case, it might be thought to pay to look only 
at the most specific diagnostic units since we have 
if wl --* w2, then P(Z\]wlw2C) = P(XIwlC) for 
any distribution P whatsoever (here, C represents 
any other contextual information we have, for exam- 
ple the other diagnostic units representing the doc- 
ument). However, if wl is significantly less frequent 
than w2 estimation errors of P(d\[wl) will be larger 
for P(dlw2) for any descriptor d, so there may not be 
a significant advantage. However, it does give us a 
1 One classic example is the case of the "New Hamp- 
shire Yankee Power Plant". In a collection of New York 
Times articles studied by Jacobs & Rau (1990), the word 
"Yankee" was found to predict NUCLEAR POWER because 
of the frequent occurrence of articles about this plant. 
However, "Yankee" on its own without the other words in 
this phrase is a good predictor of articles about the New 
York Yankees, a baseball team. If highly mutually infor- 
mative words "are combined into conjunctive predicates 
(e.g. "Yankee" E x & "Plant" E x), and a document 
is represented by its most specific predicates only, then 
when "Yankee" appears alone, it will be a good predic- 
tor of the descriptor SPORT. This example can also show 
that the bound described above is tight. Imagine (sus- 
pending belief) that each of the five words in the phrase 
have the same number of occurrences, i, in the document 
collection without NUCLEAR POWER where they never oc- 
cur together palrwise, and always occur all together in 
j true positives of the descriptor. Then the precision of 
assigning NUCLEAR POWER if any one of them appears 
in a document is j+51-'-2--, and since e in this case is i+--~, the 
bound follows (for the case n = 5) with a little algebra. 
theoretical reason to believe that representing a doc- 
ument by its set of most specific predicates is worth 
investigating, and this shall be investigated below. 
If one considers a calculus similar to the one de- 
scribed here, but allows ~ to limit to 0, then a 
weak default logic ensues which has been studied 
by Adams (1975), and further investigated by Pearl 
(1988). 
4 Adding linguistic description 
The simplest way of representing a document is as 
a set or multi set of words. Many people (eg. Lewis 
1992bc; Jacobs & Rau 1990) have suggested that a 
more linguistically sophisticated representation of a 
document might be more effective for the purposes of 
statistical keyword assignment. Unfortunately, at- 
tempts to do this have not been found to reliably 
improve performance as measured by recall and pre- 
cision for the task of document classification. I shall 
present evidence that a more sophisticated repre- 
sentation makes better predictions from the Single 
Term model defined above than it does from stan- 
dard term weighting models. 
4.1 Linguistic description 
The simplest form of linguistic description of the 
content of a machine-readable document is in the 
form of a sequence (or a set) of words. More so- 
phisticated linguistic information comes in several 
forms, all of which may need to be represented if 
performance in an automatic categorisation exper- 
iment is to be improved. Typical examples of lin- 
guistically sophisticated annotation include tagging 
words with their syntactic category (although this 
has not been found to be effective for 1R), lemma of 
the word (e.g. "corpus" for "corpora"), phrasal in- 
formation (e.g. identifying noun groups and phrases 
(Lewis 1992c, Church 1988)), and subject-predicate 
identification (e.g. Hindle 1990). For the RAPRA 
corpus, we currently identify noun groups and ad- 
jective groups. 
This is achieved in a manner similar to Church's 
(1988) PARTS algorithm used by Lewis (1992bc), 
in the sense that its main properties are robustness 
and corpus sensitivity. All that is important for this 
paper is that the technique identifies various group- 
ings of words (for example, noun-groups, adjective 
groups, and so on) with a high level of accuracy. 
Major parts of the technique are described in detail 
in Finch, 1993. As an example, this is some of the 
linguistic markup which represents the title of the 
sample document shown earlier. 
• macromolecular deformation (NG); macromolecular defor- 
mation model (NG); deformation (NG); deformation model 
(NG); model (NG); viscoelastic flow (NG); viscoelastic flow 
effects (NGS); flow (NG); flow effects (NGS); effects (NGS); 
polymer (NG); polymer melts (NGS); melts (NGS) 
It is clear that the markup is far from sophisti- 
cated, and is very much a small variation on a sim- 
ple sequence-based representation. Nevertheless, it 
is fairly accurate in so much as well over 90% of 
what are claimed to be noun groups can be inter- 
preted as such. One very useful by-product of using 
a linguistically based representation is that Il~ can 
help in linguistic tasks such as terminological col- 
lection. I shall present some examples of diagnostic 
units which are highly associated with descriptors 
later. 
5 Predicting from sophisticated 
representations 
In what follows, we shall compare the relative per- 
formance of a term weighting model with the single 
term model as we vary the sophistication of repre- 
sentation. Proportional assignmen~ 
(Lewis 1992b) is used to 
assign the descriptors from statistical measurements 
of their appropriateness. This method ensures that 
roughly the same number of assignments of particu- 
lar descriptors are made as are actually made in the 
test corpus. The strategy is simply to assign descrip- 
tor d to the N documents which score highest for 
this descriptor, where N is chosen in proportion to 
the occurrence of d in the training corpus. For term 
weighting models, the score is simply the combined 
weight of the document; for the single term model, 
the score is sup~eT¢(~ ) P(dlw). The Rule Based as- 
signment strategy applies only to the single term 
model and the rule w --~ d is included just in case 
P(dlw )> 1-¢. 
Figure 1 shows a few of the rules. All of these 
entries share the property that P(d\]w) > 0.8. They 
were selected at random from the 85,500 associations 
which were found. 
5.1 Representations and models 
Five paradigms of representation of documents will 
be compared, and two term appropriateness models 
will be compared. This gives us ten combinations. 
The first representation paradigm is a baseline one: 
represent documents as the set of the words con- 
tained in them. The second paradigm is to repre- 
sent documents according to word sequences, and 
the third is to apply a noun-group and adjective- 
group recogniser. The fourth and fifth representa- 
tion modes consider representing documents by only 
their most specific diagnostic units. For example, if 
the sequence "thermoplastic elastomer compounds" 
68 
polymer materials Research/NG; 
EEC legislation/NGS; 
venture partners/NGS; 
Bergen op/NP 
sheet lines/NGS 
railroad/NG 
injection moulding fa~:ility/NG 
PHENOLPHTHALEIN/NP 
unsaturated polyester composites/NGS 
thermoplastic elastomer compounds/NGS 
properties features/NGS 
fiber Glass/NG 
comparative performance/NG 
automotive hose/NGS 
Bitruder/NP 
worldwide tyre/NG 
Victrex polyethersulphone/NP 
PS melts/NGS 
viscoelastic characteristics/NGS 
plastics waste/NG 
lattice relaxation/NG 
fatigue crack propagation/NG 
unidirectional composites/NGS 
Flory Huggins interaction/NG 
DATA 
"-* LEGISLATION 
--* JOINT VENTURE 
--* PLASTIC 
---* COMPANY 
--* COMPANY 
--* PLASTIC 
--* DATA 
--~ THERMOSET 
---* RUBBER 
---* PLASTIC 
--* GLASS FIBRE REINFORCED PLASTIC 
--* DATA 
--, RUBBER 
--, EXTRUDER 
COMPANIES 
-'~ COMPANIES 
---+ PLASTIC 
---* VISCOELASTIC PROPERTIES 
RECYCLING 
NUCLEAR MAGNETIC RESONANCE 
"--+ MECHANICAL PROPERTIES 
REINFORCED PLASTIC 
TECHNICAL 
Figure 1: This figure shows some probably correct rules for the RAPRA corpus. In all, there are over 85,000 
such rules. 
appeared in the abstract, then ordinarily this would 
include the sequence "elastomer compounds", which 
would be included in the representation. The results 
of section 3.2 might encourage us to believe that rep- 
resenting a document by only its most specific diag- 
nostic units will improve performance (or, at least, 
precision). Consequently, a sequence of words is de- 
fined to be most specific if (a) it is a diagnostic unit 
and (b) it is not properly contained in a token of any 
other diagnostic unit present in the document. 2 
The noun-groups are found by performing a sim- 
ple parse of the documents as described above, and 
identifying likely noun groups of length 3 or less. 
The contingency table of diagnostic units verses 
manually assigned descriptors on a training corpus of 
200,000 documents was collected, and this was used 
as the basis for two term appropriateness models. 
Probabilities were estimated by adding a constant 
(usually 0.02 was found fairly optimal) to each cell, 
and directly estimating from these slightly adjusted 
counts. 
The 50,000 most frequent diagnostic unit types 
were chosen, and terms which appeared in more than 
10% of documents were discarded. 
2If "elastomer compounds" appeared separately in 
the document from "thermoplastic elastomer com- 
pounds", then both of these sequences would be rep- 
resented in the experiments reported here. 
6 Results 
The results of the experiments on the RAPRA cor- 
pus are presented below. 3 
Despite the peculiarities of the corpus, the mes- 
sage is clear. The result that the standard model 
fares no better on word sequence sets than on word 
sets is repeated, and it is clear that the Single Term 
model fares much better than the Logit model on 
this data set. However, what is most interesting is 
that the Single Term models fares significantly bet- 
ter on the more sophisticated sequence based repre- 
sentations of the document than on the simpler word 
based representation. There is, however, no signifi- 
cant advantage identified by parsing the corpus into 
noun-groups over simply considering all word se- 
quences. The recall scores for the rule-based tagging 
strategy show that the improved performance of the 
sequence based representations can be explained by 
3All recall and precision scores are microaveraged 
(Lewis 1992c); they are the expected probability of as- 
signing or recalling correctly per tagging decision. The 
training set was a set of 200,000 abstracts, and the sep- 
arate test set had 10,000 abstracts. The experiments 
looked at only the 520 most common descriptors. In 
the table, TW means that a term-weighting model was 
used, while ST means that the single term model was 
used. 'Word' means the representation was a wordset, 
'Seq', the set of all sequences, and 'NG' the set of groups 
derived from the grammar. For the sequence represen- 
tations, either all the possible sequences or groups were 
used (denoted by 'all'), or just the most specific ones 
were used (denoted by 'spec'). 
69 
the presence of many more "good" descriptor indi- 
cators. 
Assignment 
Prop. 
Prop. 
Prop. 
Prop. 
Prop. 
Prop. 
Prop. 
Prop. 
Prop. 
Rule ¢ = .2 
Rule ¢ = .2 
Rule ¢ = .2 
Rule ¢ = .2 
Rule ¢ = .2 
Model 
TW 
TW 
TW 
TW 
TW 
ST 
ST 
ST 
ST 
ST 
ST 
ST 
ST 
ST 
Repn Prec Rec 
Word 33% 32% 
Seq all 32% 34% 
Seq spec 33% 34% 
NG all 31% 36% 
NG spec 32% 32% 
Word 54% 48% 
Seq all 57% 55% 
Seq spec 55% 55% 
NG 56% 60% 
Word 83% 7% 
Seq all 77% 42% 
Seq spec 80% 40% 
NG all 82% 42% 
NG spec 84% 37% 
7 Conclusion 
The significant theoretical result is that as the so- 
phistication of the representation of abstracts is in- 
creased, the performance of the single term model 
improves, while the performance of the term weight- 
ing models does not improve significantly. This 
has been a fairly universal experience among re- 
searchers working within the term weighting clas- 
sification paradigm. 
Although there is a very marginally significant 
improvement from using linguistically sophisticated 
representations over simple sequence representations 
if all of the sequences are represented, this largely 
(though not entirely) disappears when only most 
specific sequences are considered, so it might be a 
result of the effects discussed in section 3.2. 
The rule based assignment strategy exploits the 
Single Term model's estimates, and also performs 
much better on word sequence representations than 
on word set representations. This assignment strat- 
egy is promising because it can exploit more sophis- 
ticated representations well, has a sound theory be- 
hind it, and will assign descriptors only where it has 
enough information to do so. Some of the descrip- 
tors in the RAPRA corpus, for example, are only 
ever assigned from the entire article from which the 
abstract is taken, so no assignment strategy will ever 
do well on these. On the other hand this model also 
shows promise that IR techniques might be applied 
to help infer linguistic resources such as term banks 
from large classified corpora. 
The next stage is to add more sophisticated lin- 
guistic annotation to corpora, and to trawl for rules 
in boolean combinations of descriptors, thus ad- 
dressing the results of section 3.2. In this way this 
work can be considered similar in spirit to that un- 
dertaken by Apte et al (1994), but differs in the 
forms of representation which are being considered 
for documents. 

References 

Adams, E. (1975) The Logic of Conditionals: an appli- 
cation of probability to deductive logic Reidel. 

Apte, C, F. Demerau & S. Weiss (1994) Towards Lan- 
guage Independent Automated Learning of Text Cate- 
gorization Methods. the proceeding of the Seventeenth 
ACM-SIGIR Conference on Information Retrieval. 23- 
30, DCU, Dublin. 

Buckley, C. (1993) The Importance of Proper Weighting 
Methods. ARPA Workshop on Human Language Technology. 

Church, K. (1988) A stochastic parts program and noun 
phrase parser for unrestricted text. In Second conference 
on ANLP, pp 136-43. 

Church, K., W. Gale, P. Hanks & D. Hindle (1989) Pars- 
ing, Word Associations and Typical Predicate-Argument 
Relations. In International Parsing Technologies Work- 
shop. CMU, Pittsburgh. 

Fagan, J. (1987) Experiments in Automatic Phrase In- 
dexing for Document Retrieval: Comparison of Syntactic 
and Non-Syntactic Methods. PhD Thesis. Cornell Uni- 
versity, Dept. of Computer Science. 

Finch, S. P. & N. Chater (1991) A Hybrid Approach to 
the Automatic Learning of Linguistic Categorie s . Artifi- 
cial Intelligence and Simulated Behaviour Quarterly. 78 
16-24. 

Finch, S. (1993) Finding Structure in Language. Ph.D. 
thesis, Centre for Cognitive Science, University of Edin- 
burgh, Edinburgh. 

Fuhr, N. (1989) Models for retrieval with probabilistic in- 
dexing. Information processing and management. 25(1): 
55-72. 

Fuhr, N. & Buckley, C (1993) Optimizing Document In- 
dexing and Search Term Weighting Based on Probabilis- 
tic Models First TREC Conference. 

Hindle, D. (1990) Noun Classification from Predicate-Argument Structures. In Proceedings of the 22nd meeting of the Association of Computational Linguistics. 
268-75. 

Jacobs, P. & Rau, L. (1990) SCISOR: Extracting Infor- 
mation from On-line News Correspondence of the A CM 
33 11 88-97 

Kupiec, J. (1992) Robust part-of-speech tagging Using a 
hidden Markov model. Computer Speech and Language, 
6:3 225-42. 

Lewis, D. (1991) Evaluating text categorisation. In 
Speech and natural language workshop, pp 136-143. 

Lewis, D. (1992a) Representation and learning in infor- 
mation retrieval. Ph.D. thesis, Computer Science Dept., 
Univ. Mass., Amherst, Ma. 

Lewis, D. (1992b) An Evaluation of Phrasal and Clus- 
tered Representations on a Text categorization problem. 
Proceedings of SIGIR 92. 

Lewis, D. (1992c) Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop held at Harrimn, NY. 
pp 212-217. 

Lewis, D. & K. Sparck-Jones (1993) Natural language 
processing for information retrieval University of Cam- 
bridge Technical report 307, Cambridge. 

Pearl, J. (1988) Probabilistic Reasoning in Intelligent 
Systems: Networks of Plausible Inference Morgan Kauf- 
mann, San Mateo, Ca. 

van Rijsbergen, C. J. (1979) Information retrieval. But- 
terworths, London. 

Sacks-Davis, R. (1990) Using Syntactic Analysis in a 
Document Retrieval System that Uses Signature Files. 
A CM SIGIR- 9O. 

Salton, G. & McGill, M. J. (1983) Introduction to mod- 
ern information retrieval. McGraw-Hill, NY. 

Salton, G. & C. Buckley (1988) Term Weighting Ap- 
proaches in Automatic Text Retrieval Information Pro- 
cessing and Management 24 5 513-23 

Zadeh, L. (1965) Fuzzy Sets Information and control, bf 
8 338-53. 
