Towards a Workbench for Acquisition of 
Domain Knowledge from Natural Language 
Andrei Mikheev and Steven Finch 
HCRC Language Technology Group 
University of Edinburgh 
2 Buccleuch Place, Edinburgh EH8 9LW, Scotland, UK 
E-maih Andrei.MikheevQed.ac.uk 
Abstract 
In this paper we describe an architecture 
and functionality of main components of 
a workbench for an acquisition of do- 
main knowledge from large text corpora. 
The workbench supports an incremental 
process of corpus analysis starting from 
a rough automatic extraction and or- 
ganization of lexico-semantic regularities 
and ending with a computer supported 
analysis of extracted data and a semi- 
automatic refinement of obtained hypo- 
theses. For doing this the workbench em- 
ploys methods from computational lin- 
guistics, information retrieval and kno- 
wledge engineering. Although the work- 
bench is currently under implementation 
some of its components are already im- 
plemented and their performance is il- 
lustrated with samples from engineering 
for a medical domain. 
1 Introduction 
One of the standard methods for the extraction 
of domain knowledge (or domain schema in ano- 
ther terminology) from texts is known as Distribu- 
tional Analysis (Hirshman 1986). It is based on 
the identification of the sublanguage specific co- 
occurrence properties of words in the syntactic re- 
lations in which they occur in the texts. These co- 
occurrence properties indicate important seman- 
tic characteristics of the domain: classes of ob- 
jects and their hierarchical inclusion, properties 
of these classes, relations among them, lexico- 
semantic patterns for referring to certain concep- 
tual propositions, etc. This knowledge about do- 
main in the form it is extracted is not quite sui- 
table to be included into the knowledge base and 
require a post-processing of the linguistically trai- 
ned knowledge engineer. This is known as a con- 
ceptual analysis of the acquired lingistic data. In 
general all this is a time consuming process and 
often requires the help of a domain expert. Ho- 
wever, it seems to be possible to automate some 
tasks and facilitate human intervention in many 
parts using a combination of NLP and statistical 
techniques for data extraction, type oriented pat- 
terns for conceptual characterization of this data 
and an intuitive user interface. 
All these resources are to be put together into 
a Knowledge Acquisition Workbench (KAWB) 
which is under development at LTG of the Uni- 
versity of Edinburgh. The workbench supports 
an incremental process of corpus analysis star- 
ting from a rough automatic extraction and orga- 
nization of lexico-semantic regularities and ending 
with a computer supported analysis of extracted 
data and a refinement of obtained hypotheses. 
2 KAW Architecture 
The workbench we are aiming at integrates com- 
putational tools and a user interface to support 
phases of data extraction, data analysis and hypo- 
theses refinement. The target domain description 
consists of words grouped into domain-specific se- 
mantic categories which can be further refined 
into a conceptual type lattice (CTL) and lexico- 
semantic patterns further refined into conceptual 
structures as shown elsewhere in the paper. KAW 
architecture is displayed in figure 1. 
A data extraction module provides the kno- 
wledge engineer with manageable units of lexi- 
cal data (words, phrases etc.) grouped together 
according to certain semantically important pro- 
perties. The data extraction phase can be subdivi- 
ded into a stage of semantic category identification 
and a stage of lexico-semantic pattern extraction. 
Both of these stages complement each other: the 
discovery of semantic categories allows the system 
to look for patterns and discovered patterns serve 
as diagnostic units for further extraction of these 
categories. Thus both these activities can be ap- 
plied iteratively until a certain level of precision 
and coverage is achieved. 
194 
Data Extraction Module 
Term 
clustering 
tool 
---I 
Corpus 
I 
case attachement 
robust parser 
tagger 
Linguistic analysis tools 
Word Class Identifier 
f 
/ 
Clusters 
refinement 
tool 
External sources 
access 
Fuzzy I 
f 
Lexical Pattern Finder 
Collocation 
identification 
tool 
Cluster Refinement 
tool 
Generalisation 
tool 
f 
matcher 
Analysis 
support 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Analysis/Refinement Module 
Target data structures 
;emantic Lexico- 
:ategodes Semantic 
patterns 
Concept Concept 
Type Struct 
Lattice 
These data structures are initally 
produced by the procedures 
described above, and refined by 
the analysis/refinement tool to 
become a conceptual type lattice 
and a set of frames. 
t~ 
Figure 1: This figure shows main KAWB components and modules and SGML marked data flow between 
them. 
195 
The word class identification component en- 
compasses tools for the linguistic annotation of 
texts, word clustering tools and tools for access 
to external linguistic and semantic sources like 
thesauri, machine-readable dictionaries and lexi- 
cal data bases. Statistical clustering can be au- 
tomatically checked and subcategorized with the 
help of external linguistic and semantic sources. 
The pattern finder component makes use of 
phrasal annotations of texts produced by a ge- 
neral robust partial parser. First, the corpus is 
checked for stable phrasal collocations for single 
words and entire semantic clusters by a special 
tool - a collocator. After collocations are collec- 
ted another tool - a generalizer tries automatically 
deduce regularities and contract multiple patterns 
into their general representations. Such patterns 
are then presented for a conceptual characteriza- 
tion to the knowledge engineer and some prede- 
fined generic conceptual structures are suggested 
for specialization. 
The main aim of the analysis and refinement 
module is to uncover and refine structural gene- 
ralities found in the previous phases. It matches 
in the text patterns which represent hypotheses 
of the knowledge engineer, groups together and 
generalizes the found cases and presents them to 
the knowledge engineer for a final decision. The 
matcher evaluates how good a given piece of text 
matches the pattern and returns matches at va- 
rious levels of exactness. 
If modules are to communicate flexibly then an 
inter-module information representation format 
needs to be specified. Standard Generalized Mar- 
kup Language (Goldfab 1990) is an international 
standard for marking up text. We use SGML as 
a way of exchanging information between modules 
in a knowledge acquisition system, and of storing 
that information in persistent store when it has 
been processed. 
In the rest of the paper we will embark on a more 
detailed characterization of tools themselves. We, 
however, will not present any technical details and 
suggestions on an actual implementation because 
the workbench should be able to incorporate dif- 
ferent implementations . Some of the tools axe 
already implemented while others still need im- 
plementation or reimplementation in terms of the 
open architecture of the workbench. For an illu- 
stration we have used samples from engineering 
for the cardiac-failure domain using OHSUMED 
(Hersh 1994) corpus and a corpus of patient di- 
scharge summaries ( PDS ) described in Mikheev 
1994. 
3 Linguistic Annotation 
The simplest form of linguistic description of the 
content of a machine-readable document is in the 
form of a sequence (or a set) of words. More so- 
phisticated linguistic information comes in several 
forms, all of which may need to be represented if 
performance in an automatic acquisition of lexical 
regularities is to be improved. The NLP module 
of the KAWB consists of a word tagger (e.g. Ku- 
piec 1993), a specialized partial robust parser and 
a case attachment module. 
The tagger assigns categorial features to words. 
This is not a straightforward process due to the 
general lexical ambiguity of any natural langu- 
age but state-of-the-art taggers do this quite well 
(more than 97% correctness) using different stra- 
tegies usually based on an application of Hidden 
Markov Models (HMMs). 
It is well-known that a general text parsing is very 
fragile and ambiguous by its nature. Syntactic 
ambiguity can lead to hundreds of parses even for 
fairly simple sentences. This is clearly inappro- 
priate. However, general and full scale parsing 
is not required for many tasks of knowledge ac- 
quisition but rather a robust identification of cer- 
tain text segments is needed. Among these seg- 
ments are compound noun phrases, verb phrases 
etc. To increase a precision of knowledge extrac- 
tion in some cases it is quite important to resolve 
references of pronominal anaphora. At the mo- 
ment in parsing sentences we are using a temporal 
expressions recognizer, a noun-phrase recognizer, 
a simple verb-phrase recognizer and a simple ana- 
phoric binder. This can be further extended to 
treat other phenomena of natural language, pro- 
viding that new components are robust and fast. 
The parser supplies information to a case atta- 
chement module. This module using semantically 
driven role filler expectations for verbs provides a 
more precise attachment of noun phrases to verbs. 
To do this we are using ESK - an event and state 
knowledge base (Whittemore 94) which for more 
that 700 verbs contains information on thematic 
roles, semantic types for arguments, expected ad- 
juncts, syntactic information, propositional types, 
WordNet concept types and sense indices. 
4 Automatic Precategorization 
Semantic clustering of words from an underlying 
corpora allows the knowledge engineer to find out 
main semantic categories or types which exist in 
the domain in question and sort out the lexicon 
in accordance with these types. It is important 
both that information about typology the know- 
ledge engineer adds to the system is accurate, and 
that enough information is added. In this regard, 
196 
the Zipf-Mandelbrot law, which states that the 
frequency of the nth most frequent word in a na- 
tural language is (roughly) inversely proportional 
to n. Thus the majority of word tokens appear in 
a small fraction of the possible word types. 
Finch & Chater (1991) show how it is possible to 
infer a syntactic and semantic classification of a 
set of words by analyzing how they are used in 
a very large corpus. This is useful because very 
large corpora frequently exist for many domains. 
For example, in the medical domain, the freely 
available OHSUMED corpus (Hersh 1994) contains 
some 40 million words of medical texts. We now 
describe this method for inferring a syntactic and 
semantic classification of words from scratch. 
Firstly, we measure the contexts in which words 
w E W occur, and define a statistically motiva- 
ted similarity measure between contexts of occur- 
rence of words to infer a similarity between words, 
d(wl, w2), wl, w2 E W. In our case the context is 
defined to be a vector of word bigram statistics 
across the corpus for one and two words to the 
left and right, thus representing each word to be 
classified by a vector of bigram statistics. Then 
we apply a classification procedure to produce a 
hierarchical single link clustering (or dendrogram) 
(Sokal &Sneath, 1963) of words which we use as 
a basis for further classification. If this technique 
(as more fully described in Finch 1993) is applied 
to the OHSUMED corpus, some of the structure 
which is uncovered is displayed in figure 2. This 
figure displays part of a 3,000 word dendrogram 
which can then be "cut" at an appropriate level to 
form a set of disjoint classes which can then be or- 
dered according to their frequency of occurrence. 
This gives the knowledge engineer a means to 
quickly and relatively accurately classify the most 
frequent vocabulary used in a particular domain. 
5 Lexico-Semantic Pattren 
Acquisition 
Lexico-semantic patterns are structures where lin- 
guistic entries, semantic types and entire lexico- 
semantic patterns can be used in combinations 
to denote certain conceptual propositions of the 
underlying domain and cover certain sequences of 
words in the text. Linguistic entries can be words, 
phrases and linguistic types, for example: "of" - 
word, <NP head = "infarction"> - a noun phrase 
with the head-word "infarction", <SYNT type = 
N> - a noun etc. Patterns themselves are the ba- 
sis for induction of conceptual structures. 
An example of a correspondence of many phra- 
ses to one lexico-semantic pattern is shown in fi- 
gure 3. This pattern covers all strings which have 
a reference to a person followed by one of the li- 
sted verbs in any form followed by a compound 
noun with the head "infarction" and followed by 
a date expression. In this pattern $PERSON and 
$DATE are patterns themselves and all other con- 
stituents are linguistic entries. If instead of "inf- 
arction" we use a type \[DISEASE\] we can achieve 
even broader coverage. Also note that the $DATE 
constituent is optional which is expressed by "?". 
A conceptual structure which corresponds to the 
pattern adds more implicit information to the pat- 
tern. For instance, it states explicitly that a body 
component which is a location of a disease belongs 
to the person who is an experiencer of that disease: 
\[@infarction:V\] 
--~ (is-a)--~\[@disease\] 
-+(expr)-~\[@person:y\] 
-+ (loc)--+\[@body-COMP\]+--(has) +-\[@person:*y\] 
--+ (cul)-+ \[@time-point\] 
From the NL Processing point of view lexico- 
semantic patterns provide a way for going about 
without the definition of a general semantics for 
every word in the corpus. Many commonsense 
words take their particular meaning only in a con- 
text of domain categories and this can be expres- 
sed by means of lexico-semantic patterns. 
5.1 Coltocator 
The collocator or multi-word term extraction mo- 
dule finds in the corpus significant co-occurrence 
of lexical items (words and phrases) which con- 
stitute terminology. Identified by the robust par- 
tial parser noun and verb groups which include 
domain semantic categories elicited at the preca- 
tegorization phase are collected together with fre- 
quencies of their appearance in the corpus. Phra- 
ses are filtered through a list of general purpose 
words which is constructed separately for every 
new domain. Phrases which occur more often 
than a threshold computed using Zipf-Mandelbrot 
law are saved for post-analysis. Other phrases are 
decomposed into constituents for recalculation of 
saved phrase weights as described in Mikheev 91. 
Many terms include other terms as their compo- 
nents. This surface lexical structure corresponds 
to semantic relations between concepts represen- 
ted by these terms. To uncover term inclusion 
the system scans the term bank and replaces each 
entry of a term which currently in focus with its 
number. Figure 4 displays an excerpt from collo- 
cations extracted from PDS corpus in the original 
form and after term inclusion checking. 
5.2 Inner Context Categorization 
The major part of the terminology is usually re- 
presented by nouns or nominalizations. Such 
197 
morphine 
~ indomethacin 
epinephrine 
propranolol 
. nifedipine 
_ verapamil 
_ diltiazem 
_. halothane 
-. isoflurane bupivacaine 
fentanyl 
lidocaine 
dexamethasone _~ amiodarone 
methotrexate 
t 
-41 
disease 
~is ndrome order 
dysfunction infection 
failure 
injury 
infarction 
obstruction 
fibrosis 
trauma 
illness 
deficiency 
sclerosis 
mellitus 
renal pulmonary 
cardiac myocardial 
cerebral I ventricular 
coronary aortic 
vascular gastric 
arterial I venous 
respiratory I gastrointestinal I 
S 
----t 
acute chronic 
primary 
long-term 
new 
major 
multiple 
various single 
small lar\[ge 
early 
late 
Figure 2: This figure shows four sub-clusters of our hierarchical cluster analysis of the 3,000 most 
frequent words in the OHSUMED corpus (Hersh 1994). It shows a subcluster of drugs (top left), disease- 
based nouns (top right), body-part adjectives (lower left), and condition modifying adjectives (lower 
right). 
He had suffered 
He had 
She had had 
Mr.Mcdool sustained 
She developed 
an acute myocardial infarction 
a true posterior myocardial infarction 
an interior infarction 
a small anterior myocardial infarction 
an extensive myocardial infarction 
in 1992. 
on 5th of November 1992... 
in 1985. 
in October 92. 
$PERSON <V head = {suffer, have, sustain, develop} > <NC head = "infarction"> {"on", "in"} $DATE? 
Figure 3: This figure shows a correspondence of many phrases to one lexico-semantic pattern. 
Num Freq 
$136 373 
$234 475 
$467 550 
$1109 17 
$1154 48 
$2574 21 
$2974 23 
$2980 46 
$3004 79 
Annotated Phrase 
myocardial//BODY-PART infarction//DISEASE 
anterior myocardial//BODY-PART infarction//DISEASE 
inferior myoeardial//BODY-PART infarction//DISEASE 
established inferior myocardial//BODY-PART infarction//DISEASE 
history//INFORMATION of ischwemic heart//BODY-PART disease//DISEASE 
history//INFOR.MATION of an anterior myocardial//BODY-PART infarction//DISEASE 
moderately severe stenosis//DISEASE 
aortic//BODY-PART valve//BODY-PART stenosis//DISEASE 
stenosis//DISEASE in the right coronary//BODY-PART artery//BODY-PART 
Figure 4: This figure shows an excerpt from collocations extracted from PDS corpus and the result of 
term inclusion checking. 
198 
terms usually have a particular set of modifiers 
which represent different properties. The inner 
context categorization is started with extraction 
of compound nouns from collected by the colloca- 
tor noun phrases. Semantic categories for many 
adjectieval modifiers extracted at the word clu- 
stering phase are too general if any, but collected 
collocations and external lexical sources as, for ex- 
ample, WordNet can be used. 
First, we can sort terms with the same head-word 
by length. For example, for the type INFARCTION 
the systems sorts terms as follows: 
myocardial infarction, old infarction, acute infarction 
acute myocardial infarction, anterior myocardial inf- 
arction... 
further anterior myocardial infarction... 
Then we separate pure adjectival modifiers from 
adjectivized nouns: 
infarction : inferior, old, acute, post, further, antero- 
lateral, lateral, infero-posterior, antero-septal, repea- 
ted, significant, large, limited // myocardial, dia- 
phragmatic, subendocardial 
myocardial infarction : anterior, first, extensive, 
minor, small, previous, posterior, suspected. 
Next we cluster pure adjectival modifiers into 
groups using synonym-antonym information avai- 
lable in WordNet. However, it is not necessarily 
the case that related adjectives are stated together 
in one WordNet entry. Sometimes there is an in- 
direct link between adjectives. Also, since quite 
often WordNet gives semantically unrelated (in a 
given domain) adjectives together we use a heu- 
ristic rule which says that if two adjectives are 
used together in one phrase they don't hold syno- 
nymy//antonymy relation. 
The system assumes that if there is at least one 
word in common in WordNet entries for two dif- 
ferent adjectives they can be clustered together. 
In our example for the type INFARCTION the fol- 
lowing clusters were automatically obtained: 
duster 1: chronic vs. acute; 
duster 2: major, extensive, significant, large, old vs. 
minor, small, limited; 
cluster 3: post vs. previous, ensuing; 
cluster 4: anterior vs. posterior; 
duster 5: inferior vs. superior; 
rest: suspected; lateral; recent; further; repeated; 
As we see all clusters look fairly plausible except 
the single adjective "old" which was misclassified; 
it stands for a temporal property of an infarction 
rather than its spreading at a myocardium. 
This algorithm is gradually applied to al\]~ entries 
from the term bank and the knowledge engineer is 
presented with the results. This method was fairly 
successfully used in our experiment, however, a 
large-scale evaluation of sense discrimination for 
constituent words is still needed to be done. 
5.3 Outer Context Generalizer 
The lexico-semantic generalizer is a tool which 
extracts general lexico-semantic patterns in an 
empirical, corpus-sensitive manner analogous to 
that used to automatically extract word class 
dendrograms. From the multi-word term bank 
collected by the collocation tool, we derive se- 
mantic frames by replacing each content word in 
each phrase by its semantic category, derived eit- 
her empirically from the word-level dendrogram 
in the case of frequent words, or derived from 
WordNet in the case of less frequent words (as 
described above). We also part of speech tag 
every word in the phrase. Therefore, the term 
"myocardial infarction" might become "BODY- 
PART<adj> DISEASE<noun/s>", as might "ga- 
strointestinal obstruction" or "respiratory fai- 
lure". Another example might be the assignment 
of "DISEASE<noun/s> of BODYPART<noun/pl>" 
to "obstruction of arteries" (function words such 
as "of" are usually not further subcategorized, 
since they convey structural information in them- 
selves). Thus we map the term bank to a set of 
paradigms, and we choose the set of paradigms 
which appear most frequently for clustering. 
Clustering proceeds by mapping words in the cor- 
pus to their semantic category (augmented with 
part-of-speech information), and clustering in the 
same way as we did for words, except that the 
context vectors are recorded for the set of frequent 
semantic paradigms. For infrequent words where 
the empirical method for finding semantic class 
can't be applied, the WordNet technique descri- 
bed above is used. When this is done, we get a 
clustering of short lexico-semantic paradigms. 
Once this is achieved, we can again apply the same 
methodology to find patterns of higher level which 
include patterns themselves. In our notation, we 
refer to singe word semantic categories as upper- 
case labels (which we choose as being descriptive 
of the class which has been discovered), simple se- 
quences of semantic categories by a preceding "$", 
and a sequence of sequences by a preceding "$$". 
These higher level patterns can be clustered in the 
same way to yield longer semantic sequence para- 
digms. Figure 5 illustrates generalizations for the 
types $BODY-PART and $$DISEASE. 
5.4 Analysis Support Tool 
Type oriented analysis is facilitated with generic 
conceptual structures which are different for diffe- 
199 
Pattern Structure for $BODY-PART 
BODY-PART< adj > BODY-PART< noun/s > 
LOCATION< adj > BODY-PART< noun/s > 
LOCATION< adj > LOCATION< adj > BODY-PART< noun/s > 
Examples 
aortic valve 
left heart 
left descending artery 
I Pattern Structure for $$DISEASE 
$BODY-PART DISEASE< noun/s > 
DISEASE< noun/s > "in" $DATE 
$BODY-PART DISEASE< noun/s > "in" SDATE 
DISEASE< noun/s > "of" $BODY-PART 
antero-septal myocardial infarction 
infarction in December 1987 
myocardial infarction in December 1987 
occlusion of artery 
Figure 5: This figure shows results of generalization for the types SBODY-PART and $$DISEASE. 
rent conceptual types (as more fully described in 
Mikheev & Moens 1994). For example, a type ori- 
ented structure for eventualities includes their the- 
matic roles (agent, theme ...), temporal links and 
properties while a type-oriented structure for ob- 
jects includes their components, parts, areas and 
properties. The system recognizes which structure 
should be used and presents it to the knowledge 
engineer with optional explanations or a question 
guided strategy for filling it up. 
6 Hypotheses Refinement 
A fuzzy matcher is a tool which uses a sophisti- 
cated pattern-matching language to extract text 
fragments at various levels of exactness. It mat- 
ches in the text patterns which represent hypo- 
theses of the knowledge engineer, groups together 
and generalizes cases which have been discovered 
and presents them to the knowledge engineer for 
a final decision. 
Patterns themselves can be quite complex con- 
structions which can include strings, words, ty- 
pes, precedence relations and distance specifiers. 
In the simpliest case the knowledge engineer can 
examine a context for occurrences for a word or 
a type provided that the type exists in the term 
bank as represented in figure 6. 
More complex patterns can be used for the 
description of complex groups. For instance, 
there a request can be made to find all co- 
occurrences of the type DISEASE with the type 
BODY-COMPONENT when they are at the same 
structural group (noun phrase or verb phrase) and 
the disease is a head of the group: 
{ \[disease\]. < > \[body-component\] } 
curly brackets impose a context of a structural 
group, the "." means that the words can be dis- 
tributed in the group, <> means that the compo- 
nent can be both to the left and to the right, and 
since the DISEASE is the first element of the pat- 
tern it is assumed to be the head. The program 
matches this pattern into the following entries: 
myocardial infarction, infarction of myocardium, ste- 
nosis at the origin of left coronary artery... 
To be powerful enough for our purposes this pat- 
tern language should be quite complex and it is 
important to provide an easy way for specification 
of such patterns with a question-guided process. 
7 External Sources Access 
Already existing lexical databases are an im- 
portant source of information about constituent 
words of domain texts. KAWB provides generic 
facilities for access to such linguistic sources. For 
each source a converter which transforms source 
information into SGML marked data, which then 
can be used in the workbench, should be written. 
For some domains there already exist terminolo- 
gical banks available on-line. These banks vary 
in their linguistic coverage- some list all possi- 
ble forms (singular, plural etc.) for terms while 
others just a canonical one, and in a conceptual 
coverage - some provide an extensive set of diffe- 
rent relations among terms (concepts) others just 
a subsumption hierarchical inclusion. In our 
implementation we used Unified Medical Langu- 
age System (UMLS) and WordNet (Beckwith et al 
1990) - a publicly available lexical database, ho- 
wever we haven't provided the generic support for 
an abstract thesaurus yet. 
8 Conclusion 
The workbench outlined in this paper encompas- 
ses a number of tools which facilitate different sta- 
ges of knowledge extraction, analysis and refine- 
ment based on corpus processing paradigm. These 
tools are integrated into a coherent workbench 
with a common inter-module data flow interface 
based on SGML. Thus the workbench can easily 
integrate new tools and upgrade existing ones. 
The general approach to knowledge acquisition 
supported by the workbench is a combination of 
methods used in knowledge engineering, informa- 
200 
developed an anterior myocardial 
an established inferior myocardial 
an acute inferior myocardial 
subsequent episodes of unstable 
he has experienced unstable 
infarction from which 
infarction . The 
infarction with CHB 
angina including an 
angina and was 
Figure 6: This figure shows an excerpt from a search for the type DISEASE with a distance four to the 
left and two to the right. 
tion retrieval and computational linguistics. 
References 
Beckwith, R., C. Fellbaum, D. Gross and G. 
A. Miller (1990) WordNet: A lexical data- 
base organized on psycholinguistic principles. 
CSL Report 42, Cognitive Science Labora- 
tory, Princeton University, Princeton. 
Cutting, D., J. Kupiec, J. Pedersen and P. Sibun 
(1993) Beta test version of the Xerox tagger. 
Xerox Palo Alto Reseach Center, Palo Alto, 
Ca. 
Finch, S. and N. Chater (1991) A hybrid approch 
to learning syntactic categories. AISB Quart- 
erly 8(4), 35-41. 
Finch, S. P. (1993) Finding Structure in Langu- 
age. PhD thesis, Centre for Cognitive Science, 
University of Edinburgh, Edinburgh. 
Goldfarb, C. F. (1990) The SGML Handbook. Ox- 
ford: Clarendon Press. 
Health, U S. D.of (1993) UMLS Knowledge Sour- 
ces. Washington: National Library of Medi- 
cine. 
Hersh, W. (1994) An interactive retrieval evalua- 
tion and a new large test collection for rese- 
arch. In W. B. Croft and C. J. van Rijsbergen, 
eds., Proceedings of the 17th Annual Interna- 
tional Conference onResearch and Develop- 
ment in Information Retrieval, pp. 192-202. 
Hirschman, L. (1986) Discovering sublanguage 
structures. In R. Grishman and R. Kittredge, 
eds., Analyzing Language in Restricted Do- 
mains: Sublanguage Description and Proces- 
sing, pp. 211-234. Hillsdale, N.J.: Lawrence 
Erlbaum Associates. 
Mikheev, A. and M. Moens (1994) Acquiring and 
Representing Background Knowledge for a 
NLP System. In Proceedings of the AAAI Fall 
Symposium. 
Mikheev, A. (1991) A cognitive system .for con- 
ceptual knowledge extraction from NL texts. 
PhD thesis, Computer Science, Moscow Insti- 
tute for Radio-Engineering and Automation, 
Moscow. 
Sokal, R. R. and P. H. A. Sneath (1963) Principles 
of Numerical Taxonomy. San Fransisco: W. 
H. Freeman. 
Whittemore, G. and J. Hicks (1994) ESK: Event 
and state knowledge base. In AAAI Fall Sym- 
posium. 
201 
