Mining metalinguistic activity in corpora to create lexical resources using 
Information Extraction techniques: the MOP system 
Carlos Rodríguez Penagos 
Language Engineering Group, Engineering Institute 
UNAM, Ciudad Universitaria A.P. 70-472  
Coyoacán 04510  Mexico City, México 
CRodriguezP@iingen.unam.mx 
Abstract 
This paper describes and evaluates MOP, an 
IE system for automatic extraction of 
metalinguistic information from technical and 
scientific documents. We claim that such a 
system can create special databases to boot-
strap compilation and facilitate update of the 
huge and dynamically changing glossaries, 
knowledge bases and ontologies that are vital 
to modern-day research. 
1 Introduction 
Availability of large-scale corpora has made it 
possible to mine specific knowledge from free or 
semi-structured text, resulting in what many con-
sider by now a reasonably mature NLP technolo-
gy. Extensive research in Information Extraction 
(IE) techniques, especially with the series of Mes-
sage Understanding Conferences of the nineties, 
has focused on tasks such as creating and updating 
databases of corporate join ventures or terrorist 
and guerrilla attacks, while the ACQUILEX pro-
ject used similar methods for creating lexical da-
tabases using the highly structured environment of 
machine-readable dictionary entries and other re-
sources. Gathering knowledge from unstructured 
text often requires manually crafting knowledge-
engineering rules both complex and deeply de-
pendent of the domain at hand, although some 
successful experiences using learning algorithms 
have been reported (Fisher et al., 1995; Chieu et 
al., 2003). 
  Although mining specific semantic relations 
and subcategorization information from free-text 
has been successfully carried out in the past 
(Hearst, 1999; Manning, 1993), automatically ex-
tracting lexical resources (including terminologi-
cal definitions) from text in special domains  has 
been a field less explored, but recent experiences 
(Klavans et al., 2001; Rodríguez, 2001; Cartier, 
1998) show that compiling the extensive resources 
that modern scientific and technical disciplines 
need in order to manage the explosive growth of 
their knowledge, is both feasible and practical. A 
good example of this NLP-based processing need 
is the MedLine abstract database maintained by 
the National Library of Medicine1 (NLM), which 
incorporates around 40,000 Health Sciences pa-
pers each month. Researchers depend on these 
electronic resources to keep abreast of their rapid-
ly changing field. In order to maintain and update 
vital indexing references such as the Unified Me-
dical Language System (UMLS) resources, the 
MeSH and SPECIALIST vocabularies, the NLM 
staff needs to review 400,000 highly-technical 
papers each year. Clearly, neology detection, ter-
minological information update and other tasks 
can benefit from applications that automatically 
search text for information, e.g., when a new term 
is introduced or an existing one is modified due to 
data or theory-driven concerns, or, in general, 
when new information about sublanguage usage is 
being put forward. But the usefulness of robust 
NLP applications for special-domain text goes 
beyond glossary updates. The kind of categoriza-
tion information implicit in many definitions can 
help improve anaphora resolution, semantic ty-
ping or acronym identification in these corpora, as 
well as enhance “semantic rerendering” of spe-
cial-domain ontologies and thesaurii (Pustejovsky 
et al., 2002).  
In this paper we describe and evaluate the 
MOP2 IE system, implemented to automatically 
create Metalinguistic Information Databases 
(MIDs) from large collections of special-domain 
                                                      
1 http://www.nlm.nih.gov/ 
2 Metalinguistic Operation Processor 
research papers. Section 2 will lay out the theory, 
methodology and the empirical research groun-
ding the application, while Section 3 will describe 
the first phase of the MOP tasks: accurate location 
of good candidate metalinguistic sentences for 
further processing. We experimented both with 
manually coded rules and with learning algo-
rithms for this task. Section 4 focuses on the pro-
blem of identifying and organizing into a useful 
database structure the different linguistic consti-
tuents of the candidate predications, a phase simi-
lar to what are known in the IE literature as 
Named-Entity recognition, Element and Scenario 
template fill-up tasks. Finally, Section 5 discusses 
results and problems of our experiments, as well 
as future lines of research. 
2 Metalanguage and term evolution in scien-
tific disciplines 
2.1 Explicit Metalinguistic Operations 
Preliminary empirical work to explore how re-
searchers modify the terminological framework of 
their highly complex conceptual systems, included 
manual review of a corpus of 19 sociology articles 
(138,183 words) published in various British, 
American and Canadian academic journals with 
strict peer-review policies. We look at how term 
manipulation was done as well as how metalin-
guistic activity was signaled in text, both by lexi-
cal and paralinguistic means. Some of the 
indicators found included verbs and verbal phra-
ses like called, known as, defined as, termed, co-
ined, dubbed, and descriptors such as term and 
word. Other non-lexical markers included quota-
tion marks, apposition and text formatting. 
A collection of potential metalinguistic patterns 
identified in the exploratory Sociology corpus was 
expanded (using other verbal tenses and forms) to 
116 queries sent to the scientific and learned do-
mains of the British National Corpus. The resul-
ting 10,937 sentences were manually classified as 
metalinguistic or otherwise, with 5,407 (49.6% of 
total) found to be truly metalinguistic sentences. 
The presence of three components described be-
low (autonym, informative segment and mar-
kers/operators) was the criteria for classification. 
Reliability of human subjects for this task has not 
been reported in the literature, and was not eva-
luated in our experiments. 
Careful analysis of this extensive corpus presen-
ted some interesting facts about what we have 
termed “Explicit Metalinguistic Operations” (or 
EMOs) in specialized discourse: 
A) EMOs usually do not follow the genus-
differentia scheme of aristotelian definitions, nor 
conform to the rigid and artificial structure of dic-
tionary entries. More often than not, specific in-
formation about language use and term definition 
is provided by sentences such as: (1) This means 
that they ingest oxygen from the air via fine 
hollow tubes, known as tracheae, in which the 
term trachea is linked to the description fine 
hollow tubes in the context of a globally non-
metalinguistic sentence. Partial and heterogeneous 
information, rather that a complete definition, are 
much more common. 
B) Introduction of metalinguistic information in 
discourse is highly regular, regardless of the spe-
cific domain. This can be credited to the fact that 
the writer needs to mark these sentences for spe-
cial processing by the reader, as they dissect 
across two different semiotic levels: a metalan-
guage and its object language, to use the termino-
logy of logic where these concepts originate.3 Its 
constitutive markedness means that most of the 
times these sentences will have at least two indi-
cators present, for example a verb and a descrip-
tor, or quotation marks, or even have preceding 
sentences that announce them in some way. These 
formal and cognitive properties of EMOs facilitate 
the task of locating them accurately in text.  
C) EMOs can be further analyzed into 3 distinct 
components, each with its own properties and lin-
guistic realizations: 
i) An autonym (see note 3): One or more self-
referential lexical items that are the logical or 
grammatical subject of a predication that needs 
not be a complete grammatical sentence.  
                                                      
3 At a very basic semiotic level natural language has 
to be split (at least methodologically) into two distinct 
systems that share the same rules and elements: a meta-
language, which is a language that is used to talk about 
another one, and an object language, which in turn can 
refer to and describe objects in the mind or in the 
physical world. The two are isomorphic and this ac-
counts for reflexivity, the property of referring to itself, 
as when linguistic items are mentioned instead of being 
used normally in an utterance. Rey-Debove (1978) and 
Carnap (1934) call this condition autonymy.   
ii) An informative segment: a contribution of 
relevant information about the meaning, status, 
coding or interpretation of a linguistic unit. In-
formative segments constitute what we state 
about the autonymical element. 
iii) Markers/Operators: Elements used to mark 
or made prominent whole discourse operation, 
on account of its non-referential, metalinguis-
tic nature. They are usually lexical, typograp-
hic or pragmatic elements that articulate 
autonyms and informative segments into a 
predication. 
Thus, in a sentence such as (2), the [autonym] is 
marked in square brackets, the {informational 
segment} in curly brackets and the <marker-
operators> in angular brackets: 
(2) {The bit sequences representing quanta of 
knowledge} <will be called “>[Kenes]<”>, {a 
neologism intentionally similar to 'genes'}. 
2.2 Defaults, knowledge and knowledge of 
language 
The 5,400 metalinguistic sentences from our 
BNC-based test corpus (henceforth, the EMO 
corpus) reflect an important aspect of scientific 
sublanguages, and of the scientific enterprise in 
general. Whenever scientists and scholars advance 
the state of the art of a discipline, the language 
they use has to evolve and change, and this build-
up is carried out under metalinguistic control. 
Previous knowledge is transformed into new 
scientific common ground and ontological com-
mitments are introduced and defended when se-
mantic reference is established. That is why when 
we want to structure and acquire new knowledge 
we have to go through a resource-costly cognitive 
process that integrates, within coherent conceptual 
structures, a considerable amount of new and very 
complex lexical items and terms.    
It has to be pointed out that non-specialized 
language is not abundant4 in these kinds of meta-
linguistic exchanges because (unless in the con-
text of language acquisition) we usually rely on a 
lexical competence that, although subsequently 
modified and enhanced, reaches the plateau of a 
generalized lexicon relatively early in our adult 
life. Technical terms can be thought of as seman-
tic anomalies, in the sense that they are ad hoc 
                                                      
4 Our study shows that they represent between 1 and 
6% of all sentences across different domains. 
constructs strongly bounded to a model, a domain 
or a context, and are not, by definition, part of the 
far larger linguistic competence from a first native 
language. The information provided by EMOs is 
not usually inferable from previous one available 
to the speaker’s community or expert group, and 
does not depend on general language competence 
by itself, but nevertheless is judged important and 
relevant enough to warrant the additional proces-
sing effort involved. 
Conventional resources like lexicons and dic-
tionaries compile established meaning definitions. 
They can be seen as repositories of the default, 
core lexical information of words or terms used by 
a community (that is, the information available to 
an average, idealized speaker). A Metalinguistic 
Information Database (MID), on the other hand, 
compiles the real-time data provided by metalan-
guage analysis of leading-edge research papers, 
and can be conceptualized as an anti-dictionary: a 
listing of exceptions, special contexts and specific 
usage, of instances where meaning, value or 
pragmatic conditions have been spotlighted by 
discourse for cognitive reasons. The non-default 
and highly relevant information from MIDs could 
provide the material for new interpretation rules in 
reasoning applications, when inferences won’t 
succeed because the states of the lexico-
conceptual system have changed. When interpre-
ting text, regular lexical information is applied by 
default under normal conditions, but more specific 
pragmatic or discursive information can override 
it if necessary, or if context demands so (Lascari-
des & Copestake, 1995). A neologism or a word 
in an unexpected technical sense could stump a 
NLP system that assumes it will be able to use 
default information from a machine-readable dic-
tionary.  
3 Locating metalinguistic information in 
text: two approaches  
When implementingan IE application to mine 
metalinguistic information from text, the first is-
sue to tackle is how to obtain a reliable set of can-
didate sentences from free text for input into the 
next phases of extraction. From our initial corpus 
analysis we selected 44 patterns that showed the 
best reliability for being EMO indicators. We start 
our processing5  by tokenizing  text, which then is 
                                                      
5 Our implementation is Python-based, using the  
run through a cascade of finite-state devices based 
on identification patterns that extract a candidate 
set for filtering. Our filtering strategies in effect 
distinguish between useful results such as (3) 
from non-metalinguistic instances like (4): 
(3) Since the shame that was elicited by the co-
ding procedure was seldom explicitly mentio-
ned by the patient or the therapist, Lewis 
called it unacknowledged shame. 
(4) It was Lewis (1971;1976) who called attention 
to emotional elements in what until then had 
been construed as a perceptual phenomenon .  
 For this task, we experimented with two strate-
gies: First, we used corpus-based collocations to 
discard non-metalinguistic instances, for example 
the presence of attention in sentence (4) next to 
the marker called. Since immediate co-text seems 
important for this classification task, we also im-
plemented learning algorithms that were trained 
on a subset from our EMO corpus, using as vec-
tors either POS tags or word forms, at 1, 2, and 3 
positions adjacent before and after our markers. 
These approaches are representative of wider pa-
radigmatic approaches to NLP: symbolic and sta-
tistic techniques, each with their own advantages 
and limitations. Our evaluations of the MOP sys-
tem are based on test runs over 3 document sets: 
a) our original exploratory corpus of sociology 
research papers [5581 sentences, 243 EMOs]; b) 
an online histology textbook [5146 sentences, 69 
EMOs] ; and c) a small sample from the MedLine 
abstract database [1403 sentences, 10 EMOs]. 
Using collocational information, our first ap-
proach fared very well, presenting good precision 
numbers, but not so encouraging recall. The so-
ciology corpus, for example, gave 0.94 precision 
(P) and 0.68 recall (R), while the histology one 
presented 0.9 P and 0.5 R. These low recall num-
bers reflect the fact that we only selected a subset 
of the most reliable and common metalinguistic 
patterns, and our list is not exhaustive. Example 
(5) shows one kind of metalinguistic sentence 
(with a copulative structure) attested in corpora, 
                                                                                  
NLTK toolkit (nltk.sf.net) developed by E. Loper and 
S. Byrd  at the University of Pennsylvania, although we 
have replaced stochastic POS taggers with an imple-
mentation of the Brill algorithm by Hugo Liu at MIT. 
Our output files follow XML standards to ensure 
transparency, portability and accessibility 
but that the system does not attempt to extract or 
process: 
(5) “Intercursive” power , on the other hand , is 
power in Weber's sense of constraint by an ac-
tor or group of actors over others. 
In order to better compare our two strategies, 
we decided to also zoom in on a more limited sub-
set of verb forms for extraction (namely, calls, 
called, call), which presented ratios of metalin-
guistic relevance in our MOP corpus, ranging 
from 100% positives (for the pattern so called + 
quotation marks) to 77% (called, by itself) to 31% 
(call). Restricted to these verbs, our metrics show 
precision and recall rates of around 0.97, and an 
overall F-measure of 0.97.6 Of 5581 sentences (96 
of which were metalinguistic sentences signaled 
by our cluster of verbs), 83 were extracted, with 
13 (or 15.6% of candidates) filtered-out by collo-
cations.  
For our learning experiments (an approach we 
have called contextual feature language models), 
we selected two well-known algorithms that sho-
wed promise for this classification task.7 The nai-
ve Bayes (NB) algorithm estimates the conditional 
probability of a set of features given a label, using 
the product of the probabilities of the individual 
features given that label. The Maximum Entropy 
model establishes a probability distribution that 
favors entropy, or uniformity, subject to the cons-
traints encoded in the feature-label correlation. 
When training our ME classifiers, Generalized 
(GISMax) and Improved Iterative Scaling (IIS-
Max) algorithms are used to estimate the optimal 
maximum entropy of a feature set, given a corpus.  
1,371 training sentences were converted into la-
beled vectors, for example using 3 positions and 
POS tags: ('VB WP NNP', 'calls', 'DT NN NN') 
/'YES'@[102]. The different number of positions 
considered to the left and right of the markers in 
our training corpus, as well as the nature of the 
features selected (there are many more word-types 
than POS tags) ensured that our 3-part vector in-
troduced a wide range of features against our 2 
possible YES-NO labels for processing by our 
algorithms. Although our test runs using only co-
llocations showed initially that structural regulari-
                                                      
6 With a ß factor of 1.0, and within the sociology 
document set 
7 see Ratnaparkhi (1997) and Berger et al. (1996) for 
a formal description of these algorithms 
ties would perform well, both with our restricted 
lemma cluster and with our wider set of verbs and 
markers, our intuitions about improvement with 
more features (more positions to the right of left 
of the markers) or a more controlled and gramma-
tically restricted environment (a finite set of su-
rrounding POS tags), turned out to be overly 
optimistic. Nevertheless, stochastic approaches 
that used short range features did perform very 
well, in line with the hand-coded approach.  
The results of the different algorithms, re-
stricted to the lexeme call, are presented in Table 
1, while Figures 1 and 2 present best results in the 
learning experiments for the complete set of pat-
terns used in the collocation approach, over two of 
our evaluation corpora. 
Type Positions Tags/ 
Words 
Features Accuracy Precision Recall 
GISMax 1 W 1254 0.97 0.96 0.98 
IISMax 1 T 136 0.95 0.96 0.94 
IISMax 1 W 1252 0.92 0.97 0.9 
GISMax 1 T 138 0.91 0.9 0.96 
GISMax 2 T 796 0.88 0.93 0.92 
IISMax 2 T 794 0.86 0.95 0.89 
IISMax 3 W 4290 0.87 0.85 0.98 
GISMax 3 W 4292 0.87 0.85 0.98 
IISMax 2 W 3186 0.86 0.87 0.95 
GISMax 2 W 3188 0.86 0.87 0.95 
NB 1 T 136 0.88 0.97 0.84 
NB 2 T 794 0.87 0.96 0.84 
NB 3 W 4290 0.73 0.86 0.77 
Table 1. Best metrics for “call” lexeme 
 sorted by F-measure and classifier accuracy 
Figure 1. Best metrics for Sociology corpus
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
P
R
F
NB (3/T)
IIS (1/W)
GIS (1/W)
 
Figure 2. Best metrics for Histology corpus
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
P
R
F
NB (3/W)
IIS (3/W)
GIS (1/W)
 
Figures 1 & 2. Best results for  
filtering algorithms.8  
Both Knowledge-Engineering and supervised 
learning approaches can be adequate for extrac-
tion of metalinguistic sentences, although learning 
algorithms can be helpful when procedural rules 
have not been compiled; they also allow easier 
transport of systems to new thematic domains. We 
plan further research into stochastic approaches to 
fine tune them for the task.  
One issue that merits special attention is why 
some of the algorithms and features work well 
with one corpus, but not so well with another. 
This fact is in line with observations in Nigam et 
al. (1999) that naive Bayes and Maximum Entro-
py do not show fundamental baseline superiori-
ties, but are dependent on other factors. A hybrid 
approach that combines hand-crafted collocations 
with classifiers customized to each pattern’s be-
havior and morpho-syntactic contexts in corpora 
might offer better results in future experiments. 
4 Processing EMOs to compile metalinguis-
tic information databases 
Once we have extracted candidate EMOs, the 
MOP system conforms to a general processing 
architecture shown in Figure 3. POS tagging is 
followed by shallow parsing that attempts limited 
PP-attachment. The resulting chunks are then tag-
ged semantically as Autonyms, Agents, Markers, 
Anaphoric elements or simply as Noun Chunks, 
                                                      
8 Legend: P: Precision; R: Recall; F:  F-Measure. NB: na-
ïve Bayes; IIS: Maximum Entropy trained with Improved 
Iterative Scaling; GIS: Maximum Entropy trained with Gen-
eralized Iterative Scaling. (Positions/Feature type) 
using heuristics based on syntactic, pragmatic and 
argument structure observation of the extraction 
patterns.  
Next, a predicate processing phase selects the 
most likely surface realization of informational 
segments, autonyms and makers-operators, and 
proceeds to fill the templates in our databases. 
This was done by following different processing 
routes customized for each pattern using corpus 
analysis as well as FrameNet data from Name 
conferral and Name bearing frames to establish 
relevant arguments and linguistic realizations.  
Figure 3. MOP Architecture 
As mentioned earlier, informational segments 
present many realizations that distance them from 
the clarity, completeness and conciseness of lexi-
cographic entries. In fact, they may show up as 
full-fledged clauses (6), as inter- or intra-
sentential anaphoric elements (7 and 8, the first 
one a relative clause), supply a categorization de-
scriptor (9), or even (10) restrict themselves se-
mantically to what we could call a sententially-
unrealized “existential variable” (with logical 
form ›x) indicating only that certain discourse 
entity is being introduced. 
(6) In 1965 the term soliton was coined to descri-
be waves with this remarkable behaviour.   
(7) This leap brings cultural citizenship in line 
with what has been called the politics of citi-
zenship . 
(8) They are called “endothermic compounds.”  
(9) One of the most enduring aspects of all social 
theories are those conceptual entities known 
as structures or groups. 
(10) A ›x so called cell-type-specific TF can be 
used by closely related cells, e.g., in erythro-
cytes and megakaryocytes. 
We have not included an anaphora-resolution 
module in our present system, so that instances 7, 
8 and 10 will only display in the output as unre-
solved surface element or as existential variable 
place-holders,9 but these issues will be explored in 
future versions of the system. Nevertheless, much 
more common occurrences as in (11) and (12) are 
enough to create MIDs quite useful for lexicogra-
phers and for NLP lexical resources.  
(11) The Jovian magnetic field exerts an influ-
ence out to near a surface, called the 
"magnetopause". 
(12) Here we report the discovery of a soluble 
decoy receptor, termed decoy receptor 3 
(DcR3)...  
The correct database entry for example 12 is 
presented in Table 4. 
Reference:  MedLine sample # 6 
Autonym:  decoy receptor 3 (DcR3) 
Information a soluble decoy receptor  
 Markers/ 
Operators:  
termed  
Table 4. Sample entry of MID 
The final processing stage presents metrics 
shown in Figure 4, using a ß factor of 1.0 to esti-
mate F-measures. To better reflect overall perfor-
mance in all template slots, we introduced a 
threshold of similarity of 65% for comparison 
between a golden standard slot entry and the one 
provided by the application. Thus, if the autonym 
or the informational segment is at least 2/3 of the 
correct response, it is counted as a positive, in 
many cases leveling the field for the expected 
errors in the prepositional phrase- or acronym- 
attachment algorithms, but accounting for a (basi-
cally) correct selection of superficial sentence 
segments. 
 
 
                                                      
9 For sentence (8) the system would retrieve a previ-
ous sentence: (“A few have positive enthalpies of for-
mation”). to define “endothermic compounds”. 
Corpus Tokenization 
Candidate extraction 
MID 
Candidate Filtering 
Collocations  ♦  Learning  
POS tagging & 
 Partial parsing 
Semantic labeling 
Database 
template fillup 
5 Results, comparisons and discussion 
The DEFINDER system (Klavans et al, 2001) at 
Columbia University is, to my knowledge, the 
only one fully comparable with MOP, both in 
scope and goals, but some basic differences be-
tween them exist. First, DEFINDER examines 
user-oriented documents that are bound to contain 
fully-developed definitions for the layman, as the 
general goal of the PERSIVAL project is to pre-
sent medical information to patients in a less tech-
nical language than the one of reference literature. 
MOP focuses on leading-edge research papers that 
present the less predictable informational templa-
tes of highly technical language. Secondly, by the 
very nature of DEFINDER’s goals their qualitati-
ve evaluation criteria include readability, useful-
ness and completeness as judged by lay subjects, 
criteria which we have not adopted here. Neither 
have we determined coverage against existing on-
line dictionaries, as they have done. Taking into 
account the above-mentioned differences between 
the two systems’ methods and goals, MOP com-
pares well with the 0.8 Precision and 0.75 Recall 
of DEFINDER. While the resulting MOP “defini-
tions” generally do not present high readability or 
completeness, these informational segments are 
not meant to be read by laymen, but used by do-
main lexicographers reviewing existing glossaries 
for neological change, or, for example, in machi-
ne-readable form by applications that attempt au-
tomatic categorization for semantic rerendering of 
an expert ontology, since definitional contexts 
provide sortal information as a natural part of the 
process of precisely situating a term or concept 
against the meaning network of interrelated lexi-
cal items. The Metalinguistic Information Databa-
ses in their present form are not, in full justice, 
lexical knowledge bases comparable with the 
highly-structured and sophisticated resources that 
use inheritance and typed features, like LKB (Co-
pestake et al., 1993). MIDs are semi-structured 
resources (midway between raw corpora and 
structured lexical bases) that can be further pro-
cessed to convert them into usable data sources, 
along the lines suggested by Vossen and Copesta-
ke (1993) for the syntactic kernels of lexicograp-
hic definitions, or by Pustejovsky et al. (2002) 
using corpus analytics to increase the semantic 
type coverage of the NLM UMLS ontology. An-
other interesting possibility is to use a dynami-
cally-updated MID to trace the conceptual and 
terminological evolution of a discipline. 
We believe that low recall rates in our tests are 
in part due to the fact that we are dealing with the 
wider realm of metalinguistic information, as op-
posed to structured definitional sentences that 
have been distilled by an expert for consumer-
oriented documents. We have opted in favor of 
exploiting less standardized, non-default metalin-
guistic information that is being put forward in 
text because it can’t be assumed to be part of the 
collective expert-domain competence (Section 
2.1). In doing so, we have exposed our system to 
the less predictable and highly charged lexical 
environment of leading-edge research literature, 
the cauldron where knowledge and terminological 
systems are forged in real time, and where scienti-
Figure 4. Metrics for 3 corpora 
(# of Records/Global F-Measure)
0.6
0.7
0.8
0.9
1
Precision Recall Precision Recall Precision Recall
Global Informational Segments Autonyms
Histology (35/0.71) Sociology (143/0.77) MedLine (10/0.78)
fic meaning and interpretation are constantly de-
bated, modified and agreed. We have not per-
formed major customization of the system (like 
enriching the tagging lexicon with medical terms), 
in order to preserve the ability to use the system 
across different domains. Domain customization 
may improve metrics, but at a cost for portability. 
The implementation we have described here 
undoubtedly shows room for improvement in so-
me areas, including: adding other patterns for bet-
ter overall recall rates, deeper parsing for more 
accurate semantic typing of sentence arguments, 
etc. Also, the issue of which learning algorithms 
can better perform the initial filtering of EMO 
candidates is still very much an open question. 
Applications that can turn MIDs into truly useful 
lexical resources by further processing them need 
to be written. We plan to continue development of 
our proof-of-concept system to explore those ar-
eas. DEFINDER and MOP both show great poten-
tial as robust lexical acquisition systems capable 
of handling the vast electronic resources available 
today to researchers and laymen alike, helping to 
make them more accessible and useful. In doing 
so, they are also fulfilling the promise of NLP 
techniques as mature and practical technologies.   
References 
ACQUILEX projects, final report available at: 
http://www.cl.cam.ac.uk/Research/NL/acquilex/ 
Berger, A., S. Della Pietra  et al., 1996. A Maxi-
mum Entropy Approach to Natural Language 
Processing. Computational Linguistics, vol. 22, 
no. 1. 
Carnap, R. 1934. The Logical Syntax of Lan-
guage. Routledge and Kegan, Londres 1964. 
Cartier, E. 1998. Analyse Automatique des textes: 
l’example des informations définitoires. RIFRA 
1998. Sfax, Tunisia. 
Chieu, Hai Leong, Ng, Hwee Tou, & Lee, Yoong 
Keok. 2003. Closing the Gap: Learning-Based 
Information Extraction Rivaling Knowledge-
Engineering Methods. 41st ACL. Sapporo, Ja-
pan. 
Copestake, A., Sanfilippo, A., Briscoe, T. and de 
Pavia, V. 1993. The ACQUILEX LKB: An in-
troduction. In: Inheritance, Defaults and the 
Lexicon. Cambridge University Press. 
Fisher, D., S. Soderland, J. McCarthy, F. Feng, 
and W. Lehnert. 1995. Description of the 
UMass system as used for MUC-6. In Proceed-
ings of MUC-6 
Hearst, M. 1998. Automated discovery of wordnet 
relations. In Christiane Fellbaum, editor, 
WordNet: An Electronic Lexical Database. MIT 
Press, Cambridge, MA 
Klavans, J. and S. Muresan. 2001. Evaluation of 
the DEFINDER System for Fully Automatic 
Glossary Construction, proceedings of the 
American Medical Informatics Association 
Symposium 2001 
Lascarides, A. and Copestake A. 1995. The Prag-
matics of Word Meaning, Proceedings of the 
AAAI Spring Symposium Series: Representa-
tion and Acquisition of Lexical Knowledge: 
Polysemy, Ambiguity and Generativity, Stan-
ford CA. 
Manning, Ch. 1993. Automatic acquisition of a 
large subcategorization dictionary from cor-
pora, In Proceedings of the 31st ACL, Colum-
bus, OH. 
Nigam, K., Lafferty, J., and McCallum, A.  1999. 
Using Maximum Entropy for Text Classifica-
tion, IJCAI-99 Workshop on Machine Learning 
for Information Filtering, pp. 61-67 
Pustejovsky J., A. Rumshisky and J. Castaño. 
2002. Rerendering Semantic Ontologies: Auto-
matic Extensions to UMLS through Corpus 
Analytics. LREC 2002 Workshop on Ontologies 
and Lexical Knowledge Bases. Las Palmas, Ca-
nary Islands, Spain. 
Ratnaparkhi A. 1997.  A Simple Introduction to 
Maximum Entropy Models for Natural Lan-
guage Processing, TR 97-08, Institute for Re-
search in Cognitive Science, University of 
Pennsylvania 
Rey-Debove, J. 1978. Le Métalangage. Le Robert, 
Paris. 
Rodríguez, C. 2001. Parsing Metalinguistic 
Knowledge from Texts,  Selected papers from 
CICLING-2000 Collection in Computer Science 
(CCC); National Polytechnic Institute (IPN), 
Mexico. 
Vossen, P. and Copestake, A. 1993. Untangling 
Definition Structure into Knowledge Represen-
tation. In: Inheritance, Defaults and the Lexi-
con. 
