Corpus Based Statistical Generalization Tree in Rule Optimization * 
Joyce Yue Chai Alan W. Bierm~n~ 
Department of Computer Science 
Box 90129, Duke University 
Durham, NC 27708-0129 
chai@cs.duke.edu awb@cs.duke.edu 
Abstract 
A corpus-based statistical Generalization Tree 
model is described to achieve rule opthnization 
for the information extraction task. First, the 
user creates specific rules for the target informa- 
tion from the sample articles through a training 
interface. Second, WordNet is applied to gener- 
alize noun entities in the specific rules. The de- 
gree of generalization is adjusted to fit the user's 
needs by use of the statistical Generalization Tree 
model FinaUy, the optimally generalized rules 
are applied to scan new information. The results 
of experiments demonstrate the applicability of 
our Generalization Tree method. 
Introduction 
Research on corpus-based natural language learning 
and processing is rapidly accelerating following the in- 
troduction of large on-line corpora, faster computers, 
and cheap storage devices. Recent work involves novel 
ways to employ annotated corpus in part of speech tag- 
ging (Church 1988) (Derose 1988) and the application 
of mutual information statistics on the corpora to un- 
cover lexical information (Church 1989). The goal of 
the research is the construction of robust and portable 
natural language processing systems. 
The wide range of topics available on the Internet 
calls for an easily adaptable information extraction sys- 
tem for different domains. Adapting an extraction sys- 
teem to a new domain is a tedious process. In the tradi- 
tional customization process, the given corpus must be 
studied carefully in order to get all the possible ways 
to express target information. Many research groups 
are implementing the efficient customization of infor- 
mation extraction systems, such as BBN (Weischedel 
1995), NYU (Grishman 1995), SRI (Appelt, Hobbs, 
et al 1995), SRA (Krupka 1995), MITRE (Aberdeen, 
Burger, et al 1995), and UMass (Fisher, Soderland, et 
al 1995). 
"This work has been supported by a Fellowship from 
IBM Corporation. 
We employ a rule optimization approach and imple- 
ment it in our tradable information extraction system. 
The system allows the user to train on a small amount 
of data in the domain and creates the specific rules. 
Then it automatically extracts a generalization from 
the tr~iui~g corpus and makes the rule general for the 
new information, depending on the user's needs. In 
this way, rule generali~.ation makes the customization 
for a new domain easier. 
This paper specifically describes the automated rule 
optimiT.ation method and the usage of WordNet (Miller 
1990). A Generalization Tree (GT) model based on the 
tr~inlng corpus and WordNet is presented, as well as 
how the GT model is used by our system to automat- 
ically learn and control the degree of generalization 
according to the user's needs. 
System Overview 
The system cont~i~.~ three major subsystems which, 
respectively~ address training, rule optlmi~ation, and 
the scanning of new information. The overall struc- 
ture of the system is shown in Figure 1. First, each 
article is partially parsed and segmented into Noun 
Phrases, Verb Phrases and Prepositional Phrases. An 
IBM LanguageWare English Dictionary and Comput- 
ing Term Dictionary, a Partial Parser I, a Tokenizer 
and a Preprocessor are used in the parsing process. 
The Tokenizer and the Preprocessor are designed to 
identify some special categories such as e-mail address, 
phone number, state and city etc. In the training pro- 
cess, the user, with the help of a graphical user in- 
tefface(GUI) scans a parsed sample article and indi- 
cates a series of semantic net nodes and transitions 
that he or she would like to create to represent the in- 
formation of interest. Specifically, the user designates 
those noun phrases in the article that are of interest 
and uses the interface commands to translate them 
IWe wish to thank Jerry Hobbs of SRI for providing us 
with the finite-state rules for the parser. 
81 
\[ Training Article\] 
( 
f 
L ~s=~ CL~'J 
(Rule Generator ~--~- 
I ..... 
Rule Optimization Process 
Unseen Article\[ 
-t 
~Rule Matching Routines J 
Figure 1: System Overview 
into semantic net nodes. Furthermore, the user des- 
ignates verb phrases and prepositions that relate the 
noun phrases and uses commands to translate them 
into semantic net transitions between nodes. In the 
process, the user indicates the desired translation of 
the specific information of interest into semantic net 
form that can easily be processed by the machl-e. For 
each headword in a noun phrase, WordNet is used to 
provide sense information. Usually 90% of words in 
the domain are used in sense one (the most frequently 
used sense) as defined in WordNet. However, some 
words might use other sense. For example, "opening" 
often appears in the job advertisement domain. But 
instead of using the first sense as {opening, gap}, it 
uses the fourth sense as {opportunity, chance}. Based 
on this scenario, for headwords with senses other than 
sense one, the user needs to identify the appropriate 
senses, and the Sense Classifier will keep the record of 
these headwords and their most frequently used senses. 
When the user takes the action to create the semantic 
transitions, a Rule Generator keeps track of the user's 
moves and creates the rules automatically. These rules 
are specific to the tralni~g articles and they need to 
be generalized in order to be applied on other unseen 
articles in the domain. According to ditferent require- 
ments from the user, the Rule Optimization Engine, 
based on WordNet, generalizes the specific rules cre- 
ated in the training process and forms a set of opti- 
mi~.ed rules for processing new information. This rule 
optimization process will be explained in the later sec- 
tions. During the sc~nnlng of new information, with 
the help of a rule matching routine, the system applies 
the optimized rules on a large number of unseen arti- 
cles from the domain. For the most headwords in the 
phrases, if they are not in the Sense Classifier table, 
sense one in WordNet will be assigned; otherwise, the 
Sense Classifier will provide the system their most fre- 
quently used senses in the domain. The output of the 
system is a set of semantic transitions for each article 
that specifically extract information of interest to the 
user. Those transitions can then be used by a Post- 
processor to frill templates, answer queries, or generate 
abstracts (Bagga, Chai 1997). 
Rule Operations 
Our trainable information extraction system is a rule- 
based system, which involves three aspects of role op- 
erations: rule creation, rule generalization and rule ap- 
plication. 
Ride Creation 
In a typical information extraction task, the most in- 
teresting part is the events and relationships holding 
among the events (Appelt, Hobbs, et al 1995). These 
relationships are usually specified by verbs and prepo- 
sitions. Based on this observation, the left hand side 
(LHS) of our meaning extraction rules is made up of 
three entities. The fn-st and the third entities are the 
target objects in the form of noun phrases, the second 
entity is the verb or prepositional phrase indicating 
the zelationship between the two objects. The right 
hand side (RHS) of the rule consists of the operations 
I 
I 
I 
I 
I 
I 
82 
Training Scntcncc: 
DCR Inc. 
G~ Commands: 1 
I ADD-NODE I 
Semantic Transition: 
is looking for ~ C progra ~ers. 
look for 
Specific Rule Created by Rule Generator. 
\[DCR Inc., NG, I, company\], \[look.for, VG, I, other_type\], \[programmer, NG, I, other_type\] 
ADD_NODE(DCR Inc.), ADD_NODE(programmer), ADD_~LATION(Iook_.for, DCR Inc., programmer) 
Figure 2: The Rule Creation Process 
required to create a semantic transition-ADD..NODE, 
ADD.RELATION. 
For example, during the training process, as shown 
in Figure 2, the user tr~in~ on the sentence "DCR Inc. 
is looking for C programmers...', and would like to 
designate the noun phrases(as found by the parser) 
to be semantic net nodes and the verb phrase to rep- 
resent a tr0n~ition between them. The training inter- 
face provides the user ADD.NODE, ADD.RELATION 
GUI commands to accomplish this. ADD.NODE 
is to add an object in the semantic transition. 
ADD.RELATION is to add a relationship between two 
objects. The specific rule is created automatically by 
the rule generator according to the user's moves. 
Rule Generalization 
The rule created by the rule generator as shown in Fig- 
ure 2 is very specific, and can only be activated by the 
training sentence. It will not be activated by other sen- 
tences such as "IBM Corporation seeks job candidates 
in Louisville...". Semantically speaking, these two sen- 
tences are very much alike. Both of them express that 
a company is looking for professional people. However, 
without generalization, the second sentence will not be 
processed. So the use of the specific rule is very Hrn- 
ited. In order to make the specific rules applicable to a 
large number of unseen articles in the domain, a com- 
prehensive generalization mechauism is necessary. We 
use the power of WordNet to achieve generalization. 
Introduction to VCbrdNet WordNet is a large- 
scale on-line dictionary developed by George Miller and 
colleagues at Princeton University (Miller, et al 1990a). 
The most useful feature of WordNet to the Natural 
Language Processing community is its attempt to or- 
ganize lexical information in terms of word meanings, 
rather than word forms. Each entry in WordNet is a 
concept represented by the synset. A synset is a list 
of synonyms, such as {engineer, applied scientist, tech- 
nologist} . The information is encoded in the form of 
semantic networks. For instance, in the network for 
nouns, there are "part of", "is_a", "member of"..., re- 
lationships between concepts. Philip Resnik wrote that 
"...it is d~mcult to ground taxonomic representations 
such as WordNet in precise formal terms, the use of 
the WordNet taxonomy makes reasonably clear the na- 
ture of the relationships being represented..." (Remik 
1993). The hierarchical organization of WordNet by 
word meanings (Miller 1990) provides the opportunity 
for automated generalization. With the large amount 
of information in semantic classification and taxon- 
omy provided in WordNet, many ways of incorporat- 
ing WordNet semantic features with generalization are 
foreseeable. At this stage, we only concentrate on the 
Hypernym/Hyponym feature. 
A hyponym is defined in (Mitler, et al 1990a) as fol- 
lows: "A noun X is said to be a hyponym of a noun Y 
if we can say that X is a ldnd of Y. This relation gen- 
erates a hierarchical tree structure, i.e., a taxonomy. 
A hyponym anywhere in the hierarchy can be said to 
be "a kind of" all of its superordinateds .... " If X is a 
hyponym of Y, then Y is a hypernym of X. 
Generalization From the training process, the spe- 
cific rules contain three entities on the LHS as shown in 
Figure 3. Each entity (sp) is a quadruple, in the form 
of (w,c,s,t), where w is the headword of the trained 
phrase; c is the part of the speech of the word; s is the 
sense number representing the meaning of w; t is the 
semantic type identified by the preprocessor for w. 
For each sp = (w,c,s,t), if w exists in WordNet, 
then there is a corresponding synset in WordNet. The 
hyponym/hypernym hierarchical structure provides a 
83 
1. An Abstract Specific Rule: 
(wl, c~, s~, tl), (w2, c2, s~, t~),(~s, cs, ss, is) 
> ADD..NODE(wx), ADD_NODE(to2), ADD.RELATION(w2, wx, ws) 
2. A Generalized Rule: 
(W1, C1, Sx , T1) • Generalize( spl , hx ), (W2, Cz , S2, T2 ) e Generalize( st~z, hz ), 
(Ws, C3, Ss, Ts ) ~ Generalize( sI~, hs ) 
, ) ADD.NODE(Wx), ADD_NODE(Ws), ADD_RELATION(W~,Wx, Ws) 
Figure 3: Sample Rules 
sp = (programmer. NG, I, abet_type) 
various generalization degr~ 
Genvralize(sp, I) = {engineer, applied scientist, technologist } 
Gvneralize(sp, 2) = {person, individual, someone,...} 
Generalize(sp, 3)= {life form, organism, being, ...} 
Generalize(sp, 4) = {chAry) 
Figure 4: Generalization for a Specific Concept 
way of locating the superordinate concepts of sp. By 
following additional Hypernymy, we will get more and 
more generMi~ed concepts and eventually reach the 
most general concept, such as {entity). As a result, 
for each concept, different degrees of generalization 
can be achieved by adjusting the distance between this 
concept and the most general concept in the WordNet 
hierarchy (Chai, Bierm~nn 1997).The function to ac- 
complish this task is Generalize(sp, h), which returns 
a hypernym h levels above the concept ~ in the hier- 
archy. Generalize(sp, O) returns the synset of sp. For 
example, in Figure 4, the concept {programmer} is gen- 
eralized at various levels based on Wordnet Hierarchy. 
WordNet is an acyclic structure, which suggests that 
a synset might have more than one hypernym. How- 
ever, this situation doesn't happen often. In case it 
happens, the system selects the first hypernym path. 
The process of generMi~.ing rules consists of replac- 
ing each sp = (w,c,s,t) in the specific rules by a more 
general superordinate synset from its hypernym hier- 
archy in WorclNet by performing the Generalize(sp, h) 
function. The degree of generalization for rules varies 
with the variation of h in Generalize(sp, h). 
Figure 3 shows an abstract generalized rule. The E 
symbol signifies the subsumption relationship. There- 
fore, a E b signifies that a is subsumed by b, or concept 
b is a superordinate concept of concept a. 
Opt~m~7.ation Rules with different degrees of gen- 
eralization on their different constituents will have a 
different behavior when processing new texts. A set 
of generalized rules for one domain might be sufficient; 
but in another domain, they might not be. Wit.hln a 
particular rule, the user might expect one entity to be 
relatively specific and the other entity to be more gen- 
eral. For example, if a user is interested in finding all 
DCR Inc. related jobs, he/she might want to hold the 
first entity as specific as that in Figure 2, and gener- 
M~ the third entity. The rule optimization process 
is to automatically control the degree of generalization 
in the generuli~d rules to meet user's different needs. 
Optimi~-ation will be described in later sections. 
Rule Application 
The optimally generalized rules are applied to unseen 
articles to achieve information extraction in the form of 
semantic transitions. The generaIi~.ed rule states that 
the RHS of the rule gets executed if a/l of the following 
conditions satisfy: 
• A sentence contains three phrases (not necessarily 
contiguous) with headwords W1, W2, and Ws. 
• The quadruples corresponding to these head- 
words are (Wl,C1,Sx,rl), (W2,C2,S2,r2), and 
(Ws, Cs,Ss, rs). 
• The synsets, in WordNet, corresponding to the 
quadruples, are subsumed by Cenemlize(spl, hi), 
Gener~ize(s~, h2), and Gener~/ze(s~, hs)respec- 
tively. 
Figure 5 shows an example of rule matching and cre- 
ating a semantic transition for the new information. In 
84 
Specific Rule: 
\[DCR Inc., NG, 1, company\], \[look.for, VG, 1, other.type\], \[programmer, NG, 1, other.type\] 
= ADD_NODE(DCR Inc.), ADD.NODE(programmer), ADD.RELATION(look.for, DCR Inc., programmer) 
Generalized to the Most General Rule: 
(W~,C~,S~,T~) ~ {gr,~p,...}, (W2, C2, S2,T2) ¢ {look.for,~eek, se.arch}, (Ws, Cs, Ss, Ts) E {entity} 
- ADD_NODE(Wx, ADD_NODE(W3), ADD..RELATION(W~, Wx,Ws) 
Unseen Sentence: 
The BiologyLab is searching for 
subsumed to ~ subsumed to 
{group,...} {\]ook./er, seek/search} 
Execute the RHS of the Rule: search for 
the mis~n~._~og 
subsumed to 
~entity} 
Figure 5: The Application of Gener~liT.ed Rules 
the example, the most general rule is created by gen- 
eralizing the first and the third entities in the specific 
rule to their top hypernyms in the hierarchy. Since 
verbs usually have only one level in the hierarchy, they 
are generalized to the syuset at the same level. 
Rule Optimization 
The specific rule can be generalized to the most gen- 
eral rule as in Figure 5. When we apply this most 
general rule again to the traLning corpus, a set of se- 
mantic transitions are created. Some trausitions are 
relevant, while the others are not. Users are expected 
to select the relevant transitions through a user inter- 
face. We need a mechanism to determine the level of 
generalization that can achieve best in extracting the 
relevant information and ignoring the irrelevant infor- 
mation. Therefore, Generalization Tree (GT) is de- 
signed to accomplish this task. While maintaining the 
semantic relationship of objects as in WordNet, GTs 
collect the relevancy information of all activating ob- 
jects and automatically find the optimal level of gener- 
alization to fit the user's needs. A database is used to 
maintain the relevancy information for all the objects 
which activate each most general concept in the most 
general rule. This database is transformed to the GT 
structure, which keeps the statistical information of 
relevancy for each activating object and the semantic 
relations between the objects from WordNet. The syso 
tern automatically adjusts the generalization degrees 
for each noun entity in the rules to match the desires 
of the user. The idea of this optlmi~tion process is 
to first keep recall as high as possible by applying the 
most general rules, then adjust the precision by tuning 
the rules based on the user's specific inputs. 
An Example of GT 
Suppose we apply the most general rule in Figure 5 to 
the training corpus, and the entity three in the rule is 
activated by a set of objects shown in Table 1. From 
a user interface and a statistical classifier, the rele- 
vancy_rate(reLrate) for each object can be calculated. 
rel_rate(obj) = count of ob\] being relevant 
total count of occurenoe of obj 
As shown in Table 1, 
for example, rel3ate({analyst...}) = 80%, which in- 
dicates that when {entity} in the most general rule is 
activated by analyst, 80% of time it hits relevant infor- 
mation and 20% of time it hits irrelevant information. 
On the other hand, it suggests that if {entity} is re- 
placed by the concept {analyst...}, a roughly 80% pre- 
cision could be achieved in extracting the relevant in- 
formation. The corresponding GT for Table 1 is shown 
in Figure 6. 
In GT, each activating object is the leaf node in the 
tree, with an edge to its immediate hyperaym (par- 
ent). For each hypernym list in the database, there 
is a corresponding path in GT. Besides the ordinary 
hard edge represented by the solid line, there is a soft 
edge represented by the dotted line. Concepts con- 
nected by the soft edge are the same concepts. The 
only difference between them is that the leaf node is 
the actual activating object, while the internal node is 
the hypernym for other activating objects. Hard edges 
and soft edges only have representational difference, 
as to the calculation, they are treated the same. Each 
85 
o1~ject sense 
1 analyst 
candidate 
individual 
participant 
hypemym list 
( anayst ) {ezpe,-t} {individual} 
{life form} =~, {entity} 
{candidate} ~ {ap1~licant} ~ {individual} 
{life form} m {entity}. 
{ !ndividual } ~ {life form} =~ .{..entity } 
depth •el_rate count 
4 i 80% 5 
4 100% 
2 100% 
5 O% 
_professional 1 . ... 4 100% 
software 1 ... 5 0% 
=.o ! .o. .o. o.. ... 
Table 1: Sample Database for Objects Activating {entity} 
5 
1 
2 
1 
{En6ty} c= 17or.~...Wb 
J c=i 
l~...} 
c:l 
rffiO4b 
l~oa,,} 
¢=1 
I~on.-.I 
cffil 
~./= O~ 
{life from. m~mism.-} c = 16. • = 87.$~ I 
{pe~.iadividmL.} c=16.r=87.5% 
{V, lffit..} IggL.} {~, ai~l..} {sppli¢ffi~l {igi~lag} 
r = ~Oe~ r = lOOgb r = 0% r = lO0~ rd = lO0% 
(aa'~yu} {l~fessi~al} |~-.) {~ndkim) 
~ ffi $ count=2 I ¢=1 ¢oum = $ 
rd = 80e~ vd =/00~b I rf0~b rd = 100e~ 
count = 1 
rd=0~ 
Figure 6: An Example of Generalization Tree 
node has two other fields counts of occurrence and rel- 
evancy_rate. For the leaf nodes, those fields can be 
filled from the database directly. For internal nodes, 
the instantiation of these fields depends on its children 
(hyponym) nodes. The calculation will be described 
later. 
If the relevancy_rate for the root node {entity} is 
82.3%, it indicates that, with the probability 82.3%, 
objects which activate {entity} are relevant. If the 
user is satisfied with this rate, then it's not necessary 
to alter the most general concept in the rule. If the 
user feels the estimated precision is too low, the sys- 
tem will go down the tree, and check the relevancy_rate 
in the next level. For example, if 87.5% is good enough, 
then the concept {life form, organism...} will substi- 
tute {entity} in the most general rule. If the preci- 
sion is still too low, the system will go down the tree, 
find {adult..}, {applicant..}, and replace the concept 
{entity} in the most general rule with the union of 
these two concepts. 
Generalization Tree Model 
Let's suppose Zn is a noun entity in the most general 
rule, and zn is activated by q concepts el °, e~, .... eq°; the 
times of activation for each ei ° are represented by c4. 
Since e~(i < q) activates zn, there exists a hypernym 
list .... z. in WordNet, where is 
the immediate hypernym of e~ -I. The system main- 
tains a database of activation information as shown in 
Table 2, and transforms the database to a GT model 
automatically. 
GT is an n-ary branching tree structure with the 
foUowing properties: 
• Each node represents a concept, and each edge rep- 
resents the hypernym relationship between the con- 
cepts. If ei is the immediate hypernym of ej, then 
there is an edge between node ei and ej. el is one 
level above ej in the tree. 
• The root node zn is the most general concept from 
the most general rule. 
86 
activating objects 
e!' 
..°. 
sense I counts hypernym list 
s, e~ e y =r el ~ .... ~ z. 
s2 ! e~ ~md~...~z. 
• .° ... .... 
Sq ! e¢ eq , ... 
depth 
dl 
d2 
.o°. 
d~ 
relevancy_rate 
rl 
r2 
..o 
rq 
Table 2: database of activating concepts 
The leaf nodes =o oo .o are the concepts which ~I ' ~' """~'q 
i activate zn. The internal nodes are the concepts ej 
(i ~ 0 and 1 < j _~ q) from the hypernym paths for 
the activating concepts. 
For a leaf node ei°: 
~o~nt~(e~) = 
releva'ricy.rate(ei °) = ri 
* For an internal node e, if it has n hyponyms(ie, chil- 
dren) co, ...e~ then: 
11 
counts(e) = ~ eounts(ei) 
i.~.1 
n 
relevancy.rate(e) = ~ P ( eO * relevancy_rate(w) 
i~l 
where 
P(ei) = counts(ei) 
counts(e) 
Optimized Rule 
For each noun entity in the most general rule, the sys- 
tem keeps a GT from the tra~in!ng set. Depending on 
user's di~Ibxent needs, a threshold 0 is pre-selected. For 
each GT, the system will start from the root node, 
go dow~ the tree, and find all the nodes e~ such that 
reIevan~,_rate(ei) _> O. If a node relevancy_rate is 
higher tl~an O, its children nodes will be ignored. In this 
way, the\[system maintains a set of concepts whose re/e- 
san~y.r~te is higher than O, which is called Optimized- 
Concept~. By substituting zn in the most general rule 
with O p~!mized-Conc~pts, an optimized rule is created 
to meet ;he user's needs. 
The se ~rehlng algorithm is basically the breadth-first 
search a~ 
1. Initial~ 
select 
relevm 
the pr, 
to extl 
not ca 
, :follows: 
ze Optimal-Concep~ to be empty set. Pre- 
flae threshold 0. If the user wants to get the 
~t information and particularly cares about 
~c~ion, 0 should be set high; if the user wants 
act as much as information possible and does 
:e about the precision, 0 should be set low. 
2. Starting from the root node z, perform the 
.Recursive-Search algorithm, which is defined as the 
following: 
Reenrsive-Search(concept z) 
{ if (relevancy.rate(z) _> 0) { 
put z into Optimal-Concepts set; 
ezit; } 
else { 
let m denote the number of children nodes o\[ z; 
let zi denote the child ol z (0 < i _< m); 
for (i = 1; i < m; i++) 
Recursive-S earch (zi ) ; 
); } 
Experiment and Discussion 
In this section we present and discuss results from an 
experiment. The experimental domain is triangle.jobs 
USENET newsgroup. We trained our system on 24 
articles for the extraction of six facts of interests as 
follows: 
Company Name. Examples: IBM, Metro Informa- 
tion Services, DCR Inc. 
Position/Title. Examples: programmer, financial 
analyst, software engineer. 
Experience/Skill. Example: 5 years experience in 
Oracle. 
• Location. Examples: Winston-Salem, North Car- 
olina. 
Benefit. Examples: company matching funds~ com- 
prehensive health plan. 
Contact Info. Examples: Fax is 919-660-6519, e-mail 
address. 
The testing set contained 162 articles from the same 
domain as the system was trained on. Out of 162 
articles, 21 articles were unrelated to the domain 
due to the misplacement made by the person who 
posted them. Those unrelated articles were about jobs 
87 
location benefit contact info facts company position experience 
training 62.5% 83.3% 91.7% 
testing 63.1% 90.8% 90.8% 
66.7% 25.0% 
62.4% 23.4% 
95.8% 
97.9% 
Table 3: Percentage of Facts in Training and Testing 
wanted, questions answered, ads to web site etc. First, 
we compared some of the statistics from the tr~nlng 
set and testing set. The percentage of representation 
of each fact in the articles for both training and te- 
ing domain is shown in Table 3, which is the number of 
articles containing each fact out of the total number of 
articles. The distribution of number of facts presented 
in each article is shown in Figure 7. 
The mean number of facts in each article from the 
tra;nlng set is 4.39, the standard deviation is 1.2; the 
mean number of facts in each article from the testing 
set is 4.35, the standard deviation is 1. Although these 
statistics are not strong enough to indicate the train- 
ing set is absolutely the good trMn;ng corpora for this 
information extraction task, it suggests that as far as 
the facts of interest are concerned, the training set is 
a reasonable set to be trained and learned. 
J~ to 
J: ~o 
1 
0.9 
0.8 
0.7 
O.5 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
0 
Uainin 0 
tes~ng 
1 2 3 4 S 6 
number of facts in each a~cle 
Figure 7: Distribution of Number of Facts in Each 
Article 
The evaluation process consisted of the following 
steps: fn'st, each unseen article was studied to see if 
there was any fact of interest presented; second, the 
semantic transitions produced by the system were ex- 
amined to see if they correctly extracted the fact of in- 
terest. Precision is the number of transitions correctly 
extracting facts of interest out of the total number of 
transitions produced by the system; recall is the num- 
ber of facts which have been correctly extracted out 
of the total number of facts of interest. The overall 
performance of recall and precision is defined by the 
Fomeasurement (Chinchor 1992), which is 
(~2 + 1.0) • P * R 
~.P+R 
where P is precision, R is recall, 13 = 1 ff precision and 
recall are equally important. 
First, we tested on single fact extraction, which was position~title 
fact. The purpose of this experiment is to 
test whether the different 8 values will lead to the ex- 
pected recall, and precision statistics. From the result 
out of 141 related testing articles, the recall, precision, 
F-measurement curves are shown in Figure 8. Recall 
is 51.6% when 8 = 1.0, which is lower than 75% at 
# = 0, however, precision is the highest at 84.7% when 
0 = 1.0. The F-measurement achieves its highest value 
at 64.1% when 0 = 1.0. 
Q 
C 
R Q 
Ca 
I \[ \[ I i \[ i I l I 
precision -- 0,.9 recall ~ " 
0.8 F-me~u~en~'~ 
0.7 .......... "~. 2U 
0.4 
0.3 
0.2 
0.1 
0 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.5 0.9 I 
GT ~reshold 
F'tgure 8: Performance vs. GT Threshold 
As mentioned earlier, 21 articles from the testing 
corpus are unrelated to the job advertisement domain. 
The interesting question rising here is can we use GT 
rule optimization method to achieve the information 
retrieval, in this particular case, to identify those unre- 
lated articles. Certaln\]y, we would hope that optlrn;zed 
rules won't produce any trauqitions from the unrelated 
articles. The result is shown in Figure 9. The precision 
of unrelated articles is the number of articles without 
any transitions created out of total 21 articles. We can 
see that, when 0 = 0.8, 1.0, precision is 95.7%. Only 
one article out of 21 articles is mis-identified. But when 
0 = 0, 0.2, the precision rate is very low, only 28.6% 
88 
and 38.1%. If we use the traditional way of keyword 
matching to do this information retrieval, the precision 
won't achieve as high as 95.7% since a few resume and 
job wanted postings will succeed the keyword matching 
and be mls-identitled as related articles. 
| | | I | | ! | 1 
0.9 
O.8 
0.7 
0.6 
O.5 
O.4 
0.3 
0.2 
0.1 
0 
0 
I ! I I I I I I I 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
GT threshold 
Figure 9: Precision of Identifying Unrelated vs. GT 
Threshold 
The system performance on extracting six facts is 
shown in Figure I0. The overall performance F- 
measurement gets to its peak at 70.2% when 0 = 0.8. 
When 0 = 1.0, the precision does not get to what we 
expected. One explanation is that, Im!ike the extrac- 
tion of position/title fact, for extracting the six facts 
from the domain, the training data is quite small. It is 
not sufficient enough to support the user's requirement 
for a strong estimate of precision. 0 = 0.8 is the best 
choice when the training corpus is small. 
o 
Q. 
1 
O.9 
O.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
0 
t t '| J i i J I | 
precision 
recall ----. 
F-measurement ...... 
...... -~. 
t t ! s , I I I I 
0.1 0.2. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
GT lhreshold 
Figure I0: Performance of Extracting Six Facts vs. GT 
Threshold 
Some problems were also detected which prevent 
better performance of the system. The cu~ent do- 
main is a newsgroup, where anyone can post anything 
which he/she believes is relevant to the newsgroup. It 
is inevitable that some typographical errors and some 
abbreviations occur in the articles. And the format of 
the article sometimes is unpredictable. If we can in- 
corporate into the system a spo\]llng checker, and build 
a database for the commonly used abbreviations, the 
system performance is expected to be enhanced. Some 
problems came for the use of WordNet as well. For 
example, the sentence "DCR Inc. is looking for Q/A 
people" won't activate the most general rule in Fig- 
ure 5. The reason for that is people subsumed to con- 
cept {group, grouping}, but not the concept {entity}. 
This problem can be fixed by adding one more rule 
with {group, grouping} substituting {~nt/~} in most 
general rule in Figure 5. WordNet has very refined 
senses for each concept, including some rarely used 
ones, which sometimes causes problems too. This kind 
of problem certainly hurts the performance, but it's 
not easy to correct because of the nature of WordNet. 
However, the use of WordNet generally provides a good 
method to achieve generalization in this domain of job 
advertisement. 
Conclusion and Future Work 
This paper describes a rule 0ptlmizztion approach by 
using Generalization Tree and WordNet. Our informa- 
tion extraction system learns the necessary knowledge 
by analyzing sample corpora through a training pro- 
cess. The rule optimization makes it easier for the 
information extraction system to be customized to a 
new domain. The Generalization Tree algorithm pro- 
rides a way to make the system adaptable to the user's 
needs. The idea of first achieving the highest recall 
with low precision, then adjusting precision to sat- 
isfy user's needs has been successful. We are currently 
studying how to enhance the system performance by 
further refining the generalization approach. 

References 
Aberdeen, John, John Burger, David Day, Lynette 
Hirschman, Patricia Robinson, and Marc Vil~n 1995 
MITRE: Description of the ALEMBIC System Used 
for MUC-6, Proceedings of the Sizth Message Under- 
standing Conference (MUC-6), pp. 141-155, Novem- 
ber 1995. 
Appelt, Douglas E., Jerry R. Hobbs, John Bear, 
David Israel, Megumi Kameymna, Andy Kehler, 
David Martin, Karen Myers, and Mabry Tyson 1995. 
SRI International: Description of the FASTUS Sys- 
tem Used for MUC-6, Proceedings of the Sizth Mes- 
sage Understanding Conference (MUC,-6), pp. 237- 
248, November 1995. 
Bagga, Amit, and Joyce Y. Chai 1997 A Trainable 
Message Understanding System Computational Natu- 
ral Language Learning (CoNLL97),pp. 1-8, July 1997. 
Chai, Joyce Y. and Alan W. Biermann 1997 A Word- 
Net Based Rule Generalization Engine For Me~niug 
Extraction To appear at Tenth International Sympo- 
sium On Methodologies For Intelligent Systems, 1997. 
Chinchor, Nancy 1992. MUC-4 Evaluation Metrics, 
Proceedings of the Fourth Message Understanding 
Conference (MUC-4), June 1992, San Mateo: Mor- 
gan Kalrfm~nn. 
Church, Kenneth 1988 A Stochastic Parts Program 
and Noun Phrase Parser for Unrestricted Text Pro- 
ceedings of the Second Conference on Applied Natural 
Language Processing, ACL, 1988. 
Church, Kenneth, William Gale, Patrick Hauks, and 
Donald Hindle. 1989 Parsing, Word Associations and 
typical Predicate-Argument Relations. Proceedings of 
the International Workshop on Parsing Technologies, 
1989. 
Derose, S., 1988 Grammatical Category Disambigua- 
tion by Statistical Optimization Computational Lin- 
gu/sties, 14, 1988. 
Fisher, David, Stephen Soderland, Joseph McCarthy, 
Fangfang Feng and Wendy Lehnert. 1995. Description 
of the UMass System as Used for MUCC-6, Proceed- 
ings of the Sizth Message Understanding Conference 
(MUG-6), pp. 127-140, November 1995. 
Grishmau, Ralph 1995. The NYU System for MUC-6 
or Where's the Syntax? Proceedings of the Sizth Mes- 
sage Understanding Conference (MUC-6), pp. 167- 
175, November 1995. 
Krupka, George 1~. 1995. Description of the SRA Sys- 
tem as Used for MUC-6, Proceedings of the Sixth Mes- 
sage Understanding Conference (MUG-6), pp. 221- 
235, November 1995. 
Miller, George A. 1990. Introduction to WordNet: An 
On-Line Lexical Database. WordNet Manuals, pp. 10- 
32, August 1993. 
Miller, George A., et al. 1990a. Five Papers on Word- 
Net, Cognitive Science Laboratory, Princeton Univer- 
sity, No. 43, July 1990. 
Resnik, Philip 1995a Using Information Con- 
tent to Evaluate Seantic Similsrity in a Taxon- 
omy.Proceedings of IJCAI-g5 
Resnik, Philip 1995b Disambiguating noun groupings 
with respect to WordNet senses, Third Worshop on 
Very Large Corpora, 1995. 
Resuik, Philip 1993 Selection and Information: A 
Class Based Approach to Lexical Relationships, Ph.D 
Dissertation, University of Pennsylvania, 1993. 
We~chedel, Ralph 1995. BBN: Description of the 
PLUM System as Used for MUC-6, Proceedings of 
the Sixth Message Understanding Conference (MUG- 
6), pp. 55-69, November 1995. 
