Finite-state phrase parsing by rule sequences 
Marc Vilain and David Day 
The MITRE Corporation 
202 Burlington Rd. 
Bedford, MA 01720 USA 
mbv@mitre.org, day@mitre.org 
Abstract 
We present a novel approach to parsing phrase 
grammars based on Eric Brill's notion of rule 
sequences. The basic framework we describe has 
somewhat less power than a finite-state machine, 
and yet achieves high accuracy on standard phrase 
parsing tasks. The rule language is simple, which 
makes it easy to write rules. Further, this simpli- 
city enables the automatic acquisition of phrase- 
parsing rules through an error-reduction strategy. 
This paper explores an approach to syntactic analysis 
that is unconventional in several respects. To begin 
with, we are concerned not so much with the tradi- 
tional goal of analyzing the comprehensive structure of 
complete sentences, as much as with assigning partial 
structure to parts of sentences. The fragment of interest 
here is demonstrably a subset of the regular sets, and 
while these languages are traditionally analyzed with 
finite-state automata, our approach relies instead on the 
rule sequence architecture defined by Eric Brill. 
Why restrict ourselves to the finite-state case? Some 
linguistic phenomena are easier to model with regular 
sets than context-free grammars. Proper names are a 
case in point, since their syntactic distribution partially 
overlaps that of noun phra~ses in general; as this overlap 
is only partial, name analysis within a full context-free 
grammar is cumbersome, and some approaches have 
taken to include finite-state name parsers as a front-end 
to a principal context-free parsing stage (Jacobs et al. 
I99i). Proper names are of further interest, since their 
identifi cation is independently motivated as valuable to 
both information retrieval and extraction (Sundheim 
~996). Further, several promising recent approaches to 
information extraction rely on little more than finite- 
state machines to perform the entire extraction analysis 
(Appelt et al. I993 , Grishman I995). 
Why approach this problem with rule sequences? In 
this paper we maka the case that rule sequences succeed 
at this task through their simplicity and speed. Most 
important, they support mixed-mode acquisition: the 
rules are both easy for an engineer to write and easy to 
learn automatically. 
Rule sequences 
As part of our work in information extraction, we have 
been extensively exploring the use of rule sequences. 
Our information extraction prototype, Alembic, is in 
fact based on a pipeline of rule sequence processors that 
run the gamut from part-of-speech tagging, to phrase 
identification, to sentence parsing, to inference 
(Aberdeeen et al. I995). In each case, the underlying 
method is identical. Processing takes place by 
sequentially relabeling the corpus under consideration. 
Each sequential step is driven by a rule that attempts to 
patch residual errors left in place in the preceding steps. 
The patching process as a whole is itself preceded by an 
initial labeling phase that provides an approximate 
labeling as a starting point for rule application. 
This patching architecture, illustrated in Fig. 1, was 
codified by Eric Brill, who first exploited it for part-of- 
speech tagging (Brill I993). In the part-of-speech appli- 
cation, initial labeling is provided by lexicon lookup: 
lexemes are initially tagged with the most common part 
of speech assigned to them in a training corpus. This 
initial labeling is refined by two sets of transformations. 
Morphological transformations relabel the initial 
(default) tagging of those words that failed to be found 
in the lexicon. The morphological rules arc followed by 
contextual transformations: these rules inspect lexica\[ 
context to relabel lexemes that are ambiguous with 
respect to part-of-speech. In effect, the morphological 
transformations patch errors that were due to gaps in 
the lexicon, and the contextual rules patch errors that 
were due to the initial assignment of a lexeme's most 
common tag. 
Phrase identification: some examples 
Sequencing, patching, and simplicity, the hallmarks of 
Brill's part-of-speech tagger, are also characteristic of 
our phrase parser. In our approach, phrases are initially 
built around word sequences that meet certain lexical or 
part-of-speech criteria. The sequenced phrase-finding 
rules then grow the boundaries of phrases or set their 
label, according to a repertory of simple lexical and 
contextual tests. For example, the following rule assigns 
a label of oa(; to an unlabeled phrase just in case the 
phrase is ended by the word "Inc." 
(def-phraser 
labeJ NONE ; phrase is currently 
; unlabelled 
right-wd-1 lexeme "inc." ; rightmost word in the 
; phrase is "inc." 
labebaction ORG) ; change the phrase's label, 
; but not its boundaries 
Now, consider the following partially labelled string: 
<none>Donald F. DeScenza</none>, analyst with 
<none>Nomura Securities Inc.</none> 
274 
text Initial ~ ( Labelled text )__ >C Finaltext 
@nprocessed ) • lexlconlabelling:lookup j~ transformatlons:~ 
• morphological rules J 
Figure 1: Brill's rule sequence architecture as applied to partmf-speech tagging. 
) 
The SGML markup delimits phrases whose boun- 
daries were identified by the initial phrase-finding pass. 
Of these phrases, the second successfully triggers the 
example rule, yielding the following relabeled string. 
<none>Donald F. PeScenza</none>, analyst with 
<org>Nomura Securities Inc.</org> 
The rule, which seems both as obvious as walking 
and as fool-proof comes from the name-findinig 
processor we developed for our participation in the 6 m 
Message Understanding Conference (MtJC-6). As it 
turns out, though, the rule is in fact not error-proof, 
and causes both errors of omission (i.e. recall errors) 
and commission (i.e. precision errors). Consider the 
case of "Volkswagen of America Inc." Because the 
initial phrase labeling is only approximate, the string is 
broken into two sub-phr~es separated by "of". 
<none>golkswagen</none> of <none>America 
Inc,</none> 
The example rule designates the partial phrase 
"America Inc." as an out;, a precision error because of 
its partiality, ,and fails to produce an otto label spanning 
the entire string (a recall error). 
<none>golkswagen<lnone> of <org>America Inc.</org> 
This problem is patched by a subsequent name- 
finding rule, namely the following. 
(def-phrasee 
label ORG 
left-wd-1 test country? 
left-ctxt-I lexeme "og' 
le%-ctxt-2 phrase NONE 
bounds-action MERGE 
labbel-ac~ion ORG) 
; this is an organization 
; is the leftmost lexeme 
;in the phrase on a list 
; of country words? 
; to the left of the 
; phrase is the word "og' 
; tothe left of that is an 
; unlabelled phrase 
; merge the entire left 
; contextinto the OIZG, 
; phrase and all 
The first two clauses of the rule are antecedents that 
look for phrases such as "America inc." The next two 
clauses are further antecedents that look to the left of 
the phrase for contextual patterns of form 
"<non~>,. ,</none> of". 
The final two clauses incorporate the left context 
wholesale into the triggering phrase, yielding: 
<org>golkswagen of America Inc.</org> 
This rule effectively patches tile errors caused by its 
predecessor in the rule sequence, and simultaneously 
eliminates both a recall and a precision error. 
The phrase finder 
With these examples as background, we may now 
turn our attention to the technical details of the phrase 
finding process. As noted above, this process occurs in 
two main steps, an initial labeling pass followed by the 
application of a rule sequence. 
Initial phrase labeling 
The initial labeling process seeds the phrase-finder 
with candidate phrases. These candidate phrases need 
not be any more than approximations, in partictdar, it 
is not necessary for these candidates to have wholly 
accurate boundaries, as their left and right edges can be 
adjusted later by means of patching rules. It is also not 
neccssatT for these candidates to be unfragmented, as 
fragments can be reassembled later, just as with "Volks- 
wagen of America Inc." Further, applications that 
require multiple types of phrase labels, need not choose 
such a label during the initial phrase-finding pass. 
What is important is that the initial phrase identifi- 
cation Fred the cores of phrases reliably, even if complete 
phrases arc not identified. That is, it must partially 
align some kind of candidate phrase ~ for every phrase 
(~ that is actually present in the input. Extending a 
concept from information retrieval, this amounts to 
maximizing what we might call initial recall, i.e., 
lit= I (1) I I / I (i) I, 
where (IJ is the set of actual phrases in a test set, K is the 
set of candidate phrases generated by the initial 
phrasing passs, and cI) I is tile set of those (D < q~ that arc 
partially aligned with some 1( c K. 
The general strategy we have adopted for finding 
initial phrase seeds is to look for either runs of lcxcmes 
in a fixed word list or runs of lexemcs that have been 
tagged a certain way by our part-of-speech tagger. 
1)iffercnt instantiations of this general strategy for 
initial phrase labeling naturally arise for different 
phrase-finding tasks. For example, on the classic 
"proper names" task in mixed-case text, we havc 
achieved good results starting from runs of lexemes 
tagged with Nm, or m'~ps, the Penn Treebank proper 
noun tags. This strategy achieves the desired high 
initial recall R I , as these tags are well-correlated with 
bona fide proper nanles ~md are reliably produced in 
mixed-case text by our part-of-speech tagger. This 
strategy does not yield quite as good initial precision 
(i.e., it yields false positives) for a number of rcasons, 
such as the fragmentation problcms noted above, e.g., 
golkswagen/NNP of/IN America/NNP Inc./NNP 
Once again, though, these initial precision errors arc 
readily addressed by patching rules. 
275 
Clauee type Syntax Definition 
Contextual tests 
Phrase-internal 
tests 
Label test 
Actions 
left-ctx~-l, lef~-ctxt-2 
right-ctxt~l, rig ht-ctxt-2 
le%-wd-1, left-wd-2 
right-wd-1, right-wd-2 
wd-any 
wd-span 
label 
label-action 
bounds~action 
Test one place (resp. two places) to the left of the phrase 
Test one place (resp. two places) to the right of the phrase 
Test first (resp. second) word of phrase 
Test last (resp. next-to-last) word of phrase 
Test each word of phrase in succession. Succeeds if any word in the 
phrase passes the test. 
Test entire string spanned by phrase 
Test phrase's label 
Sets the label of the phrase 
Modify the phrase's !eft or right boundaries 
Table h Repertory of unary rule clauses. 
Phrase-finding rules 
A phrase-finding rule in our framework is made up of 
several clauses. The corc of the rule consists of clauses 
that test thc lexical context around a candidatc phrase 1< 
or that test lcxcmcs spanned by 1(. The repertory of 
these test loci is given in "Fable 1. At any given locus, a 
test may either search for a particular lcxcmc, match a 
lexeme against a closed word list, match a part of 
speech, or match a phrase of a given type. Most rules 
also test the label of thc candidate phrase 1(. 
The unary contextual tests in the table may also bc 
combincd to form binary or ternary tests. For example, 
combining I,EVT-C'IXW-I and i~mrr-cwxa'-z clauses yields 
a rule that tests for the left bigram contcxt. This was 
done in the ore defragmentation rule described earlier. 
A rule also contains at least one action clause, either 
a clause that sets the label of the phrase, or one that 
modifies the boundaries of the phrase. Finally, some 
rule actions actually introduce new phrases that embed 
the candidate mad its test context; this allows one to 
build non-recursive parse trees. 
Phrase rule interpreter 
The phrase rule interpreter implements the rule 
language in a straightforward way. Given a document 
to be analyzed, it proceeds through a rule sequence one 
rule r at a time, and attempts to apply r to every phrase 
in every sentence in the document. The interpreter first 
attempts to match the test label of r to the label of the 
candidate phrase. If this test succeeds, then the 
interpreter attempts to satisfy the rule's contextual tests 
in the context of the candidate. If these test succeed, 
then the rule's bounds and label actions are executed. 
Beyond this, the only real complexity arises with 
phrase-finding tasks that require one to maintain a 
temporary lexicon. The clearest such example is proper 
name identification. Indeed, short name forms (e.g., 
"Detroit Diesel") can sometimes only be identified 
correctly once their component terms have been found 
as part of the complete naxne (e.g., "Detroit Diesel 
Corp."). The converse is also true, as short forms of 
person names (e.g., "Mr. Olatunji") can help identify 
fitll nanm forms ( e.g., "Babatunde Olatunji"). 
The interprcter maintains a temporary lexicon on a 
document-by-document basis. Every time the 
interpreter changes the label of a phrase $, pairs of form 
<Z, "c> are added to the lexicon, where ~ is a lcxcmc in 
~, and "c is the label with which (~ is tagged. This 
lexicon is then exploited to form the associations 
between short and long proper name forms (through an 
extension to the rule repertory defined above). 
Correspondence to the regular sets 
It is straightforward to prove that this approach 
recognizes a subset of the regular sets, so we will only 
sketch the outline of such a proof here. The proof 
proceeds inductivcly by constructing a finite state 
machinc bt that accepts exactly those strings which 
receive a certain label in the phrase-finding process 
under a given rule sequence Z. We consider each rule p 
in Z in order, and correspondingly elaborate the 
machine so as to reproduce the rule's effect. 
To begin with, consider that the initial phrase 
labeling proceeds by building phrases around lexemes 
0~ 1 ..... fz n in a designated word list or by finding runs 
of certain parts of speech ~t 1 ..... 7Zm. The machine that 
reproduces this initial labeling is thus 
pl/rq ..... p n/n1 
pl/nm ..... p n/nm Pl/nl ..... p n/rq 
As usual, node labeled "S" is thc start state, and any 
node drawn with two circles is ,an accepting state. The 
Pi/~i arc labels stand for all lcxemes in the lexicon that 
may be labeled with the part of speech gJ' 
The induction step in the construction procccds 
from ~l.bl , the machine built to reproduce Z up l~hrough 
rule l\] bl in the sequence, and adds additional states and 
arcs so as to reproduce Z up through ruh'. p i. 
For example, say Pi tests for the presence of a lexeme 
to the left of a phrase and e~tends the phrase's 
lxaundaries to include )v. We extend the machine bt to 
276 
encode this rule by replacing ~'s current start state S 
with a new one S', and adding a ~, transition from S' to 
the former start state S. Thus 
becomes 
Pv ,U>l @ > O- 
- ->0 
For a rule I~ that tests whether a phrase contains a 
certain lcxcme ~'i, wc construct an "acccptor" machinc 
that accepts any string with )~i in its midst. CoCO 
Noting that the regular sets are closed trader inter- 
section, wc them proceed to build the machine that 
"intersects" the acccptor with bli. 
Other rule patterns arc handled with constructions 
of a similar flavor--space considerations preclude their 
description hcre. Note, howcw:r, that extending the 
fl:amework with a temporary lexicon makcs it trans- 
finite-state, lqnally, as with all semi-parsers, the 
machines we construct in this way must actually be 
interpreted as transducers, not just acceptors. 
Learning rule sequences automatically 
Our experience with writing rule sequences by lt,-md in 
this approach has been very positive. "\['he rule patterns 
thcmselves are simple, and the fact that they arc 
sequenced localizes their effccts mid reduccs the scope 
of their interactions. These hand-engineering 
advantages are also conferred upon learning programs 
that attcmpt to acquire these rules atttomatica\[ly. 
The approach we have taken towards discovering 
phrase rule sequences automatically is a maximum 
error-reduction scheme for selecting the next rule in a 
sequence. This approach originated with Brill's work 
on part-of-speech tagging and bracketing (Brill i993). 
Brill's rule learning algorithm 
"\['he search for a rule sequence in a given training 
corpus begins hy first applying the initial labeling 
function, just as would be the case in running a 
complete sequence. Following this, the learning 
procedurc needs to consider every rule that can possibly 
apply at this juncture, which itself is a function of the 
rule schema laaaguage. For each such applicable rule *; 
the learner considers the possible improvement in 
phrase labeling conferred by r in the current state. The 
rule that most reduces the residual error in the training 
data is selected as the next rule in the sequence. 
This generate-and-test cycle is contimmd until a 
stopping criterion is reached, which is usually taken as 
the point where performance improvement falls below a 
threshold, or ceases altogether. Other a\[ternativcs 
include setting a strict limit on the number of rules 
learned, or cross-testing the performance improvement 
of a rule on a corpus distinct from the training set. 
The rule search space 
The language of phrase rules supports a large number of 
possible rules that the phrase rule learner might need to 
consider at any one time. Take one of our smallcr 
training sets, in which there arc ~9I sentences consisting 
of 6,8IZ word tokens, with z,o77 unique word types. 
(ionsidcring only lexical rules (those that look for 
particular words), this means that there are as many as 
I8,693 possibh', unary lexical rules (%077 x 9 rule 
schemata), mad IZ,941,787 binat T lexical rules (?.,o77 z x 
3 simple bigram rule schemata) in the search space. 
However, by inverting the process, and tabulating only 
those lexical contexts that actually appear in the 
training texts, this search spacc is reduced to z,:.I 9 
unal T lcxical rules and 854 binary lexical rules. 
There are two substantively different kinds of rules 
to acquire: rules that only change the label of a phrase, 
and those that change the boundary of a phrase. The 
latter prcsent a problem \[:or accurately estimating the 
improvement of a rule, since sometimes the boundary 
realignment necessary to fix a phrase problem exceeds 
the amount by which a single rule can move a 
boundary--namely, two lexemcs. For thcse phrascs to 
be fixed there will have to be more than one rule to 
nudge the appropriate phrase botmdaries over. We 
handle this through a heuristic scoring ftmction that 
estimates the wtluc of moving a boundary in such cases. 
Error estimation methods 
A rule that fixes a problem in some cases might well 
introduce errors in some other cases. This kind of over- 
generalization can occur early in the learning process, as 
new rules need only improve over an approximate 
initial labcting. The extent to which a candidate rule is 
rewarded for its specificity and penalized for its over- 
generalization can have a strong effect on the final 
performance of the rule sequences discovered. 
We explored the use of three different types of 
scoring metrics for use in selecting the "best" of the 
competing rules to add to the sequence. Initially we 
made use of a simple arithmetic difference metric, y- s, 
wimrc y (for yield) is the number of additional correct 
phrase labelings that would be introduced if a rule were 
to be added to the rule sequence, and s (for sacrifice) is 
the number of new mistaken labelings that would bc 
introduced by the addition of the rule. '\['his is Brill's 
original metric, but note that it does not differentiate 
between rules whose overall improvement is identical, 
but whose rate of over-generalization is not. For 
example, a rule whose yield is IOO and sacrifice is 7 ° is 
treated as equally valuable as one whose yield is only 3 ° 
but which introduces uo overgeneralization at all 
(sacrifice = o). This can lead to the selection of low- 
precision rules, and while small numbers of precision 
errors may be patched, wholesale precision problems 
make subsequent improvement more difficult. 
277 
Scoring metric Training Test 
Recall Precision P&R Recall Precision P&R 
Arithmetic (y-s) 88.8 8I.z 8+8 87.2 79.0 82. 9 
Log likelihood 81.9 85.7 78.4 8t.o 73.4 77.0 
F measure, ~:o.8 86. 3 8z. 9 84. 5 85.0 8I. 5 83.z 
Table 2: Comparative contributions of three scoring measures after 100 learning epochs. 
(Training on i495 sentences from the MUc-6 named entities task). 
The next measure we investigated was one 
advocated by Dunning (I993) which uses a log like- 
lihood measure for estimating the significance of rare 
events in small populations. This measure did not 
improve predsion or recall in the learned sequences. 
The third scoring measure we investigated was the 
F-measure (VanRijsbergen 1979), which was introduced 
in information retrieval to compute a weighted combi- 
nation of recall and precision. The F-measure is also 
used extensively in evaluating information extraction 
systems at MUG (Chinchor I995). It is defined as: 
F = (32 + 1)PR 
(3 2 +P)R 
This measure is conservative in the sense that its 
value is closer to precision, p, or recall, R, depending on 
which is lower. By manipulating the ~ paraaneter one is 
able to control for the relative importance of recall or 
precision. Preliminary exploration shows that a ~ of 0.8 
seems to boost precision with no significant loss in the 
long-term recall or F-measure of the rule sequences. 
Table z summariz~es the contributions of these three 
error measures towards learning rule sequences for the 
MUC-6 named entities task (for task details, see below). 
Evaluation 
We have applied this rule sequence approach to a 
variety of realistic tasks. These largely arose as part of 
our information extraction efforts, and have been either 
directly or indirecdy evaluated in the context of two 
evaluation conferences: MUC-6 and Mffl' (for Multi- 
lingual Entity Tagging). In this paper, we will 
primarily report on evaluation conducted in the context 
of the MuC-6 named entities task (Sundheim I995). 1 
The named entities task attempts to measure the 
ability to identify the basic building blocks of most 
newswire analysis applications, e.g., named entities such 
as persons, organizations, and geographical locations. 
Also measured is the identification of some numeric 
expressions (money and percentiles), dates, and times. 
This task has become a classic application for finite- 
state pre-parsers, and indeed our work was in part 
motivated by the success that has been achieved by such 
systems in past information extraction evaluations. 
We have applied a variety of techniques towards this 
task. The easy cases of dates mid times are identified by 
a separate pre-processor, leaving numeric expressions 
1We have also measured performance on several syntactic 
constructs, (e.g., the so-called noun group), and on semantic 
subgrammars, (e.o<, person-title-organization appositions). 
(also easy) and "proper names" (the interesting hard 
part) to be treated by the rule sequence processor. 
Hand-crafted Rules 
We first approached this task as an engineering 
problem, and wrote a rule sequence by hand to identify 
these named entities. The rule sequence comprises I45 
named-entity rules, Iz rules for expressions of money 
and percentiles, and 6I rules for geographical comple- 
ments (as in "Hyundai of Canada"). In addition, the 
rules refer to a few morphological predicates and some 
short word lists--one such list, for example lists words 
designating business subsidiaries, e.g., "unit". The 
initial phrase labeling for the proper name cases is 
implemented by accumulating runs of NNP- and NNeS- 
tagged lexemes. A similar strategy is used for number 
expressions, using numeric tags. 
The performance of our hand-crafted rule sequence 
is summarized in Table 3, below, which gives compo- 
nent scores on the Mt3c-6 blind test set. The most 
interesting measures are those for the difficult proper 
name cases. Our performance here is high, especially 
for person names. Our lowest score is on organizational 
names, but note that the system lacks any extensive 
organization name list. Aside from ten hard-wired 
names, all names are found from first principles. On 
the easy numeric expressions, performancc is ahnost 
perfect--precision appears poor for percentiles, but this 
is due to an artifact of the testing procedure. 2 
Machine-crafted Rules 
To evaluate the performance of our learning algorithm, 
we attempted to reproduce substantially the same 
environment as is used for the hand-crafted rules. The 
learner had access to the same predefined word lists, 
including the less-than-perfect TU'S'tmR gazetteer. 
Further, we only acquired rules for the hardest cases, 
namely the person, organization, and location phrases. 
We cut offrule acquisition after the iooth rule. 
The results for this acquired rule set are surprisingly 
encouraging. As Table 3 shows, these rules achieved 
higher recall on the very hardest phrase type 
(organization) than their hand-crafted counterparts, 
albeit at a cost in precision. Overall, however, the 
machine-crafted rules still lag behind. When we 
incorporated them into our information extraction 
2Our performance vis-a-vis other MUC-6 participants 
placed us in the top third of participating systems. Except for 
the absolute highest performer, all these top-tercile systems 
were statistically not distinguishable from each other. 
278 
Phrase type N 
Organization 419 
Person 34g 
l,ocation m 9 
Money 74 
Percent ~6 
All phrases zt5 o 
Hand-crafted rules 
Recall Precision 
85 87 
94 94 
94 87 
99 97 
tO0 6 7 
9 ~ 9 z 
Overall t,'= 91.2 
Machine-learned rules 
Recall Preckion 
87 79 
78 79 
D 68 
88 83 
Overall F= 85.2 
Table 3: Performance on the MUC-6 named entities blind tcst. 
system, the machinc-learned rules achieved an overall 
named cntitics F-score of 85.2, compared to the 91.2 
achieved by the hand-crafted rttlcs, it should be noted, 
however, that the system loaded with these machine- 
crafted rules still outpcrfimned about a third of systems 
participating in the MUc-6 evaluation. 
Multilingual evaluation (MH') 
After the Muc-6 evahtation, the namcd entity task was 
extended in various ways to make it more applicable 
cross-linguistically. Predictably, this was followed by a 
new round of evaluations: Mv:r. The target languages in 
tltis case were Spanish, Chinese, and Japanese. We 
applied our approach m all three. 
The Mt{'l' cvahtation rcquircd actual system perfor- 
mance resuhs to be kept strictly ,-monymotts, which 
precludes our reporting here any scores as specific as we 
have cited for English. What wc may legitimately 
report, however, is that wc have effectively reproduced 
or bettered our hand-engineered English results in the 
Spanish mid Japanese t~ks, despite having no native 
speakers of either language (and only the most rudi- 
mentary reading sldlls in Kanji). In both cases, we were 
d~le to exploit part-of-speech tagging and some existing 
word lists fbr person names and locations. 
For Chinese, although we had available a word 
segmentcr, we had neither part-o6speech tagger, nor 
word lists, nor even the elementary reading skills we 
had for Japanese. As a result, we had to rely ahnost 
entirely on the learning procedure to acquire any rule 
sequences. 1)cspitc thcse impediments, wc cmnc dose 
to reproducing our results with thc English machinc- 
lcarned named entidcs rule sequcncc. 
Discussion 
What is most encouraging about this approach is how 
well it performs on so many dimensions. We have only 
reported here on nature-finding tasks, but early invcsti- 
gations in other areas arc encouraging as well. With 
rule sequences that parse noun groups, for instance, we 
hope to reproduce the utility of other rulc-scqucnce 
approaches to text chunking (Ramshaw & Marcus 
I995). We are also excited by the promise of the 
learning proccdure, not just because it learns good 
rules, but dso because the rules it learns can be freely 
intermixed with hand-cngineered rules. This mixed- 
mode acquisition is unique among natural language 
learning proccdurcs, mid we put it to good use in 
building our multilingual name-tagging sequences. 
l)espitc rcsuhs that comparc favorably to those of 
more mature systems, this work is still in its infancy. 
We still have much to explore, especially with the 
learning procedure, lndccd, while the lcamcr induces 
/'tile sequences that pcrfi~rm well in tim aggrcgatc, 
individual rules clearly show their mechanical genesis. 
For instm~cc, whcn the learner must break tics between 
identically-scoring rule candidates, it often does so in 
lhlguistically clumsy ways. At times, the learner may 
acquire a good contextual pattern, but may bc unable to 
extend it to closcly-related cases that would occur 
naturally m a linguist. 
We belicve thcsc problems arc solvable in the ncar~ 
term, and wc have partial solutions in place already. As 
our tcclmiques mature, this validates not only ottr 
particular approach Io phrase-finding, but the whole 
field of language processing through rule sequences. 
References 
Aberdeen, J., Burger, J., Day, D., llirsehman, \]., 
Robinson, P., & Vilain, M. t995. "Description of the 
Alembic" system used for MIJC-6". Ill Prcdgs. of'MUC-6, 
(\]olumbia MD. 
Appch, I). E., ttobbs, J. R., Bear, J., Israel, D., & 
Tyson, M. I993. "I;AsTUS: A finite-state processor for 
information extraction fi'om rcd-world text." in Prcdgs. 
q' IJCAt-93, Chantb&y, France. 
Brill, E. 093. A corpus-based approach m language 
learning. 1)octoral 1)issertation, Univ. of Pennsylvania. 
Chinchor, N. 094. "M uc- 5 evaluation metrics". In 
Prcdgs. t~'MUC-5, Baltimore, Ml3. 
Dunning, T. 093. "Accurate methods for the 
statistics of surprise and coincidence". Comput. Ling 19 . 
Grishmml, R. 095- "The NVu system fin" MtJC-6, or 
where's the syntax?" Ill Prcdgs. of MOO-6, Cohunbia Ml3. 
Jacobs, P. S., Krupka, G., &Rau, L. 199i. "I.exico- 
semantic pattern-matching as a companion m parsing". 
in Prcdgs. of the Fourth DaUeA Speech and Nat. Lang. 
Workshop, San Marco, CA: Morgan Kaufinan. 
Ramshaw, I.. c/r Marcus, M. 095. "Text chunking 
using transformation-based learning". \[n Preys. of 3rd 
Wkshp on Very Large Corpora, (;ambridge, MA. 
Sundhcim, B. 095. "Named entity task definition". 
In Prcdgs. e~MUC-6, Columbia MD. 
Van Rijsbergen, ('.J. I979. Information Retrieval. 
London: Buttcrsworth. 
2 7 9 
