THE COMMON PATTERN SPECIFICATION LANGUAGE 
Douglas E. Appelt 
Artificial Intelligence Center 
SRI International 
333 Ravenswood Ave, Menlo Park, CA 
Boyan Onyshkevych 
R525 
Department of Defense 
Ft. Meade MD 
ABSTRACT 
This paper describes the Common Pattern Specification Language (CPSL) that was developed during the TIPSTER program by a 
committee of researchers from the TIPSTER research sites. Many information extraction systems work by matching regular 
expressions over the lexical features of input symbols. CPSL was designed as a language for specifying such finite-state grammars 
for the purpose of specifying information extraction rules in a relatively system-independent way. The adoption of such a common 
language would enable the creation of shareable resources for the development of rule-based information extraction systems. 
1. THE NEED FOR CPSL 
As researchers have gained experience with information 
extraction systems, there has been some convergence of 
system architecture among those systems based on the 
knowledge engineering approach of developing sets of 
rules more or less by hand, targeted toward specific sub- 
jects. Some rule-based systems have achieved very high 
performance on such tasks as name identification. Ide- 
ally, developers of information extraction systems 
should be able to take advantage of the considerable 
effort that has gone into the development of such high- 
performance extraction system components. Unfortu- 
nately, this is usually impossible, in part because each 
system has a native formalism for rule specification, and 
the translation of rules from one native formalism to 
another is usually a slow, difficult, and error-prone pro- 
cess that ultimately discourages the sharing of system 
components or rule sets. 
Over the course of the TIPSTER program and other 
information extraction efforts, many systems have con- 
verged on an architecture based on matching regular 
expression patterns over the lexical features of words in 
the input texts. The Common Pattern Specification Lan- 
guage (CPSL) was designed to take advantage of this 
convergence in architecture by providing a common for- 
malism in which finite-state patterns could be repre- 
sented. This would then enable the development of 
shareable libraries of finite-state patterns directed 
toward specific extraction tasks, and hopefully remove 
one of the primary barriers to the fast development of 
high-performance information extraction systems. 
Together with common lexicon standards and annota- 
tion standards, a developer can exploit previous domain 
or scenario customization efforts and make use of the 
insights and the hard work of others in the extraction 
community. The CPSL was designed by a committee 
consisting of a number of researchers from the Govern- 
ment and all of the TIPSTER research sites involved in 
Information Extraction that are represented in this vol- 
ume. 
2. INTERPRETER ASSUMPTIONS 
A pattern language is intended to be interpreted. Indeed, 
the interpreter is what gives the syntax of the language 
its meaning. Therefore, CPSL was designed with a 
loosely specified reference interpreter in mind. It was 
realized that extraction systems may not work exactly 
like the reference interpreter, and it was certainly not the 
goal of the designers to stifle creativity in system design. 
However, it was hoped that any system that imple- 
mented at least the functionality of the reference inter- 
preter would, given appropriate lexicons, be able to used 
published sharable resources. 
23 
The functionality assumed to be implemented by the 
reference interpreter is as follows: 
The interpreter implements cascaded finite-state trans- 
ducers. 
Each transducer accepts as input a sequence of annota- 
tions conforming to the Annotation object spec- 
ification of the TIPSTER Architecture 
\[Grishman, this volume\]. The fundamental 
operation performed by the interpreter is to test 
whether the next annotation in sequence has an 
attribute with a value specified by the grammar 
being interpreted. 
Each transducer produces as output a sequence of 
annotations conforming to the Annotation 
object specification of the TIPSTER Architec- 
ture. 
The interpreter maintains a "cursor" marking the cur- 
rent position in the text. All possible rules are 
matched at each point. One of the matching 
rules is selected as a "best match" and is 
applied. The application of a rule results in the 
creation of new annotations, and in moving the 
"cursor" to a new position. 
The interpreter does an initial tokenization and lexical 
lookup on the input. Each lexical input item is 
marked with a Word annotation, and attributes 
from the lexicon are associated with each anno- 
tation. 
The interpreter provides an interface to any external 
functions to extend the functionality of the basic 
interpreter. Such functions should be used spar- 
ingly and be carefully documented. One exam- 
ple of a legitimate use would be to construct 
tables of information useful to subsequent 
coreference resolution. 
To date, one interpreter has been developed that closely 
conforms to the specifications of the reference inter- 
preter, namely the TextPro system, implemented by 
Appelt. The object code, together with a fairly compre- 
hensive English lexicon and gazetteer, and a sample 
grammar for doing name recognition on Wall Street 
Journal texts is freely downloadable over the web at the 
following URL: 
http://www.ai.sri.com/-appelt/TextPro/. 
3. A DESCRIPTION OF CPSL 
A CPSL grammar consists of three parts: a macro defini- 
tion part, a declarations part, and a rule definition part. 
The declaration section allows the user to declare the 
name of the grammar, since most extraction systems 
will employ multiple grammars to operate on the input 
in sequential phases. The grammar name is declared 
with the statement 
Phase: <grammar_name> 
The Input declaration follows the Phase declaration, and 
tells the interpreter which annotations are relevant for 
consideration by this phase. For example, a name recog- 
nizer will probably operate on Word annotations, while 
a parser may operate on Word and NamedEntity annota- 
tions. If there are multiple annotation types declared in 
the Input declaration, the first annotation in the list is 
considered the "default" annotation type. The impor- 
tance of the default annotation will be explained under 
the discussion of quoted strings. Any other annotations 
are invisible to the interpreter, as well as any text that 
might be annotated exclusively by annotations of 
ignored types. A typical Input declaration would be: 
Input: Word, NamedEntity 
Finally, the language supports an Options declaration, 
where the user can specify implementation-dependent 
interpreter options to be used when interpreting the 
grammar. 
3.1 The Rules Section 
The rules section is the core of the grammar. It consists 
of a sequence of rules, each with a name and an optional 
priority. The general syntax of a rule definition is 
Rule: <rule_name> 
Priority: <integer> 
<rule_pattern part> --> 
<rule_action_part> 
Rules have names primarily for the implementation 
dependent convenience of error printing and tracing 
modules. Priority values can be any integer, and indicate 
to the interpreter whether this rule is to take precedence 
over other rules. The implementation of priority in the 
reference interpreter is that the rule matching the most 
annotations in the input stream is preferred over any rule 
matching fewer annotations, and if two rules match the 
same number of annotations, the rule with the highest 
priority is preferred. If several rules match that have the 
24 
same priority, then the rule declared earlier in the file is 
preferred. Interpreters should adopt this priority seman- 
tics by default. If another priority semantics is imple- 
mented, the grammar writer can select it in the Options 
declaration. 
The reference interpreter is assumed to maintain a "cur- 
sor" pointer marking its position in the chunk of input 
currently being processed. The interpreter matches each 
rule pattern part against the sequence of annotations of 
the declared input type. If no rules match, then the cur- 
sor is moved past one input annotation. If one or more 
rule pattern parts match at the current cursor position, 
the interpreter selects the "best" match according to the 
priority criteria discussed above, and executes the rule 
action part for that rule. Finally, the interpreter moves 
the cursor past the text matched by the main body part 
of the rule pattern part. This process is repeated until the 
cursor is finally moved to the end of the current input 
chunk. 
The Rule Pattern Part. 
The pattern part of the rules consists of a prefix pattern, 
a body pattern, and a postfix pattern. The prefix and 
postfix patterns are both optional, but the body is man- 
datory. The syntax is as follows: 
< prefix_pattern > body_pattern 
< postfix_pattern > 
When pattern matching begins, the reference interpreter 
assumes that the initial cursor position is between the 
prefix pattern and the body pattern. If the annotations to 
the immediate left of the cursor match the prefix pattern, 
then the body pattern is matched. If that match is suc- 
cessful, then the postfix pattern is matched. If all three 
matches are successful, then the pattern is deemed a suc- 
cessful match. Following success and execution of the 
rule's action part, the cursor is moved to the point in the 
text after which the body pattern matched, but before the 
postfix pattern, if any. 
Each of the constituents in the above rule is defined the 
same way. They are grouped (and optionally labeled) 
sequences of pattern elements. Labels are only useful in 
the central body pattern, because the annotations 
matched in the body pattern can be operated on by the 
action part of the rule. When a new annotation is created 
from a label in the body pattern, the new annotation 
receives a span consisting of the first through last char- 
acters covered by the spans of the matched annotations. 
Groups of pattern elements are enclosed with parenthe- 
ses, and are optionally followed by a label. There are 
two types of label expressions, indicated by ":" and "+:" 
characters, respectively. When used in the pattern part of 
a rule, the ":" label references the last-matched annota- 
tion within its scope. The "+:" annotation, on the other 
hand, refers to the entire set of annotations matched 
within its scope. Here is an example of labels used in a 
pattern: 
((~douglas"):firstName ~appelt") 
+:wholeName 
In this example, the label "firstName" refers to the 
annotation spanning "douglas", and the label 
"wholeName" refers to the set of annotations { "dou- 
glas" "appelt" }. 
Pattern elements are constraints on the type, selected 
attributes and values of the next annotations in the input 
stream. The basic form of an attribute constraint is 
Annotation_type. attribute <rel> 
<value> 
The annotation_type must be one of the types listed on 
the Input declaration for this grammar. The attribute 
must be one of the attributes defined for that annotation 
type. The <rel> element is one of the relations appro- 
priate for the attribute type. Possible relations are equal 
(==), not equal (!=), greater than (>), less than (<), 
greater than or equal to (>=), less than or equal to (<=). 
The <value> element can be a constant of any type 
known to the interpreter, or it can refer to an annotation 
matched in the pattern part. The data types supported by 
the reference interpreter are integer, floating point num- 
ber, Boolean, string, symbol, a reference to another 
annotation, or sets of any of those types. The reference 
interpreter does not treat symbols and strings differently, 
except that if a symbol contains any non-alphanumeric 
characters, it must be enclosed in string quotes in order 
to be parsed correctly by the grammar compiler. 
A pattern element consists of constraints in the above 
form, enclosed in brace characters. For example: 
{Word.N == true, 
Word.number == singular} 
would match an annotation that has a Boolean "N" 
attribute with value true, and a character "number" 
attribute whose value is "singular." 
The reference interpreter assumes that if an attribute is 
not present on an annotation, it will be treated as though 
25 
it were a Boolean attribute with value false. Reasonable 
type coercion is done when comparing values of differ- 
ent types. 
An abbreviation allows an entire pattern element to be 
replaced by a quoted string. This is shorthand for con- 
straining the lemma attribute of the default input annota- 
tion for this grammar to be the specified string. For 
example, if annotation type Word were declared to be 
the default input type for the current grammar then the 
pattern element 
"howdy" 
would be exactly equivalent to typing 
{Word.lemma == ~howdy"}. 
The reference interpreter assumes that the value of the 
lemma attribute is the character sequence that is used to 
look the word up in the lexicon to obtain its other lexical 
attributes. 
In addition to being sequenced in groups, pattern ele- 
ments can be combined with one of several regular 
expression operators. Possible operations include 
Alternation: (argl I arg2 I "" I arg n) 
Iteration: (argl arg2 ... argn) * or (argl 
arg2 ... argn) + 
Optionality: (argl arg2 ... argn)? 
As you would expect, * matches zero or more occur- 
rences of its argument, + matches one or more occur- 
rences, and ? matches zero or one occurrences. 
Finally, a pattern element can be a call to an external 
function. An external function call is simply the name of 
the function followed by parameters enclosed in square 
brackets. The function must be defined to return a Bool- 
ean value, and it can take any number of arguments, 
which can be references to annotations and attributes 
bound by labels defined to the left of where the external 
function call appears in the pattern. If the function 
returns true, the pattern matching continues, and it fails 
if the function returns false. 
The Rule Action Part 
The rule action part tells the interpreter what to do when 
the rule pattern part matches the input at the current 
position, and consists of a comma-separated list of 
action specifications. The basic form of an action speci- 
fication is 
annotation/attribute 
<assignment_operator> <value> 
The annotation/attribute specification is an instruction to 
the interpreter to build a new annotation. The annota- 
tion/attribute specification has the following syntax: 
:<label>.<annotation_type>. 
<attribute> 
The label must be one of the labels defined in the pattern 
part of the rule. Also, the label must have been bound 
during the pattern-matching phase. For example, a label 
in an optional element that was not matched would be 
unbound, and generate a runtime error. The annotation 
type of the newly created annotation can be any annota- 
tion type. The attribute is optional. If present, it means 
to assign the value on the right hand of the assignment 
operator to the indicated attribute on the newly created 
annotation. If the attribute is not present, then the only 
legal value on the right hand of the assignment operator 
is "@", which tells the interpreter to create an annota- 
tion spanning the specified label, but which has no 
attributes. 
The binding and the type of the label determine the span 
set of the newly created annotation. If the label was 
defined with ":", the annotation has a single span, which 
is the first through the last character of the annotations 
in the group to which the label is attached. If the label 
was defined with "+:", the new annotation has a set of 
spans, where each span in the set is obtained from one of 
the annotations in the group to which the label is 
attached. 
When the reference interpreter is evaluating an assign- 
ment statement, it looks for an annotation of the type 
specified on the left -hand side that has the exact span 
specified by the label. If one exists, then that one is used 
to complete the assignment operation. Otherwise, a new 
annotation is created. This functionality allows one to 
assign values to multiple attributes on a single annota- 
tion by using a sequence of assignment actions with the 
same label and annotation type. 
CPSL includes two assignment operators: "=" and "+=". 
The former operator is the basic assignment operator. 
The latter operator assumes that the left hand operand 
represents a set, and the right hand element is added to 
the set by the assignment. 
In addition to assignment statements, the action part of a 
rule can contain simple conditional expressions. The 
conditional expression can refer to the attributes of 
26 
annotations bound during the pattern match. Simple 
conjunction and disjunction operators (& and D are pro- 
vided for multiple conditional clauses, however, the lan- 
guage does not define a full Boolean expression syntax 
with parentheses and operator precedence. The clauses 
are simply evaluated left to fight. The THEN and ELSE 
clauses of the conditional consist of a Here is an exam- 
ple of a conditional expression: 
(IF :l.Word.lemma != false 
THEN 
:rhs.DateTime.lemma = :l.Word.lemma) 
Action specifications can also be calls to external func- 
tions, invoked as before, by the name of the function fol- 
lowed by a list of parameters enclosed in square 
brackets. External functions can return a value or be 
defined as void. If the function returns a value, it can 
appear on the right-hand side of an assignment state- 
ment. Otherwise, the external function call appears as an 
entire action specification. 
CPSL does not specify how the interface between the 
interpreter and the external function should be imple- 
mented. Each implementation is free to define its own 
API. 
3.2 The Macro Definition Section 
The grammar writer can optionally define macros at the 
beginning of a grammar definition file. CPSL macros 
are pure text substitution macros with the following 
twist: each macro consists of a pattern part and an action 
part, just like a CPSL rule. The macro is invoked by 
writing its name followed by an argument list delimited 
by double angle brackets somewhere in the pattern part 
of a rule. When the compiler encounters a macro call in 
the pattern part of the rule, it binds the parameters in the 
call to the variables in the macro definition prototype. 
The parameter bindings are substituted for occurrences 
of the parameters in the macro's pattern part, and the 
expanded pattern part is then inserted into the rule's pat- 
tern part in place of the macro call. Then, parameter 
substitution is performed on the macro's action part, and 
the resulting action specification is then added to the 
beginning of the rule's action part. It is permitted for the 
pattern part of a macro definition to contain references 
to other macros, so this macro substitution process is 
iterated until no more macro substitutions are possible. 
Here is an example of a macro definition: 
Short_and_stupid\[X,lbl\] ==> 
{Word.X == true, Word.ADJ == false} 
:ibl. Item.X = true, ;; 
An invocation of the above macro: 
Rule: foo 
(Short_and_stupid<<N,myLabel>>) 
:myLabel 
:myLabel. Item.type = stupid 
would result in the following rule being compiled: 
Rule: foo 
({Word.N == true, 
Word.ADJ == false}):myLabel 
:myLabel. Item.N = true, 
:myLabel. Item.type = stupid 
Macros can be used to automatically generate some very 
complicated rules, and when used judiciously can con- 
siderably improve their readability and comprehensibil- 
ity. 
4. A FORMAL DESCRIPTION OF CPSL 
The following is a BNF description of the common pattern specification language: 
<GRAMMAR> ::= <MACROS> <DECLARATIONS> <RULES> 
.... Declarations 
<DECLARATIONS> ::= <DECL> (<DECLARATIONS>) 
<DECL> ::= <DECL_TYPE> : <SYMBOL_LIST> 
<DECL_TYPE> ::= Phase \[ Input I Options 
27 
<SYMBOL_LIST> ::= <SYMBOL> (, <SYMBOL_LIST>) 
- Macros 
<MACROS> ::= <MACRO> (<MACROS>) 
<MACRO> ::= <MACRO_HEADER> ==> 
<PAT_PART> --> <ACT_PART> ;; 
<MACRO_HEADER> ::= <SYMBOL> [ <PARAMLIST> ] 
<PAT_PART> ::= any characters except --> and ;; 
<ACT_PART> ::= any characters except --> and ;; 
<PARAMLIST> ::= <SYMBOL> ( , <PARAM_LIST> ) 
<MACRO_INVOCATION> ::= <SYMBOL> << <ARG_LIST> >> 
<ARG_LIST> ::= <ARG> (, <ARG_LIST>) 
<ARG> ::= any characters except ; and >> 
Rules 
<RULES> ::= <RULE> ( <RULES> ) 
<RULE> ::= <NAME DECL> (<PRIORITY_DECL>) <BODY> 
<NAME_DECL> ::= Rule : <SYMBOL> 
<PRIORITY DECL> ::= Priority : <NUMBER> 
<BODY> ::= <CONSTRAINTS> --> <ACTIONS> 
<CONSTRAINTS> ::= ( < <CONSTRAINT_GROUP> > ) 
<CONSTRAINT_GROUP> 
( < <CONSTRAINT_GROUP> > ) 
<CONSTRAINT_GROUP> ::= <PATTERN_ELEMENTS> 
( I CONSTRAINT GROUP) 
<PATTER~LELEMENTS> ::= <PATTER~EMEMENT> 
(<PATTER~LELEMENTS>) 
<PATTERN_ELEMENT> ::= <BASIC_PATTERN_ELEMENT> I 
~(~ <CONSTRAINT_GROUP> ~)" <KEENE_OP> <BINDING> I 
"(~ <CONSTRAINT_GROUP> ")" I 
~(~ CONSTRAINT_GROUP ~)" <KLEENE_OP> I 
~(~ <CONSTRAINT_GROUP> ~)" <BINDING> 
28 
<KLEENE_OP> ::= * I + I ? 
<BINDING> ::= <INDEX_OP> : <LABEL> 
<INDEX_OP> ::= : I +: 
<LABEL> ::= <SYMBOL> I <NUMBER> 
<BASIC_PATTER~ELEMENT> ::= { <C_EXPRESSION> } i 
<QUOTED_STRING> I 
<SYMBOL> i 
<FUNCTION_CALL> 
<FUNCTION_CALL> ::= <SYMBOL> "[" <FARG_LIST> "]" 
<FARG_LIST> ::= nil I <FARG> ("," <FARG_LIST> ) 
<FARG> ::= <VALUE> I (^) <INDEX_EXPRESSION> 
<C_EXPRESSION> ::= <CONSTRAINT> 
( ", " <C_EXPRESSION> ) 
<CONSTRAINT> ::= <ATTRSPEC> <TEST_OP> <VALUE> I 
<ANNOT_TYPE> 
<ATTRSPEC> ::= <ANNOT_TYPE> <SYMBOL> 
<ANNOT_TYPE> ::= <SYMBOL> I <ANY> 
TEST_OP ::= == I ~= I >= I <= I < I > 
<VALUE> ::= <NUMBER> I <QUOTED_STRING> I <SYMBOL> 
I true I false 
<ACTIONS> ::= <ACTION_EXP) ( , <ACTIONS>) 
<ACTION_EXP> ::= <IF_EXP> ] <SIMPLE_ACTION> 
<IF_EXP> ::= "(" IF <A C EXPRESSION> 
THEN <ACTIONS> ")" I 
"(" IF <A_C_EXPRESSION> THEN <ACTIONS> 
ELSE <ACTIONS> ")" 
<A_C_EXPRESSION> ::= <A_CONSTRAINT> 
(<BOOLEAI~OP> <A C EXPRESSION>) 
<BOOLEA~OP> ::= & I "J" 
<A_CONSTRANT> ::= <INDEX_EXPRESSION> 
29 
<TEST OP> <VALUE> 
<SIMPLE_ACTION> ::= <ASSIGNMENT> I 
<FUNCTION_CALL> 
<ASSIGNMENT> ::= <INDEX_EXPRESSION> = @ I 
<INDEX_EXPRESSION> < ASSIGN_OP > 
(<VALUE> I <INDEX_EXPRESSION> I 
<FUNCTION_CALL> ) 
<ASSIGN_OP> ::-- I += 
<INDEX_EXPRESSION> : := : <INDEX> <FIELD> 
<FIELD> ::= <ANNOT_TYPE> ( <SYMBOL> ) 

REFERENCES 

1. Grishman, Ralph et al. The TIPSTER Architecture (this volume) Tipster 1998 
