Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 492–500,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Entity Annotation based on Inverse Index Operations
Ganesh Ramakrishnan, Sreeram Balakrishnan, Sachindra Joshi
IBM India Research Labs
IIT Delhi, Hauz Khas,
New Delhi, India
{ganramkr, sreevb, jsachind}@in.ibm.com
Abstract
Entity annotation involves attaching a la-
bel such as ‘name’ or ‘organization’ to a
sequence of tokens in a document. All the
current rule-based and machine learning-
based approaches for this task operate at
the document level. We present a new
and generic approach to entity annotation
which uses the inverse index typically cre-
ated for rapid key-word based searching
of a document collection. We define a set
of operations on the inverse index that al-
lows us to create annotations defined by
cascading regular expressions. The entity
annotations for an entire document cor-
pus can be created purely of the index
with no need to access the original docu-
ments. Experiments on two publicly avail-
able data sets show very significant perfor-
mance improvements over the document-
based annotators.
1 Introduction
Entity Annotation associates a well-defined label
such as ‘person name’, ‘organization’, ‘place’,
etc., with a sequence of tokens in unstructured
text. The dominant paradigm for annotating a
document collection is to annotate each document
separately. The computational complexity of an-
notating the collection in this paradigm, depends
linearly on the number of documents and the cost
of annotating each document. More precisely, it
depends on the total number of tokens in the doc-
ument collection. It is not uncommon to have mil-
lions of documents in a collection. Using this par-
adigm, it can take hours or days to annotate such
big collections even with highly parallel server
farms. Another drawback of this paradigm is that
the entire document collection needs to be re-
processed whenever new annotations are required.
In this paper, we propose an alternative para-
digm for entity annotation. We build an index for
the tokens in the document collection first. Us-
ing a set of operators on the index, we can gener-
ate new index entries for sequences of tokens that
match any given regular expression. Since a large
class of annotators (e.g., GATE (Cunningham et
al., 2002)) can be built using cascading regular ex-
pressions, this approach allows us to support anno-
tation of the document collection purely from the
index.
We show both theoretically and experimentally
that this approach can lead to substantial reduc-
tions in computational complexity, since the order
of computation is dependent on the size of the in-
dexes and not the number of tokens in the doc-
ument collection. In most cases, the index sizes
used for computing the annotations will be a small
fraction of the total number of tokens.
In (Cho and Rajagopalan, 2002) the authors de-
velop a method for speeding up the evaluation of
a regular expression ‘R’ on a large text corpus by
use of an optimally constructed multi-gram index
to filter documents that will match ‘R’. Unfortu-
nately, their method requires access to the docu-
ment collection for the final match of ‘R’ to the
filtered document set, which can be very time con-
suming. The other bodies of related prior work
concern indexing annotated data (Cooper et al.,
2001; Li and Moon, 2001) and methods for doc-
ument level annotation (Agichtein and Gravano,
2000; McCallum et al., 2000). The work on index-
ing annotated data is not directly relevant, since
our method creates the index to the annotations di-
rectly as part of the algorithm for computing the
annotation. (Eikvil, 1999) has a good survey of
existing document level IE methods. The rele-
vance to our work is that only a certain class of
annotators can be implemented using our method:
namely anything that can be implemented using
cascading weighted regular expressions. Fortu-
492
nately, this is still powerful enough to enable a
large class of highly effective entity annotators.
The rest of the paper is organized as follows. In
Section 2, we present an overview of the proposed
approach for entity annotation. In Section 3, we
construct an algorithm for implementing a deter-
ministic finite automaton (DFA) using an inverse
index of a document collection. We also compare
the complexity of this approach against the direct
approach of running the DFA over the document
collection, and show that under typical conditions,
the index-based approach will be an order of mag-
nitude faster. In Section 4, we develop an alter-
native algorithm which is based on translating the
original regular expression directly into an ordered
AND/OR graph with an associated set of index
level operators. This has the advantage of oper-
ating directly on the much more compact regular
expressions instead of the equivalent DFA (which
can become very large as a result of the NFA to
DFA conversion and epsilon removal steps). We
provide details of our experiments on two publicly
available data sets in Section 5. Finally we present
our conclusions in Section 6.
2 Overview
Figure 1 shows the process for entity annotation
presented in the paper. A given document collec-
tion D is tokenized and segmented into sentences.
The tokens are stored in an inverse index I. The
inverse index I has an ordered list U of the unique
tokens u1, u2, ..uW that occur in the collection,
where W is the number of tokens in I. Addition-
ally, for each unique token ui, I has a postings
list L(ui) =< l1,l2,...lcnt(ui) > of locations in
D at which ui occurs. cnt(ui) is the length of
L(ui). Each entry lk, in the postings list L(ui),
has three fields: (1) a sentence identifier, lk.sid,
(2) the begin position of the particular occurrence
of ui, lk.first and (3) the end position of the same
occurrence of ui, lk.last.
We require the input grammar to be the same
as that used for named entity annotations in GATE
(Cunningham et al., 2002). The GATE architec-
ture for text engineering uses the Java Annota-
tions Pattern Engine (JAPE) (Cunningham, 1999)
for its information extraction task. JAPE is a pat-
tern matching language. We support two classes
of properties for tokens that are required by gram-
mars such as JAPE: (1) orthographic properties
such as an uppercase character followed by lower
case characters, and (2) gazetteer (dictionary) con-
tainment properties of tokens and token sequences
such as ‘location’ and ‘person name’. The set of
tokens along with entity types specified by either
of these two properties are referred to as Basic
Entities. The instances of basic entities specified
by orthographic properties must be single tokens.
However, instances of basic entities specified us-
ing gazetteer containment properties can be token
sequences.
The module (1) of our system shown in Fig-
ure 1, identifies postings lists for each basic en-
tity type. These postings lists are entered as index
entries in I for the corresponding types. For ex-
ample, if the input rules require tokens/token se-
quences that satisfy Capsword or Location Dic-
tionary properties, a postings list is created for
each of these basic types. Constructing the post-
ings list for a basic entity type with some ortho-
graphic property is a fairly straightforward task;
the postings lists of tokens satisfying the ortho-
graphic properties are merged (while retaining the
sorted order of each postings list). The mecha-
nism for generating the postings list of basic en-
tities with gazetteer properties will be developed
in the following sections. A rule for NE an-
notation may require a token to satisfy multiple
properties such as Location Dictionary as well as
Capsword. The posting list for tokens that satisfy
multiple properties are determined by perform-
ing an operation parallelint(L,Lprime) over the post-
ing lists of the corresponding basic entities. The
parallelint(L,Lprime) operation returns a posting list
such that each entry in the returned list occurs in
both L as well as Lprime. The module (2) of our sys-
tem shown in Figure 1 identifies instances of each
annotation type, by performing index-based oper-
ations on the postings lists of basic entity types and
other tokens.
3 Annotation using Cascading Regular
Expressions
Regular expressions over basic entities have been
extensively used for NE annotations. The Com-
mon Pattern Specification Language (CSPL)1
specifies a standard for describing Annotators that
can be implemented by a series of cascading regu-
lar expression matches.
Consider a regular expression R over an al-
phabet Σ of basic entities, and a token sequence
1http://www.ai.sri.com/∼appelt/TextPro
493
Figure 1: Overview of the entity annotation process described in this paper
T = {t1,...,tW}. The annotation problem aims
at determining all matches of regular expression
R in the token sequence T. Additionally, NE an-
notations do not span multiple sentences. We will
therefore assume that the length of any annotated
token sequence is bounded by ∆, where ∆ can
be the maximum sentence length in the document
collection of interest. In practice, ∆ can be even
smaller.
3.1 Computing Annotations using a DFA
Given a regular expression R, we can convert it
into a deterministic finite automate (DFA) DR. A
DFA is a finite state machine, where for each pair
of state and input symbol, there is one and only
one transition to a next state. DR starts process-
ing of an input sequence from a start state sR, and
for each input symbol, it makes a transition to a
state given by a transition function ΦR. Whenever
DR lands in an accept state, the symbol sequence
till that point is accepted by DR. For simplicity of
the document and index algorithms, we will ignore
document and sentence boundaries in the follow-
ing analysis.
Let @ti,i+∆,1 ≤ i ≤ W −∆ be a subsequence
of T of length ∆. On a given input @ti,i+∆, DR
will determine all token sequences originating at ti
that are accepted by the regular expression gram-
mar specified through DR. Figure 2 outlines the
algorithm findAnnotations that locates all token
sequences in T that are accepted by DR.
Let DR have {S1,...,SN} states. We assume
that the states have been topologically ordered so
that S1 is the start state. Let α be the time taken
to consume a single token and advance the DFA
to the next state (this is typically implemented as
a table or hash look-up). The time taken by the al-
findAnnotations(T,DR)
Let T = {t1,...,tW}
for i = 1 to W −∆ do
let @ti,i+∆ be a subsequence of length ∆ starting
from ti in T
use DR to annotate @ti,i+∆
end for
Figure 2: The algorithm for finding all the occur-
rences of R in a token sequence T.
gorithm findAnnotations can be obtained by sum-
ming up the number of times each state is vis-
ited as the input tokens are consumed. Clearly,
the state S1 is visited W times, W being the total
number of symbols in the token sequence T. Let
cnt(Si) give the total number of times the state Si
has been visited. The complexity of this method
is:
CD = α
i=Nsummationdisplay
i=1
cnt(Si) = α
bracketleftBigg
W +
i=Nsummationdisplay
i=2
cnt(Si)
bracketrightBigg
(1)
3.2 Computing Regular Expression Matches
using Index
In this section, we present a new approach for find-
ing all matches of a regular expression R in a to-
ken sequence T, based on the inverse index I of T.
The structure of the inverse index was presented in
Section 2. We define two operations on postings
lists which find use in our annotation algorithm.
1. merge(L,Lprime): Returns a postings list such
that each entry in the returned list occurs either in
L or Lprime or both. This operation takes O(|L|+|Lprime|)
time.
2. consint(L,Lprime): Returns a postings list such
that each entry in the returned list points to a to-
ken sequence which consists of two consecutive
494
subsequences @sa and @sb within the same sen-
tence, such that, L has an entry for @sa and Lprime
has an entry for @sb. There are several meth-
ods for computing this depending on the relative
size of L and Lprime. If they are roughly equal in
size, a simple linear pass through L and Lprime, anal-
ogous to a merge, can be performed. If there is
a significant difference in sizes, a more efficient
modified binary search algorithm can be imple-
mented. The details are shown in Figure 3. The
consint(L,Lprime)
Let M elements of L be l1 ···lM
Let N elements of L’ be lprime1 ···lN
if M < N then
set j = 1
for i = 1 to M do
set k = 1, keep doubling k until
lprimej.first ≤ li.last < lprimej+k.first
binary search the Lprime in the interval j···k
to determine the value of p such that
lprimep.first ≤ li.last < lprimep+1.first
if lprimep.first = li.last a match exists, copy to output
set j = p+ 1
end for
else
Same as above except l and lprime are reversed
end if
Figure 3: The modified binary search algorithm
for consint
complexity of this algorithm is determined by the
size qi of the interval required to satisfy lprimej.first ≤
li.last < lprimej+qi.first (assuming |L| < |Lprime|). It
will take an average of log2(qi) operations to de-
termine the size of interval and log2(qi) opera-
tions to perform the binary search, giving a to-
tal of 2log2(qi). Let q1···qM be the sequence
of intervals. Since the intervals will be at most
two times larger than the actual interval between
the nearest matches in Lprime to L, we can see that
|Lprime| ≤ summationtextMi=1 qi ≤ 2 ∗ |Lprime|. Hence the worst case
will be reached when qi = 2|Lprime|/|L| with a time
complexity given by 2|L|(log2(|Lprime|/|L|) + 1), as-
suming |L| < |Lprime|.
To support annotation of a token sequence that
matches a regular expression only in the con-
text of some regular expression match on its left
and/or right, we implement simple extensions to
the consint(L1,L2) operator. Details of the ex-
tensions are left out from this paper owing to space
constraints.
3.3 Implementing a DFA using the Inverse
Index
In this section, we present a method that takes a
DFA DR and an inverse index I of a token se-
quence T, to compute a postings list of subse-
quences of length at most ∆, that match the regu-
lar expression R.
Let the set S = {S1,...,SN} denote the set
of states in DR, and let the states be topologi-
cally ordered with S1 as the start state. We as-
sociate an object lists,k with each state s ∈ S and
∀1 ≤ k ≤ ∆. The object lists,k is a posting list
of all token sequences of length exactly k that end
in state s. The lists,k is initialized to be empty
for all states and lengths. We iteratively compute
lists,k for all the states using the algorithm given
in Figure 4. The function dest(Si) returns a set
of states, such that for each s ∈ dest(Si), there
is an arc from state Si to state s. The function
label(Si,Sj) returns the token associated with the
edge (Si,Sj).
for k = 1 to ∆ do
for i = 1 to N do
for s ∈ dest(Si) do
if i == 1 then
t = L(label(Si,s))
else
t = consint(listSi,k−1,L(label(Si,s)))
end if
lists,k = merge(lists,k,t)
end for
end for
end for
Figure 4: The algorithm for building the index to
all token sequences in T that match R.
At the end of the algorithm, all token sequences
corresponding to postings lists lists,i,s ∈ S,1 ≤
i ≤ ∆ are sequences that are matched by the reg-
ular expression R.
3.4 Complexity Analysis for the Index-based
Approach
The complexity analysis of the algorithm given
in Figure 4 is based on the observation that,summationtext
k=∆
k=1 |listSi,k| = cnt(Si). This holds, since
listSi,k contains an entry for all sequences that
visit the state Si and are of length exactly k. Sum-
ming the length of these lists for a particular state
Si across all the values of k will yield the total
number of sequences of length at most ∆ that visit
the state Si.
For the algorithm in Figure 3, the time taken by
495
one consint operation is given by 2β(|listSi,k| ∗
(log(ρijk) + 1)) where β is a constant that varies
with the lower level implementation. ρijk =
|L(label(Si,Sj))|
|listSi,k| is the ratio of the postings list size
of the label associated with the arc from Si to
Sj to the list size of Si at step k. Note that
ρijk ≥ 1. Let prev(Si) be the list of pre-
decessor states to Si. The time taken by all
the merge operations for a state Si at step k
is given by γ(log(|prev(Si)|)|listSi,k|) Assum-
ing all the merges are performed simultaneously,
γ(log(|prev(Si)|) is the time taken to create each
entry in the final merged list, where γ is a con-
stant that varies with the lower level implementa-
tion. Note this scales as the log of the number of
lists that are being merged.
The total time taken by the algorithm given in
Figure 4 can be computed using the time spent on
merge and consint operations for all states and
all lengths. Setting ¯ρis = maxk ρisk, the total time
CI can be given as:
CI =
i=Nsummationdisplay
i=2

γlog(|prev(Si)|) + 2β summationdisplay
s∈dest(Si)
log( ¯ρis)

cnt(Si)
(2)
Note that in deriving Equation 2, we have ig-
nored the cost of merging list(Sa,k) for k =
1···∆ for the accept states.
3.5 Comparison of Complexities
To simplify further analysis, we can replace
cnt(Si) with fcnt(Si) where fcnt(Si) =
cnt(Si)/W. If we assume that the token distribu-
tion statistics of the document collection remain
constant as the number of documents increases,
we can also assume that fcnt(Si) is invariant to
W. Since ρijk is given by a ratio of list sizes, we
can also consider it to be invariant to W. We now
assume α ≈ β ≈ γ since these are implementa-
tion specific times for similar low level compute
operations. With this assumptions from Equations
1 and 2, the ratio CD/CI can be approximated by:
1 +summationtextNi=2 fcnt(Si)
summationtextN
i=2
bracketleftBigsummationtext
s∈dest(Si) 2log( ¯ρis) + log(|prev(Si)|)
bracketrightBig
fcnt(Si)
(3)
The overall ratio of CD to CI is invariant to W
and depends on two key factors fcnt(Si) andsummationtext
s∈dest(Si) log( ¯ρis). If fcnt(Si) lessmuch 1, the ratio
will be large and the index-based approach will be
much faster. However, if either fcnt(Si) starts ap-
proaching 1 or summationtexts∈dest(Si) log( ¯ρis) starts getting
very large (caused by a large fan out from Si), the
direct match using the DFA may be more efficient.
Intuitively, this makes sense since the main ben-
efit of the index is to eliminate unnecessary hash
lookups for tokens do not match the arcs of the
DFA. As fcnt(Si) approaches 1, this assumption
breaks down and hence the inherent efficiency of
the direct DFA approach, where only a single hash
lookup is required per state regardless of the num-
ber of destination states, becomes the dominant
factor.
3.6 Comparison of Complexities for Simple
Dictionary DFA
To illustrate the potential gains from the index-
based annotation, consider a simple DFA DR with
two states S1 and S2. Let the set of unique to-
kens A be {a,b,c···z}. Let E be the dictionary
{a,e,i,o,u}. Let DR have five arcs from S1 to S2
one for each element in E. The DFA DR is a sim-
ple acceptor for the dictionary E, and if run over
a token sequence T drawn from A, it will match
any single token that is in E. For this simple case
fcnt(S2) is just the fraction of tokens that occur
in E and hence by definition fcnt(S2) ≤ 1. Sub-
stituting into 3 we get
CD
CI =
1 + fcnt(S2)
2log(5)fcnt(S2) (4)
As long as fcnt(S2) < 0.27, this ratio will always
be greater than 1.
4 Inverse Index-based Annotation using
Regular Expressions
A DFA corresponding to a given regular expres-
sion can be used for annotation, using the inverse
index approach as described in Section 3.3. How-
ever, the NFA to DFA conversion step may result
in a DFA with a very large number of states. We
develop an alternative algorithm that translates the
original regular expression directly into an ordered
AND/OR graph. Associated with each node in the
graph is a regular expression and a postings list
that points to all the matches for the node’s regu-
lar expression in the document collection. There
are two node types: AND nodes where the output
list is computed from the consint of the postings
lists of two children nodes and OR nodes where
the output list is formed by merging the posting
496
lists of all the children nodes. Additionally, each
node has two binary properties: isOpt and self-
Loop. The first property is set if the regular ex-
pression being matched is of the form ‘R?’, where
‘?’ denotes that the regular expression R is op-
tional. The second property is set if the regular
expression is of the form ‘R+’, where ‘+’ is the
Kleen operator denoting one or more occurrences.
For the case of ‘R*’, both properties are set.
The AND/OR graph is recursively built by scan-
ning the regular expression from left to right and
identifying every sub-regular expression for which
a sub-graph can be built. We use capital letters
R,X to denote regular expressions and small let-
ters a, b, c, etc., to denote terminal symbols in
the symbol set Σ. Figure 5 details the algorithm
used to build the AND/OR graph. Effectively, the
AND/OR graph decomposes the computation of
the postings list for R into a ordered set of merge
and consint operations, such that the output L(v)
for node v become the input to its parents. The
graph specifies the ordering, and by evaluating all
the nodes in dependency order, the root node will
end up with a postings list that corresponds to the
desired regular expression.
if R is empty then
Return NULL
else if R is a symbol a ∈ Σ then
Return createNode(name = a)
else
Decompose R such that R → Rprime <regexp>
if <regexp> is empty then
if Rprime == (X) or X+ or X∗ or X? then
node = createGraph(X)
if Rprime == X+ or X∗ then
node.selfLoop = 1
end if
if Rprime == X? or X∗ then
node.isOpt = 1
end if
else if Rprime == (X1|X2|..|Xk) then
node = createNode(name = R)
node.nodetype = OR
for i = 1 to k do
node.children[i] = createGraph(Xi)
end for
end if
else
node = createNode(name = R)
node.nodetype = AND
node.children[1] = createGraph(Rprime)
node.children[2] = createGraph(<regexp>)
end if
Return node
end if
Figure 5: createGraph(R)
Figure 6: An example regular expression and cor-
responding AND/OR graph
4.1 Handling ‘?’ and Kleen Operators
The isOpt and selfLoop properties of a node are
set if the corresponding regular expression is of
the form R?, R+ or R∗. To handle the R? case
we associate a new property isOpt with the output
list L(v) from node v, such that L(v).isOpt = 1
if the v.isOpt = 1. We also define two operations
consintepsilon1 in Figure 7 and mergeepsilon1 which account
for the isOpt property of their argument lists. For
consintepsilon1, the generated list has its isOpt set to
1 if and only if both the argument lists have their
isOpt property set to 1. The mergeepsilon1 operation re-
mains the same as merge, except that the resultant
list has isOpt set to 1 if any of its argument lists
has isOpt set to 1. The worst case time taken by
consintepsilon1 is bounded by 1 consint and 2 merge
operations.
To handle the R+ case, we define a new oper-
ator consintepsilon1(L,+) which returns a postings list
Lprime, such that each entry in the returned list points
to a token sequence consisting of all k ∈ [1,∆]
consecutive subsequences @s1,@s2 ...@sk, each
@si,1 ≤ i ≤ k being an entry in L. A sim-
ple linear pass through L is sufficient to obtain
consint(L,+). The time complexity of this op-
eration is linear in the size of Lprime. The isOpt prop-
erty of the result list Lprime is set to the same value as
its argument list L.
Figure 6 shows an example regular expres-
sion and its corresponding AND/OR graph; AND
nodes are shown as circles whereas OR nodes are
shown as square boxes. Nodes having isOpt and
selfLoop properties are labeled with +, ∗ or ?.
Any AND/OR graph thus constructed is acyclic.
The edges in the graph represent dependency be-
tween computing nodes. The main regular expres-
sion is at the root node of the graph. The leaf
nodes correspond to symbols in Σ. Figure 8 out-
lines the algorithm for computing the postings list
of a regular expression by operating bottom-up on
the AND/OR graph.
497
consintepsilon1(L,Lprime)
if ((L.isOpt == 0) and (L’.isOpt == 0)) then
Return consint(L,Lprime)
end if
if ((L.isOpt == 0) and (L’.isOpt == 1)) then
Return merge(L,consint(L,Lprime))
end if
if ((L.isOpt == 1) and (L’.isOpt == 0)) then
Return merge(consint(L,Lprime),Lprime)
end if
if ((L.isOpt == 1) and (L’.isOpt == 1)) then
t = merge(consint(L,Lprime),Lprime)
Return merge(t,L)
end if
Figure 7: consintepsilon1
for Each node v in the reverse topological sorting of GR
do
if v.nodetype == AND then
Let v1 and v2 be the children of v
L(v) = consintepsilon1(L(v1),L(v2))
else if v.type == OR then
L(v) = mergeepsilon1(L(v.child1),···,L(v.childn))
end if
if v.selfLoop == 1 then
L(v) = consintepsilon1(L(v),+)
end if
if v.isOpt == 1 then
L(v).isOpt = 1
end if
end for
Figure 8: The algorithm for computing postings
list of a regular expression R using the inverse in-
dex I and the corresponding AND/OR graph GR
5 Experiments and Results
In this section, we present empirical compari-
son of performance of the index-based annotation
technique (Section 4) against annotation based on
the ‘document paradigm’ using GATE. The exper-
iments were performed on two data sets, viz., (i)
the enron email data set2 and (ii) a combination of
Reuters-21578 data set3 and the 20 Newsgroups
data set4. After cleaning, the former data set was
2.3 GB while the latter was 93 MB in size. Our
code is entirely in Java. The experiments were
performed on a dual 3.2GHz Xeon server with 4
GB RAM. The code for creation of the index was
custom-built in Java. Prior to indexing, the sen-
tence segmentation and tokenization of each data
set was performed using in-house Java versions of
2http://www.cs.cmu.edu/∼enron/
3http://www.daviddlewis.com/resources/
testcollections/reuters21578/
4http://people.csail.mit.edu/jrennie/
20Newsgroups/
standard tools5.
5.1 Rule Specification using JAPE
JAPE is a version of CPSL6 (Common Pattern
Specification Language). JAPE provides finite
state transduction over annotations based on reg-
ular expressions. The JAPE grammar requires in-
formation from two main resources: (i) a tokenizer
and (ii) a gazetteer.
(1) Tokenizer: The tokenizer splits the text into
very simple tokens such as numbers, punctuation
and words of different types. For example, one
might distinguish between words in uppercase and
lowercase, and between certain types of punctua-
tion. Although the tokenizer is ca pable of much
deeper analysis than this, the aim is to limit its
work to maximise efficiency, and enable greater
flexibility by placing the burden on the grammar
rules, which are more adaptable. A rule has a
left hand side (LHS) and a right hand side (RHS).
The LHS is a regular expression which has to be
matched on the input; the RHS describes the an-
notations to be added to the Annotation Set. The
LHS is separated from the RHS by ’>’. The fol-
lowing four operators can be used on the LHS: ’|’,
’?’, ’∗’ and ’+’. The RHS uses ’;’ as a separa-
tor between statements that set the values of the
different attributes. The following tokenizer rule
identifies each character sequence that begins with
a letter in upper case and is followed by 0 or more
letters in lower case:
"UPPERCASELETTER" "LOWERCASELETTER"*
>>> Token; orth=upperInitial; kind=word;
Each such character sequence will be annotated as
type “Token”. The attribute “orth” (orthography)
has the value “upperInitial”; the attribute “kind”
has the value “word”.
(2) Gazetteer: The gazetteer lists used are plain
text files, with one entry per line. Each list rep-
resents a set of names, such as names of cities,
organizations, days of the week, etc. An index file
is used to access these lists; for each list, a ma-
jor type is specified and, optionally, a minor type.
These lists are compiled into finite state machines.
Any text tokens that are matched by these ma-
chines will be annotated with features specifying
the major and minor types. JAPE grammar rules
5http://l2r.cs.uiuc.edu/∼cogcomp/
tools.php
6A good description of the original version of this lan-
guage is in Doug Appelt’s TextPro manual: http://www.
ai.sri.com/∼appelt/TextPro.
498
then specify the types to be identified in particular
circumstances.
The JAPE Rule: Each JAPE rule has two parts,
separated by “–>”. The LHS consists of an an-
notation pattern to be matched; the RHS describes
the annotation to be assigned. A basic rule is given
as:
Rule::=
<rule> <ident> ( <priority> <integer> )?
LeftHandSide ">>>" RightHandSide
(1) Left hand side: On the LHS, the pattern is
described in terms of the annotations already as-
signed by the tokenizer and gazetteer. The annota-
tion pattern may contain regular expression opera-
tors (e.g. ∗, ?, +). There are 3 main ways in which
the pattern can be specified:
1. value: specify a string of text, e.g.
{Token.string == “of”}
2. attribute: specify the attributes (and values)
of a token (or any other annotation), e.g.
{Token.kind == number}
3. annotation: specify an annotation type from
the gazetteer, e.g. {Lookup.minorType ==
month}
(2) Right hand side: The RHS consists of de-
tails of the annotations and optional features to be
created. Annotations matched on the LHS of a rule
may be referred to on the RHS by means of labels
that are attached to pattern elements. Finally, at-
tributes and their corresponding values are added
to the annotation. An example of a complete rule
is:
Rule: NumbersAndUnit
(({Token.kind=="number"})+:numbers
{Token.kind=="unit"})
>>>
:numbers.Name={rule="NumbersAndUnit"}
This says ‘match sequences of numbers followed
by a unit; create a Name annotation across the span
of the numbers, and attribute rule with value Num-
bersAndUnit’.
Use of context: Context can be dealt with in the
grammar rules in the following way. The pattern to
be annotated is always enclosed by a set of round
brackets. If preceding context is to be included in
the rule, this is placed before this set of brackets.
This context is described in exactly the same way
as the pattern to be matched. If context follow-
ing the pattern needs to be included, it is placed
Figure 9: An example JAPE rule used in the ex-
periments
after the label given to the annotation. Context is
used where a pattern should only be recognised if
it occurs in a certain situation, but the context itself
does not form part of the pattern to be annotated.
For example, the following rule for ‘email-id’s
(assuming an appropriate regular expression for
“EMAIL-ADD”) would mean that an email ad-
dress would only be recognized if it occurred in-
side angled brackets (which would not themselves
form part of the entity):
Rule: Emailaddress1
({Token.string=="<"})
(
{Token.kind==EMAIL-ADD}
)
:email
({Token.string==">"})
>>>
:email.Address={kind="email",
rule="Emailaddress1"}
5.2 Results
In our first experiment, we performed annotation
of the two corpora for 4 annotation types using 2
JAPE rules for each type. The 4 annotation types
were ‘Person name’, ‘Organization’, ‘Location’
and ‘Date’. A sample JAPE rule for identifying
person names is shown in Figure 9. This rule iden-
tifies a sequence of words as a person name when
each word in the sequence starts with an alpha-
bet in upper-case and when the sequence is imme-
diately preceded by a word from a dictionary of
‘INITIAL’s. Example words in the ‘INITIAL’ dic-
tionary are: ‘Mr.’, ‘Dr.’, ’Lt.’, etc.
499
Table 1 compares the time taken by the index-
based annotator against that taken by GATE for the
8 JAPE rules. The index-based annotator performs
8-13 times faster than GATE. Table 2 splits the
time mentioned for the index-based annotator in
Table 1 into the time taken for the task of comput-
ing postings lists for basic entities and derived en-
tities (c.f. Section 2) for each of the data sets. We
can also observe that a greater speedup is achieved
for the larger corpus.
Dataset GATE Index-based
Enron 4974343 374926
Reuters 752287 92238
Table 1: Time (in milliseconds) for computing an-
notations using the two techniques
Dataset Orthographic Gazetteer Derived
entitytypes entitytypes entitytypes
Enron 38285 105870 230771
Reuters 28493 21531 42214
Table 2: Time (in milliseconds) for computing
postings lists of entity types
An important advantage of performing annota-
tions over the inverse index is that index entries
for basic entity types can be preserved and reused
for annotation types as additional rules for anno-
tation are specified by users. For instance, the in-
dex entry for ‘Capsword’ might find reuse in sev-
eral annotation rules. As against this, a document-
based annotator has to process each document
from scratch for every newly introduced annota-
tion rule. To verify this, we introduced 1 addi-
tional rule for each of the 4 named entity types.
In Table 3, we compare the time required by
the index-based annotator against that required by
GATE for annotating the two corpora using the 4
additional rules. We achieve a greater speedup fac-
tor of 23-37 for incremental annotation.
Dataset GATE Index-based
Enron 1479954 62227
Reuters 661157 17929
Table 3: Time (in milliseconds) for computing an-
notations using the two techniques for the addi-
tional 4 rules
6 Conclusions
In this paper we demonstrated that a suitably con-
structed inverse index contains all the necessary
information to implement entity annotators that
use cascading regular expressions. The approach
has the key advantage of not requiring access to
the original unstructured data to compute the an-
notations. The method uses a basic set of opera-
tors on the inverse index to construct indexes to all
matches for a regular expression in the tokenized
data set. We showed theoretically, that for a DFA
implementation, the index approach can be much
faster if the index sizes corresponding to the labels
on the DFA are a small fraction of the total num-
ber of tokens in the data set. We also provided
a more efficient index-based implementation that
is directly computed from the regular expressions
without the need of a DFA conversion and experi-
mentally demonstrated the gains.

References
Eugene Agichtein and Luis Gravano. 2000. Snow-
ball: Extracting relations from large plain-text col-
lections. In Proceedings of the Fifth ACM Interna-
tional Conference on Digital Libraries.
Junghoo Cho and Sridhar Rajagopalan. 2002. A fast
regular expression indexing engine. In Proceedings
of the 18th International Conference on Data Engi-
neering.
Brian Cooper, Neal Sample, Michael J. Franklin,
G´ısli R. Hjaltason, and Moshe Shadmon. 2001. A
fast index for semistructured data. In The VLDB
Conference, pages 341–350.
H. Cunningham, D. Maynard, K. Bontcheva, and
V. Tablan. 2002. GATE: A framework and graph-
ical development environment for robust NLP tools
and applications.
H. Cunningham. 1999. Jape – a java annotation pat-
terns engine.
Line Eikvil. 1999. Information extraction from world
wide web - a survey. Technical Report 945, Nor-
weigan Computing Center.
Quanzhong Li and Bongki Moon. 2001. Indexing and
querying XML data for regular path expressions. In
The VLDB Journal, pages 361–370.
Andrew McCallum, Dayne Freitag, and Fernando
Pereira. 2000. Maximum entropy Markov mod-
els for information extraction and segmentation. In
Proc. 17th International Conf. on Machine Learn-
ing, pages 591–598. Morgan Kaufmann, San Fran-
cisco, CA.
