TAGARAB: A Fast, Accurate Arabic Name Recognizer 
Using High-Precision Morphological Analysis 
John Maloney and Michael Niv 
SRA International Corp. 
4300 Fair Lakes Court 
Fairfax, VA 22033 
{j ohnunaloney, michael_niv}@sra, com 
Abstract 
We describe a fast, high-performance name rec- 
ognizer for Arabic texts. It combines a pattern- 
matching engine and supporting data with a mor- 
phological analysis component. The role of the mor- 
phological analysis in accurate name recognition is 
discussed. VCe also provide evaluations of both mor- 
phological analysis and name recognition. 
1 Introduction 
1,1 Roadmap 
Arabic named entity recognition in texts in Arabic 
script is, to our knowledge, a little researched topic. I 
In this paper we describe a system, TAGARAB, 
that uses a generic pattern-matching engine, SRA's 
NetOwl TurboTag TM, combined with an integrated 
morphological analysis process, which recognizes 
names at a high level of accuracy. 2 We first discuss 
the factors involved in recognizing names in Arabic. 
We then present a system description, focussing on 
the morphological analysis and the name recogni- 
tion components. We also report the results of our 
evaluations of each component's performance. 
Finally, we discuss the contribution of the mor- 
phological analysis to the name recognition. 
1.2 Background to Arabic Name 
Recognition 
Name recognition has emerged as an NLP technol- 
ogy that is effective and can provide high value to 
several different kinds of applications, such as clus- 
tering and summarization (Aone and Maloney, 1997; 
Aone et al., 1997). Development of such a capability 
for Arabic involved meeting some new challenges. 
Like other Semitic writing systems, Arabic does 
not exhibit differences in orthographic case. Un- 
1 "Named entity recognition" is meant in the sense familiar 
from the Message Understanding Conferences (MUC) and the 
Muhilingual Entity Task (MET). It refers to the identification 
and categorizing by type in unformatted text of names of 
persons, entities, and locations, as well as of numerics such as 
percentages and times/dates. In the following, we use "name" 
loosely in the same sense. 
2 This work has been funded by the Department of Defense, 
Contract No. MDA904-97-C-3065. 
like in English-language mixed-case texts, therefore, 
there is no obvious clue such as initial capitalized let- 
ters to indicate the presence of a name constituent. 
In our experience with Thai and other non-case 
writing systems, this seems to effectively impose a 
requirement that there be some understanding of 
the morphological nature of each token -- espe- 
cially part-of-speech information. Having some mor- 
phological information allows one to make distinc- 
tions between likely and unlikely name constituents, 
which is particularly important when deciding where 
a name ends and the non-name context begins. 
We partially motivate this for Arabic as follows: 3 
• Closed-class items are rarely, if ever, part of an 
Arabic proper name: \[Hasan f% with a preposi- 
tion in second position, seems an unlikely Ara- 
bic name. 4 
• Inflected forms of verbs are rarely part of proper 
names: \[yata9allamu\], "he learns," is not a per- 
missible name constituent. 5 
• Many lexical items, of course, are not used, or 
highly unlikely to be used, in proper names: 
verbs of speaking or cognition, for example, do 
not appear to occur in names and they do not 
frequently overlap in appearance with proper 
name constituents: \[muHammad qultu\] is an 
unlikely name. 
• Items with subject or object suffixes are rarely, 
if ever, part of names: \[yukaatibuhu\], "He will 
correspond with him. ''6 
SThe Arabic source used in this work was the newspaper 
AI-Hayat 
4 We will usually vocalize Arabic for the sake of ease of com- 
prehension (marked with square brackets), but will present it 
in consonantal transliteration when appropriate (marked with 
italics). Unobvious symbols: t = tha, H = voiceless pharyn- 
gem fricative; 9 = voiced pharyngeal fricative; $ = alif without 
h~rt'tza. 
SThere are some ism names that contain, at least histor- 
ically, imperfect verb forms, such as \[yaz~d\], "he increases." 
cf. Schimmel 1988, (p. 1). 
6Other Semitic naming traditions, e.g. Akkadian, Amor- 
ire, Hebrew, do permit "sentence names" containing a finite 
verb or a predication. (cf. Schimmel 1988, p. 1). 
8 
In addition, there are many positive cues, such as 
titles, frequent given names, and so forth, that allow 
a system to identify names, but some morphological 
characterization of non-name portions is of critical 
importance. We will discuss this in more detail be- 
low in Section 4. 
2 System Description 
Figure 1 contains the architecture of TAGARAB. 
Our system has two major modules: a Morpho- 
logical Tokenizer and a Name Finder. The Mor- 
phological Tokenizer has the ability, in addition to 
performing lexical scanning that establishes word- 
level units, to add morphological features to tokens. 
Text encoded in ISO-8859-6 is first passed through 
this tokenizer and then the tokenized stream is pro- 
cessed by the Name Finder module which identifies 
names and other extraction targets and annotates 
the text with appropriate SGML tags for each ex- 
tracted item. 
2.1 Morphological Tokenlzer 
2.1.1 Description 
The Morphological Tokenizer's basic task is to iden- 
tify the sequences of words, punctuation symbols, 
numbers, existing SGML tags, etc., that comprise 
the input text. For each such "token," a descrip- 
tion of what it has found is returned as a vector 
of up to 32 application-definable bits (e.g., PUNC- 
TUATION, WORD, NUMBER). The Tokenizer is a 
very fast program, generated using the Flex scanner- 
generator from a tokenizer specification. 
We decided to augment the tokenizer's usual role. 
While it still finds numbers and punctuation tokens, 
it treats an Arabic word (a contiguous sequence of 
Arabic letters) as a collection of one or more mor- 
pheme tokens, each with its own bit-vector of prop- 
erties. The properties include those listed above, as 
well as morphology-specific properties whose nature 
and linguistic motivation is discussed in the next sec- 
tion. 
Making the morphological analysis part of tok- 
enization has the advantage of maintaining the high 
speed of SRA's TurboTag. An external morphol- 
ogy module -- with a high computation overhead 
-- would degrade performance. 
Table 1 contains the features identified by the to- 
kenizer. 
Each token in the text receives some subset of 
the lexical types in Table 1. For example, a string 
such as $rbh, phonetically \[~aribahu\], "he drank it," 
receives the tokenizer types ARABIC, PERFECT, 
and SUFFIX. The first type means that the token 
comprises Arabic letters, the second that it is a Per- 
fect verb form, and the third that there is a suffix 
attached. Note that in this case, the string is not 
broken up into pieces, such as stem and suffix. It re- 
Basic Feature \[ Specification 
Token type (unique) 
ARABIC 
UPPER 
LOWER 
CAP 
MIXED 
PUNC 
INT 
REAL 
WRITTEN 
UNKNOWN 
White space info 
NOWS 
NL 
DSPACE 
TAB 
BL 
SPACE 
Arabic character string 
Upper-case English 
Lower-case English 
Initial-cap English 
Mixed-case English 
Punctuation marks 
Arabic integer 
Arabic number with decimal pt. 
Arabic number with commas 
Unknown token 
(optional) 
No white space before token 
New line char before token 
Double space before token 
Tab before token 
Blank line before token 
Space before token 
Morphological Feature 
Part of Speech (if P.O.S -- ARABIC, unique) 
CLOSED 
CONJ 
PREP 
NOUN 
IMPERFECT 
ADJ 
ADV 
PERFECT 
PART 
Closed-class items; not prep. 
wa- and fa 
Prepositions 
Nouns, including Verbal Nouns 
Imperfects, including all moods 
Adjectives 
Adverbs 
Perfects 
Participles (not used as 
noun or adjective) 
Feature (optional) 
DEF_ART Definite Article 
SUFF Verbal and Nominal Suffixes 
Table 1: Tokenizer Features for Arabic 
mains a single token with information being added 
as to what the component pieces are. The only cases 
where the tokenizer splits off pieces of a string are 
where there is an attached conjunction (wa or fa) 
or an attached preposition (la, ba, ka), or both. In 
these cases, in place of an original string such as 
orthographic wq$1, phonetic \[waq£1a\], "and he said," 
there will he two separate tokens: \[wa\] with the type 
information ARABIC and CONJ, and \[q~la\] with 
the type information ARABIC and PERFECT. 
Some of these tokenizer types are exclusive, such 
as PERFECT and IMPERFECT. A token cannot 
be both simultaneously. Others, however, such as 
NOUN and DEF_ART, can both be applied to a 
token. 
2.1.2 Implementation 
We initially developed the morphological analysis 
module as a sequence of 31 patterns in Perl's regular 
expression language. This allowed us to efficiently 
9 
I ISO 8859-6 Text I"~.. 
f I Tokens with features 
\] 
I SGML-annotated text 
Stem Usts 
Morphological Tokenizer 
~ Feature-extraction 
Rules 
Word Usts 
Name Finder 
~rNetOwl _ J Pattern-Action TurboTagTM ~.t I Rules 
( Pattem Engine J 
Figure 1: TAGARAB Architecture 
develop and refine the patterns needed to recognize 
the various morphological word-shapes. When we 
plugged this version of the morphological analyzer 
into the original tokenizer, however, processing was 
quite slow due to the sequential nature of our mor- 
phological patterns and the backtracking nature of 
Perl's regular expression matcher. To compensate, 
we incorporated the morphological functionality di- 
rectly into the Flex specification of the tokenizer. 
Whenever the Flex-generated scanner identifies an 
Arabic word, it dispatches the appropriate regular- 
expression to extract the separate morphemes from 
that word -- a task that is beyond the capability of 
Flex. 
The result is the fastest Arabic morphological an- 
alyzer we are aware of: The overall processing rate 
for TAGARAB is approximately 46 megabytes/hour 
on a Sun Ultra 1. Morphological processing by itself 
runs at about 190 megabytes/hour. 
2.1.3 Linguistic Design 
We had originally planned to develop a morpholog- 
ical capability that would be helpful in improving 
name recognition, as discussed in Section 1.2. In 
the following, we discuss the linguistic design of the 
morphological analysis. 
Arabic is a highly inflected language. We be- 
lieved that there are frequently enough surface cues 
in the shape of an Arabic word ~ to allow the as- 
signment of the kind of morphological information 
described in Section 2.1.1. For example, inflected 
forms of derived verb stems such as \[yaftatiHu\], "he 
inaugurates," would seem to have an orthographic 
"shape" that is fairly unique in an Arabic text. We 
felt that this information could be exploited to sue- 
cessfully identify tokens as nouns, verbs (perfects or 
imperfects), etc., to a sufficiently reliable extent that 
the later name-recognition patterns could effectively 
make use of it. 
The morphological analysis process consists of a 
series of regular expressions partially supported by 
lists of noun, verb, and adjective stems, as well as 
closed-class items. The regular expressions cover all 
allowable prefixes and suffixes for each stem type. 
Infixation phenomena, however, such as the infixed 
of the Eighth Verbal Form, s are handled as variant 
forms in the verb stem list e.g., 'gtbr, "he consid- 
ered" and 9br, "he crossed." No attempt is made 
to handle co-occurrence constraints among prefixes 
and suffixes, nor to assign voice. Likewise, no at- 
tempt is made to include contextual information, as 
is done with standard part-of-speech taggers. There 
is no attempt to handle ambiguity: The regular ex- 
pression patterns are ordered, and the search for an 
analysis of a word stops at the first match. The 
token types are then assigned, and the form is not 
submitted to any other regular expressions. 
Not all Arabic tokens are hit by one of these 
regular-expression patterns that provide morpholog- 
ical features. Although there is a mix of patterns 
supported by lexical information and patterns that 
operate entirely by rule (no supporting lexical data), 
the vast majority of matches appear to occur with 
the former set of patterns. In other words, the cover- 
age of the morphological analysis is crucially depen- 
dent on the lexical data. There are 1051 noun forms, 
813 verb forms, and 241 adjective forms. There 
is also a comprehensive list of closed-class items. 
The notion of "lexical item" in TAGARAB's lists 
is somewhat similar to the listing principle found 
in Landau : broken plural forms for nouns and ad- 
rTAGAFtAB deals exclusively with the written forth, i.e., 
without indication of short vowels. 
SWe use the usual terms for these forms found in Western 
grammars. 
10 
jectives receive an independent entry, much like dif- 
ferent stems of verbs as mentioned previously. We 
make no effort to distinguish I and II forms of verbs, 
as these are not usually distinguished orthographi- 
cally in AI-Hayat. In general, if there is no visible 
orthographic distinction in normal Arabic prose, we 
do not make a distinction in the lexical data. 
We also entered forms both with and without 
hamza, as in AbHayat any form that may receive 
a hamza may also appear without it (even in the 
same text!). 
Another important feature is that we entered only 
what seemed to us to be frequent lexical items in the 
lexical lists, and tried to do it in such a manner that 
what seemed intuitively the most likely reading of a 
form would be the one selected. This makes sense in 
the case of such a highly deterministic morphology 
and also given our time and resource constraints. 
We wanted to ensure that we got the right readings 
for a large number of highly frequent items, as this 
would be the most useful way to constrain the name- 
recognition patterns. Many high-frequency common 
nouns, verbs, adjectives, and other parts of speech 
in Arabic do not usually form part of names. As 
it turned out (See Section 4 below), this strategy 
worked quite well with person names but was less 
significant for organization names. 
We also decided not to enter items that are used 
directly by the later name-recognition patterns, such 
as locations and given names, as these are accessed 
by the patterns through that module's own word 
lists and therefore having a morphological reading 
for them is not important (see Section 2.2). 
In addition to the supporting lexical data, the or- 
dering of the regular expressions also aided in de- 
termining the analysis selected. The regular ex- 
pressions are grouped functionally, and in general 
the ones pertaining to closed class items apply first, 
then nouns, adjectives, perfects, and imperfects in 
that order. This had the effect that items which are 
highly "marked" as belonging to one category (e.g., 
verbs with a double pronominal object) would be 
captured appropriately by a verb-recognizing regu- 
lar expression that looks for such suffixes, but that 
items that are not so highly marked (e.g., a simple 
third-person masculine perfect form) would be bi- 
ased towards a reading according to the order of the 
regular expressions. 
We were pleasantly surprised to learn that this 
kind of approach -- although obviously simplistic in 
some ways -- produces a very high level of precision 
in analysis (i.e., the parts of speech assigned tend 
strongly to be correct) and surprisingly good recall 
(i.e., there is good coverage of the corpus). We dis- 
cuss our empirical results in Section 3. 
In addition, the morphological information thus 
produced contributed substantially to the effective- 
ness of the name-recognition patterns. We discuss 
this in Section 4. 
2.2 Name Finder 
The Name Finder module of TAGARAB (see Fig- 
ure 1) uses as input the tokens found by the Mor- 
phological Tokenizer with the basic and morpholog- 
ical features attached. The pattern-matching engine 
is SRA's NetOwl TurboTag TM. It uses data con- 
sisting of a set of Pattern-Action rules supported 
by Word Lists. The latter consists of items such 
as personal titles that are used by the patterns to 
recognize names. The Pattern-Action rules use con- 
textual and structural information about names to 
recognize them dynamically. They also make exten- 
sive use of the feature information coming from the 
Morphological Tokenizer. There is minimal perma- 
nent storage of names. 
The Pattern-Action rules are written in a conve- 
nient specification language. They are not compiled, 
but are read at run time as part of engine initializa- 
tion. 
3 Evaluation 
3.1 Morphology 
3.1.1 Preparation 
To evaluate the quality of the morphological anal- 
ysis, we used SRA's tagging tool, TagTool TM, to 
manually tag a set of documents for morphological 
analysis and part of speech. 9 
For this test document set, we randomly selected 
fourteen texts from the AI-Hayat CD-ROM not be- 
longing to the name recognition training or testing 
sets. In addition to manually tagging them, we also 
ran TAGARAB over these fourteen texts and used 
a standard MUC-style scoring program to compare 
the morphological output of TAGARAB with the 
"answers" in the hand-tagged version. 
3.1.2 Tagging Rules 
We hand-tagged every token in the text, except for: 
s Punctuation, Digits 
s Tokens within person names 
• Person titles 
* Locations 
• Location adjectives (e.g., \[kuwayd\], ~ubn~n;\]) 
• Adverbs or accusative forms of nouns or adjec- 
tives marked by fatHafayn. 
°Because of staiTmg constraints and need for knowledge 
of Arabic, the same person worked on both the development 
of the morphology component, the name patterns, and also 
hand-tagged the test set. To remove as far as possible any 
taint, we did not change the system or any supporting data 
once the manual tagging began. 
11 
These exceptions exist because the morphology 
component did not attach features to these items 
for the reasons given in Section 2.1.3. As a result of 
not hand-tagging them, the scoring program judged 
as spurious any morphological features found by the 
system for such items. 
We tagged the test set contextually, again in ac- 
cordance with the design of the morphological com- 
ponent. The most important effect was on the fea- 
ture PART. We found participles which act usu- 
ally as a noun, (e.g., \[al-ma~rfig\], "the plan", \[al- 
muwaz.z.af\], "the employee"), usually as an adjective 
(e.g., \[al-bayt al-majhfil\], "the unknown house") or 
seem to be freely used in either reading (e.g., \[mus- 
lim\], "Muslim,"). We tagged these participles con- 
textually as nouns or adjectives. One effect of this 
was that the number of items tagged as participles 
was quite low (in effect, only when they are used 
predicatively). 
In all, the evaluation corpus contains 3214 tokens, 
of which 2324 are Arabic words. 1879 of the latter 
received morphological features when hand-tagged. 
3.1.3 Morphology Evaluation 
The scores for the morphology component are given 
in Table 2. l° 
Since we did not have access to a morphologi- 
cal analyzer that produces all possible readings for 
forms based on a large lexicon, we do not have a pic- 
ture of the total morpho-lexical ambiguities in our 
evaluation texts. However, despite the small lexicon 
we manually built, the overall recall is reasonable 
(73.0%), and it also holds up well for most of the 
major open class items: perfects (72.7%), imperfects 
(60.2%), and nouns (66.8%). The low recall in ad- 
jectives (28.8%) is due to the fact that we did not 
make many lexical entries for adjectives. Since ad- 
jectives do not come first in the Arabic noun phrase, 
and since we use the morphological information to 
constrain the name patterns, tagging the head noun 
in a noun phrase is what is generally necessary, not 
tagging the adjective. 
What is striking in the above table is the high pre- 
cision across all the categories, with the exception 
of adjectives and participles, the latter a very small 
set for the reasons set out in Section 3.1.2. Preci- 
sion is consistently above 90%. We interpret this to 
mean that a manually built system with a moderate 
lexicon, having the capacity to only select one read- 
ing for a given form and not paying any attention 
l°The colunm headings are the standard ones from MUC: 
POS: possible number of points (one point for identifying a 
constituent boundary, another for identifying its category), 
ACT: actual responses given, COP,: correct answers, PAR: 
boundary errors, INC: category labelling errors, SPU: re- 
sponses given that are not in answer key, MIS: items in key 
missing from response, REC: recall (COR/POS), PRE: preci- sion (COR/ACT), F-M: f-measure ((2. PRE. REC)/(PRE + 
REC)). 
to a word's context, is capable of a very significant 
amount of morphological disambiguation in Arabic. 
Our results are also consistent with the results 
of Levinger et al. for the structually similar He- 
brew. Levinger et ai. discovered that non-context- 
based morphological analysis preferring the most 
likely morpho-lexical analysis (generated using a sta- 
tistical algorithm) gives extremely good results. 
Table 3 shows the collisions among the tags. 
The most common confusions were between per- 
fects and nouns in both directions. The system 
tagged the following tokens as nouns where the 
human tagged them as perfects: n~r (2x), "he/it 
published," bgJ, "he/it sent," Hd_t, "it happened," 
w.sflhm, "we described them," 11 .sdrt, "was issued, 
appeared," wSs.l, "he/it continued."*2 
Conversely, there were 16 cases where TAGARAB 
considered a token as a perfect, and the human 
tagged it as a noun. As with the previous case, the 
great majority were confusions of the perfect with 
a derivationally related noun or verbal noun (e.g., 
qtl, "he killed" or "killing.") Despite the small num- 
bers of such collisions in the sample, it seems to us 
that this is the most difficult disambiguation task, at 
least at the part of speech level, since verbal nouns 
plus the semantic subject or object in an i.d~fa con- 
struction can look much like a finite verb plus sub- 
ject or object. Clearly, context or a higher level of 
syntactic/semantic understanding is required to dif- 
ferentiate the two readings. 
On the other hand, the other major confusion 
revealed by this table, Noun/Adjective and Adjec- 
tive/Noun, is one that seems easily remedied by 
building in some knowledge of short context into 
the morphology component. For example, the fol- 
lowing three examples (and the rest resemble these) 
are cases where the system selects a noun reading 
for the adjective within the scope of a noun phrase: 
\[lgji'fina muslimfin\], "Muslim refugees," \[bi.sfiratin 
xa~a\], "in a special form," \[mu'tamaruhu al-s.iHgfi\], 
"his news conference." In these cases, the adjectives 
also have noun readings, but the local context shows 
clearly which reading is correct. 
These results identify specific areas of morpho- 
lexical ambiguity bringing into focus where addi- 
tional contextual cues are needed for better ambi- 
guity resolution. 
3.2 Evaluation of Name Patterns 
The scores for the name recognition in TAGARAB 
over the training set of texts are given in Table 4. 
The blind scores are given in Table 5. 
nIn this case, TAGARAB had identified the initial w as 
the conjtmtion wa and the rest of the string as a noun, #flhm, 
"their property." 
nSimilar to w.sflhm. TAGARAB took the inital w as the 
conjunction and took the rest of the string as a noun. 
12 
V 
Type 
ZLOSED 
PERFECT 
~ONJ 
PREP 
ADJ 
PARTICIPLE 
\[MPERFEC  
ADV 
NOUN 
FOTAL 
POS ACT 
336 326 
300 240 
286 306 
772 696 
320 136 
12 8 
186 124 
32 28 
1514 1072 
3758 2936 
COR PAR INC 
322 0 4 
218 0 18 
284 0 0 
674 0 22 
92 0 40 
6 0 2 
112 0 10 
26 0 2 
1011 1 50 
2745 1 148 
SPU MIS 
0 10 
4 64 
22 2 
0 76 
4 188 
0 4 
2 64 
0 4 
10 452 
42 864 
REC PRE F-M 
95.8 98.8 97.3 
72.7 90.8 80.7 
99.3 92.8 95.9 
87.3 96.8 91.8 
28.8 67.6 40.4 
50.0 75.0 60.0 
60.2 90.3 72.3 
81.2 92.9 86.7 
66.8 94.3 78.2 
73.0 93.5 82.0 
Table 2: Final Scores for Morphological Tokenizer (see footnote 10) 
Resp\Keys 
NOUN 
PREP 
PERFECT 
CLOSED 
CONJ 
ADV 
PART 
IMPERFEC~I 
~DJ 
NOUN PREP 
2 
16 
6 
PERFECT CLOSED 
7 
3 
~OTAL 25 11 9 
ADV PART IMPERFECT 
1 1 4 
1 
2 1 1 
ADJ 
17 
5 20 
TOTAL 
32 
0 
28 
0 
7 
1 
2 
1 
3 
74 
Table 3: Morphological Tag Collisions (Rows=System Responses; Columns=Keys) 
Pattern performance followed our experience with 
other languages, except for the recognition of time 
expressions. Usually, these scores run in the mid- 
to high-nineties on a test corpus, but the rich vari- 
ety of time and date formulas hindered scoring very 
high here. The scores for Arabic are consistent with 
scores for other languages (Thai, Chinese, Japanese) 
where there is no orthographic case information. 
4 Contribution of Morphology to 
Name Recognition 
As described in Section 2.2, name recognition is per- 
formed by a set of Pattern-Action rules ("patterns") 
supported by data in the form of word lists, and in 
the case of TAGARAB, reliance on morphological 
information. 
To investigate the role of the latter in support- 
' ing name recognition, we performed an experiment 
where we turned off the morphological analyzer por- 
tion of the tokenizer to see what the impact on the 
patterns would be. "Turning off" the morphology 
simply means that we substituted a tokenizer that 
does not cleave off clitics, and only generates the to- 
ken types under Basic Feature in Table 1 and none 
of those under Morphological Feature. This had the 
effect that name-finding patterns which had previ- 
ously accessed the morphological features provided 
by the tokenizer could no longer do so. The pat- 
terns themselves were not changed in any way. The 
results are contained in Table 6. It should be com- 
pared with Table 5. 
NUMBER and TIME do not show much change 
with morphological token information removed. 
This is not surprising, since the patterns identifying 
these elements rely on word lists of month names, 
written-out numerals, etc. We handled the different 
inflected forms of such items as words for time units 
(\[9~m\], "year," etc.) by simply listing all possibilites 
(singular, dual, plural, masculine, feminine) as sepa- 
rate entries. This is a reasonable strategy given the 
relatively small number of inflected forms for these 
items. In addition, relying on "static" word lists 
rather than "dynamically" generated morphological 
information coming from the tokenizer reduces the 
risk of error. 
PERSON names are affected the most by not 
having the morphological features available: Recall 
drops 12.2 points and Precision 24.1. By contrast, 
ENTITY names are less affected: Recall loses 7.4, 
and Precision 7.2. This large difference in the drop in 
Precision (24.1 points for Person vs. 7.2 for Entity) 
is due to the difference in pattern-writing "styles" for 
the two name types. For PERSON names in Arabic, 
there are few structural or contextual clues: we had 
13 
type 
NUMBER 
ENTITY 
rIME 
LOCATIOI~ 
PERSON 
POS ACT 
314 314 
2144 2070 
1612 1574 
2640 2680 
1926 1946 
COR PAR INC 
307 5 0 
1740 58 44 
1467 41 2 
2534 30 I0 
1778 36 4 
SPU MIS 
2 2 
228 302 
64 102 
106 66 
128 108 
REC PRE F-M 
97.8 97.8 97.8 
81.2 84.1 82.6 
91.0 93.2 92.1 
96.0 94.6 95.3 
92.3 91.4 91.8 
TOTAL 8636 8584 7826 170 60 528 580 90.6 91.2 90.9 
type 
NUMBER 
ENTITY 
TIME 
LOCATIOtq 
PERSON 
TOTAL 
Table 4: Name Recognition 
POS 
264 
2024 
1596 
3468 
2466 
9818 
ACT COR PAR 
262 256 4 
1880 1557 77 
1416 1288 60 
3128 2957 61 
2180 1879 157 
8866 7937 359 
Scores for Arabic: Training Set 
INC 
0 
84 
4 
20 
16 
124 
SPU 
2 
162 
64 
90 
128 
446 
MIS REC PRE 
4 97.0 97.7 
306 76.9 82.8 
244 80.7 91.0 
430 85.3 94.5 
414 76.2 86.2 
1398 80.8 89.5 
F-M 
97.3 .... 
79.8 
85.5 
89.7 
80.9 
85.0 
Table 5: Name Recognition 
available a list of titles and given names which served 
as the major indicators of the presence of a personal 
name. For the remainder of the name, consisting of 
an uncertain number of tokens extending out beyond 
the given name, there was not much to distinguish 
where the name ended and the non-name context be- 
gan, except the morphological information provided 
by the tokenizer. 
Islamic names involving elements such as \[ibn\], 
\[bin\], \[abu\], etc., were exploited as much as possi- 
ble, but names with these elements were surprisingly 
rare in the AI-Hayat articles. 13 
In effect, the patterns recognizing person names 
relied heavily on the presence of any morphological 
feature to rule out a given token as being part of the 
personal name. Without this information, there was, 
for example, a huge increase in the number of spu- 
rious names (names tagged by TAGARAB but with 
no equivalent in the human-tagged keys; under the 
SPU column in the score reports): One typical ex- 
ample is the vocalized string \[mudir al-9amaliygt al- 
siy~s~ya\], "Director of Political Operations," which 
the system took as a name, but which is actually an 
appositive to a preceding actual person name. The 
word \[mud~r\] is on the title list signalling the onset 
of a name and there is nothing to constrain the next 
two tokens from being consumed. If morphological 
type information were available signalling that \[al- 
9amaliy~t\] is not a likely name constituent, then the 
pattern would not have succeeded. 
There was also increase in boundary errors (un- 
lZWe do not have direct statistics on different name types 
in the source (e.g., Islamic, Russian). However, we do have 
statistics on the relative yields of the patterns, which are or- 
ganized themselves in terms of name types. 
Scores for Arabic: Blind Set 
der the PAR column in the score reports). These 
were those cases where the name pattern "didn't 
know where to stop," as in \[al-sayyid muHammad al- 
baGd~d~ ka'annahu\], where the last element would 
have received the feature CLOSED from the Mor- 
phological Tokenizer since it is a subordinating con- 
junction with a person suffix. A further side-effect 
of such a boundary error, as well as the spurious 
names, is that an element such as \[ka'annahu\], once 
it is considered part of a name by the system, will be 
used to recognize variant short forms of that name. 
This has the result that any \[ka'annahu\] in the arti- 
cle is subject to being classified as a person name. 
By contrast with person names, the patterns gen- 
erating entity names were less affected by the lack 
of morphological information since they were able to 
exploit more specific name structure. For example, 
a pattern for a specific class of entity names might 
include both the initial word \[majlis\], "council," and 
have as final boundary marker an element such as a 
location adjective. The latter is a separate word list 
independent of the morphological component of the 
system. In effect, there are more specific structural 
indicators of entity names than there are of person 
names, so the patterns are written differently, with 
less reliance on morphological information. 
The final category of names, LOCATION, shows 
a large increase in missing names (under the column 
MIS in the score reports). The reason for this might 
not be obvious, since location names do not receive 
any morphological features (see 2.1.3). However, the 
morphological capability within TAGARAB identi- 
fies clitic items such as the conjunctions \[wa\] and 
\[fa\], as well as the prepositions \[ba\], \[la\], and \[ka\], as 
described above in Section 2.1.1. If there is no mot- 
14 
Type 
NUMBER 
ENTITY 
TIME 
LOCATION 
PERSON 
tOTAL 
POS ACT 
264 262 
2024 1862 
1596 1392 
3468 2802 
2466 2542 
9818 8860 
COR PAR INC 
255 5 0 
1407 105 148 
1255 67 6 
2536 82 36 
1579 289 8 
7032 548 198 
SPU MIS 
2 4 
202 364 
64 268 
148 814 
666 590 
1082 2040 
REC PRE F-M 
96.6 97.3 97.0 
69.5 75.6 72.4 
78.6 90.2 84.0 
73.1 90.5 80.9 
64.0 62.1 63.1 
71.6 79.4 75.3 
Table 6: Name Recognition Scores for Arabic Without Morphological Features: Blind Set 
phological capability, then these clitics are not rec- 
ognized anywhere. This has the effect that location 
names with a clitic attached will not be recognized, 
since the system no longer tokenizes the location as 
a separate item. 
Annemarie Schimmel, Islamic Names. Edinburgh 
University Press. Edinburgh, 1989. 
5 Summary 
We have described a system for recognizing names, 
dates, times, and numerics in Arabic-language text 
through a combination of high-precision morpholog- 
ical analysis and a subsequent component that iec- 
ognizes the named entities. Although highly deter- 
ministic and not taking account of context, the mor- 
phological analysis component removes a great deal 
of morpho-lexical ambiguity and has the side-effect 
of demonstrating that the true difficulties in Arabic 
morphological ambiguity might be limited to specific 
contexts. 
In addition, we have shown that named entity 
recognition in Arabic can be performed at levels con- 
sistent with other non-case languages despite great 
differences in structure between them. Finally, we 
have shown that morphological information is cru- 
cially important to effective Arabic name recogni- 
tion. 

References 
Chinatsu Aone and John Maloney: "Reuse of a 
Proper Noun Recognition System in Commercial 
and Operational NLP Applications," in Proceed- 
ings of ACL'97 Workshop on From Research to 
Commercial Applications: Making NLP Technol- 
ogy Work in Practice, Madrid, Spain, 1997. 
Chinatsu Aone, Mary Ellen Okurowski, James Gor- 
linsky, and Bjornar Larsen: "A Scalable Summa- 
rization System Using Robust NLP," in Proceed- 
ings of ACL'97 Workshop on Intelligent Scalable 
Text Summarization, Madrid, Spain, 1997. 
Jacob Landau. A Word Count of Modern Arabic 
Prose. American Council of Learned Societies, 
New York, 1959. 
Moshe Levinger et al. "Learning Morpho-Lexieal 
Probabilities from an Untagged Corpus with an 
Application to Hebrew." Computational Linguis- 
tics, Vol. 2113, 1995. 
