Japanese Named Entity Extraction Evaluation 
- Analysis of Results - 
Satoshi Sekine 
Computer Science Department 
New York University 
715 Broadway, 7th floor 
New York, NY 10003, USA 
sekine@cs, nyu. edu 
Yoshio Eriguchi 
Research and Development Headquarters 
NTT Data Corporation 
1-21-2, Shinkawa, Chuo-ku, 
Tokyo, 104-0033, Japan 
eriguchi~rd, nttdata, co. jp 
Abstract 
We will report on one of the two tasks in the 
IREX (Information Retrieval and Extraction 
Exercise) project, an ew~luation-based project 
fbr Infbrmation Retriewfl and Intbrmation Ex- 
traction in Japanese (Sekine and Isahara, 2000) 
(IREX Committee, 1999). The project stm'ted 
in 1998 and concluded in September 1999 with 
many participants and collaborators (45 groups 
in total from Japan and the US). In this paper, 
the Nmned Entity (NE) task is reported. It is 
a task to extract NE's, such as names of orga- 
nizations, persons, locations and artifacts, time 
expressions and numeric expressions from news- 
p~t)er articles. First, we will explain the task 
and the definition, as well as the data we cre- 
ated and the results. Second, the analyses of the 
results will be described, which include analysis 
of task diifieulty across the NE types and sys- 
tern types, amflysis of dolnain dependency and 
comparison to hmnan per~brmance. 
1 Introduction 
The need for II1 and IE technologies is getting 
larger because of the improvements ill computer 
technology and the appearance of the Internet. 
Many researchers in the field feel that the ew~l- 
uation based projects in the USA, MUC (MUC 
Homepage, 1999) and TREC (TREC Home- 
page, 2000), have played a very important role 
in each field. In Japan, however, while there has 
been good research, we have had some dilficul- 
ties comparing systems based on the same plat- 
form, since our researdt is conducted at many 
different; Ulfiversities, companies, and laborato- 
ries using different data and evaluation lnea- 
sures. Our goal is to have a common platform in 
order to evaluate systems with the stone stan- 
dard. We believe such projects are nseflfl not 
only for comparing system performance lint also 
to address the following issues: 
1) To share and exchange problems among re- 
searchers, 
2) To aeeulntll~,te large quantities of data, 
3) To let other people know the importance 
and the quality of IR and IE techniques. 
Finishing the project, we believe we achieved 
these goals. 
In this paper we will describe one of the two 
tasks in the IR.EX project, the Nmned Entity 
task. 
2 IREX NE 
2.1 Task 
Named Entity extraction involves finding 
Named Entities, such as names of organizations, 
persons, locations, and artifimts, time expres- 
sions, and numeric expressions, such as money 
and percentage expressions. It is one of the ha- 
sic techniques used in IR and IE. At the ewfl- 
uation, participants were asked to identit\[y NE 
expressions as correctly as possible. In order 
to avoid a copyright I)robleIn, we made a tool 
to convert a tagged text to a set of tag off'set 
information and wc only exchanged tag ott;et 
intbrlnation. 
2.2 Definition 
The definition of NE's is given in an 18-page 
document, which is available through the II1EX 
homepage (IREX Homepage, 1999). There 
are 8 kinds of NE's shown in 'lhfl)le 1. In or- 
der to avoid requiring a unique decision ~br 
ambiguous cases where even a lnnnan could 
not tag unambiguously, we introduced a tag 
"OPTIONAL ''1. If a system tags an expression 
1This tag is newly introduced in IREX and does not 
exist in MUC. The tag accounts for 5.7% of all NE occur- 
1106 
NE Examl)le 
ORGANIZATION The Diet, 1REX Commit;tee 
PERSON Sekine, \¥akanohana 
LOCATION Japan, '\]bkyo, Mt.Fuji 
ARTIFACT Pentiuln II, Nobel Prize 
DATE March 5, 1.965; Yesterday 
TIME 11 PM, nfidnight 
MONEY 100 yen, $12,345 
PERCENT 1.0%, a half 
~i~ble 1: NE Classes 
within the OPTIONAL tag, it; is just ignored for 
th.e scoring. The defilfition was created 1)ased on 
the MUC/MET definition; however, the process 
of lnaking the definition was not easy. In par- 
ticulaI', the definition of the newly introduced 
NE tyl)e "artifact" was Colitroversial. W'e ad- 
mit that more consideration is needed to make 
a clem'er definition of the NE typos. 
Comparing the NE task in Japanese to that 
in English, one of the ditIiculties comes from the 
fact that there is no word delinfiter in Japanese. 
Sysl;elns have to identity the })oundaries of ex- 
pressions. This will 1)ecome complicated when 
we want to tag a sul)string of what ix gener- 
ally considered a ,Japanes(~ wor(t, l/or (~xaml)le , 
il.t .Jal)allese there is a word "Ratnich±" which 
means "Visil; 3apa.n" and consists of two Chi- 
nese eh.aracters, "Ra±" (Visit;)and "Nichi" (ab- 
breviation of .Japan). Although mmly word seg- 
reenters identif~y it as a single, word, we expect 
to extrtmt only "Nichi" as a local;ion. '\]'his is 
a tricky prol)lem, as opposed to the ease in En- 
glish where a word is the unit of NE candidates. 
2+3 Runs and Data 
There were three kinds of NE exercises, the dry 
run, a restricted (hmlMn tbrmal rtm, and a gen- 
eral domain tbl'mal 1'1111, which will be explained 
later. Also we created three kinds of training 
(h~ta: the dry run trailfing data, the CI{.L_NE 
data and the formal run domain restricted trail> 
ing data. Td)le 2 shows the size of each data set. 
Note that CRL_NE (lata l)elongs to the Colllnltt- 
nication ll.esearch Laboratory (CI{L), but it is 
ronces ill the generM dolnaill evMuation and 2.1% in the 
restricted domain e, valuation (the t.ypes of the evaluation 
will be explained later). 
ineht(ted ill the tat)le, because the data was cre- 
ated by IREX participants, using the definition 
of II{EX-NE, +rod distributed through I\]{,EX. 
Data Number of 
articles 
Dry Run training 
Dry t.hm 
CIIL_NE data 
D)rlnal l'lln (restricted) training 
Formal run (restricted) 
Formal run (general) 
46 
36 
1.174 
23 
20 
71 
Table 2: Dnta size 
~n or(let to ensure the, fairness of the exercise 
in the formal \]'un~ we used newspaper articles 
which no one had ew~r seen. We, set the date to 
fl'eeze the system development (April 13, 1999). 
The date for the evahtation was set one month 
after that (lat;e (May 13 to \]7, 1999) so that 
we could select the test m'ticles fl'om the 1)cried 
t)etween those dates. \¥e thank the Mainichi 
Newspaper CorI)oration for provi(ling this data 
for us t\]:ee of charge. 
2.4 Restricted domain 
in the fbrmal run, in order to study system 
portability and the effect of domains on NE 
perfoilllanc(',, we had two kinds of evaluation: 
rest;rioted domain and general domMn. In the 
general domain ewthtation, w(, selected articles 
regardless of dolnain. The domain of the re- 
stricted domain evaluation was a.lmouneed one 
month before the develolmmnt freeze date. It; 
was an "arrest;" domain defined as follows and 
211 the articles in the restricted domain are se- 
lected based on the definition. 
77re articles arc 'related to an e'ucnt 
",,frost". The event is defined as th, c 
a'r'rc.st of a .suspect o1' s'~t,5'pects by po- 
lice, National \])olicc, State police of 
other police forces including the o'ncs 
of foreign countries. It includes arti- 
cles mentionirtg an arrest event in the 
past. It: excludes articles which have 
only i'n:formation about requesting an 
arrest warrant, art accusation or send- 
ing the pape'rs pc'training to a case to 
an Attorney's OJJicc. 
1107 
2.5 Results 
8 groups and 11 systems participated in the 
dry run, and 14 groups and 15 systems partici- 
pated in the %rmal run 2. Tim evaluation results 
were made public anonymously using systeln 
ID's. Table 3 shows the ew~luation results (F- 
measure) of the formal run. F-measure is cal- 
culated from recall and precision (IREX Coin- 
mittee, 1999). It ranges from 0 to 100, and the 
larger the better 
System ID 
1201 
1205 
1213 
1214 
1215 
1223 
1224 
1227 
1229 
1231 
1234 
1240 
1247 
1250a 
12501) 
general 
57.69 
80.05 
66.60 
70.34 
66.74 
72.18 
75.30 
77.37 
57.63 
74.82 
71.96 
60.96 
83.86 
69.82 
57.76 
restrict 
54.17 
78.08 
59.87 
80.37 
74.56 
74.90 
77.61 
85.02 
64.81 
81.94 
72.77 
58.46 
87.43 
70.12 
55.24 
diff. 
-3.52 
-1.97 
-6.73 
+10.03 
+7.82 
+2.72 
+2.31 
+7.65 
+7.18 
+7.12 
+0.81 
-2.50 
+3.57 
+0.30 
-2.52 
~li~ble 3: NE Formal run result 
3 Analyses of the results 
3.1 Difficulty across NE type 
In Table 4, tile F-measure of the best perform- 
ing system is shown in the "Best" column; the 
average F-measures are shown in the "Average" 
column tbr each NE type on the formal runs. It 
can be observed that identifying time and nu- 
lneric expressions is relatively easy, as the av- 
erage F-measures are more than 80%. In con- 
trast, the accuracy of the other types of NE is 
not so good. Ill particular, artifacts are quite 
difficult to identify. It is interesting to see that 
tagging artifacts in the general domain is mud1 
harder thins in the restricted domain. This is 
because of the limited types of artifacts in the 
restricted domain. Most of the artifacts in the 
2The participation to the dry run was not obligatory. 
This is why the number of participants is smaller in the 
dry run than that in the formal rmL 
restricted domain are the names of laws, as the 
doinain is the arrest domain. Systems lnight be 
able to find such types of names easily because 
they could be recognized by a small number of 
simple patterns or by a short list. The types 
of tim artifacts in the general donmin are quite 
diverse, including names of prizes, novels, ships, 
or paintings. It nfight be difficult to build pat- 
terns for these itelns, or systems may need very 
complicated rules or large dictionaries. 
3.2 Three types of systems 
Based on the questionnaire for the particit)ants 
we gatlmred alter the formal runs, we found that 
there are three types of systems. 
• Hand created pattern based 
These are pattern based systems where the 
patterns are created by hand. A typical 
system used prefix, sutlqx and proper noun 
dictionaries. Patterns in these systems look 
like "If proper nouns are followed by a suffix 
of person name (for example, a common 
suflqx like "San", which is ahnost equivalent 
to Mr. and Ms.) then the proper nouns are 
a t)erson nmne". This type of system was 
very common; there were 8 systems in this 
category. 
• Automatically created pattern based 
These are pattern based systems where 
some or all of the patterns are created an- 
tolnatically using a training corl)us. There 
were three systems in this category, and 
these systems used quite different meth- 
ods. One of them used the "error driven 
method", in which hand created patterns 
were applied to tagged training data and 
the system learned from tlm mistakes. An- 
other system learned patterns for a wide 
range of information, including, syntax, 
verb frame and discourse information fl'om 
training data. The last system used the 
local context of training data and several 
filters were applied to get more accurate 
patterns. 
Fully automatic 
Systems in this category created their 
knowledge automatically from a training 
corpus. There were four systems in this 
category. These systems basicMly tried to 
assign one of the four tags, beginning, mid- 
dle or ending of an NE, or out-ofNE, to 
1108 
NE type 13est 
Orgmfization 78 
Person 87 
Location 8d: 
Artifact 44 
Date 90 
Time 82 
Money 86 
Percent 84 
Total 84 
General domain 
Average Expert 
57 96 
68 99 
70 98 
26 90 
86 98 
83 97 
86 100 
86 97 
70 98 
Best 
75 
87 
88 
83 
93 
97 
100 
87 
Table 4: Results 
Restrict domain 
Average Novice Expert 
55 
69 
68 
58 
89 
90 
91 
88 
97 
94 
74 
96 
98 
100 
98 
100 
99 
92 
100 
98 
100 
72 94 99 
each word or each character. The source 
information for the training was typically 
character type, POS, dictionary in%rma- 
tion or lexical information. As tile learn- 
ing meclmnism, Maximmn Entrot)y models, 
decision trees, and HMMs were used. 
It is interesting to see that tile top three sys- 
tems came, fi'om each category; the best sys- 
tem was a hand create, d pattern based system, 
tile second system was an automatically created 
pattern based system and the third system was 
a fully automatic system. So we believe we can 
not conclude which type is SUl)erior to the eth- 
el'S. 
Analyzing the results of the top three sys- 
to, ms, we observed the, importance of tile dic- 
tionaries. The best hand created pattern based 
system seems to have a wide coverage dictionary 
for person, organization and location names 
and achieved very good accuracy tbr those cat- 
egories. Howe.ver, the hand created pattern 
based system failed to capture the evahmtion 
specific pattenls like "the middle of April". Sys- 
tems wore required to extract the entire ex- 
1)ression as a date expression, but; the system 
only extracted "April". The best hand created 
rule based system, as well as the best ~mtolnat- 
ically created pattern lmsed system also missed 
other specific patterns which inchlde abbrevi- 
ations ("Rai-Nichi" = Visit-Japan), conjmm- 
tions of locations ("Nichi-Be±" = ,\]alton-US), 
and street; addresse, s ("Meguro-ku, 0okayama 
2-12-1"). The best hilly autolnatic system was 
successflfl in extracting lllOSt of these specific 
patterns. However, the flflly automatic system 
has a probleln in its coverage. In lmrticular, the 
training data was newspaper articles published 
in 1994 and the test; data was fro151 1999, so 
there are several new names, e.g. the prilne min- 
ister's name which is not so co1111noll (0buchi) 
and a location nmne like "Kosovo", wlfich were 
rarely mentioned in 1994 but apt)eared a lot in 
1999. '.Fhe system missed many of them. 
3.3 Domain dependency 
In Table 3, the differences in performance be- 
tween the general domain and the restricted do- 
main ~r(' shown ill the cohmm "diff.". Many 
systems 1)erfl)nne(l better in the restri(:ted do- 
main, although ~ small ntllnl)er of systems per- 
forlned better ill the genera\] domain. There 
were two systems which intentiomflly tuned 
their systems towards the restricted domain, 
which are shown in bold in the table. Both of 
these were alnong the systelns which perfbrlned 
much better (more than 7%) ill the restricte.d 
domain. The system which achieved the largest 
improvement was a fully automatic system, and 
it only replaced the training data for the domain 
restricted task (so this is an intentionally tuned 
system). It shows the domain dependency of the 
task, although further investigation is needed to 
see why some other systems can perform nmch 
better even without domain tinting. 
3.4 Comparison to human performance 
In Table 4, hulllan performance is shown in 
the "Novice" and "Expert" cohtmns. "Novice" 
means tile average F-measure of three gradu- 
ate students all(l "l;xpert" means the average 
F-measure of the two people who were most re- 
1109 
sponsible for creating tile definition and created 
the answer. They frst created two answers in- 
dependently and checked them by themselves. 
The results after the checking are shown in the 
table, so many careless mistakes were deleted 
at this time. We Call say that 98-99 F-measure 
is the performance of experts who create them 
very carefully, and 94 is a usual person's perfof 
lnance. 
We can find a similar pattern of performance 
among different NEs. Hmnans also performed 
more poorly for artifacts and very well for time 
and numeric expressions. 
Tile difference t)etween the best system per- 
formance and hulnan perfbrlnance is 7 or more 
F-measure, as opposed to the case in English 
where the top systems perform at, a level compa- 
rable or superior to human perfornlancc. There 
could be several reasons for this. One obvious 
reason is that we introduced a ditficult NE type, 
artifact~ which degrades the overall performance 
more for the system side than the lmman side. 
Also, the difficulty of identifying the expression 
boundaries may contribute to the difference. Fi- 
nally, we believe that the systems can possibly 
improve, as IREX was the first evaluation based 
project in Japanese, whereas in English there 
trove been 7 MUC's and tile technology may 
have matured by now. 
5 Acknowledgment 
We would like to thank all the participants of 
IREX projects. The project would never have 
been as successful as it was without the partic- 
ipants, all of whom were very cooperative and 
constructive. 
4 Conclusion 
We reported on tile NE task of tim IREX 
project. We first explained tile task and the 
definition, as well a.s the d~ta we created and 
the results. The analyses of the result were de- 
scribed, which include analysis of task ditficulty 
across the NE types and system types, analysis 
of domain dependency and comparison to hu- 
man performance. 
As this is one of the first projects of this type 
in ,Japan, we may have a lot to do ill the fu- 
ture and holmflflly tile results of tile project 
will be beneficial for fllture projects. As tile 
next step, II/,EX will be merged with a simi- 
lar project NTCIR (NTCIR Homepage, 2000) 
which places more eml)hasis on IR., with a newly 
created project for sumlnarization, TSC (TSC 
Homepage, 2000), and contilme this kind of ef- 
fort for tile fllture. 

References 

IREX Committee 1999 Proceedings of the 
IREX Workshop 

Satoshi Sekine, Hitoshi Isahara. 2000 : "IREX: 
IR and IE Ewdnation Project in Japanese" 
Proceedings of the LREC-2000 coT@fence 

IREX Hornepage 
http://cs.nyu.edu/projects/proteus/irex 

MUC Homepage http://www.muc.saic.com/ 

TREC Homepage trec.nist.gov/ 

NTCIR H(nneI)age 
httl)://www.rd.nacsis.ac.jp/fitcadm/index- 
en.html 

TS C ttomeI)age httl,:// al  ga.jaist.ac.jp:S000/tsc/ 
