NYU: DESCRIPTION OF THE JAPANESE NE SYSTEM USED
FOR MET-2
Satoshi Sekine
Computer Science Department
New York University
715 Broadway,7th#0Door
New York, NY 10003, USA
sekine@cs.nyu.edu
INTRODUCTION
In this paper, experimentsonthe Japanese Named Entitytask are reported. We employed a supervised
learningmechanism. Recently,several systems have been proposed for this task, butmanyofthem use
hand-coded patterns. Creatingthese patterns is laborious work, andwhen weadapt these systems toanew
domain or a new de#0Cnition of named entities, it is likely toneed a large amount of additional work. On the
other hand, in a supervised learning system, whatisneeded toadapt the system is tomakenew trainingdata
andmaybe additional small work. While this is also not a very easy task, it would be easier than creating
complicated patterns. For example, based on our experience, 100 training articles can be created in a day.
There also have been several machine learning systems applied tothis task. However, these either 1#29
partially need hand-made rules, 2#29 have parameters whichmust be adjusted byhand 3#29 do not perform well
by fully automatic means or 4#29 need a huge trainingdata. Our system does not work fully automatically,but
performs well withasmall training corpus and does not have parameters to be adjusted byhand. We will
discuss oneofthe related systems later..
ALGORITHM
In this section, the algorithm of the system will be presented. There are twophases, one for creatingthe
decision tree from trainingdata #28trainingphase#29 andtheother for generatingthetagged output based on
thedecision tree #28runningphase#29. We use a Japanese morphological analyzer, JUMAN #5B6#5Dand a program
package for decision trees, C4.5 #5B7#5D. Weusethree kinds of feature setsinthedecision tree:
#0F Part-of-speechtagged by JUMAN
Wede#0Cnethe set of our categories based on itsmajor category and minor category.
#0F Character type information
Character type, like Kanji, Hiragana, Katakana, alphabet, number or symbol, etc. and some combina-
tions of these.
#0F Special Dictionaries
List of entities created based on JUMAN dictionary entries, lists distributed by SAIC for MUC, lists
foundontheWeb or based on human knowledge. Table1shows thenumber of entities in each
dictionary
1
. Organization namehas twotypes of dictionary; one for proper names andtheother for
general nouns whichshould be tagged when they co-occur with proper names. Also, wehave a special
dictionarywhichcontains wordswritten inRomanalphabetbut mostlikelythese are notanorganization
#28e.g. TEL, FAX#29. Wemade a list of 93 suchwords.
1
Someofthe lists are available at#5B8#5D
Entity pre#0Cx name su#0Ex
Org. 14 10076#2F49 175
Person 0 17672 82
Loc. 0 14903 60
Date 24 199 29
Time 2 24 5
Money 15 0 39
Percent 0 99 3
Table 1: Special Dictionary Entries
Creatingthe special dictionaries is not very easy,butitisnotvery laborious work. The initial dictionary was
builtinaboutaweek. In the course of the system development, in particular during creatingthe training
corpus, we added someentities tothe dictionaries.
Thedecision tree gives an output for eachtoken. It is oneofthe 4 possible combinations of opening,
continuation and closinganamed entity,andhavingnonamed entity,shown in Table 2. In this paper, we
Output beginning ending
OP-CL openingofNE closingofNE
OP-CN openingofNE cont. of NE
CN-CN cont. of NE cont. of NE
CN-CL cont. of NE closingofNE
none none none
Table 2: Fivetypes of Output
will use two di#0Berentsetsofterms in order toavoid the confusion between positions relativetoatoken and
regions of named entities. Theterms beginning and ending are used toindicate positions, whereas opening
and closing are used toindicatethestart andendofnamed entities. Notethatthere is no overlappingor
embeddingofnamed entities. An example of real dataisshown in Figure 1.
TrainingPhase
First, the trainingsentences are segmented and part-of-speechtagged by JUMAN. Then eachtoken is
analyzed byitscharacter type andismatched against entries in the special dictionaries. Onetoken can
matchentries in several dictionaries. For example, #5CMatsushita" could matchthe organization, person and
location dictionaries.
Usingthe trainingdata, a decision tree is built. It learns abouttheopeningand closingofnamed entities
based on thethree kinds of informationofthe previous, currentand followingtokens. Thethree types of
information are the part-of-speech, character type and special dictionary information described above.
If we just use thedeterministic decision created bythe tree, it could cause a problem in therunningphase.
Because thedecisions are made locally,the system could make an inconsistent sequence of decisions overall.
For example, onetoken could be tagged as theopening of an organization, while thenext token mightbe
tagged as the closing of person name. We can think of several strategies to solvethis problem #28for example,
themethod by #5B2#5D will be described in a later section#29, butwe used a probabilistic method.
The instances in the training corpus correspondingto a leaf of thedecision tree may not all havethe
sametag. At a leaf we don't just record the most probable tag; rather, wekeep the probabilitiesofthe
all possible tags for that leaf. In this way we can salvage cases where a tag is part of the most probable
globally-consistenttaggingofthetext, even though it is not the most probable tag for this token, andso
would be discarded if wemadeadeterministic decision at eachtoken.
subsectionRunningPhase
In therunningphase, the #0Crst three steps, token segmentation and part-of-speechtaggingby JUMAN,
analysis of character type, and special dictionary look-up, are identical tothatinthe trainingphase. Then,
in order to#0Cndthe probabilities of openingand closinganamed entity for eachtoken, the properties of the
previous, currentand followingtokens are examined against thedecision tree. Figure 2 shows two example
paths in thedecision tree. For eachtoken, the probabilities of `none'andthe four combinationsofanswer pairs
for eachnamed entitytype are assigned. For instance, if wehave7named entitytypes, then 29 probabilities
are generated.
Once the probabilities for all thetokens in a sentence are assigned, the remainingtask is to discover the
most probable consistentpaththrough thesentence. Here, a consistentpathmeans that for example, a path
can't have org-OP-CN and date-OP-CL in a row, but can have loc-OP-CN and loc-CN-CL.Theoutputis
generated from the consistent sequence withthe highest probability for eachsentence. The Viterbi algorithm
is used in the search; this can be runintime linear in the lengthofthe input.
EXAMPLE
Figure 1 shows an example sentence along withthree types of information,part-of-speech, character type
and special dictionary information, and given information of openingand closingofnamed entities. Figure
2shows two example paths in thedecision tree. For the purpose of demonstration, we used the #0Crst and
secondtoken of the example sentence in Figure 1. Each line corresponds to a question asked bythe tree
nodes alongthepath. The last lineshows the probabilities of named entity information whichhave more
than 0.0 probability. This instance demonstrates howthe probabilitymethod works. As we can see, the
probability of none for the #0Crst token #28Isuraeru = Israel#29 is higher than that for theopening of organization
#280.67 to 0.33#29, butinthe secondtoken #28Keisatsu =Police#29, the probability of closing organization is much
higher than none #280.86 to 0.14#29. The combined probabilities of thetwo consistentpaths are calculated. One
of these paths makes thetwotokens an organization entity while alongtheother path, neither token is part
of a named entity.The probabilities are higher in the #0Crst case #280.28#29 than thatinthelatter case #280.09#29, So
thetwotokens are tagged as an organization entity.
Token ISURAERU KEISATSU NI YORU TO , ERUSAREMU
POS PN-loc N postpos V postpos comma PN-loc
Char.type Kata Kanji Hira Hira Hira Comma Kata
Special Dict. loc org-S - - - - loc
NE answer org-OP-CN org-CN-CL - - - - loc-OP-CN
Token SHI HOKUBU DE 26 NICHI GOGO ,
POS N-suf N postpos number N-suf N comma
Char.type Kanji Kanji Hira Num Kanji Kanji Comma
Special Dict. loc-S - - - date-S time,time-P -
NE answer loc-CN-CL - - date-OP-CN date-CN-CL time-OP-CL -
Figure 1: Sentence Example
RESULTS
We will report resultsof#0Cve experimentsdescribed in Table 3. Here, #5CTrainingdata", #5CDry rundata"
and#5CFormal rundata" are thedata distributed by SAIC, and #5Cseefu data" is thedata created by Oki, NTT
dataand NYU #28available through #5B8#5D#29. Notethat all Training, Dry runand seefu data are in thetopic of
ISURAERU #28first token#29 KEISATSU #28second token#29
if current token is a location -#3E yes if current token is a location -#3E no
if next token is a loc-suffix -#3E no if current token is a organization -#3E no
if next token is a person-suffix-#3Eno if current token is a time -#3E no
if next token is a org-suffix -#3E yes if current token is a loc-suffix -#3E no
if previous token is a location -#3E no if next token is a time-suffix -#3E no
THEN none = 0.67, org-OP-CN = 0.33 if current token is a time-suffix -#3E no
if next token is a date-suffix -#3E no
if current token is a date-suffix -#3E no
if current token is a date -#3E no
if next token is a location -#3E no
if current token is a org-suffix -#3E yes
if previous token is a location -#3E yes
THEN none = 0.14, org-CN-CL = 0.86
Figure 2: Decision Tree Path Example
vehicle crash, and only theFormal rundataisonthetopic of space craft launch. Numbers in the brackets
indicatethenumber of articles.
Experiment TrainingData Test Data
1#29 Formal run Trainingdata#28114#29, seefu data#28150#29, Formal rundata
Dry rundata#2830#29
2#29 Best in-house Dry run Trainingdata#28114#29, seefu data#28150#29 Dry Rundata
3#29 75#2F25 experiment 75#25 of Formal runData#2875#29 25#25 of Formal rundata
4#29 All training + 75#2F25 Trainingdata#28114#29, seefu data#28150#29, Dry 25#25 of Formal rundata
rundata#2830#29,75#25 of Formal rundata#2875#29
5#29 Add planet names sameas4#29 sameas4#29
Table 3: Runs
The resultsofFormal runandthe best in-house dry-run are shown in Table 4. We can clearly tell that
the recall of Named Entities #28person, organization andlocation#29 are bad. This is caused bythechange of the
topic. For example, there are very few foreign person names written in Katakanainthe trainingdata, #28as
a foreign person would hardly be a victim of a crash in Japan#29. However, in the space craft launch, there
are many foreign person names written in Katakana. This is the reason whythe recall of persons is so low.
Also, in thetest documents, planet names, #5CtheSun","the Earth" or #5CSaturn" are tagged as locations, which
could not be predicted from the trainingtopic. We missed all suchnames in the formal test.
The best in-house Dry run resultwas achieved before the formal run without lookingatthetest data. So
it should be regarded as an example of the performance if we knowthetopic of thematerial. Wethink this
is satisfactory, consideringthatthe e#0Bort wemadewas just preparing dictionaries andnopatterns.
Table 5 shows three experiments performed after the formal run. As thetopic change may degradeofthe
performance, we conducted experiments in whichthe trainingdata includes documentsinthe sametopic.
The #0Crst experiment used 75#25 of the formal rundata for trainingandtherestofthedata for testing. Four
such experimentswere madeto obtain the result for theentire corpus. The second experiment includes the
trainingdata used in the formal run in addition tothe 75#25 of the formal rundata. Thetable shows about
1#25 improvementover the formal run. This is an encouraging result, thebetter performance was achieved
with only 75 articles on the sametopic compared with 294 articles on a di#0Berenttopic used in the formal
run. The resultofthe second experiment also shows a good sign thatdocuments in a di#0Berenttopic helped
to improvethe performance. This result suggestsanidea of #5Cdomain adaptation scheme". Thatistohavea
F-measure Formal run Best in-house dry run
Entity Recall Precision Recall Precision
Org. 75 83 78 87
Person 48 74 87 90
Loc. 70 87 91 95
Date 96 95 97 91
Time 95 96 98 98
Money 90 97 100 100
Percent 90 95 88 100
Overall 75 85 87 90
F-measure 79.49 88.62
Table 4: ResultofFormal Runandthe best in-house Dry run
large general corpus of tagged documentsasthe basis, andto add small domain speci#0Cc documentstohavea
domainspeci#0Cc system. Lastly,inthethird experiment, we added the planet names in thelocation dictionary.
From the formal run result, it was clear thatoneofthemain reasons of the performance degradation is the
lackofthe planet names. The addition improves 3.5#25 whichisbetter than theother trials. Although there
are several other obvious reasons to be #0Cxed, the F-measure 86.34 is comparable tothe best in-house Dry
run experimentdescribed before #28Experiment 2;F-measure = 88.62#29.
Experiment F-measure
3#29 75#2F25 experiment 80.46
4#29 All training + 75#2F25 82.73
5#29 Add planet names 86.34
Table 5: Comparative Results
RELATED WORK
There have been several e#0Bortstoapply machine learningtechniques tothe sametask #5B4#5D #5B3#5D #5B5#5D#5B2#5D. In
this section, we will discuss a system whichisoneofthe most advanced and which closely resembles our own
#5B2#5D. A good review of most of theother systems can be foundintheir paper.
Their system uses thedecision tree algorithmand almost the same features. However, there are signi#0Ccant
di#0Berences between the systems. Themain di#0Berence is thatthey have more than onedecision tree, each
of whichdecides if a particular named entitystarts#2Fends atthe currenttoken. In contrast, our system has
only onedecision tree which produces probabilities of informationaboutthenamed entity.Inthis regard,
we are similar to #5B3#5D, which also uses a probabilistic method in their N-gram based system. This is a crucial
di#0Berence which also has important consequences. Because the system of #5B2#5Dmakes multiple decisions at
eachtoken, they could assign multiple, possibly inconsistenttags. They solved the problem byintroducing
two somewhat idiosyncratic methods. Oneofthem is thedistance score, whichisusedto#0Cndanopening
and closing pair for eachnamed entitymainly based on distance information. Theother is thetag priority
scheme, whichchooses a named entity among di#0Berenttypes of overlapping candidates based on the priority
order of named entities. These methods require parameters whichmust be adjusted when they are applied
toanew domain. In contrast, our system does not require suchmethods, as themultiple possibilities are
resolved bythe probabilistic method. This is a strong advantage, because we don't need manual adjustments.
The resultthey reported is not comparable to our result, because thetext andde#0Cnition are di#0Berent.
Butthetotal F-score of our system is similartotheirs, even though the size of our trainingdataismuch
smaller.
DISCUSSION
First, wehaveto consider topic or domain dependency of thetask. It is clear thatinorder toachieve
good performance in the framework, wehavetoinvestigate dictionary entries for thetask. It may or may not
easy to modify the dictionary.For example, a list of foreign person namewritten in Katakana is not so easy
to create, whereas a list of planet names is easy to#0Cnd. This di#0Eculty also existsinpattern-based methods,
but in our framework it is not necessary to create domain dependentpatterns.
Currently creating dictionaries is donebyhand. One possibilitytoautomatize the process is tousea
bootstrappingmethod. Starting with core dictionaries, we can runthe system on untagged texts, and increase
theentities in the dictionaries.
Another issue is aliases. In newspaper articles, aliases are often used. The full name is used only the
#0Crst timethe companyismentioned #28Matsushita Denki Sangyou Kabushiki Kaisya =Matsushita Electric
Industrial Co. Ltd.#29 andthen aliases #28Matsushita or Matsushita Densan =Matsushita E.I.#29 are used in
thelater sections of the article. Our system cannot handle these aliases, unless the aliases are registered in
the dictionaries.
Also, lexical information should help the accuracy.For example, a name, possibly a person or an orga-
nization, in a particular argument slot of a verb can be disambiguated bytheverb. For example, a namein
the object slot of theverb `hire' might be a person, while a nameinthesubject slot of verb `manufacture'
might be an organization.

REFERENCES
#5B1#5D Defense Advanced Research Projects Agency, #5CProceedings of Workshop on Tipster Program Phase II" Morgan Kaufmann Publishers #281996#29
#5B2#5D Bennett, S., Aone, C. andLovell, C., #5CLearningtoTag Multilingual Texts Through Observation" Conference on Empirical Methods in Natural Language Processing #281997#29
#5B3#5D Bikel, D., Miller, S., Schwartz, R. andWeischedel, R., #5CNymble: a High-Performance Learning Name-#0Cnder" Proceedings of the Fifth Conference on Applied Natural Language Processing #281997#29
#5B4#5D Cowie, J., #5CDescription of the CRL#2FNMSU Systems Used for MUC-6" Proceedings of Sixth Message Understanding Conference #28MUC-6#29 #281995#29
#5B5#5D Gallippi, A., #5CLearningto Recognize Names Across Languages" Proceedings of the 16th International Conference on Computational Linguistics #28COLING-96#29 #281996#29
#5B6#5D Matsumoto, Y., Kurohashi, S., Yamaji,O., Taeki, Y. and Nagao, M., #5CJapanese morphologicalanalyzing System: JUMAN" Kyoto University and Nara Institute of Science and Technology #281997#29
#5B7#5D Quinlan, R., #5CC4.5: Program for Machine Learning" Morgan Kaufmann Publishers #281993#29
#5B8#5D Sekine, S., #5CHomepage of data related Japanese Named Entity" http:#2F#2Fcs.nyu.edu#2Fcs#2Fprojects#2Fproteus#2Fmet2j #281997#29
