Japanese IE System and Customization Tool 
Chikashi Nobata 
Department of Information Science 
University of Tokyo 
Science Building 7. Hongou 7-3-1 
Bunkyo-ku, Tokyo 113 Japan 
nova @is. s. u-tokyo, ac.jp 
Satoshi Sekine, Roman Yangarber 
Computer Science Department 
New York University 
715 Broadway, 7th floor 
New York, NY 10003, USA 
{ sekine\[ roman} @cs. nyu. edu 
1 Introduction 
This paper reports on the development of thee 
Japanese Information Extraction system and the 
Japanese Information Extraction customization 
tool. These systems are based on the correspond- 
ing English systems, the Proteus Information Ex- 
traction system (Grishman 97) and the Proteus 
Extraction Tool (Yangarber and Grishman 97), 
developed at NYU. In this paper, we will, in 
particular, describe the differences between the 
Japanese systems and English systems. 
2 The Japanese IE System 
Figure 1 shows the overall structure of the 
Japanese IE system. The system consists of a 
cascade of modules with their attendant knowl- 
edge bases, with the input text document passed 
through the pipeline of modules. Because 
document 
i 
~5~hl ~- 5h~-l-y-s~ .... t ....................... 
I morphological analysis (JUMAN) \] 
\[ name recognition \[ 
\[ partial syntactic analysis \] 
\[ event pattern matching \] 
\[ co-reference analysis \[ 
\[ inference I 
/ 
\[ template generation \[ 
extracted templates 
Figure 1: Japanese Proteus IE system architec- 
ture 
Japanese sentences have no spaces between the 
tokens, we first have to run a morphological an- 
alyzer in order to segment each sentence into to- 
kens. Although an alternative is to use the input 
sequence of characters as it is, this may cause a 
serious problem when we apply patterns because 
of ambiguities. The first module, morphological 
analysis is responsible for breaking the sentence 
into tokens. We used JUMAN (Matsumoto et al. 
97) for this purpose. It also provides the part-of- 
speech information for each token, which will be 
used in the next module. 
The second module is responsible for extracting 
named entities, like organization, person, location 
names, time expressions, and numeric expressions. 
This is different from the English system, which 
uses the pattern matching mechanism for named 
entity detection. In the Japanese system, this is 
done by an independent program which uses a de- 
cision tree technique to assign tags of named en- 
tity information to the input text (Sekine 98). 
The next two modules employ regular expres- 
sion pattern matching, applying patterns of suc- 
cessively increasing complexity to the document. 
These are essentially the same as those of the 
English system. The patterns - the regular ex- 
pressions with their associated actions, -- are en- 
coded in a special pattern specification language, 
and are compiled and stored in a separate pattern 
base. Pattern matching is a form of determinis- 
tic, bottom-up partial parsing. The numbers of 
patterns between the two systems are slightly dif- 
ferent mainly due to the difference in the methods 
of detecting named entities. 
Language Event Other 
pattern pattern 
English 51 187 
Japanese i 39 64 
Total 
238 
103 
Table 1: Number of patterns 
91 
The following explains the pattern matching 
component using some simplified examples of 
Japanese patterns. There are two phases in the 
pattern matching component. The first phase uses 
patterns to identify small syntactic units, such as 
noun and verb phrases. Thus, e.g., in order to 
analyze the example sentence in Appendix A, this 
phase will employ a pattern such as: 
np(person) np(position) 
which matches an ordered pair of noun phrases 
(up) whose lexical heads belong to the semantic 
class "person" and "position", respectively. The 
pattern's action will subsume the matched text 
into a new, larger constituent, with the side ef- 
fect that the entity corresponding to the "person" 
noun phrase will acquire a slot named "position", 
linked to the position entity. 
The second phase called event patterns applies 
domain-specific and scenario-specific patterns, to 
resolve higher-level syntactic constructions (appo- 
sition, prepositional phrase attachment), conjunc- 
tion, and clause constructions. The actions trig- 
gered by the patterns specify operations on the 
logical form representation of the sentence, which 
evolves as the pattern matching progresses. The 
logical form contains the descriptions of the enti- 
ties, relationships, and events discovered so far by 
the analysis. For example, there is a pattern 
np(person) GA np(position) NI 
vp (SHOUKAKU- SURU) 
(here, the Japanese characters are written in 
typewriter type-face) This is an event pat- 
tern and it constructs several relationships re- 
garding the person, position and the predicate 
SHOUKAKU-SURU (to promote). 
The subsequent phases operate on the logical 
form built in the pattern matching phases. Refer- 
ence resolution links anaphoric pronouns to their 
antecedents, and resolves other co-referring ex- 
pressions. We create some Japanese-specific rules 
for abbreviations and any other equivalent expres- 
sions. 
The Japanese system uses the corresponding 
English components for unification of partial event 
structures and template generation. 
3 Customization Tool 
There is not much to mention about the port- 
ing of the English Customization tool (PET) to 
Japanese. There were programming level difficul- 
ties because its window system is developed us- 
ing GARNET, a graphical system for LISP, which 
does not support the Japanese language. 
4 Evaluation 
Table 2 shows the result of the information extrac- 
tion experiment for English and Japanese. The 
English results are based on the formal MUC-6 ex- 
periment. The Japanese experiment is conducted 
using Nikkei newspaper articles on the same do- 
main, "executive succession events". We spent 
about two months of one person's part-time labor 
to develop the patterns. Actually, the developer 
created the patterns at the same time as he devel- 
oped the Japanese system. It achieves a slightly 
better score. This may due to the fact that he 
spent more time than that spent for the English 
rule creation (one month), or due to the tendency 
we found that there is a typical document style 
for this kind of articles. The latter is not clear as 
we have not investigated the English articles. 
5 Future Work 
Structural generalization 
In the English system, there is a facility to general- 
ize a pattern based on structural variation. For ex- 
ample, an active mood pattern can be generalized 
to a passive pattern or a relative clause pattern. 
A similar mechanism might be useful in Japanese. 
Japanese is known as a free word-order language. 
In principle, a pattern can be generalized based 
on this property of Japanese, but there are some 
exceptions and some heuristics should be consid- 
ered. 
Reference resolution for zero pronoun 
In Japanese, subject or object nouns can some- 
times be omitted. It is a difficult problem to re- 
cover the noun, but we found several cases where 
such technologies are needed to achieve better per- 
formance. 
6 Acknowledgments 
The English Proteus Information Extraction sys- 
tem, on which this work is based, was developed 
by Professor Ralph Grishman. We would like to 
gratefully acknowledge his assistance in the devel- 
opment of the Japanese system. 
92 
Language Corpus # of articles 
English training 100 
test 51 
Japanese training! 87 
test 90 
Recall 
61 
47 
64 
57 
Table 2: Evaluation 
Precision F-measure 
74 67.01 
70 56.40 
71 67.32 
70 62.61 
93 
Appendix A: 
=========== 
Position Company Person Status On the Job Reason 
............................................................ 
~ ~~ ~~ OUT YES Reassignment 
~ ~~ ~~ IN NO Reassignment 
~ ~~ ~X~ OUT YES Reassignment 
~ ~,~ 4~ IN NO Reassignment 
Appendix B: 
(defpattern jpn-decision4 
"np-sem(C-company) '~' j-at-date? j-at-board-of-directors? 
np-sem(C-person) np-sem(C-position) '~' 
np-sem(C-position) '~[' 
jnoun-appoint '~' '~' 3noun-jinji '~' jverb-decision: 
verb-tense=future, 
person-at=5.attributes, 
company-prev-at=l.attributes, 
person-prev-at=6.attributes, 
company-at=l.attributes, 
position-at=8.attributes") 
(defclausefun when-j-appoint (phrase-type) 
(let ((person-at (binding 'person-at)) 
(person-entity (essential-entity-bound 'person-at 'C-person)) 
(company-prev-entity (entity-bound 'company-prev-at)) 
(position-prev-entity (entity-bound 'position-'~rev-at)) 
(company-entity (entity-bound 'company-at)) 
(position-entity (entity-bound 'position-at)) 
(reason-symbol 'REASSIGNMENT) 
(verb-span (binding 'verb-span)) 
(tense-verb-word (first-word-bound 'verb-span) 
(tense-verb-symbol) 
(new-event) 
) 
(when (null position-entity) 
(error "~%when-j-appoint:cannot output appoint event.-%")) 
(not-an-antecedent position-entity) 
(setq tense-verb-symbol (intern tense-verb-word)) 
(format *trace-output* "~%-S ~S" verb-span tense-verb-symbol) 
(setq new-event 
(assert-event 
:predicate 'appoint 
94 
:person person-entity 
:position position-entity 
:company company-entity 
:reason reason-symbol 
:verb verb-head 
:verb-tense verb-tense)) 
(when position-prey-entity 
(assert-event 
:predicate 'leave-job 
:person person-entity 
:position position-prev-entity 
:company company-entity 
:reason reason-symbol 
:verb verb-head 
:verb-tense verb-tense))) 
95 

References 

Ralph Grishman: 
Information extraction: Techniques and chal- 
lenges. In Maria Teresa Pazienza, editor, In- 
formation Extraction. Springer-Verlag, Lecture 
Notes in Artificial Intelligence, Rome, 1997. 

Yuuji Matsumoto, Sadao Kurohashi, Osamu Ya- 
maji, Yuu Taeki and Makoto Nagao: Japanese 
morphological analyzing System: JUMAN Ky- 
oto University and Nara Institute of Science 
and Technology 1997 

Satoshi Sekine: NYU:Description of The 
Japanese NE System Used for MET-2 In Proc. 
MUC-7 Verginia, USA, 1998. 

Roman Yangarber and Ralph Grishman: Cus- 
tomization of Information Extraction Systems 
In Proc. Int'l Workshop on Lexically Driven 
Information Extraction, pages 1-11, Frascati, 
Italy, 1997. 
