NEC:
DESCRIPTION OF THE VENIEX SYSTE M
AS USED FOR MUC- 5
Kazunori MURAKI, Shinichi DOI and Shinichi ANDO
NEC Corp. Information Technology Research Laboratories
Human Language Research Laborator y
4—1—1, Miyazaki, Miyamae-ku, Kawasaki 216, JAPA N
E-mail: {k-muraki, doi, ando}@hum .cl.nec.co.jp
Phone: +81—44—856—2148, Fax: +81—44—856—2238
INTRODUCTION
NEC Corporation has had years of experience in natural language processing and machine translation[l ,
2, 3, 4, 5], and currently markets commercial natural language processing systems . Utilizing dictionaries
and parsing engines we have already had, we have developed the VENIEX System (VENus for Informa-
tion EXtraction) as used for MUC-5 in only three months. Our method is to apply both domain-specific
keyword-based analysis and full sentential parsing with general grammar[6, 7] . The keyword dictionary of
VENIEX contains about thirty thousand entries, whose semantic structures are sub..ME_Capability frame,
and the parsing and discourse processing are controlled with the information given in this semantic struc-
ture of keywords. The resulting scores of VENIEX for formal run *texts were from 0 .7181(minimum) t o
0.7548(maximum) in Richness-Normalized Error and 48.33 in F-MEASURES(P&R) .
SYSTEM ARCHITECTURE
The overall system architecture is shown in Fig. 1. An input text is divided into sentences and eac h
sentence is processed separately. ME_Capability frames are extracted from each sentence . An example of
the procedure of information extraction from one sentence by VENIEX is shown in Fig . 2-4.
The characteristic modules of VENIEX are as follows :
• Keyword Dictionary which contains about thirty thousand entries, whose semantic structures ar e
sub_ ME_Capability frame ,
• Parser which generates ME_Capability frames by correlating keywords during full sentential parsing,
whose process is controlled with the information in this semantic structure of keywords ,
• Discourse Processor which combines MECapability frames of each sentence .
We call this lexical-information-driven method for parsing and discourse processing "Lexical-Discourse-
Parsing". This method utilizes the merits of both domain-specific keyword-based analysis and full sentential
parsing and discourse processing with general grammar. It also reduces expenses of general parsing and
discourse processing.
Preprocesso r
This module divides an input text into a header and a body of the text, and stores the document number ,
date and source information from the header for entry into the template . It also divides the body of the text
into sentences, which will be processed separately during morphological analysis and parsing .
147
(Input: Tipster text )
	
, -;.. .. . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. . . .. . . . . . . . . . . . .. . .. ..... . . . . . . . . . .	 +	
	
i
	
'
( Output : templa..e-")-
Figure 1 : System Architecture of VENIE X
Dictionarie s
Our system utilizes two dictionaries, a syntactic dictionary and a keyword dictionary. Both dictionaries
are converted from the machine translation dictionaries we had developed . The syntactic dictionary contain s
about ninety thousand entries and the keyword dictionary contains about thirty thousand entries, includin g
the names of corporations, pieces of equipment, devices, place names, etc. Also, we extracted the names we
didn't have in our original dictionaries from the Tipster corpus, and enlarged the keyword dictionary .
We added semantic structures which are sub_ME_Capability frame -partial structure of ME_Capability
frame- to the entries of the keyword dictionary. Examples of the sub-frames are shown in Fig . 2.
Fig.2-a) is an example of an Entity sub-frame, which provides slots for a name and a type of an entity.
This sub-frame can provide other-name-slot whose value is a list of other names of the entity including
nicknames and abbreviations, such as "NEC" and " E " of " 8 *It% ".
Fig.2-b) is an example of a ME_Capability sub-frame, which provides slots for the process type an d
detailed information of the process including its type and the equipment used . The words to which a
ME_Capability sub-frame is added are extracted from technical term dictionaries of microelectronics and th e
Tipster corpus.
Fig.2-c) is an example of a Relation sub-frame, which is added to words representing the relation betwee n
words with an Entity sub-frame and words with a ME_Capability sub-frame . These sorts of words are gener-
ally Japanese verbs . The Relation sub-frame provides case slots with a case marker -Japanese postpositiona l
particles- representing grammatical relations . Each case slot contains a sub-slot representing whether the
filler of this case slot is a word with an Entity sub-frame or a word with a ME_Capability sub-frame . If it
Morphological
Analyzer
t
Local Parser
Preprocessor
Discourse
Processor
Templat e
Generator
148
a) Entity
ex)
b) ME_Capability
ex)
CV D
13*Asa4 `J
:/T ,f-f
;/ 4 T 4 - : Q*As 'eq
J 474 — SOU : f
~4t1z1/ 1,1:3 4tt
Vdfi~J ' Ii : CVD
~: --~	 f
CVDa IR
c) Relatio n
ex)
Figure 2 : Example of sub_ME_Capability Fram e
key :
	
:/T~fr4--
roleslots : [NAZI
Vr:
key :
	
4~rz=I/
	
l`El '- d fit
149
Subject Marker Object Marke r
Input sentenc e
s	 x#it4F1J ICVDa l
nihon-shinkuu-gijutsu C VD -so uchi
	
seisan
Entity
	
CVD-Equipment manufactur e
Figure 3 : Example of Input Sentenc e
is a word with an Entity sub-frame, the case slot also contains a sub-slot representing the role of the Entity
sub-frame to a ME_Capability sub-frame, whose value is a list of " MRI' (developer)", " M1Z. (manufac-
turer)", " kk4 (distributor)" or "J k/ 11~ (purchaser_or_user)" . Therefore, the Relation sub-frame in
Fig.2-c) means that :
• The Japanese verb " l (manufacture)" has two case slots .
• The filler of the first slot with a subject marker " 7r" is a word with an Entity sub-frame, whose role
to a ME_Capability sub-frame is " *AZ' (manufacturer)" .
• The filler of the second slot with an object marker " " is a word with a ME_Capability sub-frame .
Morphological Analyzer
This module divides input sentences into morphemes and gives every morpheme lexical attributes wit h
a syntactic dictionary and a keyword dictionary. For example, an input sentence " R *Acs Ofr CVD
I®.1~ 6 0 (Nihon—shinkuu—gijutsu manufactures a piece of CVD-equipment.)" is divided as shown in
Fig. 3, and the semantic structures shown in Fig. 2 are given to morphemes " 5 * I!#5t$T ", "CVD fib "
and "IS " .
If a morpheme is encountered that doesn't exist in the dictionaries, it is marked as an unknown wor d
and its part of speech is estimated from neighboring morphemes . For example, in a text that contains
many nouns, the recognition of unknown words becomes an important function because these words ma y
be important proper nouns . Numerical values are also tagged with the same kinds of information as word s
because they often perform as content words, and are often useful for determining sentence structures .
Local Parser
This module re-collects morphemes given by the Morphological Analyzer and produces phrases . It
also combines the ME_Capability sub-frames given to the words in a phrase, and assigns a new combine d
ME_Capability sub-frame to the output phrase.
This module also deduces keywords from particular suffixes and patterns . For example, the nouns pre -
ceding the suffix "it " or " .t4± " is considered as business entities . Unknown noun preceding parenthese s
inserted a place name can be business entities, too .
150
Sentence Structure
ICVD f Y . o
Entity
	
ME_Capability Relatio n
ME_Capability Fram e
Figure 4: Example of Information Extractio n
Parser
This module re-collects phrases produced by Local Parser and outputs parse trees and semantic structures ,
which are ME_Capability frames. Its function involves not only parsing but also semantic interpretation ,
lexical disambiguation and information extraction. The main body of the analyzer is a unification-base d
chart parser, and the parsing strategy is bottom-up breadth-first. The solution with the highest preferenc e
score is selected. Our Local Parser and Discourse Processor are based on the same parsing engine and diffe r
only in parsing rules. Sharing engines and functions by modules, we can efficiently develop the VENIEX
system.
The Parser can handle a wide variety of complex sentences . It analyzes and generates modifying con-
nections between phrases and the relation between keywords. It constructs semantic structures which ar e
ME_Capability frames from sub-frames described in a keyword dictionary by correlating keywords during
full sentential parsing, whose process is controlled with the information in the sub-frames . For example, as
illustrated in Fig. 4, the Parser recognizes the structure of the sentence " B *X=#i*>3 t CVD l~'k l $
Z . " and constructs a ME_Capability frame from sub-frames shown in Fig. 2.
This module can also deduce keywords. If an unknown noun fills a Relation sub-frame's case slot whose
filler must be a word with an Entity sub-frame, this noun can be considered as an entity.
In addition, the Parser recognizes special expressions whose sub_ME_Capability frames are used for dis-
course processing . It selects the most important Entity sub-frame and the ME-Capability sub-frame, an d
also analyzes the Entity sub-frame and the ME_Capability sub-frame represented by anaphoric expressions .
~-f usirk I:7 .1=
,f ? 1) /
	
IJ : CVD
--~ 	 f	
221.1 : CVD t
s ;/ r 4 T 4 —~ : 8 *A_a
T'fT 4
151
Parser keep these sub-frames respectively in "currentEnt" slot, "currentME" slot, "anaphorEnt" slot an d
"anphorME" slot . We will later show examples of these slots with a walkthrough example.
Though it is not illustrated in Fig . 1, VENIEX has another module as a fail-safe system between the Parse r
and the Discourse Processor. If the Parser cannot analyze an input sentence and outputs only fragments of
ME_Capability frame, this Postparser module re-collects and combines the fragments without considering
the sentence structure .
Discourse Processor
This module combines ME_Capability frames generated by the Parser into frames representing content of
the whole article . It recognizes relation among the ME_Capability frames by resolving co-reference for entities
and microelectronics . The co-reference resolution is achieved by unifying "currentEnt" with "anaphorEnt "
and unifying "currentME" with "anphorME". VENIEX can resolve co-reference represented by a wide vari-
ety of expressions: anaphoric expression (identical and unidentical), cleft sentence, ellipsis, name of Entities,
etc[7] . We will later show an example of this process with a walkthrough example as well .
Template Generato r
In VENIEX, the outputs of the Discourse Processor are ME_Capability frames. In other words, essential
information has already been extracted during morphological, syntactic and discourse analysis. All that
remains is to transform the frames and the information of the input article stored by the Preprocessor to
the output templates in the official form .
PROCESSING WALKTHROUGH TEX T
Overview
VENIEX has two steps for ME information extraction; 1) extracting ME_Capability frames separately
from each sentence, 2) combining the frames above into frames representing content of the whole text . This
method has two tasks in constructing a body of knowledge with small pieces of information contained in
more than one sentence . First, it must construct new information with pieces of partial information scattere d
in different sentences . Second, it must identify identical information represented by different expressions .
VENIEX attains these tasks by discourse processing on surface expressions focused on ellipsis, anaphor a
and so on . The walkthrough text, however, has discourse problems that can't be solved with that particula r
surface process. Therefore VENIEX can't merge the information sufficiently and outputs two ME object s
for only one ME object in the text. Also, VENIEX fails in complement of ellipsis and extracts only on e
entity for two entities. As a result, the evaluation of walkthrough text is 66.67 P&R.
Morphological Analyze r
The Morphological Analyzer divides a sentence into morphemes and assigns corresponding syntactic in-
formation to each morpheme using the syntactic dictionary. At the same time, it assigns some information
from the keyword dictionary to morphemes.
Fig. 5 below shows the result of morphological analysis of the 1st sentence of the walkthrough .
/*4'W /MA
{key Ii {slot ['WWI) {csh ~+t ; ckey
	
' 5 4 4 — ; —; rolslots (*MC }
4E1J{csh 'k ; ckey -f ci .x.
	
Malt DI}
/Rff
{key -7 4E7 LL' 1-U.=
	
{ Ili top{RIR f}}}/q)/ ; ;
{key [ 'f {slot ["YI'PI'gJ {csh 711 ; ckey = :/T 4 T 4 — ; rolslots Mt, Rat,
152
*WiJ{csh t ; ckey 7 ,( 13 -T-L~j Fo='
	
fl }]}}
/) ( - - t – / . / 13 *AM*
{key x Yt4T4 — { :'5 4T4-8U k ; spell *Asa }}
/ (/*4±/4 *iIR *'-
{key 1(b: , {gazette B* (C1) **III (W) *4
	
(tti) ; level 3}}
/rti/. /#t$c/Vf/11i1/1 /) /li/ ;♦; 1
{key Hk {gazette *RI (®) ; level i} }
/ 0)/*/Jj
{key t% {slot Ef i {csh ti* ; ckey .X. :/5 -45 -– rolslots [ i ] } ,
1~~]{csh ; ckey 7 .(gnxL4 F o =Agtit }]}}
/M/4 /VP/0)/>ikt$
{key 9tl {slot [4S',j {csh 0; ckey x-:/t 4 T 4 – ; rolslots [PO' A', Wig , EMI ,
~~{csh ; ckey '7 4 . 1}]}}
is / BTU'f ./4' –tintri
{key SY5-4T4 — {x ;/t t4–SU 11 ;
SU k [e— T4 —3 — 4
	
—~ stb,
	
4 .2----f Yy–~ ;/at)L] ;
spell BTU'f ~q•-~vatl~'}}
/#t/ (/*4i/7 4tI-a–t 71ii
{key at, {gazette *IM (1) 7t'l%m – t y 7 (W) ; level 2}}
/e/*I i
{key,( Z {gazette i+~® ([ 1) ; /}levely~i}}
/I'-/ p#/*4t/ {6 /LL/♦ /* 4$/IC
{key X* {slot [ f * f fn ] {csh 0; ckey x ' 4 T 4 — ; rolslots [
	
] } ,
14~ra7{csh ; ckey -74 13xL4 F Q= 7 A41E1E }]} }
/0/1W/MR
{key 74ux.LLq rn-- .7.4ai'l{ Mk top{ RR Ran }
/0)/e-D //
	
/ itXMIll
	
(4MfJ @AAtk) RI
{key '74 rrx L7 F ti.-L AI.2 {
	
L'f i"] :/7 {1ASU ®®CVD ; 74 11,A *% ;
RR ii { WWII ®®cVD
	
}}}}
//& . 11M&•W4
{key I10' {slot [ i J {csh
	
ckey / 4 t 4 — ; roleslots [*III,
	
o° ] }
II'L-]{csh 'k ; ckey z '( 'tixLy 1- 0=7.51it } ,
ftig4i] {csh orO ,
ckey .x,/T4T4– ; roleslots [JJIA / illfi ]}]}}
{key 4$11i1iiH{slot [f*-N''i{csh 0 ; ckey x ./t4T4 -
t J {csh T ; ckey 11f4 }}] }
/ L/f:/0 /
Figure 5: Walkthrough — The result of morphological analysis —
The notation "/" in Fig. 5 is a delimiter of two morphemes. A morpheme recognized as a keyword
is followed by a corresponding sub_ME_Capability frame, which is a partial structure of ME_Capability
frame, loaded from the keyword dictionary. For example the word " *A'' , which means "manufacturing",
has information that the entity which appears as the subject plays a manufacturer part of the object, th e
ME_Capability frame. The word " 8 *AM* ", which is a company name, has information that the type
is company. The word " *Xli C V D ((t*71l*l ) ", which means "CVD equipment", conveys
information that the type is layering and that the film is metal, and implies the existence of equipment .
VENIEX gathers the sub_ME_Capability frames and combines them into the ME_Capability frames .
Local Parser
The Local Parser recognizes a Japanese phrase by utilizing local patterns and the syntactic informatio n
given by the Morphological Analyzer . The Local Parser combines sub_ME_Capability frames in one phrase .
The way of combination differs according to the sort of keywords ; " s :/ T 4 t 4 -- " (entity), " 4f n S-
I .
153
L P' l q 4 .z gg " (microelectronics), " BM " (relationship) and so on.
The output of the Morphological Analyzer is shown in Fig . 6.
/ *A Sli ; li tc .'o)
{key -74
	
I-
	
A*a{ Mk top{ aft Man}
/ * L–2i-,
{key P4* {slot [ r~4 {csh ht; ckey x- :/f- 4 4 — ;
roleslots W M, *AL MIM },
1 W sJ {csh T. ;
	
ckey
	
~(q 1:7x-l/ 4 F Q= 4 ~)l~ }] }}
/ B* ##4i (*414M)IIA3I:O r1 . 4iAA*t**fc)y~1±
{key xYT 4 ? 4 — { dT
	
(®) *OM (A) 7J'~ (i) ;
spell B *>J311;
.i. :/5-4 .t.4 –£U A }}
/CI S
{key
	
,{gazette *I1 (Q)
/ *64*
{key 110 {slot (1 ) {csh
14A''6] {csh
/i e
/W $.
{key DIM {slot [424 3 ) {csh
	
ckey - ' 4 T 4 — ;
roleslots ma*, '>Sl>] `, RIM},} ,
1847 {csh T ; ckey D}]}}
/BTU-f Yy — t'/a i )l.#t (*±7-~tt .–•h Ii)
{key
	
{ ? * ld7? * l (l) ?his--h'7 MI) ;
spell BTUT Yy—t's*'>V;
SU* [Ls— TT —3— 4
	
t: — 4 :7---4
	
)l] ;
x :/T4 T'( –Al
	
}}
/IW tta)
{key ~ ~(q q x L' 4 I• u z
	
fl {) top{ R1 Ra }}}
/ t
	
/
	
/1MMfii C V D (1) 1W
{key -74oxL7h p =7 .Z4E6t {7i1 V4tJY{1I EMU ®®CVD ; 74)lA Mi ;
RR VW{ IMP ®®cvD Mt }}} }
/ &
{key MA {slot Ef
	
J {csh
	
ckey x : /5 -45 -4 — roleslots
	
,
	
] },
4a'4) {csh ; ckey
	
nxl/ I` nr 4,. fl },
.PPE'433 {csh or($ ,
	
'. e) ;
ckey x YT4T4 — ; roleslots
	
I]
/ k / **Lk.
{key 4 ii.
	
{slot [f4 W J {csh fit; ckey xYT 4 T 4 — } ,
rt'itl {csh 1 ; ckey 11A }] }}
Figure 6 : Walkthrough — The result of local parsing —
In Fig. 6, the entity " B *Ast f " acquires the new information by extracting the keyword which shows
its location in the identical phrase .
Parser
The Parser recognizes the syntactic structure of each sentence in the input text . An ME_Capability frame
in a phrase is combined with corresponding ME_Capability frames in other phrases if they have syntacti c
level 1}}
; ckey
	
44 — ; roleslots [Jtlt] }
; ckey ,f
	
Er .-r- MAN DI}
{key b, {gazette fL1 (WI) ; level 1}}
/ # 4i'lr/n A. /'`
	
'jam
{key I10* {slot [ MaJ {csh bt; ckey .x. :'/ -I- 4 T T — ; roleslots []
	
`] },
4-RAJ {csh ; ckey 74 gaxV4 FQ=7.c
	
}]}}
154
relations. The way to combine the frames depends on the sort of each of the keywords .
The output of the Parser to walkthrough text is shown in Fig. 7. In Fig. 7, the number following a
notation of "_" is an index. If two objects have a same index, these objects are the identical.
+++ 00 +++++++++++++++++++++++++++++++++
{entities [xYT45- 4 — { T( *IA (®) 7-t!i- .3.--h .y7 (lam) ; spell BTU4L5' — ~' /a-')1';
SURF [e— T4 — .—4Y9—fvajJL, }—743--x(%9—.1- :/al-)t,] ;
x /T4T4-$U *Al 389131362,
x:44 — %744-SU MA; spell 8 4 *1;
	
f~f 8* (®) 441011I (A)
	
(A') } 99129292] ;
total [74 'i L7I R- .zCEit{MA top{Mt
	
{*,
	
q }} ;
MR* q ; *AV q ; >
	
q ; J A '/ U
	
q } ,
R xlit{Mat q ; J*At/'JJBg q ;
fj 1/4 fi'J ,/y {74 )l' A Itg; JA$U ®®CVD ;
Vitt M R { i U ®®CVD
	
$
	
[x'5•474 — 99129292]}};
WAS' [x'T4T4— 99129292] ;
°a [x ' 4 T4 — _89129292]). .89129293,
	
~•(gRxL7l ri=p,-tit {fib top{ *lit
	
{>6I> q }} ;
Jg1 g/TUJ1lg 0 ; keg q ; Slag q ; ;OA4' q }] :
currentEnt :/'I-45-4 4 — 99129292;
currentME V4 4 RT-1/ I• c~ = JAM 99129293}
+++ 01 +++++++++++++++++++++++++++++++++
{entities [x ' 4 T 4 — {spell 13 * 7 4 ;;j~ aflx' ?4 4—SU
	
} ..92602607] ;
total ['~ 'f R Lx 7 I` R % Agtfi {MAt/TUJ
	
q ; Vq 9 q ;
fj top{ MR l { Slag [x LT4 T 4 — _92602607]}};
> > 'tV [s 't4 T4 - 92602607] ;
Piel* [x JT4 T4 — _92602607]). 92602606] ;
anaphorfE v4 4 us L ), R= AWE 92602606;
currentEnt :/54 4 — .92602607;
currentHE 4 7 R x L' I- R=4 *a 92602606}
+++ 02 ++++++++++++++
{entities [x/?4T4–{x,/T4T4-2U ,1 ;
SUS [e— 5- 4—7—7)1,/iyr/, e—T 4.L—7)11 i '] ;
spell BT Ti7 A,/I•' } 96207866] ;
anaphorEnt x ' 4 4 — 96207096}
{}
+++ 04 +++++++++
{}
+++ 05 ++++++++++++
{entities [= :/t 4 7 4 — { m :/f. 4 41y—SUc Ike ; spell 8
	
a4l } -102793590] ;
total (v /
	
x.L I•R= AfIl BG{PIIRI q ; IAMB/ II yc q ;
*j& L4 t J :/y {AN ®®CVD ; 74)1,A ika ;
MN RIKMIA' [x:/t4T4
102793590] } };
]ll14 [x %T 4 T 4 — _102793590] ;
lea ` [x %7474 — _102793416,
x. %T 4 T4 — 102793590] } _102793419] ;
anaphorEnt T. '? 4 T 4 — 102793416;
currentME -74 7 R x L I' R=rl Mg 102793419}
+++ 06 +++++++++++++++++++++++++++++++++
155
{anaphorEnt ' 4 T 4 — _108363917 ;
anaphorME v 4 4 q s V }. q
	
42glIgMt/fUlit 0 ; k 'g 0 ; M`A' q ;
>j` [= JT 4 T T — _108363917] ;
fj top{ cl Riff{
	
[s :/44 — _108363917]} }
} _108363918 ;
currentEnt :/-t4 108363917}
Figure 7: Walkthrough — The result ofparsing —
In the 1st sentence (No . 00), for example, manufacture and distribution on the CVD equipment is
extracted. The sentence consists of two simple sentences and the extracted information lies in the 2nd simple
sentence. VENIEX recognizes that these two simple sentences share the nominative case, and combines th e
ME_Capability frames according to the path of information ;
{"Q*As#0 "_" rlLt0 "-{"tZ
	
" - "CVD E "}}
({Entity — "agree" — {"manufacture and distribute" — Equipment}}) .
This results in VENIEX extracting the ME_Capability frame, as shown in Fig . 7.
The Parser in VENIEX extracts 5 ME_Capability frames from 4 sentences ; the 1st sentence, the 2nd sen-
tence (No. 01), the 6th sentence (No . 05) and the 7th sentence (No . 06). As for the 6th sentence, VENIEX
succeed in extracting 2 ME_Capability frames from the noun phrase, " 8 *A # Miff Frig • Pkn L T
to 3 C JtJ CVD
	
", and an entire of sentence .
Meanwhile, the Parser keeps a sub_ME_Capability frame, which appears as an entity or microelec-
tronics in the sentence, respectively in "entities" slot and "currentME" slot . Additionally, for an entity
(sub_ME_Capability frame), which is the subjective case in the given sentence, the Parser keeps it in "cur-
rentEnt" slot. For a sub_ME_Capability frame represented by anaphoric expressions, the Parser also instan -
tiates "anaphorEnt" slot or "anaphorME"slot. After anaphora resolution, it puts referred "currentEnt" slo t
or "currentME" slot in corresponding "anaphorEnt" slot or "anaphorME"slot.
Discourse Processor
The Discourse Processor merges ME_Capability frames the Parser output by utilizing "entities" slot ,
"currentEnt" slot, "currentME" slot, "anaphorEnt" slot and "anaphorME" slot. For example, the 2nd sen-
tence in the walkthrough carries information that the entity " *As 'J " distributes some equipment
and the equipment which appears in anaphoric expression " Jiff " refers to the CVD equipment in the
1st sentence. In processing the 2nd sentence, the Discourse Processor recognizes the expression " fit "
as an anaphoric expression and instantiates the "anaphorME" while extracting sub_ME_Capability frame
(see Fig. 7). It checks consistency between the "anaphorME" slot and the "currentME" slot instantiated in
processing the previous sentence and identifies these as the same object .
VENIEX makes two mistakes in discourse processing for walkthrough text .
One appears in ellipsis processing . Ellipsis of nominative in the 6th sentence must be resolved for ex-
tracting the entity which is the distributor. The distributor entity must be "BTU 7 )t'!< i 4 " in the 3rd
sentence because it is clear, according to context, that the 2nd paragraph is written about its activities. But
VENIEX selects "currentEnt" slot which is the nominative of the 2nd sentence, because it lacks knowledg e
to process a paragraph or joint venture .
The other mistake is caused by failure in merging ME_Capability frame in the 1st paragraph with one in
the 2nd paragraph . These frames must be identical objects because the topic of the article is a joint venture,
and a joint venture distributes often products of parent company .' (We think, however, that the equivalence
of the CVD equipment cannot be decided based only on these clues, and it is possible to interpret that thi s
equipment are different.) VENIEX processes all ME objects separately when there is no specified referentia l
156
expression.
As a result, VENIEX output the template shown in Fig . 8.
< T :/ F -000452-1> : _
-i': 000452
Rt'7Ci 8 : 890804
=Is – .z tlipTf: " 8 #UBifr AE **MI '.
Fl : < z4 q ux1• u=x OE -000452-1 >
< ''q 4 7 Er = L F E= Rlalt -000452-2 >
T R B : 930820
IileP UU : 2
<-74uxL71. 13 Rant -000462-1> : _
MA: < L4fi') Y-000462-1 >
a 4: < .x. :/ t- 4 T 4--000452-1 >
,°`: <xLT4T4--000452-1>
<-74 u = 1, P > u . - M - 0 0 0 4 5 2 -2 > : _M : < L
4 ") ' -000452-2 >
alb `: < x :/ 5- 4 T 4 - -000452-1 >
Mg: < x%T4T4--000452-1 >
< x T 4 T 4- -000452-1> : _
7G 'T41' 4
	
El
	
l 1P1
*>7"r : E3* (1) 4**J1l (!~) *')-ftf ( )
x' 4T4--$U :
< L4 fi') ' -000452-1> : _
SSU : CVD
74 )le A :
< 1 t9 :/-0 0 0 4 5 2 -/4 : =
S.211 : CVD
74 ILA : CA
it
	
<
	
-000462-1>
<MK -000462-1> : _
>`: < x T 4 j-' 4
RCM : CVD N
#iE : MP 113
Figure 8: Walkthrough — The template —
RESULTS AND FUTURE WORK
The resulting scores of VENIEX at formal run were from 0.7476(minimum) to 0 .7858(maximum) in
Richness-Normalized Error and 47.41 in F-MEASURES(P&R), which are shown in Table 1. We have im-
proved the system a little after the formal run -only by debugging parsing rules, not by adding new rules
and/or dictionaries-, and the current scores of VENIEX for formal run texts are from 0.7181(minimum)
to 0.7548(maximum) in Richness-Normalized Error and 48.33 in F-MEASURES(P&R), which are shown in
Table 2. The current scores for dry run texts are also shown in Table 3.
Though we have developed the VENIEX System in only three months, there wasn't so much difference
in scores with other systems in MUG-5 . But the scores were lower than we had expected. The main reaso n
is the lowness of recall rate . We didn't have enough time to collect keywords, especially verbs representin g
the relations between entities and microelectronics.
We have developed many new functions for the MUG5 system, such as co-reference resolution and key-
word deduction . We have been evaluating these functions separately to judge weather they worked as w e
designed . For example, to evaluate the performance of keyword deduction function in the Local Parser and
the Parser, we made an information extraction experiment without dictionaries of entities . The resultin g
157
a) From the error-based score reports
Richness-Nomalized Error
ERR UND OVG SUB Min-err Max err
67 55 30 14 0.7476 0.7858
b) From the recall-precision-based score reports
REC PRE P&R (F-Meseasure )
All-Object 39 61 47.41
Text-Filtering 66 83 -
Table 1 : Summary of our MUC-5 Score
a) From the error-based score reports
Richness-Nomalized Error
ERR UND OVG SUB Min-err Max-err
66 55 26 14 0.7181 0.7548
b) From the recall-precision-based score reports
REC PRE P&R (F-Meseasure)
All-Object 39 64 48.33
Text-Filtering 68 85 -
Table 2: Current Scores for Formal Run Texts
a) From the error-based score reports
Richness-Nomalized Error
ERR UND OVG SUB Min-err Max-err
53 39 21 10 0.5566 0.5963
b) From the recall-precision-based score reports
REC PRE P&R (F-Meseasure )
All-Object 55 71 61.91
Text-Filtering 88 91
Table 3 : Current Scores for Dry Run Text s
a) From the error-based score reports
Richness-Nomalized Error
ERR UND OVG SUB Min-err Max err
73 65 25 14
	
_ 0.7639 0.8029
b) From the recall-precision-based score reports
REC PRE P&R (F-Meseasure )
All-Object 30 64 40.87
Text-Filtering 59 85
Table 4: Scores of Experiment without Entity Dictionary
15 8
scores for formal run text are shown in Table 4 . The result says that this function works well. Through
the development of VENIEX system for MUG-5, we have learned that we can realize information extractio n
system with our natural language processing techniques . But to improve the system, we must make more
detailed evaluation of performance of each function .
One of the biggest theme of future work is automated or semi-automated training of the system . We
plan to develop a bootstrapping method to improve the system with iterating cycles of "refining system" -
"evaluating the performance".
References
[1] Muraki, K., "VENUS: Two-phase Machine Translation System", Future Generations Computer Sys-
tems, 2, 198 6
[2] Ichiyama, S., "Multi-lingual Machine Translation System", Office Equipment and Products, 18-131 ,
August 1989
[3] Okumura, A., Muraki, K. and Akamine, S., "Multi-lingual Sentence Generation from the PIVOT inter -
lingua", Proceedings of MT SUMMIT III, July 1991
[4] Doi, S., Muraki, K ., Kamei, S. and Yamabana, K ., "Long Sentence Analysis by Domain-Specific Patter n
Grammar", Proceedings of EACL 93, April 1993
[5] Yamabana, K ., Kamei, S. and Muraki, K., "On Representation of Preference Scores", Proceedings of
TMI-93, July 1993
[6] Ando, S., Doi, S. and Muraki, K ., "Information Extraction System based on Keywords and Text Struc-
ture", Proceedings of the 47th Annual Conference of IPSJ, October 1993 (in Japanese )
[7] Doi, S., Ando, S. and Muraki, K., "Context Analysis in Information Extraction System based on Key -
words and Text Structure", Proceedings of the 47th Annual Conference of IPSJ, October 1993 (in
Japanese)
159
