AN ANALYSIS OF THE 3OINT VENTURE JAPANESE TEXT PROTOTYPE 
AND ITS EFFECT ON SYSTEM PERFORMANCE 
Steve Moiorano 
Office of Research & Development 
Washington, D.C. 2@5@5 
email: maiorono@cr1.nmsu.edu 
BACKGROUND 
The TIPSTER Data Extraction and 
Fifth Message Understanding 
Conference (MUC-5) tasks focused on 
the process of dataextraction. This 
is a procedure in which pre- 
specified types of information are 
identified within free text, 
extracted, and inserted 
automatically within a template. 
Three TIPSTER contractors -- BBN, 
GE/CMU, NMSU/Brandeis -- 
participated in the August '93 MUC-5 
evaluation for both the English 
joint venture (EJV) and English 
microelectronics (EME) domains and 
their Japanese-language 
counterparts, the \]3V and 3ME 
applications. Two other contractors 
-- SRI and SRA -- participated in 
the EJV and 33V domains alone. CMU's 
Textract system took part in the 
3apanese-language domains only. Of 
the five systems that tested in both 
English and Japanese, all but one 
scored higher in the Japanese- 
language applications according to 
both the summary error-based scores 
and reca11/precision-based metrics. 
This overall result has lead some 
participants and observers to 
suggest that Japanese is an "easier" 
language than English. 
Japanese-language usage in the total 
1297-article \]3V corpus exhibits the 
same degree of ellipsis-generated 
vagueness and ambiguity as in other 
domains and genres of Japanese 
writing. On the other hand, 
however, in matters of information 
presentation JJV articles are very 
formulistic. This paper argues that 
the stereotypical structure of the 
topic sentence in the J3V corpus 
together with the "default" pattern 
of certain template fills gives the 
Japanese systems o ready basis for 
extracting information and inserting 
it into a template. The result is 
better overall systems' performance 
in 33V than EJV as indicated by the 
scoring metrics. 
METHODOLOGY 
The argument outlined in this paper 
is based upon a discourse anaZysis 
of two portions of the entire 1297- 
article 3JV corpus: the 15e-article 
33V test set and 1~ randomly 
selected development-set articles. 
In addition, a descriptive anaZysis 
was performed on approximately 50 
JJV test articles and corresponding 
template results for varying 
combinations of the six systems that 
participated in MUC-5; all six 
systems, however, were analyzed on a 
subset of 12 selected articles, or a 
total of 72 individual template 
results. The entire descriptive 
examination is motivated by a desire 
to understand better the various 
systems' capabilities in order to 
make the numerical results more 
tangible to potential users. The 
assumption is that one can construct 
a composite performance-based 
description for each system derived 
from the analysis of individual 
templates, and that the resulting 
snapshot -- what the system actually 
does -- will be more comprehensible 
to users than the theoretical model 
of a system outlined in a technical 
summary -- what it should do. 
165 
Although the discourse anaZysis has 
not yielded o fult-btown discourse 
structure for the JJV corpus, the 
most essential element of the 
evolving top-down paradigm, the 
topic sentence, is identified. Any 
attempt to formulate o complete 
discourse paradigm for JJV must 
first deal with this sentence. It 
contains much information 
significant in its own right and -- 
more to the point for data 
extraction -- relevant to template 
insertion. In fact, most of the 
time the topic sentence contains a11 
the minimally required data for 
instantiating and tracking a tie-up 
relationship. 
This paper first examines the 
stereotypical nature of this topic 
sentence -- hereafter referred to as 
an article's ~Impact Line" -- before 
moving onto o discussion of the 
"default" mechanism. The Impact Line 
prototype operating in conjunction 
with the instantiation of certain 
high-percentage star fills 
("defaults") provides a proficient 
extraction heuristic and 
corresponding salubrious 
quantitative effect upon system 
performance. 
JJV DOMAIN AND THE IMPACT LINE 
The JV application focuses on 
tracking tie-ups between at least 
two entities. It is necessary, 
therefore, to I) identify the 
entities engaged in some business 
activity or development project and 
2) to confirm that the arrangement 
between them is a tie-up 
relationship. Therefore, for the 
Impact Line to hove any "impact" at 
ali in this application, its 
prototype should at least contain 
the information necessary in 
fulfilling the above criteria. 
Two definitions of the prototypicat 
Impact Line, version i and version 
2, ore presented below. Version I 
discusses the data items necessary 
to meet the above-mentioned criteria 
for generating o tie-up: two 
entities and the indication of o 
tie-up. In order to show how the 
structure of this version-1 Impact 
Line facilitates the identification 
and extractionof these data items, 
moreover, the first definition 
discusses the grammatical role of 
the Japanese topic marker (~ "wo," 
its importance in marking relevant 
proper nouns in the JJV corpus, and 
the Impact Line's verbal element. 
By this definition, 81% of the JJV 
test set is Impact Line 
prototypica1. 
Version 2 is a more restrictive 
definition requiring the presence of 
two more extractable data elements 
in the Impact Line in addition to 
the criteria of version I. The 
second definition, therefore, 
discusses the types and distribution 
of Impact Line data items. This 
version of the prototype occurs 65% 
of the time. 
DEFINITION OF THE PROTOTYPICAL 
IMPACT LINE (VERSION 1) 
Cl) IMPACT LINE TOPIC MARKER 
(GRAMMATICAL FORCE) 
In the same way that the Impact Line 
is crucial to developing a complete 
discourse paradigm for JJV, or 
perhaps any domain of Japanese 
newspaper articles, I any discussion 
about what constitutes a 
prototypical Impact Line must start 
with the Japanese topic marker 
(<TM) =wa" whose role as designator 
of the Impact Line's grammtical 
1 I am just beginning to analyze newspaper 
=announcement" articles in other domains, such as 
JME, to see if the Impact Line prototype has validity 
and can form the basis for a rnetarnodel that is not 
domain specific. 
166 
subject is predominant in the 33V 
test corpus. The =wo"-designated 
subject sets the tone for the Impact 
Line as the Impact Line does for the 
33V article. 
In 3apanese discourse generally, 
"wo" is o particle that indicates 
the theme or topic of o sentence and 
as such often, but not always, 
corresponds to the subject of the 
sentence. Perhaps just as often 
=wa" serves to highlight or 
topicglize other pieces of 
information, while the particle "go" 
marks the subject. For example: 
Kono hon waken go yonda. 
(Speaking of this book, Ken has read 
it.) 
Eigo wa Ken ~ umai desu. 
(With regards to English, Ken is 
skillful.) 
The subject Ken is designated by go 
and the topic by wa. However, when 
the subject or agent of the action 
is also the sentence topic, wo 
marks the grammatical subject. For 
example: 
Ken wa kono hon o yondo. 
(Speaking of Ken, he read this 
book.) 
It is this latter grammatical 
function of "wa" as the sentence 
topic and agent-of-action 
designator that predominates in the 
JJV test articles. Example 1 
below is #2638 from the 33V test 
set: 
• ~lt_E'~ (~t -I-~ B 
PN-Subject <'1"t4 Numeral+N 
Tokyo Marine & Fire 17th 
N Prt NP 
English big/gen'l/insur 
N PN 
camp. Commercial Union 
N N PN Prt 
camp. hqs. London with 
VP Prt VP 
business/tie-up/did announcem' t/did 
Translation: 
Tokyo Harine & Fire \[Insurance 
Co.\] announced on the 17th that it 
has concluded a business tie-up with 
o large English general insurance 
company, Commercial Union 
(headquarters London). 
Given the grammatical importance of 
"wa" in indicating the subject of 
the Impact Line, this function takes 
on added significance in the 3V 
domain where the identification of 
tie-up entities in a tie-up 
relationship triggers the extraction 
process. The Impact Line topic 
marker in 33V articles is o reliable 
designator of proper nouns that are 
valid tie-up partners to be 
extracted and inserted into the 
template. In fact, in 117 Impact 
Lines out of 145 z 33V test-set 
articles (81%), "wa" marks at least 
one tie-up partner; 3 and 
this tie-up partner is not simply 
the Impact Line topic, but the agent 
of action as welt. 
Furthermore, in 19 instances out of 
those 117, the topic marker is 
z Five of the 150 test-set articles produced a template but not any tie-ups because they were about 
either sister-city relationships or talks that were broken off. Therefore, the baseline figure that will be 
used hereafter in discussing the JJV test set is 145. 
3 There was a similar high percentage of 79% for 
100 randomly selected JJV development set articles. 
167 
preceded immediately by two proper 
nouns designating two principal tie- 
up partners. Typically the structure 
will look like Example Z below: 
(Ex.Z) B~TY • ~'-- • z ~ 
PN Conj 
Japan/IBM and 
PN <TM 
Sum~tomo Electric 
The conjunction ~ ("to") binds the 
two entities IBM Japan and Sumitomo 
Electric as co-subjects. Alternately 
this paradigm altows for modifiers 
before either or both of the 
entities (ExcIptes 3 -- S): 
(Ex.3) I,~ ~ ~l~I~I 
Toyota and US car- 
.~ ~ ~b -GM (~t 
maker GM <TM 
(Ex.4) B ~ 8 |,bI,,Y, ~ - 
Japanese carmaker 
Toyota and GM <TM 
(Ex.S) 
Japanese carmaker Toyota and 
US carmaker GM <T~ 
Thus far, the prototypical Impact 
Line can be encapsulated in the 
following short notation: 
..... X 
where X is a principat tie-up 
entity and the ellipsis marks allow 
inclusion of multiple subjects as 
shown in Examptes 2 -- 5. It is 
important to note, moreover, that 
whether modifiers precede an ENTITY- 
designate or not, or whether a 
conjunction is present or not, the 
topic marker =wo~ is preceded 
immediately -- in the grammatical 
sense -- by an entity that is a 
principal t~e-up partner. Twenty- 
one of the 117 "wo"-designated 
entities are preceded immediately by 
information about the entity -- such 
as location -- enclosed ~n 
parentheses, rather than the entity 
name ~tself. For exampte: 
Nikko Securities (hqs. Tokyo) <TM 
Orthographically this may be 
misleading, but grammatically the 
topic marker indicates the entity, 
not its headquarters location. 
Therefore, such cases retain their 
prototypical validity. 
(2) IMPACT LINE TOPIC MARKER 
(PRACTICAL FORCE) 
The Impact Line topic marker exerts 
a force that extends beyond the 
scope of a JJV article's first 
sentence. In instances of ellipsis, 
which occurs frequently throughout 
the JJV corpus, the appropriate 
subject can be supplied by inserting 
the Impact Line "wa"-designated 
subject. Articte #1747 is a classic 
example of Japanese presentation: 
|~t- ~'~:~ ~ ~ ~ ~ .... 
~. ~) ~~ 
Literat transtation (\[ \] indicates 
zero anaphora):- 
1) On the 6th Joyo Bank announced 
168 
that \[ \] hod concluded o 
comprehensive business tie-up with 
Nomuro Securities. Z) In the 
securities area, \[ \] already has o 
tie-up arrangement with Nikko 
Securities, but in order to meet the 
diverse needs of \[ \] regional 
customers, \[ \] is making up for the 
lock of securities-related services 
through tie-ups with several 
companies .... 4) As far as the tie- 
up with Nomura is concerned, M & A 
(company mergers and acquisitions) 
business is included, and Joyo is 
poised to move aggressively into 
this area. 
Note that the Impact Line subject, 
Joyo Bank, does not appear again 
until the fourth sentence, which is 
the last line of the article. Until 
it reappears as the subject, it is 
omitted and one needs to supply a 
pronoun or proper name -- ~it", 
"its ", "Joyo" -- in order to read 
the passage understandabty in 
English. In other words, the 
heuristic, which states that 
e11ipsis can be filled by the 
subject marked by the Impact Line 
topic marker, works quite wet1 here. 
Admittedly this is an easy case 
because stylistically Japanese 
allows ellipsis in a sentence that 
follows one in which the subject was 
introduced originally. In fact, 
using the term heuristic qua a 
convention with grammatical and 
stylistic acceptability may be 
inappropriate. However, in numerous 
other instances when convenience 
dominates and ellipsis is propagated 
throughout a text beyond the decent 
bounds of style, assigning the 
proper subject is less clear-cut. 
Particularly troublesome are those 
cases in which ellipsis continues 
for several sentences before the 
introduction of a new subject 
appropriately designated by another 
topic marker. Thereafter, the 
subject -- which one? -- is again 
omitted, and one must decide between 
calling upon the proximate "wa Y- 
designated subject or the original 
Impact Line "wa"-designated agent. 
When coding or checking 1@@ of the 
15@ test-set articles, I noted only 
one instance (#2111) in which 
context demanded that the subject of 
a particularly complex sentence was 
not the default Impact Line Uwa m- 
designated one. It is, therefore, a 
powerful heuristic, especially in 
the JJV corpus where the articles 
ore on overage short and the 
~protogonist" principal tie-up 
entity is highlighted at the outset 
by the Impact Line "wa. ~ The 
protagonist entity usually announces 
the tie-up to the public, and in 
this sense, ~has the action ~ 
throughout the remainder of the 
text. In short, when in doubt one 
should revert to the initial topic 
subject. 
INVALID USES OF uWA" 
Before turning to the Impact Line 
verbal element and finishing the 
prototype version-1 definition, the 
two types of occurrences below help 
illustrate further the legitimate 
uses of ~wo" by showing what does 
not qualify as prototypical: 
1. In the JJV test set, there are 
three instances in which the Impact 
Line topic marker is not preceded by 
an ENTITY but by a PERSON who is 
announcing a tie-up. The entity name 
is present as a modifier, e.g., 
Japan Development Bank's Takahashi 
Hajime president <'\[14 
Such instances ore eliminated from 
consideration as a prototype because 
the initial "wo ~ is not preceded by 
a principal tie-up partner. 
169 
2. In one instance the initiat "wa" 
marks a valid entity for extraction, 
however, it is not o principal tie- 
up partner; it is the PARENT of one 
of the principals. 
(3) IMPACT LINE: OTHER : 
REQUISITE ELEMENTS 
As mentioned above under GRAMMATICAL 
FORCE, the JV application tracks 
tie-up relationships between two or 
more entities. And, it has already 
been demonstrated that the Impact 
Line topic marker is a reliable 
indicator (81% of the JJV test set) 
of at least one of those entities. 
The next question is: Does the 
prototypical Impact Line also 
contain the other elements required 
for instantiating a tie-up? That 
is: I) Is the name of the other tie- 
up entity(ties) present in the 
Impact Line, and 2) is there any 
explicit indication that the 
arrangement between the two entities 
is in fact a tie-up relationship? 
i) Remarkably, there are only seven 
instances -- over and above the 
previously cited 117 -- in which an 
Impact Line would otherwise be 
considered prototypical except that 
the other tie-up partner name(s) is 
not specified until later in the 
text. In other words, 81% of JJV 
test-set Impact Lines indicate 
clearly not only by virtue of the 
topic marker at least one tie-up 
entity, but atso introduce the name 
of the other principal partner as 
well. 
2) In order to confirm that any two 
or more entities present in the 
Impact Line are in a tie-up 
relationship, the Impact Line must 
state specifically that this is the 
case. The verbal elements at the end 
of the Impact Line are important to 
look at, therefore, in determining 
whether there is a tie-up or not. 
Typically, Japanese text will 
stipulate ~teikei," which is the 
most frequent term for tie-up, but 
will also use other phrases that are 
either synonymous or describe an 
arrangement or activity that 
presupposes a tie-up, such as: 
(agreed to join) 
(z ~ED C, 7a 
(signed contract to establish JV 
company) 
(announced the formalization of an 
R&D contract) 
A11 of the previously judged 117 
prototypical instance meet this 
standard, and not surprisingly, 
given the formulistic nature of the 
Impact Line, 96 out of those 117 
(82%) employ the word ~teikei." 
(Example 7 later discusses an 
Impact Line in which "teikei ~ does 
not appear.) 
(4) VERSION-1 REVIEW 
Example 1: 
PN-Subject <llW Numeral+N 
Tokyo Marine & Fire 17th 
<~',J~ ~ X~-~.~(~ 
N Prt NP 
English big/wen' 1/insur 
N PN 
camp. Commercial Union 
N N PN Prt 
camp. hqs. London with 
170 
VP Prt VP 
business/tie-up/did announcem't/did 
Example I is reprised above to 
review the elements of a 
prototypical Impact Line. It must 
contain all the elements required by 
a valid tie-up. Therefore, the 
Impact line must state that there is 
a tie-up (or, was, in the case of 
dissolution) between at least two 
entities who are named; more if the 
partnership so stipulates. 4 
Furthermore, at least one of the 
named tie-up entities -- the 
"protagonist" -- must be followed 
immediately by the topic marker 
Version-1 Criteria: 
• Two Entities: Tokyo Marine & Fire 
and Commercial Union 
• "Wa-Designated Protagonist Tie-Up 
Entity: Tokyo Marine & Fire 
• Existence of Tie-Up Relationship: 
indicated by keyword ~i~ 
"teikei" 
At first glance this seems like an 
onerous burden for a prototypical 
structure to bear. But it is the 
discourse nature of Impact Lines in 
the 3JV domain to be replete with 
pertinent information, much of it 
suitable for extraction. In view of 
the fact that the Impact Line 
introduces much data at the outset 
of an article, a more restrictive 
definition (version 2) requiring the 
Impact Line to contain additional 
extractable data items is presented 
below. 
DEFINITION OF PROTOTYPICAL 
IMPACT LINE (VERSION Z) 
The definition of version 2 requires 
4 Two articles vAth 3 tie-up partnem and one ~th 4 are included in the 117 prototypical cases. 
the presence of two extractable data 
items in the Impact Line in addition 
to the minimum criteria of version 
1. As the Impact Line in Example 
1 above shows, a valid tie-up 
relationship exists between Tokyo 
Marine & Fire and Commercial Union. 
Moreover, the statement presents two 
additional pieces of information 
that are relevant for extraction: 
Commercial Union is an English 
company (NATIONALITY) and its 
headquarters is in London (ENTITY 
LOCATION). One is also told that 
Commercial Union is, indeed, a 
company (ENTITY TYPE), but this is 
considered less an item that is 
extracted discretely than one that 
follows automatically from the 
identification of the entity itself. 
This slot will be discussed later as 
a =default ~ fill. 
The types of extractable data items 
that occur in the 117 prototypical 
Impact Lines are listed, with the 
SLOT NAME followed by instances of 
occurrence enclosed in parentheses: 
ENTITY LOCATION (79)*, INDUSTRY TYPE 
(88), PRODUCT/SERVICE (88), 
NATIONALITY (56)*, PERSON NAME 
(44)*, PERSON POSITION (40)*, PERSON 
ENTITY AFFILIATION (44)*, ALIAS 
(25), START TIME (12), END TIME (I), 
CHILD COMPANY (II), ECONOMIC 
ACTIVITY SITE (9), INVESTMENT (1), 
FACILITY NAME (i), FACILITY LOCATION 
(I), and JV COMPANY (i). 
The *-marked slots indicate that 
when these particular data items 
appear in a 33V test-set article, 
they ore more opt to appear in the 
Impact Line than in the remainder of 
the text. For example, ENTITY 
LOCATION information occurs in the 
Impact Line in 79 cases out of a 
total of 118 instantiations in the 
JJV test set, or 67% for the JJV 
test corpus; the percentages for 
PERSON NAME, PERSON ENTITY 
AFFILIATION, PERSON POSITION, AND 
NATIONALITY ore 59%, 53%, 53%, and 
44% respectively. There ore, 
171 
moreover, orthographic consistencies 
in the textual presentation of 
certain information that should be 
noted: A11 but three of the 79 
ENTITY LOCATION items are enctosed 
in parens; o11 but six for the 
ALIAS; and o11 of the PERSON NAME, 
POSITION, ENTITY AFFILIATION data. 
Viewed another way, out of 117 
version-I prototypical Impact Lines, 
eight hove no additional data items; 
15 have just one; 27 hove two; 19 
hove three; 17 hove four; and 31 
Impact Lines have five or more data 
items. In other words, if the 
version-2 definition of o 
prototypicot Impact Line were to 
require the presence of two 
additional data elements, such as 
NATIONALITY and ENTITY LOCATION as 
in the case of Example I above, 
then there ore 94 (117 minus the 23 
that hove less than two additional 
items) instances out of the 145 33V 
test corpus that quotify, or 65% of 
the \]\]V test corpus. Viewed from 
either version of the Impact Line 
prototype, articles in the 33V test 
corpus possess at the outset a 
wealth of potential information for 
the extraction task -- 81% in its 
most lenient interpretation and 65% 
in its more restrictive. 
Two Impact Line examples from the 
JJV test corpus ore given below to 
highlight the requirements of the 
version-2 definition of the Impact 
Line prototype: 
Exampte 6: 
B ~z~@~ ~ ~B~ 
PN- Subj <ll4 N+Prt(Ad j) 
Hitachi/manuf./ptoce American 
~t~ -~ - 
NP 
large/computer/maker 
=~b~ F ./~- F~ (HP) 
PN 
Hewlett Packard Co. (HP) 
Conj Prt N Prt 
with tie-up <DO marker 
VP 
formal/announcement/did 
Translation: 
Hitachi Manufacturing formolly 
announced a tie-up with the large 
American computer maker, Hewlett 
Packard. 
Version-Z Criterio 
• Two Entities: Hitachi Manufacturing 
and Hewlett Packard 
• "Protagonist" Tie-up Entity Marked 
by "wo': Hitachi Manufacturing 
• Tie-up Relationship: indicated by 
keyword ~11~ "teikei" 
• Two Data Items: Nationality 
(American) 
Alias (HP) 
Example 7: 
PN-Subj <TM N 
Asohi/beer 21st 
• \[\] ~ ~E~JU~~ 
N Prt NP 
American draft/beer/maker 
7 Fju~ • 9 7-~'~ 
PN 
Adotph Coors Co. 
PN Prt N Prt 
(Colorado) beer <DO marker 
Adj Prt VP 
domestic license/production/do 
172 
.~-~ ~ ~ (~ ~ ~ ~-c ~ 
V+Nom(N) VP Prt 
selling was decided that 
~ ~o 
VP 
announcement/did 
Translation: 
On the 21st, Asahi Beer announced 
the decision that it will do 
the licensed production and selling 
of Adotph Coors' beer domestically; 
Adotph Coors (Colorado) is an 
American draft beer maker. 
Version-2 Criteria 
• Two Entities: Asahi Beer and Adolph 
Coors 
• "Protagonist" Entity Marked by 
"wa": Asahi Beer 
• Tie-up Relationship: indicated by 
phrases =produce" and "sell" that 
describe activities which 
presuppose tie-up 
• Two Data Items (minimum): 
Nationality (American) 
Entity Location (Colorado) 
oAdditionat Data Items Present: 
Industry Type (Production) 
Product/Service ("beer") 
Industry Type (Sates) 
Product/Service (=beer") 
Economic Activity Agent (Asahi 
Beer) 
e(Acceptable Additional Item: 
Economic Activity Site 
(inference that "domestic" = Japan) 
TEMPLATE DEFAULTS 
Given the fact that the topic 3JV 
sentence is stereotypicat in both 
the amount of data contained 
(magnitude) and the way in which it 
is presented (Impact Line 
prototype), how this discourse 
structure might jump-start a system 
by providing top-level information 
which can be propagated throughout 
the template is examined next. One 
needs to discuss first, however, the 
notion of template "default" fills. 
Default fills can be classified as 
either de jure, de facto, or 
logical. De jure defaults include 
the top-level or TEMPLATE OBJECT 
fills, such as the DOC-NR, DOC-DATE 
and DOC-SOURCE, whose slots ore 
filled by SGML-togged data items. 
They ore, what one might call, 
"gimmes" by design and, therefore, 
are not incorporated in the scoring 
algorithm that measures system 
performance. The de facto and 
logical defaults need some 
explanation. 
De facto defaults correspond to 
those set fills instantiated with a 
very high percentage of one type of 
data. Judging by actual systems' 
output and the patterns of certain 
answer-key template fills, no one 
will dispute that, in the end, data 
fetl out of text into some set fills 
at a much higher frequency than was 
intuited originottywhen the 
template was being designed, s 
Below is o snapshot of high- 
percentage 33V test-set set fills. 
(The second figure represents 
percentages for 100 randomly 
selected development-set articles.) 
5 Some of the distinctions that were made at 
design time over the course of pr(x~essing 
approximately 50 articles became blurred unavoidably 
as the fill rules evolved. Therefore, the initial random 
distribution between, e.g., the ENTITY TYPE set fills 
of COMPANY, GOVERNMENT, INDIVIDUAL, and 
OTHER became lopsided in favor of COMPANY. 
173 
SLOT 
NAME 
FILL TEST- DEV-SET% 
SET% 
TIE-UP 
STATUS 
ENTITY 
TYPE 
REL- 
'ENT2-TO- 
El 
ENT REL 
STATUS 
EXISTING 95% 91.50% 
COMPANY 98.30% 96.60% 
PARTNER 82.60% 84.50% 
CURRENT 94.50% 95.50% 
Given these percentages, how did the 
systems actually perform? Is there 
any indication that these de facto 
default fills were instantiated? 
The figures below seem to offer 
evidence for this. Every system 
evaluated on the TIPSTER JJV test 
corpus for MUC-S showed 
substantially lower error rates for 
each of the above set fills versus 
their overall (A11-Objects) error 
scores. 
SYS- TIE- ENTI- REL- ER OVER- 
TEM UP TY 2-TO- STAT- ALL 
STAT- TYPE i US ERROR 
US 
I 28 28 35 33 54 
2 47 42 51 49 72 
3 40 37 46 45 63 
4 47 48 45 45 70 
S 56 46 53 51 70 
6 25 26 35 31 S@ 
The descriptive analysis of the 12 
templates mentioned above in 
METHODOLOGY shows a similarly 
distinctive trend in actuaI systems' 
output. The 12 templates were not 
randomly selected: All of them meet 
the version-1 definition for the 
Impact Line prototype, and only four 
do not meet the restrictive one; six 
articles are short -- six lines or 
less in length; one article 
specifies three principal tie-up 
partners in the Impact Line rather 
than the usual two; two articles 
contain multiple tie-ups rather than 
the usual (84% of JJV test corpus) 
one tie-up; one article specifically 
mentions the formation of a 3V 
company in the Impact Line; two 
Impact Lines introduce a principal 
tie-up entity marked by the topic 
marker "wa" that is clausally 
modified by the name of its parent 
company; and one article's Impact 
Line marks two tie-up entities. In 
short, whenever a correct ENTITY was 
instantiated by any system, the 
above-mentioned default fills 
cascaded throughout the template, 
even if -- practically speaking -- 
the resulting fills indicated that a 
lone COMPANY was in o CURRENT 
PARTNER relationship with itself. 
The discussion of article 1528 below 
shows such an instance of this. 
Other template fills con be regarded 
as logical defaults, or those that 
ore o logical consequence of the 
template object-oriented design. If 
the keyword ~teikei" confirms that 
there is a tie-up and its status is, 
as mentioned above EXISTING, then 
obviously the template has o tie-up 
event; i.e., a TIE-UP OBJECT must be 
instantiated to accommodate the 
extraction of such information as 
TIE-UP STATUS, ENTITY, etc. 
Similarly, if there is a tie-up 
event and two entities are in a 
relationship defined as PARTNER, 
then obviously there is an ENTITY 
RELATIONSHIP. If there is an 
INDUSTRY TYPE identified, there must 
be on ECONOMIC ACTIVITY OBJECT to 
accommodate the INDUSTRY OBJECT, 
which in turn accommodates the 
INDUSTRY TYPE. The template 
structure and other logical effects 
for inserting extracted data items 
into it will be outlined further 
below in the discussion of #1528. 
174 
,t, 
THE COMBINED EFFECTS OF 
PROTOTYPICAL DISCOURSE AND THE 
DEFAULT MECHANISM 
To i11ustrate the potential effects 
that stereotypical 33V discourse 
structure has on template fills and 
overall performance when the de 
facto defaults are considered as 
wet1, the example of article #1528 
is submitted betow. 
%528 Impact L~ne: 
PN <TM PN 
Shi seido ophthatmi c/phorm./co. 
PN N PN 
Senju Pharm'tical (hqs. Osaka 
N PN 
pres./Yoshida/Shoj i/Mr. ) 
NP PN 
orthopedic/phorm./co. Maruho 
N PN 
(ditto, Yamamoto/Hi deo/Mr) 
Conj 
and 
NP Prt N 
medical/supplies sales 
~ ~ ~G~o o o 
Prt VP Prt 
tie-up/did announcement/did 
Translation: 
Shiseido announced that it had 
\[concluded\] a medical supplies 
sales tie-up with Senju 
Pharmaceutical (headquarters Osaka, 
Mr. Shoji Yoshida, president), a 
ophthalmic pharmaceutical company, 
and Maruho (ditto, Mr. Hide, 
Yamamoto), an orthopedic 
pharmaceutical compony...(remainder 
omitted) 
Number 1528 is a short six-line 
article with o version-2 
prototypical Impact Line containing 
the following data items: 
• Existence of Tie-up Relationship: 
indicated by keyword "teikei" 
• =Protagonist" Tie-up Partner 
indicated by topic marker "wa": 
Shiseido 
• Tie-up Partner: Senju 
Pharmaceutical 
• Entity Location (specificotly 
named): Osaka 
• Person Name: Shoji Yoshido 
• Person Position: President 
• Entity Affiliation (info foltows 
entity it describes): Senju 
• Tie-up Partner: Maruho 
• Entity Location (inferred from 
"dittoS): Osaka 
• Person Name: Hide, Ycmw~moto 
• Person Position: (unclear whether 
"ditto ~ indicates president) 
• Entity Affiliation: Moruho 
• Industry Type: Soles 
• Product/Service String: =medical 
supplies ~ 
Data items from remainder of text: 
• Alternate Product/Service String 
for Sales 
• Another Industry Type: Production 
• Product/Service String for 
Production 
• Alternate Product/Service String 
for Production 
• Economic Activity Agents: Shiseido, 
Senju, Maruho 
• Start Time for Production 
• Revenue for Soles 
• Start Time for Revenue 
• Revenue Type 
• Revenue Rate 
Adding the logicat and de facto 
default stats -- such as TIE-UP, 
TIE-UP STATUS, ENTITY TYPE, ENTITY 
RELATIONSHIP, REL-ENTZ-TO-ENTi, 
175 
ENTITY RELATIONSHIP STATUS, ECONOMIC 
ACTIVITY, etc., there are a total of 
47 possible fills that are scored. 
SYSTEM I: MINIMUM CASE 
SCENARIO 
Given the plethora of data items in 
the Impact Line and its prototypical 
structure, minimally o system should 
be able to identify and extract on 
ENTITY NAME (Shiseido) by the topic 
marker =wo" because this element of 
the Impact Line is the most 
consistent port of the prototype. 
Suppose, moreover, o system 
confirms the existence of a tie-up 
event (CONTENT) by identifying the 
keyword =teikei, ~ which is another 
consistent element of the Impact 
line prototype, and one other data 
item from the Impact Line such as 
the INDUSTRY TYPE SALES, which also 
has a keyword associated with it 
"hanbai." This system would have in 
effect identified and extracted 
three data items from the Impact 
Line. The default instantiations 
associated with the extraction of 
these items would be: TIE-UP STATUS 
(EXISTING), the named ENTITY (is a 
constituent of the TIE-UP), ENTITY 
TYPE (COMPANY), on ENTITY 
RELATIONSHIP, the named ENTITY (is a 
constituent of the ER), an ECONOMIC 
ACTIVITY (accommodates INDUSTRY), 
INDUSTRY (accommodates INDUSTRY 
TYPE), REL-ENTZ-TO-ENTI (PARTNER), 
and ENTITY RELATIONSHIP STATUS 
(CURRENT), for a total of 12 
template fills. 
This can also be viewed below 
schematically in template fashion. 
(The bold lettering indicates the 
three data items extracted from the 
Impact Line to highlight their place 
of insertion into the template and 
the embedding described above; 
italicized print indicates de facto 
default fills; plain text designates 
logical defaults; the <TEMPLATE 
OBJECT> de jure default fills are 
not scored except for CONTENT; and 
the numbers (I) - (12) represent the 
total correct fills.) 
<TEMPLATE-I>:= 
Doc Number: 1528 
Doc Date: 9@@227 
News Source: Nikkei Shimbun 
Content: <TIE-UP-I> (1) 
<TIE-UP-l>:= 
Tie-up Status: Existing (2) 
Entity: <ENTITY-l> (3) 
Econ Activity:<ECON ACTIVITY-l> (4) 
<ENTITY>:= 
Entity Name: Shiseido (5) 
Entity Type: Company C6) 
ER:<ER-I>(7) 
<ER-I>:= 
Entl: <ENTITY-I> (8) 
ReI-Ent1-To-Ent2: Partner (9) 
Status: Current C10) 
<ECON ACTIVITY-l>:= 
Industry: <INDUSTRY-I> (ii) 
<INDUSTRY-l>:= Industry Type:Sales C1Z) 
To review the logic outIined above: 
An entity name is correctly 
identified by the topic-marker 
heuristic; in order to place the 
name within the template, an ENTITY 
OB3ECT must be generated to 
accommodate it; this is accomptished 
through the generation of a TIE-UP 
OBJECT which, in turn, is generated 
by the CONTENT pointer; CONTENT is 
confirmed by the keyword =teikei;" 
the third data item "sales" con be 
inserted into the template once on 
ECON ACTIVITY OBJECT is generated in 
order to accommodate the INDUSTRY 
OBJECT needed to instontiote the 
INDUSTRY TYPE data; if a named 
ENTITY is inserted as above, it, by 
definition, must be a constituent 
part -- or principal partner -- of a 
TIE-UP, and also, by definition, 
must be in an ENTITY RELATIONSHIP 
with another entity (not identified 
here); the rest of the slots are de 
176 
facto default fills. 
The results of identifying and 
extracting successfully three data 
items from the Impact Line would be 
as follows: 
• 12 slots are filled out of a 
possible total of 47 
• All 12 are correct 
• Recall = 26 
• Precision = I@@ 
• Error = 74 
• Undergeneration = 74 
This means that what the systemdid 
capture, it did so accurately; and 
it did so through the identification 
of only o small percentage of the 
data items available to it in the 
Impact Line. Through the =default" 
mechanism, three discrete elements 
proliferated into a template with 12 
correct fills. 
SYSTEM 2: BETTER CASE SCENARIO 
Suppose, however, another system, 
System 2, extracts successfully the 
same three data items as System i 
and, in addition, identifies other 
Impact Line information such as 
ENTITY LOCATION (Osaka), PERSON NAME 
(Shoji Yoshida), PERSON POSITION 
(President), ENTITY AFFILIATION 
(Shiseido), and another named ENTITY 
(Senju). System 2, moreover, 
successfully recognizes a START TIME 
which appears in text after the 
Impact Line. Finally, this system 
incorrectly extracts a second 
INDUSTRY TYPE (RESEARCH rather than 
PRODUCTION), and lists only two ECON 
ACTIVITY AGENTS (Shiseido and Senju) 
rather than three (Shiseido, Senju, 
and Maruho) because it failed to 
identify the third entity name in 
the Impact Line. System 2, in 
short, has done a better job than 
System I in making use of the top- 
level Impact Line data available to 
it. However, it still misses 
several Impact Line items and 
misidentifies (undergenerates) two 
others, but coupled with the 
instantiation of the same defaults 
outlined in the schematic above the 
results would look more impressive: 
• Out of 47 total possible scored 
slots, 29 are filled; 26 correctly. 
• Recoil = SS 
• Precision = 9e 
eError = 46 
• Undergeneration = 40 
SYSTEM 3: BETTER STILL 
Finally, suppose yet another system, 
System 3, does an even more thorough 
job of extracting data from the 
Impact Line. In addition to what 
System 2 recognizes, this system 
identifies the third entity 
(Maruho), a second PERSON (Hideo 
Yanmmoto) with ENTITY AFFILIATION 
(Maruho) and POSITION (infers 
"President" from =ditto" which is 
scored as acceptable), and the 
PRODUCT/SERVICE string associated 
with SALES. Like System 2 above, 
System 3 recognizes a START TIME 
from the body of the text and 
misidentifies a second INDUSTRY TYPE 
as RESEARCH. Since this system has 
managed to extract every piece of 
Impact Line information and insert 
it into the template along with the 
default fills, not surprisingly its 
results would look impressive 
indeed. 
oOut of 47 possible scored slots, 38 
are filled; 37 correctly. 
• Recall = 8@ 
ePrecision = 99 
eError = 2@ 
oUndergeneration = 19 
CONCLUSION 
This paper has shown that JJV 
articles possess o stereotypical 
pattern of introducing much 
significant information amenable to 
the data extraction task. This 
stereotypical pattern is embodied in 
what has been outlined here as the 
177 
Impact Line prototype. Furthermore, 
the "mining ~ of the Impact Line to o 
minimal degree by extracting the 
topic marker-designated ENTITY is, 
one could say, o little that goes o 
long way. This is due in large port 
to that ENTITY's strategic place in 
the template and the way in which 
default fills associated with it ore 
propagated throughout the template. 
Hence, higher scores result for JJV 
than EJV. 
A system, such as System 3 above, 
that takes full advantage of the 
Impact Line prototype and the 
plethora of information available 
therein can maximize its capability 
and show a quantum leap in 
statistical performance. Obviously, 
the formulation of a complete JJV 
discourse structure would raise 
performance to another level. 
Discourse analysis alone, however, 
will not resolve all the problems 
endemic to Japanese, such as 
e11ipsis. If the formulistic nature 
of Japanese discourse in the JJV 
domain is o boon to data extraction, 
then its penchant for omitting 
sentence topics altogether is a 
potentiat minefield. Discrete data 
items that have been easily 
identified at the outset need to be 
correctly referenced to other 
activities that follow or the 
resulting template fills well paint 
a totally misleading picture as to 
who is doing what to whom. This 
paper has discussed a heuristic for 
topic-marker substitution that might 
help in this regard, but it is only 
o small port of the equation for 
making Japanese more explicit. 
178 
