REFERENCE RESOLUTION USING SEMANTIC PATTERNS 
IN JAPANESE NEWSPAPER, ARTICLES 
Ta,kahiro Wakao 
University of Shcflic'ld, I)epa,rtment of (~ompul,er Science 
l{cgcnl; Cou l, 211 Portobello St, Sheffield SI dDP, (JK 
I • I ma~l: t. w~tkao @dcs.sh@ ac.'uk 
1 INTRODUCTION 
Reference resohttion is one of the important tasks in 
naturM l~mguage l)rocessing. I n Japanese newspaper 
articles, pronouns are not often used ~m referential 
expressions for COl'fll)ally ll0.11\]es~ \])lit shortelled (:()Ill- 
party names and doush.a ("the same eompany") are 
used more often (Murald et al. 1993). Although there 
have beeo. studies of reference resolution Ib,' wmous 
aou. phrases in Japanese (Shibata el al. 1990; Kitani 
|994), except Kitani's work, they do not clearly show 
how to lind the referents in computa.tionally i)lausible 
ways for a large amount of data, suc, h as a newst)aper 
database. In this l)aper 1, we determilm the referents 
of dousha and their locations I)y hand, and then pro- 
pose one simph" and two heuristic methods which use 
SClllantic information in text ,';uc.h as collll)ally ilalllC8 
and their patterns, so as to I,est these three methods 
on how accurately they lind t.he correct referents. 
Dousha is f(mnd with several l)artich~s such as 
"\]~e", "ga", "*to", and "go" in neWSl)al)er artMes. 
Those which co-occur with ha and ga arc choseu for 
the data since they are l.hc. two most fre(luent parti- 
cles when dousha is in the sul)jeet position in a sen- 
tenc(:, q'ypically, ha marks the topic of the sentence 
and ga marks the subject of tim sellt(~l/(;e. A typical 
use of dousha is as follows: 
Nihon Kentakii I'~Hraido (;hikin ha, 
Japan Kentucky Fried (;hicken ha, 
sekai s;d(lai no piza chien, 
world's largesl, pizza chain store, 
Piza Ilatto to teikei wo musul)i, 
Pizza llut to tie-up estal)lish, 
kotoshi gogatsu kara zenkoku de 
starting May t, his year, nation-wide, 
takuhai piza chien no tenkai wo 
pizza, deliw?l'y chain store extension 
l,|,his paper was written wheu the author was at the (;mn- 
l)uting l{escm'ch t~&bot';d.ol'y of New Mcxi(:(, .~tld.e (Jnlwwslty. 
The aul.ho," \]l~ts been al; Unlvcrsity of Shelllcld slncc J;mu;n'y 
\[ 994. 
hajintesu to hapl)you shita. 
begin almounced. 
sarani dousha ha furaido chikin no 
I'V| oreover, the 8allle COlll\])ally 
chicken of 
fried 
takuhai saabisu nimo nori(lasu. 
delivery service as well will start. 
A rough translation is: 
"Kentucky l"ried ()hieken Japan allllOlltlced that it 
had established a tie-Ul) with the world largest l)izza 
chain store, l)izza tlut, and I)egan to expand pizza de- 
livery chain stores nation-wide starting in May this 
year. Moreover, the company will start delivery of 
fried chicken as well." 
Pottsha ill t\]le second sel,te\]lce relel:S to Ken- 
l.ucky Fried ('&icken Japan as "the company" does in 
l,hc English translation. As shown in this example, 
some articles COlltailt lllore than one possible referent 
or ronlt)any ~ aIId the reference resolution of doush.a 
should identify the referent correcl, ly. 
2 LOCATIONS ANI) CONTEXTS 
OF THE tH~,Ii'ERENTS 
Most of the Japanese newspal)er articles e×amined 
in this study are in t.he domain of Joint-Ventures. 
The som'ees of lh<'. newspaper articles are mostly lhe 
Nikkci and the Ashahi. '\['h(! total number <)l" the ar- 
ticles is 1375, and there are 42 cases of dousha with 
ga amt 66 cases (>f dousha with ka in the entire set of 
articles. 
The followiug tables, Table 1 and Tabh'. 2, show 
the locations and contexts where the referents of both 
subsets of dousha appear. 
1133 
Table t Locations and contexts of the referents of dousha with ga 
dousha with ga 
location \] context 
Within the same sentence 
Snbject 
Non-subject 
company name + ha 
part of the subject * 
compauy name + niyorulo 
others * * * 
In the previous sentence 
Subject 
Non-subject 
company name + ha 
company name + ga 
empha.sis structure ** 
part of the subject * 
company name + lo 
In two sentences before 
Subject \[ company name + ha 
I company name + ga In previous paragraph 
Topic of the paragraph \[ company name + ha 
In two paragraphs before 
Topic of the paragraph \[ company name + ha 
number of eases 
19 
13 
2 
Table 2 Locations and contexts of the referents of dousha with ha 
dousha with ha 
location \] context 
Subject 
Subject 
Non-subject 
Within the same sentence 
company name + ga 
company name + deha 
In the previous sentence 
company name + ha 
eml)hasis structure ** 
part of the subject * 
others 
In two sentences before 
Subject \[ company name + ha \[ 
part of the subject • 
number of cases 
32 
21 
5 
4 
2 
17 
1.6 
1 
In three sentences before (in the same paragraph) 
Subject \[ company name + ha 
In previous paragraph 
Topic of the paragraph \[ company name + ha 
Topic of the paragraph \[ colnpany name + ga 
In two paragraphs before 
\[lbpic of the paragraph I company name + ha 
In three paragraphs before 2 
Topic of the paragraph \[ company name + ha __ 2 
Note for Table 1 and Table 2 
company name referred to is a part of a larger subject noun phrase. 
company name referred to comes at the end of tim 
sentence, a way of emphasising the company name in Japanese. 
company name with to (with), kara (from), 
wo tsuuji (through), tono aidade (between or among). 
1134 
1,'or doush(~ with ga (Table 1), the referred coin- 
pany nan'les, mr the referents appear in non-sul/ject 
positions fi:om time to time, especially if the referent 
appears in the same sentence as dousha does. For 
dousha with ha (Table 2), compared with Table 1, 
very l>w referents are located in the same sentence, 
and most of the referents are in the subject position. 
For both occurrences of dousha, a considerable num- 
ber of the referellts appear two or more sentelice8 
beR)re, and a few of them show up even two or three 
paragraphs before. 
3 THREE HEURISTIC METHODS 
TESTED 
3.1 Three Heuristic Methods 
Oile simple and two heuristic iilethods to fill(I the rcl L 
erents of dousha are described below. The lirst, the 
simple method, is to take the closest COml/any name, 
(the one which appears most recently before dousha), 
as its referent (SilnI)le Closest Method or SCM). 
It is used ill this paper to indicate the t)~Lseline pcr- 
lbrmance tbr reference resolution of dousha. 
The second method is a modified Siml)le Closest 
Method for dousha with ca. It is basically the same 
as SCM except that: 
• if there is (tile or there, conlpany liaine hi the 
sairie seiltellce before the dousha, take |.lie clos- 
est COllipally nanw. as the referent. 
• if there is a conipany llaille inllnediately fol- 
lowed by ha, ca, deha, or niyorulo somewhere 
bel5re dousDa, use the closest such company 
name as the referent. 
• if the previous sentence ends with a cOral)any 
ll;_~llle, thus l)utting aii enlphasis on the COlll- 
pally liaine, make it the rcl\]2relit. 
• if there is a pattern "COlilpaily liame lie hlllliali 
lialIle title..." (equivalent to <'title hiilllaii lialno 
of cOUll)any elaine..." in I'\]nglish) in the prove- 
Oils SOlltellce, then iiso the COllipaliy n~iliie as 
tim reforelit. Typical titles are sh.achou (presi- 
dent) alld kaichou (Chairinan uP I/oard). 
The theM heuristic method is used t~r dousha with ha 
cases. It is also based on SCM except the following 
points: 
• if there is a company name innnediately tbL 
lowed by ha, ga, deha, or uiyoruto somewhere. 
before dousha, use the closest such colnl)any 
name as the referent. 
* if the previous sentence ends with a company 
name, thus putting an eniphasis Oil the coin- 
liany nalne, make it the reli'~rent. 
• if there is a pattern "coral)any nanie no human 
name title..." (equivalent to "title human name 
of company name..." in English) in the prove- 
mils seiltelice~ theil /ise the cowilialiy Ilanie as 
the refi;rent. 
The third method is in fact a set of the second 
method, and both of them use semantic information 
(i.e. company name, human name., title), syntactic 
patterns (i.e. where a conll)any name, a human name, 
or a title appears in a sentence) and several specific 
lexical items which come immediately after tim com- 
pany Ilallies. 
3.2 Test Results 
The three lnethods haw'. heen tested on the develop- 
merit data from which the lnethods were produced 
and on the set. of unseen test data. 
3.2.1 Against the dew;lolmient data 
As mentioned in section two, there are 42 cases of 
dousha with ga and 66 cases of dousha with ha. 
For the dousha with ga rases, the Simple Clos- 
est Method identifie.s the referents 67% correctly (27 
correct out of 42), and the second inethod does so 
90% (38 out of 42) correctly. ,qCM misses a number 
of referents whMi appear iii previous sentences, and 
most of those which appear two or inore sentelices 
previously. 
For the cases of dousha with ha, SCM identities 
the referents correctly only 52% (34 correct out of 
66), however, the third heuristic method correctly 
ideiltilies 94% (62 out of 66). 
3.2.2 Against the test; data 
The test data was taken front Japanese newspaper 
articles on micro-electronics. There are 1078 arti-- 
c.les, and 51 cases of dousha with ga and 250 cases of 
dousha with ha. The test has been conducted against 
the. all get cases (51 of them) and the first t O0 Bet cases. 
For the dousha with ga cases, the Simple Clos- 
est Method identifies the referents 80% correctly (4 I 
correct out of 51), and the second method does so 
96% (49 out of 51) correctly. 
For the c~Lses of dousha with ha, SCM identifies 
the referents correctly only 83% (83 correct out of 
100), however, tl,e third heuristic method correctly 
ide.ntifies 96% (96 out of 100). 
The following table, Table 3, shows the summary 
of the test. results. 
1135 
Table 3 Summary of'rest Results 
\[ Development Data Test Dat_a 
dousha with ga 
SCM I 67% 80% 
2ndmethod I 90% 96% 
dousha with ha 
SCM 52% 83 % 
3rdmethod 94% 96% 
4 DISCUSSION 
Tile second and third heuristic methods show high 
accuracy in finding the referents of dousha with ga 
and ha. This means that partial semantic parsing 
(in which key semantic information such as company 
name, human name, and title is marked) is snfli- 
cient for reference resolution of important referential 
expressions such as dousha in Japanese. Moreover, 
since the two modified methods are simple, they will 
be easily implemented by computationally inexpen- 
sive finite-state pattern matchers (Hobbs el aL 1992; 
Cowie ct al. 1993). Therefore, they will be suitable 
for large scale text processing (Jaeobs 1992; Chinchor 
el al. 1993). 
One important point to realize is that the sec- 
ond and third methods, although they are simple to 
implement, achieve something that is rather compli- 
cated and may be computatlonally expensive other- 
wise. For example, in order to find the correct refer- 
ent of a given dousha, you may have to skip one entire 
paragraph and find the referent two paragraphs be- 
fore, or you may have to choose the right company 
name from several possible company names which ap- 
pear before the given dousha. The modified methods 
do this correctly most of the time without worrying 
about constructing sometimes complicated syntactic 
structures of the sentences in the search window for 
the possible referent. 
Another important point is that the modified 
methods make good use of post-nominal particles, 
especially ha and ga. For example, if the referent is 
located two sentences or more before, then the ref- 
erent (the company name) comes with ha ahnost all 
the time (35 out of 38 such cases for both dousha). 
It seems that if tile referent of the dousha in consid- 
eration is more than a certain distance before, two 
sentences in this case, then tile referent is marked 
with ha most of the time. Kitani also uses this ha 
or ga marked company names as key information in 
his reference resolution algorithm for dousha (Kitani 
1994). 
5 CONCLUSION 
The locations and contexts of tile referents of dousha 
in Japanese Joint-Venture articles are determined by 
hand. Three henristic methods are proposed and 
tested. The methods which use semantic informa- 
tion in the text and its patterns show high accuracy 
in finding the referents (96% for dousha with ga and 
96% for dousha with ha for the unseen test data). The 
high success rates snggest that a semantic pattern- 
matching approach is not only a valid method but 
also an ef\[icicnt method for reference resolution in 
the newspaper article domains. Since the Japanese 
language is highly case-inflected, case (particle) infof 
mation is used effectively in these methods for refer- 
ence resolution. -How much one can do with semantic 
pattern matching for reference resolution of similar 
expressions such as "the company" or "tile Japanese 
company" in English newspaper articles is a topic for 
fllture research. 
6 ACKNOWLEDGEMENT 
I would like to thank tile Tipster project group at the 
CRL for their inspiration and suggestions. 1 would 
also like to thank Dr. Yorick Wilks, Dr. John Barn- 
den, Mr. Steve Ilchnreieh, and Dr. Jim Cowie for 
their productive comments. The newspaper articles 
used in this study are from the Tipster Information 
Extraction project provided by AR.PA. 
7 REFERENCES 
Chinchor, N., L. llirschman, and D. Lewis (1993). 
Evaluating Message Understanding Systems: An 
Analysis of the Third Message Understanding Con- 
fercnce (MUC-3). Computational Linguistics, .19(3), 
pp. 409-449. 
Cowie, J., T. Wakao, L. Guthrie, W. Jin, J. Puste- 
jovsky, and S. Waterman (1993). The Diderot Infof 
marion Extraction System. In the proceedings of The 
First Conference of the Pacific Association for Com- 
putational Linguistics (PACLING 93) Simon Fraser 
University, Vancouver, B.C. Canada, pp. 23-32. 
Jacobs, P.S. (1992). Introduction: Text Power and 
hltetligent Systems. In P.S. Jaeobs Ed., Tezt-Based 
Inlelligent Systems. Lawrence Erlbaum Associates, 
llillsdale New Jersey, pp. 1-8. 
tIobbs, J., D. Appelt, M. Tyson, J. Bear, and D. 
Israel (1992). SRI lntert,ational Description of tile 
FASTUS System used for MUC-4. In the proceedings 
of l,'ourlh Message Understanding Conference (M UC- 
4), Morgan Kauflnann Publishers, San Mateo, pp. 
269-275. 
1136 
Kitani, T. (1994). Merging Information by l)iseourse 
I)roeessing for Information Extraction. In the pro- 
ceedings of the tenth IEF, E Conference on Artificial 
IT~lelligeucc for Applications, pp. 168-173. 
Muraki, K., S. Doi, and S. Ando (1993). Context 
Analysis in Information Extraction System based on 
Keywords and Text Structure. Ill tile proceedings 
of the /j Tth National Conference of Information Pro- 
ccssin(I Society of Japan, 3-81. (In Japanese). 
Shibata, M., O. Tanaka, and J. Fukumoto (1990). 
Anaphora in Newspaper Editorials. In the proceed- 
ings of the ~Olh National Conference of Information 
Processing ,%cicl?l of Japan, 51"/i. (In Japanese). 
1137 
