Identification of Coreference Between Names and Faces 
Koichi Yamada and Kazunari Sugiyama 
Yasunori Yonamine and Hiroshi Nakagawa 
Faculty of Engineering, Yokohama National University 
79-5 Tokiwadai Hodogaya-ku Yokohama City, Kanagawa 240-8501 Japan 
Phone: +81-45-339-4137 
{aron, ksugi, yasunet, nakagawa}@naklab.dnj.ynu.ac.jp 
Abstract 
To retrieve multimedia contents by their mean- 
ing, it is necessary to use not only the contents 
of distinct media, such as image or language, 
but also a certain semantic relation holding be- 
tween them. For this purpose, in this paper, we 
propose a method to find coreferences between 
human names in the article of newspaper and 
human faces in the accompanying photograph. 
The method we proposed is based on the ma- 
chine learning and the hypothesis driven com- 
bining method for identifying names and corre- 
sponding faces. Our experimental results show 
that the recall and precision rate of our method 
are better than those of the system which uses 
information exclusively from either text media 
or image media. 
1 Introduction 
In multimedia contents retrieval, almost all 
of researches have ibcused on information ex- 
tracted from single media, e.g. (Han and 
Myaeng, 1996) (Smeaton and Quigley, 1996). 
These methods don't take into account seman- 
tic relations, like coreference between faces and 
names, holding between the contents of individ- 
ual media. In order to retrieve multimedia con- 
tents with this kind of relations, it is necessary 
to find out such relations. 
In this research, we use photograph news ar- 
ticles distributed on the Internet(Mai, 1997) 
and develop a system which identifies a person's 
name in texts of this type of news articles and 
her/his face on the accompanying photograph 
image, based on 1) the machine learning tech- 
nology applied to individual media contents to 
build decision trees which extract face regions 
and human names, and 2) hypothesis based 
combining method for the results extracted by 
decision trees of 1). Since, in general, the num- 
ber of candidates from image and that from lan- 
guage are more than one, the output of our sys- 
tem is the coreference between a set of face re- 
gions and a set of names. 
There are many researches in the area 
of human face recognition (Rowley et al., 
1996)(Hunke, 1994)(Yang et al., 1997)(Turk 
and Pentland, 1991) and human name extrac- 
tion, e.g. (MUC, 1995). However, almost all 
of them deal with the contents of single media 
and don't take into account the combination 
of multimedia contents. As a case of combin- 
ing multimedia contents, there is a research of 
captioned images (Srihari and Burhans, 1994) 
(Srihari, 1995). Their system analyzes an im- 
age and the corresponding caption to identify 
the coreference between faces in the image and 
names in the caption. The text in their research 
is restricted to captions, which describes con- 
tents of the corresponding images. However, in 
newspapers or photo news, captions don't al- 
ways exist and long captions like the captions 
used in their research are rare. Therefore, in 
general, we have to develop a method to capture 
effective linguistic expressions not from captions 
but from the body of text itself. 
In the research field of the video contents 
retrieval, although there are many researches 
((Flickner et al., 1995),etc), few researches have 
been done to combine image and language me- 
dia (Satoh et al., 1997)(Satoh and Kanade, 
1997)(Smith and Kanade, 1997)(Wactlar et al., 
1996)(Smoliar and Zhang, 1994) . In this field, 
as language media, there are soundtracks or 
captions in the video or sometimes in its tran- 
scriptions. For analysis of video contents, the 
information which consists along the time axis 
is effective and is used in such systems. On the 
other hand, for analysis of still images, some 
other methods that are different from the meth- 
ods for video contents retrieval are required be- 
cause the relatively small amount of and lim- 
ited information than information from videos 
are provided. 
In section 2, the background and our system's 
overview are stated. In section 3 and 4, we de- 
scribe the language module and the image mod- 
ule, respectively. Section 5 describes the com- 
1"1 
bining method of the results of the language 
module and the image module. In section 6, 
the experimental results are shown. Section 7 is 
our conclusions. 
2 System architecture for combining 
To find coreferences between names in the text 
and faces in the image of the same photograph 
news article, we have to extract human names 
from the text and recognize faces in the image 
(Figure 1). 
Photograph news article 
image :iiii~\]~i:i::" ""::!ili:i" "i:i:i:i 
~ ........... L J .... .,X. 
..... ./.., 
~. ........ P. .... 
.. ......... . .... 
Face recogn~ion 
O0 
A 
B 
C D 
Correspondence 
I relation 
A 
Name extraction 
Figure 1: Human name extraction and face 
recognition. 
The problem is that the face of the person 
whose name is appearing in a text is not always 
appearing in the image, and vice versa. There- 
fore, we have to develop a method by which 
we automatically extracts a person whose name 
appears in the text and simultaneously his/her 
face appears on the image of the same article. 
For the convenience, we define common person, 
common name and common face as follows. 
Definition 1 A person whose name and face 
appear in the text of the article and in the photo 
image of the same article respectively, is called 
common person. The name of the common per- 
son is called common name, and the face of the 
common person is called common face. 
This research is initiated by the intuition that 
is state as assumptions as follows: 
Assumption 1 The name of a common person 
has a certain linguistic feature in the text dis- 
tinct from that of a non common person. 
Assumption 2 The face of a common person 
has a certain image feature distinct from that of 
a non common person. 
These two assumptions are our starting point 
to seek out a method to identify the difference 
between tlle way of appearing of common names 
or faces in each media and the way of appear- 
ing of non common names or faces, and assign 
certainties of commonness to names and faces 
respectively based on the above assumptions. 
Since each media requires its proper process- 
ing methodology, our system has the language 
module to process the text and the image mod- 
ule to process the image. Our system also has 
the combining module which derives the final 
certainty of a name and a face from the cer- 
tainty of name calculated by the language mod- 
ule and the certainty of face calculated by the 
image module respectively. 
For the image module, it is necessary to use 
the resulting information given by the language 
module, such as the number of names of high 
certainty, because the features of regions like 
where and how large they are, depend on the 
number of common persons. For example, the 
image module should select the largest region 
if the language module extracts only one name. 
On the other hand, for the language module, it 
is also necessary to use the result we get from 
the image module, such as the number of faces 
of high certainty, to select names of the common 
person. 
However, if we consider the nature of these in- 
teractive procedures between the language mod- 
ule and the image module, it is easily known 
that one module cannot wait until the comple- 
tion of analysis of the other module. To resolve 
this situation, we consider two kinds of method. 
Method 1: First, the image (or language) 
module analyzes contents to proceed the 
process and outputs the partial results. 
Then assuming the result of the image (or 
language) module is correct, the language 
(or image ) module analyzes the text (or im- 
age). 
Needless to say, the assumed partial results 
might be wrong. In that case, the image 
(or language) module has to backtrack to 
resolve the conflict between the result of 
the image module and that of the language 
module. Namely, this method is a kind of 
search with backtrack and it also requires 
the threshold value by which the system de- 
cides whether the situation needs to back- 
track or not. Moreover, the result depends 
on which media is analyzed first. 
Method 2: Before combining of the results of 
image processing and those of language 
processing, the system works out all the 
hypotheses about the number of common 
18 
persons. Using all of these hypotheses, the 
system selects the best combination of the 
results. Its strong advantages are 1) the op- 
timal solution is always found, and 2) each 
module can process independently. 
Considering the advantages and the short- 
comings of two of the above described meth- 
ods, it is reasonable to adopt Method 2. In this 
research, the hypotheses of the number of com- 
mon persons are "one", "two" and "more than 
two." The reasons of introducing "more than 
two" are the followings: the images containing 
four or more persons are very rare, and such 
images have similar features to the images con- 
taining three persons. 
Image module 
article 
Language module 
Outputs under each of 3 hypotheses" ~ 
erson = "t ", "2"\] "more than 2" 
Combining module ) 
common persons 
John: 0.8 
Paul: 0.6 
Figure 2: Overview of our system. 
3 Extraction of human name 
candidates 
The language module extracts the human name 
candidates from all human names appearing in 
the text and assigns certainty of commonness 
to each of the candidates of a common name. 
When the extracted name is a common name, 
the person is regarded as the important person 
in the article. Therefore, the linguistic expres- 
sions around the name probably have the spe- 
cific linguistic features. Thus, our system de- 
cides whether an extracted name is a common 
name or not with information of the linguistic 
expressions around the human name. To select 
effective features for this purpose from the all 
features generated from the text, we employ a 
machine learning technique, because some im- 
portant features could be fallen out if selected 
by hand. Moreover, machine learning technique 
might be able to learn incomprehensible phe- 
nomena for human. 
It is hard to recognize meaningful linguistic 
features without morphological analysis. On 
the other hand, if the system does the syntax 
analysis, the handling of the ambiguity becomes 
a big problem. Furthermore, on the practical 
use, high processing cost becomes a problem to 
process huge amount of news articles. As the 
consequence, we adopt a word sequence pattern 
based approach. For this, firstly, we analyze 
texts of news articles with morphological ana- 
lyzer JUMAN(version 3.6)(Kurohashi and Na- 
gao, 1998) to extract the part of speech tags as 
the features in machine learning. Note that a 
compound noun is treated as one noun because 
if we treat component words of the compound 
noun individually, the patterns we have to deal 
with become too complicated for machine learn- 
ing systems. The features to be used for learn- 
ing are the followings. 
Compound noun which contains a 
human name 
The human name appearing in the news articles 
might have the adjacent words which describe 
additional information about the name such as 
title, age, year of birth and so on. The name 
with some kind of words, like title, sometimes 
becomes one compound noun and treated as one 
morpheme in our system. Our system tries to 
find this type of information as features for ma- 
chine learning. 
Part of speech tags around a human 
name 
As well known, syntactic parsing is computa- 
tionally heavy and usually has high ambiguities. 
Thus, instead of syntactic parsing, we extract 
the combination of a word, its part of speech 
tag and its relative position to the focused name 
for learning. Especially we focus on the words 
around the human name to capture the charac- 
teristic linguistic expressions about the human 
name. Our system employs two levels of the 
part of speech tag defined by the morphological 
analyzer JUMAN. 
Since our system is for Japanese, object is de- 
scribed by a case particle. In pattern matching, 
instead of the sophisticated case analysis done 
by syntactic parsing, our system first applies the 
particle followed to the word as a feature. As 
for a predicate, we choose the predicate whose 
position is after the name and nearest to the 
name because in Japanese a predicate comes af- 
ter subject, object, and other syntactic compo- 
nents. 
19 
Location and frequency of a human 
name 
Location of the word is important because it 
reflects structures of documents. Our system 
uses features as follows: 1) whether the word 
is in the title or not, 2) the line number of the 
line the word is in, and 3) the number of the 
paragraph the word is in. Our system also uses 
the order of the occurrence of the name in all 
the name occurrences and the frequency of the 
name in the text. 
Using linguistic features described above ex- 
tracted from training data as inputs, we use 
C5.0 (Rul, 1998) to generate decision trees. For 
each case in test data, C5.0 outputs th~ class 
predicted by the decision tree with the confi- 
dence of the prediction. We use the confidence 
as the output of this module. 
Another factor for selecting feature for learn- 
ing is how many morphemes around the name 
are used. In our experiment, ten morphemes 
around the name are used. The experimental 
results will be shown in section 6. 
4 Extraction of human face 
candidates 
To identify coreferences between the face in the 
image and the name in the text, this module 
should extract regions that are candidates of 
common face. In this section, we describe the 
image module which extracts face candidates 
from the image. The face candidates are the 
faces of persons who might be common persons. 
Next, as same as the language module does, this 
module learns the characteristic features of the 
region of a common face that are used to de- 
cide whether an extracted region as a face is a 
common face or not. 
4.1 Extraction of face regions 
To extract face regions, this module uses the fol- 
lowing methods: 1) Filtering to remove noise, 
and 2) RGB based modeling of skin color to 
extract face region. Furthermore, this mod- 
ule generates features of each region and learns 
characteristics of the common face by C5.0. The 
value of each feature, e.g. location of face re- 
gion, region size, depends upon the number of 
the persons appearing in the image as shown 
in Figure 3 and the text. To optimize fea- 
ture based recognition, this module proceeds 
the processes corresponding to three hypothe- 
ses, say the number of common person is one, 
two, or more than two. 
© O0 00 
one two three 
0 
Figure 3: Differences in the features according 
to the number of the person. 
4.2 Skin color modeling 
The advantage of using color for face detection 
is robust against orientation, occlusion and in- 
tensities, and able to process fast, but the de- 
merit is the difficulty in detecting only a face 
fi'om a human body or other parts like hands, 
and to locate it accurately. 
Darrell et al.(Darrell et al., 1998) con- 
vert (R,G,B) tuples into tuples of the form 
(log(G),log(R) - log(G), log(B) - (log(R) + 
log(G))/2) which is called "log color-opponent 
space", and detect skin color by using a clas- 
sifier with an empirically estimated Gaussian 
probability model of "skin" and "not-skin" in' 
the space. Yang et al.(Yang and Waibel, 1995) 
develop a real-time face tracking system, and 
they propose an adaptive skin color model un- 
der different lighting condition based on the fact 
that its distribution under a certain lighting 
condition can be Characterized by a multivari- 
ate Gaussian distribution(Yang et al., 1997). 
The variables are chromatic colors, that is, r = R/(R+G+B), 
and g = G/(R+G+B). On the 
other hand, Satoh et al.(Satoh et al., 1997) use 
the Gaussian distribution in (R, G, B) space in 
their face detection system because this model 
is more sensitive to brightness of skin color. 
The picture of the newspaper we treat is a 
scene picture that includes not only a common 
face but also other faces, and a face doesn't al- 
ways look straight forward. Thus, we use color 
information to detect a face because the color 
doesn't depend on its orientation. Suppose that 
the skin color distribution complies with the 
Gaussian distribution in (R,G, B) space(Satoh 
et al., 1997). Then, we introduce the Maha- 
lanobis distance. That is the distance fi'oln 
the center of gravity of the group consider- 
ing variance-covariance of data. We calculate 
the mean intensity M(= (/~,G,/})T), variance- 
covariance matrix V and Mahalanobis distance 
d from skin color data of 5pixel × 5pixel blocks, 
which are extracted from the cheek colored ar- 
eas of 85 persons (Satoh et al., 1997). The al- 
most all of cheek colored areas express natural 
20 
skin color and they are rarely in a. shadow even 
if the people wear hat, etc. Suppose I be in- 
tensities of a pixel of the input image. Then, if 
that pixel satisfies (1), we take that pixel as the 
candidate pixel with skin color. 
d 2 > (I- M)Tv-I(I - M) (1) 
where value d is experimentally optinfized. 
The method we described above is not so ac- 
curate in some cases. Some extra, non-facial 
regions would also be extracted simultaneously. 
To achieve higher accuracies, we examine the 
distribution of (R + G + B) - (R - B), and draw 
border lines in order to contain more than 80% 
of the sample. We decide the triangle manually 
by observing the various output images. We ex- 
tract the pixels which is in the triangle shown 
in Figure 4. 
100 200 300 400 500 600 lO0 800 
R+G+B 
Figure 4: Skin color in (R -I- G + B) - (R - B) 
space. 
Some results by this method is shown in Fig- 
ure 5. As you can see, not only faces but also 
hands and other regions whose color is similar to 
skin color are also extracted. To elilninate these 
undesirable regions, we use a decision trees built 
by C5.0 as stated in 4.3. 
Figure 5: Result of face candidate region ex- 
traction. 
4.3 Features of extracted regions 
In this research, we use the following 17 fea- 
tures including the composition information of 
the whole image, in addition to the form and 
color of the region that is used with conven- 
tional image retrieval(Han and Myaeng, 1996). 
The following five features are used to'express 
the form of skin color region: 1) Ratio of re- 
gion to the largest region, 2) Ratio between the 
length of X-axis direction and the length of Y- 
axis direction, 3) Rectangularity, 4) Ellipticity, 
5) Eccentricity. 
The feature about the color is the followings: 
6-9) Each of the mean of R, G, B and intensity 
Y. 
The following eight features are positional in- 
formation of the region. 
10) Aspect ratio of the whole image. 
11,12) x,y coordinates of the center of gravity 
of the region. 
13) Distance between the center of gravity of 
the region and the center of the whole im- 
age, normalized with a half of the length of 
the diagonM line of the image. 
14) The order of the region in descending order 
of 13). 
15) Distance between the center of gravity of 
the region and the center of the upper end 
of the whole image, normalized with the 
length from the center of the upper end to 
the left lower end (or the right lower end). 
16) The order of the region in descending order 
of 15). 
17) Suppose that the image is divided into 3 × 
3 sub-areas. Which of these sub-areas the 
center of gravity is in. 
Using these 17 features extracted from trMn- 
ing data as input features, we use C5.0 to learn 
decision trees, which extract candidates of com- 
mon face with different certainties as described 
in section 3. The experhnental results will be 
shown in section 6. 
5 Combining candidates from image 
and language 
In this section, we describe the combining mod- 
ule whose inputs are the candidates extracted 
by the language module and the image module 
described in section 3 and 4, respectively. Its 
output is the result of the whole system. 
21 
5.1 Input 
As already said, since the language module and 
the image module process under hypothesis of 
"one", "two", or "lnore than two" persons, re- 
spectively, one module outputs three results ac- 
cording to these three hypotheses. Then out- 
puts from both modules are expressed a.s fol- 
lows: 
(output of language ntodule ) 
(output of image module) 
= .fta,~g(n,x) (2) 
= fim~g~(m,y) 
(3) 
Note that n and m are the number of the com- 
mon persons adopted a.s the hypothesis, x and y 
are orders in ascending order of certainty about 
the person being common. The certainty of the 
decision in the language module and the im- 
age module is the confidence output by C5.0. 
For example, fl,,~a(n, 2) expresses the certainty 
of the person'who has the second highest cer- 
tainty. Each output is something like the graph 
on Figure 6. In this figure, all of the extracted 
f,,,,~ 
\[ |" \] , 
f image 1.0 ............................ 
0 first second third Candidate of 
person's name or face 
Figure 6: Output from each module under one 
hypothesis. 
candidate names or faces are sorted in descend- 
ing order of calculated certainties by distinct 
decision trees of the language module or the im- 
age module because the nmnber of the common 
person might be more than one. By introducing 
certainties, as later described, we obtain enor- 
mous flexibility in combining ca,ndidates from 
the language module and those from the image 
module. 
5.2 Combination of hypotheses 
Since the language module and the image mod- 
ule process under each of three hypotheses, 
there are 3 × 3 combinations of the results. 
This combining module selects the best pair 
from those combinations and outputs the re- 
sults based on the selected pair. To select the 
best pair, we introduce some kinds of distance 
described as follows. 
Distance between outputs of two media 
The distance between the result of the image 
module and the result of the language module fu 
is defined by (4). 
M f.(.,,.,4 = ~ I/to~(.,, z) - k,,~ag~(m, ~)1 
z=l fta~g(n, z) 7 fimagdm, z) (4) 
where 114 is the maximum number of the persons 
known from the results of both modules. As you 
know from (4), the'nearer the certainties of the 
candidates from the language module and the 
image module which have the same order z are, 
the smaller the fu(n, m) is. 
Distance between output of media and 
hypothesis 
If there is difference between a hypothesis and 
the output calculated under the hypothesis, say 
ft~ng and fimag~, the hypothesis should not be 
considered to be valid. Therefore, we intro- 
duce the distance between the hypothesis and 
the output of the language module: ft~g or that 
of the image module: fimoee. A hypothesis of n 
common persons is defined in (5). 
1 (x _< n) fa(n,x) = 0 (x > n) (5) 
where x is the order of certainty of candidates. 
Since each of the language module and the im- 
age module has its own hypothesis, the combin- 
ing module calculates the distance fat defined 
by (6) between the hypothesis used in the la.n- 
guage module and the result fl'om the language 
module. It also calculates the distance fai de- 
fined by (7) between the hypothesis used in the 
image module and the result from the image 
module. 
3 fo~(',) = F_. Ifto,~g(,~, z) - A(.,, z)l 
z=l £,n~(n, z) 7 fa(,~, z) (6) 
3 Ikma,e(m, Z) -- fa(m,Z)\[ 
L i(m) = ~ fim~(m, z) 7 L('~, z) (r) 
z--~\] 
In the case that the hypothesis is "more than 
two", the certMnty of candidates whose order is 
fourth or larger are ignored. 
Decreasing factor for each inconsistent 
hypothesis 
Different hypotheses of the language module 
and the image module indical~e inconsistency. 
However, since the analysis of each module is 
not perfect, our system does not exclude such 
22 
inconsistent coml)inations of hypotheses. In- 
stead, we decrease the certainty of such in- 
consistent combinations. For this, we use de- 
creasing factors D(m,n) where n and m mean 
the hypothesized nmnber of person in the lan- 
guage module and the image module, respec- 
tively. We empirically tuned up the actual val- 
ues of D(m, n) as shown in Table 1. 
Table 1: Decreasing factor D(m, n) for each in- 
consistent hypothesis. 
n 
1 2 3 or more 
1 1.0 0.9 0.5 
m 2 0.9 1.0 0.6 
3 or more 0.5 0.6 0.8 
Integration of the measures 
Using these three distances, namely fti, f~t and 
foi, and D(n,m) , the combining module fi- 
nally calculates total certainty f(n,m) defined 
by (8) for each combination of hypotheses. The 
smaller the f(n, m) is, the nearer the result from 
the language module is the result of the image 
module. 
f(n, m) = 
{fli(n, ?,~) n L 1} {fa/(n) + 1} {f~i(m) + 1} 
1 
× D(n, m) (8) 
5.3 Combining the results 
When a combination which has the smallest 
f(n, m) has been selected, the results from the 
language module and the image module are 
fixed. The system combines these results into 
one result funion(n, m, z), where the person cor- 
responding to z is expected to be a common 
person, funion(n, m, z) is the final output of the 
whole system. For this combining, we investi- 
gate two methods as follows. In (9), the con- 
sistency on the number of common persons is 
regarded as an important factor. On the other 
hand, in (10), when at least one of two mod- 
ule, namely time language module or the image 
module, assigns high certainty to a candidate 
person, the whole system finally assigns high 
certainty to the candidate person. 
w, m, z) = 
.hamAn, x (9) 
Vz, m, = 
l - { 1 - h.n,A,,., z)} {l - 
(10) 
The final outputs of whole system are something 
like these: "John: common person (certainty: 
0.8)", "Paul: common person (certainty: 0.4)" 
and so on. These results are used to find the 
face on the image if We specify a certain name 
in the text to retrieve his/her face image, or vice 
versa.. 
6 Experiments 
We have experimentally evaluated the system 
we proposed by comparing with the simple sys- 
tems which contain only the language module 
or the image module respectively to confirm the 
effect of the combining process. The language 
module and the image module work under three 
kinds of hypothesis in the simple systems as 
well. Thus, we use the system's result which 
has the minimum distance between the output 
of media and the hypothesis defined by formula 
(6),(7) a.s the baseline of evaluation. In our ex- 
periments, we use the photograph news in the 
web page called "AULOS" distributed by The 
Mainichi Newspapers(Mai, 1997). The average 
length of the text of the article is about 300 
characters or 100 words. The almost all of the 
images are full colored, and the average size of 
them is about 250 x 200 pixels. Moreover, the 
images are not accompanied with captions. On 
this evaluation, we use articles with full colored 
images published on May and June 1997. As 
for common name extraction, we did four fold 
cross-validation for 228 articles of this period 
which contains common human names. As for 
common face extraction, we did three fold cross- 
validation for the set of color photograph im- 
ages which are contained by the articles used 
by the language module. To evaluate how accu- 
rate the system identifies the given person being 
a common person, we calculated the recall and 
precision rate of the system's decision about a 
person being common. Since the outputs of our 
system are certainties, recall and precision rates 
are defined as follows. 
Recall = 
Precision - 
Eiecc W(i) 
Number of the common persons 
(11) 
w(i) 
Ew w(i) 
where W(i) is the certainty of person i, and cc 
means a set of all correctly identified persons. 
23 
Table 2: The evaluation results of the outputs 
from each module. 
Recall Precision 
Language module 0.68 0.67 
Image module 0.52 0.64 
Combining module based on (9) 0.42 0.74 
Combining module based on (10) 0.76 0.69 
The evaluation results of each module is 
shown in Table 2. 
For the language module and the combining 
module, we evaluate names and its certainties. 
On the other hand, for the image module, we 
evaluate only certainties under the assumption 
that the human name of the face which was as- 
signed higher certainty is correct because the 
image module doesn't output human names. 
The effect of combining appears as the differ- 
ence between the results of the combining mod- 
ule and the results of the language module or 
the image module. The combining module has 
two variations. The module based on (10) im- 
proved both recall and precision rates by com- 
bining. The reason of high recall rate is that 
one module picks up the person whom the other 
module fails to pick up. Since high precision 
rate is maintained, this compensation is really 
effective. On the other hand, the combining 
module based on (9) improves the precision rate 
more than the module based on (10). The rea- 
son of this phenomena is that the module is able 
to cancel the noise which appears in one media 
contents by the other media contents. However, 
the recall rate was decreased as expected from 
(9). 
7 Conclusions 
We have developed the system which identifies 
coreferences between the human face in the im- 
age and the human name in the text by selecting 
combinations of hypotheses and the combining 
of the results from the language module and the 
image module. The experimental result is that 
recall is 42% to 76% and precision is 69% to 
74%. This result indicates that the practical use 
of semi-automatic extraction of common per- 
son from multimedia contents for IR purposes 
would come into our sight with some technical 
improvement along this line of research strategy. 

References 
T. Darrell, G. Gordon, M. H'arville, and J. Wood- 
fill. 1998. Integrated person tracking using stereo, 
color, and pattern detection. CVPR'98, pages 
601-609. 
Myron Flickner, Harpreet Sawhney, et al. 1995. 
Query by image and video content: The QBIC 
system. Compuler, 28(9):23-32. 
Kyung-Ah Han and Sung-Hyun Myaeng. 1996. hn- 
age organization and retrieval with automatically 
constructed feature vectors. SIGIR'96, pages 
157-165. 
H. M. Hunke. 1994. Locating and tracking of human 
faces with neural networks. Tech. Report CMU- 
CS-94-155, Carnegie Mellon University. 
Sadao Kurohashi and Makoto Nagao. 1998. 
Japanese morphological analysis system JUMAN 
lnanual (versidn 3.6). Kyoto University. 
The Mainichi Newspapers, 1997. A ULOS Photo 
News. http ://www.mainichi.co.jp/. 
DARPA, 1995. Proceedings of the Sixth Message Un- 
derstanding Conference (MUC-6). 
H. A. Rowley, S. Baluja, and Takeo Kanade. 1996. 
Neural network-based face detection. CVPR'96, 
pages 203-208. 
RuleQuest Research Pty Ltd, 1998. See5/ C5.0. 
http://www.rulequest.com/. 
Shin'ichi Satoh and Takeo Kanade. 1997. Name-It: 
Association of face and name in video. CVPR'97, 
pages 368-373. 
Shin'ichi Satoh, Yuichi Nakamura, and Takeo 
Kanade. 1997. Name-It: Naming and detecting 
faces in video by the integration of image and nat- 
ural language processing. IJCAI-97, pages 1488- 
1493. 
Alan F. Smeaton and Fan Quigley. 1996. Experi- 
ments on using semantic distances between words 
in image caption retrieval. SIGIR'96, pages 174- 
180. 
Michael A. Smith and Takeo Kanade. 1997. Video 
skimming and characterization through the com- 
bination of image and language understanding 
techniques. CVPR'97, pages 775-781. 
Stephen W. Smoliar and HongJiang Zhang. 1994. 
Content-based video indexing and retrieval. Mul- 
limedia, 1(2):62-72. 
Rohini K. Srihari and Debra T. Burhans. 1994. 
Visual semantics: Extracting visual informa- 
tion frol/l text accompanying pictures. AAAI-94, 
1:793-798. 
Rohini K. Srihari. 1995. Automatic indexing 
and content-based retrieval of captioned images. 
Computer, 28(9):49-56. 
M. T~rk and A. Pentland. 1991. Eigenfaces for 
recognition. Journal of Cognitive Neuroscience, 
3(1):71-86. 
Howard D. Wactlar, Takeo Kanade, Michael A. 
Smith, and Scott M. Stevens. 1996. Intelligent 
access to digital video: Informedia project. Com- 
puter, 29(5):46-52. 
Jie Yang and Alex Waibel. 1995. Tracking human 
faces in real-time. Tech. Report CMU-CS-95-210, 
Carnegie Mellon University. 
Jie Yang, Weier Lu, and Alex Waibel. 1997. Skin- 
color modeling and adaptation. Tech. Report 
CMU-CS-97-146, Carnegie Mellon University. 
