XML Viewer - w99-0203

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0203_intro.xml
Size: 8,918 bytes
Last Modified: 2025-10-06 14:06:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0203">
  <Title>Identification of Coreference Between Names and Faces</Title>
  <Section position="2" start_page="0" end_page="18" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In multimedia contents retrieval, almost all of researches have ibcused on information extracted from single media, e.g. (Han and Myaeng, 1996) (Smeaton and Quigley, 1996).</Paragraph>
    <Paragraph position="1"> These methods don't take into account semantic relations, like coreference between faces and names, holding between the contents of individual media. In order to retrieve multimedia contents with this kind of relations, it is necessary to find out such relations.</Paragraph>
    <Paragraph position="2"> In this research, we use photograph news articles distributed on the Internet(Mai, 1997) and develop a system which identifies a person's name in texts of this type of news articles and her/his face on the accompanying photograph image, based on 1) the machine learning technology applied to individual media contents to build decision trees which extract face regions and human names, and 2) hypothesis based combining method for the results extracted by decision trees of 1). Since, in general, the number of candidates from image and that from language are more than one, the output of our system is the coreference between a set of face regions and a set of names.</Paragraph>
    <Paragraph position="3"> There are many researches in the area of human face recognition (Rowley et al., 1996)(Hunke, 1994)(Yang et al., 1997)(Turk and Pentland, 1991) and human name extraction, e.g. (MUC, 1995). However, almost all of them deal with the contents of single media and don't take into account the combination of multimedia contents. As a case of combining multimedia contents, there is a research of captioned images (Srihari and Burhans, 1994) (Srihari, 1995). Their system analyzes an image and the corresponding caption to identify the coreference between faces in the image and names in the caption. The text in their research is restricted to captions, which describes contents of the corresponding images. However, in newspapers or photo news, captions don't always exist and long captions like the captions used in their research are rare. Therefore, in general, we have to develop a method to capture effective linguistic expressions not from captions but from the body of text itself.</Paragraph>
    <Paragraph position="4"> In the research field of the video contents retrieval, although there are many researches ((Flickner et al., 1995),etc), few researches have been done to combine image and language media (Satoh et al., 1997)(Satoh and Kanade, 1997)(Smith and Kanade, 1997)(Wactlar et al., 1996)(Smoliar and Zhang, 1994) . In this field, as language media, there are soundtracks or captions in the video or sometimes in its transcriptions. For analysis of video contents, the information which consists along the time axis is effective and is used in such systems. On the other hand, for analysis of still images, some other methods that are different from the methods for video contents retrieval are required because the relatively small amount of and limited information than information from videos are provided.</Paragraph>
    <Paragraph position="5"> In section 2, the background and our system's overview are stated. In section 3 and 4, we describe the language module and the image module, respectively. Section 5 describes the com- null bining method of the results of the language module and the image module. In section 6, the experimental results are shown. Section 7 is our conclusions.</Paragraph>
    <Paragraph position="6"> 2 System architecture for combining To find coreferences between names in the text and faces in the image of the same photograph news article, we have to extract human names from the text and recognize faces in the image (Figure 1).</Paragraph>
    <Paragraph position="7"> Photograph news article image :iiii~\]~i:i::&amp;quot; &amp;quot;&amp;quot;::!ili:i&amp;quot; &amp;quot;i:i:i:i  ~ ........... L J .... .,X.</Paragraph>
    <Paragraph position="8"> ..... ./.., ~. ........ P. ....</Paragraph>
    <Paragraph position="9"> .. ......... . ....</Paragraph>
    <Paragraph position="10">  The problem is that the face of the person whose name is appearing in a text is not always appearing in the image, and vice versa. Therefore, we have to develop a method by which we automatically extracts a person whose name appears in the text and simultaneously his/her face appears on the image of the same article.</Paragraph>
    <Paragraph position="11"> For the convenience, we define common person, common name and common face as follows.</Paragraph>
    <Paragraph position="12"> Definition 1 A person whose name and face appear in the text of the article and in the photo image of the same article respectively, is called common person. The name of the common per-son is called common name, and the face of the common person is called common face.</Paragraph>
    <Paragraph position="13"> This research is initiated by the intuition that is state as assumptions as follows: Assumption 1 The name of a common person has a certain linguistic feature in the text distinct from that of a non common person.</Paragraph>
    <Paragraph position="14"> Assumption 2 The face of a common person has a certain image feature distinct from that of a non common person.</Paragraph>
    <Paragraph position="15"> These two assumptions are our starting point to seek out a method to identify the difference between tlle way of appearing of common names or faces in each media and the way of appearing of non common names or faces, and assign certainties of commonness to names and faces respectively based on the above assumptions.</Paragraph>
    <Paragraph position="16"> Since each media requires its proper processing methodology, our system has the language module to process the text and the image module to process the image. Our system also has the combining module which derives the final certainty of a name and a face from the certainty of name calculated by the language module and the certainty of face calculated by the image module respectively.</Paragraph>
    <Paragraph position="17"> For the image module, it is necessary to use the resulting information given by the language module, such as the number of names of high certainty, because the features of regions like where and how large they are, depend on the number of common persons. For example, the image module should select the largest region if the language module extracts only one name.</Paragraph>
    <Paragraph position="18"> On the other hand, for the language module, it is also necessary to use the result we get from the image module, such as the number of faces of high certainty, to select names of the common person.</Paragraph>
    <Paragraph position="19"> However, if we consider the nature of these interactive procedures between the language module and the image module, it is easily known that one module cannot wait until the completion of analysis of the other module. To resolve this situation, we consider two kinds of method. Method 1: First, the image (or language) module analyzes contents to proceed the process and outputs the partial results.</Paragraph>
    <Paragraph position="20"> Then assuming the result of the image (or language) module is correct, the language (or image ) module analyzes the text (or image). null Needless to say, the assumed partial results might be wrong. In that case, the image (or language) module has to backtrack to resolve the conflict between the result of the image module and that of the language module. Namely, this method is a kind of search with backtrack and it also requires the threshold value by which the system decides whether the situation needs to backtrack or not. Moreover, the result depends on which media is analyzed first.</Paragraph>
    <Paragraph position="21"> Method 2: Before combining of the results of image processing and those of language processing, the system works out all the hypotheses about the number of common  persons. Using all of these hypotheses, the system selects the best combination of the results. Its strong advantages are 1) the optimal solution is always found, and 2) each module can process independently.</Paragraph>
    <Paragraph position="22"> Considering the advantages and the shortcomings of two of the above described methods, it is reasonable to adopt Method 2. In this research, the hypotheses of the number of common persons are &amp;quot;one&amp;quot;, &amp;quot;two&amp;quot; and &amp;quot;more than two.&amp;quot; The reasons of introducing &amp;quot;more than two&amp;quot; are the followings: the images containing four or more persons are very rare, and such images have similar features to the images containing three persons.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML