File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0203_metho.xml
Size: 18,082 bytes
Last Modified: 2025-10-06 14:15:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0203"> <Title>Identification of Coreference Between Names and Faces</Title> <Section position="3" start_page="18" end_page="19" type="metho"> <SectionTitle> 3 Extraction of human name </SectionTitle> <Paragraph position="0"> candidates The language module extracts the human name candidates from all human names appearing in the text and assigns certainty of commonness to each of the candidates of a common name. When the extracted name is a common name, the person is regarded as the important person in the article. Therefore, the linguistic expressions around the name probably have the specific linguistic features. Thus, our system decides whether an extracted name is a common name or not with information of the linguistic expressions around the human name. To select effective features for this purpose from the all features generated from the text, we employ a machine learning technique, because some important features could be fallen out if selected by hand. Moreover, machine learning technique might be able to learn incomprehensible phenomena for human.</Paragraph> <Paragraph position="1"> It is hard to recognize meaningful linguistic features without morphological analysis. On the other hand, if the system does the syntax analysis, the handling of the ambiguity becomes a big problem. Furthermore, on the practical use, high processing cost becomes a problem to process huge amount of news articles. As the consequence, we adopt a word sequence pattern based approach. For this, firstly, we analyze texts of news articles with morphological analyzer JUMAN(version 3.6)(Kurohashi and Nagao, 1998) to extract the part of speech tags as the features in machine learning. Note that a compound noun is treated as one noun because if we treat component words of the compound noun individually, the patterns we have to deal with become too complicated for machine learning systems. The features to be used for learning are the followings.</Paragraph> <Paragraph position="2"> Compound noun which contains a human name The human name appearing in the news articles might have the adjacent words which describe additional information about the name such as title, age, year of birth and so on. The name with some kind of words, like title, sometimes becomes one compound noun and treated as one morpheme in our system. Our system tries to find this type of information as features for machine learning.</Paragraph> <Paragraph position="3"> Part of speech tags around a human name As well known, syntactic parsing is computationally heavy and usually has high ambiguities. Thus, instead of syntactic parsing, we extract the combination of a word, its part of speech tag and its relative position to the focused name for learning. Especially we focus on the words around the human name to capture the characteristic linguistic expressions about the human name. Our system employs two levels of the part of speech tag defined by the morphological analyzer JUMAN.</Paragraph> <Paragraph position="4"> Since our system is for Japanese, object is described by a case particle. In pattern matching, instead of the sophisticated case analysis done by syntactic parsing, our system first applies the particle followed to the word as a feature. As for a predicate, we choose the predicate whose position is after the name and nearest to the name because in Japanese a predicate comes after subject, object, and other syntactic components. null Location and frequency of a human name Location of the word is important because it reflects structures of documents. Our system uses features as follows: 1) whether the word is in the title or not, 2) the line number of the line the word is in, and 3) the number of the paragraph the word is in. Our system also uses the order of the occurrence of the name in all the name occurrences and the frequency of the name in the text.</Paragraph> <Paragraph position="5"> Using linguistic features described above extracted from training data as inputs, we use C5.0 (Rul, 1998) to generate decision trees. For each case in test data, C5.0 outputs th~ class predicted by the decision tree with the confidence of the prediction. We use the confidence as the output of this module.</Paragraph> <Paragraph position="6"> Another factor for selecting feature for learning is how many morphemes around the name are used. In our experiment, ten morphemes around the name are used. The experimental results will be shown in section 6.</Paragraph> </Section> <Section position="4" start_page="19" end_page="20" type="metho"> <SectionTitle> 4 Extraction of human face candidates </SectionTitle> <Paragraph position="0"> To identify coreferences between the face in the image and the name in the text, this module should extract regions that are candidates of common face. In this section, we describe the image module which extracts face candidates from the image. The face candidates are the faces of persons who might be common persons.</Paragraph> <Paragraph position="1"> Next, as same as the language module does, this module learns the characteristic features of the region of a common face that are used to decide whether an extracted region as a face is a common face or not.</Paragraph> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.1 Extraction of face regions </SectionTitle> <Paragraph position="0"> To extract face regions, this module uses the following methods: 1) Filtering to remove noise, and 2) RGB based modeling of skin color to extract face region. Furthermore, this module generates features of each region and learns characteristics of the common face by C5.0. The value of each feature, e.g. location of face region, region size, depends upon the number of the persons appearing in the image as shown in Figure 3 and the text. To optimize feature based recognition, this module proceeds the processes corresponding to three hypotheses, say the number of common person is one, two, or more than two.</Paragraph> <Paragraph position="1"> to the number of the person.</Paragraph> </Section> <Section position="2" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 4.2 Skin color modeling </SectionTitle> <Paragraph position="0"> The advantage of using color for face detection is robust against orientation, occlusion and intensities, and able to process fast, but the demerit is the difficulty in detecting only a face fi'om a human body or other parts like hands, and to locate it accurately.</Paragraph> <Paragraph position="1"> Darrell et al.(Darrell et al., 1998) convert (R,G,B) tuples into tuples of the form</Paragraph> <Paragraph position="3"> space&quot;, and detect skin color by using a classifier with an empirically estimated Gaussian probability model of &quot;skin&quot; and &quot;not-skin&quot; in' the space. Yang et al.(Yang and Waibel, 1995) develop a real-time face tracking system, and they propose an adaptive skin color model under different lighting condition based on the fact that its distribution under a certain lighting condition can be Characterized by a multivariate Gaussian distribution(Yang et al., 1997).</Paragraph> <Paragraph position="4"> The variables are chromatic colors, that is, r = R/(R+G+B), and g = G/(R+G+B). On the other hand, Satoh et al.(Satoh et al., 1997) use the Gaussian distribution in (R, G, B) space in their face detection system because this model is more sensitive to brightness of skin color.</Paragraph> <Paragraph position="5"> The picture of the newspaper we treat is a scene picture that includes not only a common face but also other faces, and a face doesn't always look straight forward. Thus, we use color information to detect a face because the color doesn't depend on its orientation. Suppose that the skin color distribution complies with the Gaussian distribution in (R,G, B) space(Satoh et al., 1997). Then, we introduce the Mahalanobis distance. That is the distance fi'oln the center of gravity of the group considering variance-covariance of data. We calculate the mean intensity M(= (/~,G,/})T), variance-covariance matrix V and Mahalanobis distance d from skin color data of 5pixel x 5pixel blocks, which are extracted from the cheek colored areas of 85 persons (Satoh et al., 1997). The almost all of cheek colored areas express natural skin color and they are rarely in a. shadow even if the people wear hat, etc. Suppose I be intensities of a pixel of the input image. Then, if that pixel satisfies (1), we take that pixel as the candidate pixel with skin color.</Paragraph> <Paragraph position="6"> d 2 > (I- M)Tv-I(I - M) (1) where value d is experimentally optinfized.</Paragraph> <Paragraph position="7"> The method we described above is not so accurate in some cases. Some extra, non-facial regions would also be extracted simultaneously.</Paragraph> <Paragraph position="8"> To achieve higher accuracies, we examine the distribution of (R + G + B) - (R - B), and draw border lines in order to contain more than 80% of the sample. We decide the triangle manually by observing the various output images. We extract the pixels which is in the triangle shown in Figure 4.</Paragraph> <Paragraph position="9"> Some results by this method is shown in Figure 5. As you can see, not only faces but also hands and other regions whose color is similar to skin color are also extracted. To elilninate these undesirable regions, we use a decision trees built by C5.0 as stated in 4.3.</Paragraph> </Section> <Section position="3" start_page="20" end_page="20" type="sub_section"> <SectionTitle> 4.3 Features of extracted regions </SectionTitle> <Paragraph position="0"> In this research, we use the following 17 features including the composition information of the whole image, in addition to the form and color of the region that is used with conventional image retrieval(Han and Myaeng, 1996).</Paragraph> <Paragraph position="1"> The following five features are used to'express the form of skin color region: 1) Ratio of region to the largest region, 2) Ratio between the length of X-axis direction and the length of Y-axis direction, 3) Rectangularity, 4) Ellipticity, 5) Eccentricity.</Paragraph> <Paragraph position="2"> The feature about the color is the followings: 6-9) Each of the mean of R, G, B and intensity Y.</Paragraph> <Paragraph position="3"> The following eight features are positional information of the region.</Paragraph> <Paragraph position="4"> 10) Aspect ratio of the whole image.</Paragraph> <Paragraph position="5"> 11,12) x,y coordinates of the center of gravity of the region.</Paragraph> <Paragraph position="6"> 13) Distance between the center of gravity of the region and the center of the whole image, normalized with a half of the length of the diagonM line of the image.</Paragraph> <Paragraph position="7"> 14) The order of the region in descending order of 13).</Paragraph> <Paragraph position="8"> 15) Distance between the center of gravity of the region and the center of the upper end of the whole image, normalized with the length from the center of the upper end to the left lower end (or the right lower end). 16) The order of the region in descending order of 15).</Paragraph> <Paragraph position="9"> 17) Suppose that the image is divided into 3 x 3 sub-areas. Which of these sub-areas the center of gravity is in.</Paragraph> <Paragraph position="10"> Using these 17 features extracted from trMning data as input features, we use C5.0 to learn decision trees, which extract candidates of common face with different certainties as described in section 3. The experhnental results will be shown in section 6.</Paragraph> </Section> </Section> <Section position="5" start_page="20" end_page="22" type="metho"> <SectionTitle> 5 Combining candidates from image </SectionTitle> <Paragraph position="0"> and language In this section, we describe the combining module whose inputs are the candidates extracted by the language module and the image module described in section 3 and 4, respectively. Its output is the result of the whole system.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 5.1 Input </SectionTitle> <Paragraph position="0"> As already said, since the language module and the image module process under hypothesis of &quot;one&quot;, &quot;two&quot;, or &quot;lnore than two&quot; persons, respectively, one module outputs three results according to these three hypotheses. Then outputs from both modules are expressed a.s follows: null (output of language ntodule ) (output of image module)</Paragraph> <Paragraph position="2"> Note that n and m are the number of the common persons adopted a.s the hypothesis, x and y are orders in ascending order of certainty about the person being common. The certainty of the decision in the language module and the image module is the confidence output by C5.0.</Paragraph> <Paragraph position="3"> For example, fl,,~a(n, 2) expresses the certainty of the person'who has the second highest certainty. Each output is something like the graph on Figure 6. In this figure, all of the extracted candidate names or faces are sorted in descending order of calculated certainties by distinct decision trees of the language module or the image module because the nmnber of the common person might be more than one. By introducing certainties, as later described, we obtain enormous flexibility in combining ca,ndidates from the language module and those from the image module.</Paragraph> </Section> <Section position="2" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 5.2 Combination of hypotheses </SectionTitle> <Paragraph position="0"> Since the language module and the image module process under each of three hypotheses, there are 3 x 3 combinations of the results.</Paragraph> <Paragraph position="1"> This combining module selects the best pair from those combinations and outputs the results based on the selected pair. To select the best pair, we introduce some kinds of distance described as follows.</Paragraph> <Paragraph position="2"> Distance between outputs of two media The distance between the result of the image module and the result of the language module fu is defined by (4).</Paragraph> <Paragraph position="4"> where 114 is the maximum number of the persons known from the results of both modules. As you know from (4), the'nearer the certainties of the candidates from the language module and the image module which have the same order z are, the smaller the fu(n, m) is.</Paragraph> <Paragraph position="5"> Distance between output of media and hypothesis If there is difference between a hypothesis and the output calculated under the hypothesis, say ft~ng and fimag~, the hypothesis should not be considered to be valid. Therefore, we introduce the distance between the hypothesis and the output of the language module: ft~g or that of the image module: fimoee. A hypothesis of n common persons is defined in (5).</Paragraph> <Paragraph position="7"> where x is the order of certainty of candidates.</Paragraph> <Paragraph position="8"> Since each of the language module and the image module has its own hypothesis, the combining module calculates the distance fat defined by (6) between the hypothesis used in the la.nguage module and the result fl'om the language module. It also calculates the distance fai defined by (7) between the hypothesis used in the image module and the result from the image module.</Paragraph> <Paragraph position="10"> In the case that the hypothesis is &quot;more than two&quot;, the certMnty of candidates whose order is fourth or larger are ignored.</Paragraph> <Paragraph position="11"> Decreasing factor for each inconsistent hypothesis Different hypotheses of the language module and the image module indical~e inconsistency. However, since the analysis of each module is not perfect, our system does not exclude such inconsistent coml)inations of hypotheses. Instead, we decrease the certainty of such inconsistent combinations. For this, we use decreasing factors D(m,n) where n and m mean the hypothesized nmnber of person in the language module and the image module, respectively. We empirically tuned up the actual values of D(m, n) as shown in Table 1.</Paragraph> <Paragraph position="12"> Integration of the measures Using these three distances, namely fti, f~t and foi, and D(n,m) , the combining module finally calculates total certainty f(n,m) defined by (8) for each combination of hypotheses. The smaller the f(n, m) is, the nearer the result from the language module is the result of the image module.</Paragraph> <Paragraph position="14"/> </Section> <Section position="3" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 5.3 Combining the results </SectionTitle> <Paragraph position="0"> When a combination which has the smallest f(n, m) has been selected, the results from the language module and the image module are fixed. The system combines these results into one result funion(n, m, z), where the person corresponding to z is expected to be a common person, funion(n, m, z) is the final output of the whole system. For this combining, we investigate two methods as follows. In (9), the consistency on the number of common persons is regarded as an important factor. On the other hand, in (10), when at least one of two module, namely time language module or the image module, assigns high certainty to a candidate person, the whole system finally assigns high certainty to the candidate person.</Paragraph> <Paragraph position="2"> The final outputs of whole system are something like these: &quot;John: common person (certainty: 0.8)&quot;, &quot;Paul: common person (certainty: 0.4)&quot; and so on. These results are used to find the face on the image if We specify a certain name in the text to retrieve his/her face image, or vice versa..</Paragraph> </Section> </Section> class="xml-element"></Paper>