File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2071_metho.xml

Size: 15,444 bytes

Last Modified: 2025-10-06 14:10:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2071">
  <Title>Discriminating image senses by clustering with multimodal features</Title>
  <Section position="5" start_page="548" end_page="550" type="metho">
    <SectionTitle>
3 Model
</SectionTitle>
    <Paragraph position="0"> Our goal is to provide a mapping between images and a set of iconographically coherent clusters for a given query word, in an unsupervised framework. Our approach involves extracting and weighting unordered bags-of-words (BOWs; henceforth) features from the webpage text, simple local and global features from the image, and running spectral clustering on top. Fig. 3 shows an overview of the implementation.</Paragraph>
    <Section position="1" start_page="549" end_page="549" type="sub_section">
      <SectionTitle>
3.1 Feature extraction
</SectionTitle>
      <Paragraph position="0"> Document and text filtering A pruning process was used to filter out image-document pairs based on e.g. language specification, exclusion of &amp;quot;Index of&amp;quot; pages, pages lacking an extractable target image, or a cutoff threshold of number of tokens in the body. For remaining documents, text was preprocessed (e.g. lower-casing, removing punctuation, tokens being very short, having numbers or no vowels, etc.). We used a stop word list, but avoided stemming to make the algorithm language independent in other respects. When using image features, grayscale images (no color histograms) and images without salient regions (no keypoints detected) were also removed.</Paragraph>
      <Paragraph position="1"> Text features We used the following BOWs: (a) tokens in the page body; (b) tokens in a +-10 window around the target image (if multiple, the first was considered); (c) tokens in a +-10 window around any instances of the query keyword (e.g.</Paragraph>
      <Paragraph position="2"> squash); (d) tokens of the target image's alt attribute; (e) tokens of the title tag; (f) some meta tokens.3 Tf-idf was applied to a weighted average of the BOWs. Webpage design is flexible, and some inconsistencies and a certain degree of noise remained in the text features.</Paragraph>
      <Paragraph position="3"> Image features Given the large variability in the retrieved image set for a given query, it is difficult to model images in an unsupervised fashion. Simple features have been shown to provide performance rivaling that of more elaborate models in object recognition (Csurka et al, 2004) and (Chapelle, Haffner, and Vapnik, 1999), and the following image bags of features were considered: Bags of keypoints: In order to obtain a compact representation of the textures of an image, patches are extracted automatically around interesting regions or keypoints in each image. The keypoint detection algorithm (Kadir and Brady, 2001) uses a saliency measure based on entropy to select regions. After extraction, keypoints were represented by a histogram of gradient magnitude of the pixel values in the region (SIFT) (Lowe, 2004).</Paragraph>
      <Paragraph position="4"> These descriptors were clustered using a Gaussian Mixture with [?] 300 components, and the resulting global patch codebook (i.e. histogram of codebook entries) was used as lookup table to assign each keypoint to a codebook entry.</Paragraph>
      <Paragraph position="5"> 3Adding to META content, keywords was an attribute, but is irregular. Embedded BODY pairs are rare; thus not used. Color histograms: Due to its similarity to how humans perceive color, HSV (hue, saturation, brightness) color space was used to bin pixel color values for each image. Eight bins were used per channel, obtaining an 83 dimensional vector.</Paragraph>
    </Section>
    <Section position="2" start_page="549" end_page="549" type="sub_section">
      <SectionTitle>
3.2 Measuring similarity between images
</SectionTitle>
      <Paragraph position="0"> For the BOWs text representation, we use the common measure of cosine similarity (cs) of two tf-idf vectors (Jurafsky and Martin, 2000). The cosine similarity measure is also appropriate for keypoint representation as it is also an unordered bag.</Paragraph>
      <Paragraph position="1"> There are several measures for histogram comparison (i.e. L1, kh2). As in (Fowlkes et al, 2004) we use the kh2 distance measure between histograms hi and hj.</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="3" start_page="549" end_page="550" type="sub_section">
      <SectionTitle>
3.3 Spectral Clustering
</SectionTitle>
      <Paragraph position="0"> Spectral clustering is a powerful way to separate non-convex groups of data. Spectral methods for clustering are a family of algorithms that work by first constructing a pairwise-affinity matrix from the data, computing an eigendecomposition of the data, embedding the data into this low-dimensional manifold, and finally applying traditional clustering techniques (i.e. k-means) to it.</Paragraph>
      <Paragraph position="1"> Consider a graph with a set of n vertices each one representing an image document, and the edges of the graph represent the pairwise affinities between the vertices. Let W be an nxn symmetric matrix of pairwise affinities. We define these as the Gaussian-weighted distance</Paragraph>
      <Paragraph position="3"> (2) where {at,ak,ac} are scaling parameters for text, keypoints, and color features. It has been shown that the use of multiple eigenvectors of W is a valid space onto which the data can be embedded (Ng, Jordan, Weiss, 2002). In this space noise is reduced while the most significant affinities are preserved. After this, any traditional clustering algorithm can be applied in this new space to get the final clusters. Note that this is a nonlinear mapping of the original space. In particular, we employ a variant of k-means, which includes a selective step that is quasi-optimal in a Vector Quantization sense (Ueda and Nakano, 1994). It has the added advantage of being more  robust to initialization than traditional k-means. The algorithm follows, 1. For given documents, compute the affinity matrix W as defined in equation 2.</Paragraph>
      <Paragraph position="4">  2. Let D be a diagonal matrix whose (i,i)-th element is the sum of W's i-th row, and define L = D[?]1/2WD[?]1/2.</Paragraph>
      <Paragraph position="5"> 3. Find the k largest eigenvectors V of L.</Paragraph>
      <Paragraph position="6"> 4. Define E as V , with normalized rows.</Paragraph>
      <Paragraph position="7"> 5. Perform clustering on the columns of E,  which represent the embedding of each image into the new space, using a selective step as in (Ueda and Nakano, 1994).</Paragraph>
      <Paragraph position="8"> Why Spectral Clustering? Why apply a variant of k-means in the embedded space as opposed to the original feature space? The k-means algorithm cannot separate non-convex clusters. Furthermore, it is unable to cope with noisy dimensions (this is especially true in the case of the text data) and highly non-ellipsoid clusters. (Ng, Jordan, Weiss, 2002) stated that spectral clustering outperforms k-means not only on these high dimensional problems, but also in low-dimensional, multi-class data sets. Moreover, there are problems where Euclidean measures of distance required by k-means are not appropriate (for instance histograms), or others where there is not even a natural vector space representation. Also, spectral clustering provides a simple way of combining dissimilar vector spaces, like in this case text, keypoint and color features.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="550" end_page="552" type="metho">
    <SectionTitle>
4 Experiments and results
</SectionTitle>
    <Paragraph position="0"> In the first set of experiments, we used all features for clustering. We considered three levels of sense granularity: (1) all senses (All), (2) merging related senses with their corresponding core sense (Meta), (3) just the core senses (Core). For experiments (1) and (2), we used 40 clusters and all labeled images. For (3), we considered only images labeled with core senses, and thus reduced the number of clusters to 20 for a more fair comparison. Results were evaluated according to global cluster purity, cf. Equation 3.4  for 5 runs with different initializations. For each keyword, the table lists the number of senses, median, and range of global cluster purity, followed by the baseline. All senses used the full set of sense labels and 40 clusters. Meta senses merged core senses with their respective related senses, considering all images and using 40 clusters. Core senses were clustered into 20 clusters, using only images labeled with core sense labels. Purity was stable across runs, and peaked for Core. The baseline reflected the frequency of the most common sense.  sense images were grouped into 20 clusters, on the basis of individual feature types, and global cluster purity was measured. The table lists the median and range from 5 runs with different initializations. Img included just image features; TxtWin local tokens in a +-10 window around the target image anchor; BodyTxt global tokens in the page BODY; and Baseline uses the most common sense. Text performed better than image features, and global text appeared better than local. All features performed above the baseline.</Paragraph>
    <Paragraph position="1"> Median and range results are reported for five runs, given each condition, comparing against the baseline (i.e. choosing the most common sense).</Paragraph>
    <Paragraph position="2"> Table 2 shows that purity was surprisingly good, stable across query terms, and that it was highest when only core sense data was considered. In addition, purity tended to be slightly higher for BASS, which may be related to the annotator being less confident about its fine-grained sense distinctions, and thus less strict for assigning core sense labels for this query term.5 In addition, we looked  for all senses was 0.67, and for meta senses 0.83. Not all clusters were as pure as this one; global purity for all 40 cluster was 0.49. This cluster appeared to show some iconography; mostly standing cranes. Interestingly, another cluster contained several images of flying cranes. Most weighted tokens: cranes whooping birds wildlife species. Table 1 has sense labels.  Individual cluster purity for all senses was 0.5, and for meta senses 1.0. Global purity for all 40 cluster was 0.52. This cluster both shows visually coherent images, and a sensible meta semantic field. Most weighted tokens: chayote calabaza add bitter cup. Presumably, some tokens reflect the vegetable's use within the cooking domain. sense data based on a particular feature. Table 3 shows that global text features were most informative (although not homogenously), but also that each feature type performed better than the base-line in isolation. This indicates that an optimal feature combination may improve over current performance, using manually selected parameters. In addition, purity is not the whole story. Figs. 4 and 5 show examples of two selected interesting clusters obtained for CRANE and SQUASH, respectively, using combined image and text features and all individual senses.6 Inspection of image clusters indicated that image features, both in isolation and when used in combination, appeared to con- null that further exploring image features may be vital for attaining more subtle iconographic senses.</Paragraph>
    <Paragraph position="3"> Moreover, as discussed in the introduction, images are not necessarily anchored in the immediate text which they refer to. This could explain why local text features do not perform as well as global ones. Lastly, in addition, Fig. 6 shows an example of a partial cluster where the algorithm inferred a specific related sense.</Paragraph>
    <Paragraph position="4"> We also experimented with different number of clusters for BASS. The results are in Table 4, lacking a clear trend, with comparable variation to different initializations. This is surprising, since we would expect purity to increase with number of  number of clusters (5 runs each with distinct initializations), and recorded median and range of global purity for all six senses of the query term, and for the four meta senses, without a clear trend.</Paragraph>
    <Paragraph position="5"> clusters (Sch&amp;quot;utze, 1998), but may be due to the spectral clustering. Inspection showed that 6 clusters were dominated by core senses, whereas with 40 clusters a few were also dominated by RELATED senses or PEOPLE. No cluster was dominated by an UNRELATED label, which makes sense since semantic linkage should be absent between unrelated items.</Paragraph>
  </Section>
  <Section position="7" start_page="552" end_page="553" type="metho">
    <SectionTitle>
5 Comparison to previous work
</SectionTitle>
    <Paragraph position="0"> Space does not allow a complete review of the WSD literature. (Yarowsky, 1995) demonstrated that semi-supervised WSD could be successful.</Paragraph>
    <Paragraph position="1"> (Sch&amp;quot;utze, 1998) and (Lin and Pantel, 2002a, b) show that clustering methods are helpful in this area.</Paragraph>
    <Paragraph position="2"> While ISD has received less attention, image categorization has been approached previously by adding text features. For example, (Frankel, Swain, and Athitsos, 1996)'s WebSeer system attempted to mutually distinguish photos, handdrawn, and computer-drawn images, using a combination of HTML markup, web page text, and image information. (Yanai and Barnard, 2005) found that adding text features could benefit identifying relevant web images. Using text-annotated images (i.e. images annotated with relevant keywords), (Barnard and Forsyth, 2001) clustered them exploring a semantic hierarchy; similarly (Barnard, Duygulu, and Forsyth, 2002) conducted art clustering, and (Barnard and Johnson, 2005) used text-annotated images to improve WSD. The latter paper obtained best results when combining text and image features, but contrary to our findings, image features performed better in isolation than just text. They did use a larger set of image features and segmentation, however, we suspect that differences can rather be attributed to corpus type. In fact, (Yanai, Shirahatti, and Barnard, 2005) noted that human evaluators rated images obtained via a keyword retrieval method higher compared to image-based retrieval methods, which they relate to the importance of semantics for what humans regard as matching, and because pictorial semantics is hard to detect.</Paragraph>
    <Paragraph position="3"> (Cai et al, 2004) use similar methods to rank visual search results. While their work does not focus explicitly on sense and does not provide in-depth discussion of visual sense phenomena, these do appear in, for example, figs. 7 and 9 of their paper. An interesting aspect of their work is the use of page layout segmentation to associate text with images in web documents. Unfortunately, the au- null thors only provide an illustrative query example, and no numerical evaluation, making any comparison difficult. (Wang et al, 2004) use similar features with the goal to improve image retrieval through similarity propagation, querying specific web sites. (Fuji and Ishikawa, 2005) deal with image ambiguity for establishing an online multimedia encyclopedia, but their method does not integrate image features, and appears to depend on previous encyclopedic background knowledge, limited to a domain set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML