File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2071_intro.xml
Size: 6,650 bytes
Last Modified: 2025-10-06 14:03:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2071"> <Title>Discriminating image senses by clustering with multimodal features</Title> <Section position="4" start_page="547" end_page="548" type="intro"> <SectionTitle> 2 Data and annotation </SectionTitle> <Paragraph position="0"> Yahoo!'s image query API was used to obtain a corpus of pairs of semantically ambiguous images, in thumbnail and true size, and their corresponding web sites for three ambiguous keywords inspired by (Yarowsky, 1995): BASS, CRANE, and SQUASH. We apply query augmentation (cf. Table 1), and exact duplicates were filtered out by identical image URLs, but cases occurred where both thumbnail and true-size image were included.</Paragraph> <Paragraph position="1"> Also, some images shared the same webpage or came from the same site. Generally, the latter gives important information about shared discourse topic, however the images do not necessarily depict the same sense (e.g. a CRANE bird vs. a meadow), and image features can separate them into different clusters.</Paragraph> <Paragraph position="2"> Annotation overview The images were annotated with one of several labels by one of the authors out of context (without considering the web site and its text), after applying text-based filtering (cf. section 3.1). For annotation purposes, images were numbered and displayed on a web page in thumbnail size. In case the thumbnail was not sufficient for disambiguation, the image linked at its true size to the thumbnail was inspected.2 The true-size view depended on the size of the original picture and showed the image and its name.</Paragraph> <Paragraph position="3"> However, the annotator tried to resist name influence, and make judgements based just on the image. For each query, 2 to 4 core word senses (e.g.</Paragraph> <Paragraph position="4"> squash vegetable and squash sport for SQUASH) were distinguished from inspecting the data. However, because &quot;context&quot; was restricted to the image content, and there was no guarantee that the image actually depicts the query term, additional annotator senses were introduced. Thus, for most core senses, a RELATED label was included, accounting for meanings that seemed related to core meaning but lacked a core sense object in the image. Some examples for RELATED senses are in Fig. 1. In addition, for each query term, a PEOPLE label was included because such images are common due to the nature of how people take pictures (e.g. portraits of persons or group pictures of crowds, when core or related senses did not apply), as was an 2We noticed a few cases where Yahoo! retrieved a thumbnail image different from the true size image.</Paragraph> <Paragraph position="5"> 1. fish 35% any fish, people holding catch 2. musical instrument 28% any bass-looking instrument, playing 3. related: fish 10% fishing (gear, boats, farms), rel. food, rel. charts/maps 4. related: musical instrument 8% speakers, accessories, works, chords, rel. music 5. unrelated 12% miscellaneous (above senses not applicable) 6. people 7% faces, crowd (above senses not applicable) CRANE (2650) 5: crane, construction cranes, whooping crane, sandhill crane, origami cranes 1. machine 21% machine crane, incl. panoramas 2. bird 26% crane bird or chick 3. origami 4% origami bird 4. related: machine 11% other machinery, construction, motor, steering, seat 5. related: bird 11% egg, other birds, wildlife, insects, hunting, rel. maps/charts 6. related: origami 1% origami shapes (stars, pigs), paper folding 7. people 7% faces, crowd (above senses not applicable) 8. unrelated 18% miscellaneous (above senses not applicable) 9. karate 1% martial arts SQUASH (1948) 10: squash+: rules, butternut, vegetable, grow, game of, spaghetti, winter, types of, summer 1. vegetable 24% squash vegetable 2. sport 13% people playing, court, equipment 3. related:vegetable 31% agriculture, food, plant, flower, insect, vegetables 4. related:sport 6% other sports, sports complex 5. people 10% faces, crowd (above senses not applicable) 6. unrelated 16% miscellaneous (above senses not applicable) web page document). For each term, the number of annotated images, the query retrieval terms, the senses, their distribution, and rough sample annotation guidelines are provided, with core senses marked in bold face. Because image retrieval engines restrict hits to 1000 images, query expansion was conducted by adding narrowing query terms from askjeeves.com to increase corpus size. We selected terms relevant to core senses, i.e. the main discrimination phenomenon. UNRELATED label for irrelevant images which did not fit other labels or were undeterminable.</Paragraph> <Paragraph position="6"> For a human annotator, even when using more natural word senses, assigning sense labels to images based on image alone is more challenging and subjective than labeling word senses in textual context. First of all, the annotation is heavily dependent on domain-knowledge and it is not feasible for a layperson to recognize fine-grained semantics. For example, it is straightforward for the layperson to distinguish between a robin and a crane, but determining whether a given fish should have the common name bass applied to it, or whether an instrument is indeed a bass instrument or not, is extremely difficult (see Fig. 2; e.g. deciding if a picture of a fish fillet is a picture of a fish is tricky). Furthermore, most images display objects only partially; for example just the neck of a classical double bass instead of the whole instrument. In addition, scaling, proportions, and components are key cues for object discrimination in real-life, e.g. for singling out an electric bass from an electric guitar, but an image may not provide these detail. Thus, senses are even fuzzier for ISD than WSD labeling. Given that laypeople are in the majority, it is fair to assume their perspective and naiveness. This latter fact also led to annotations' level of specificity differing according to search term. Annotation criteria depended on the keyword term and its senses and their coverage, as shown in Table 1. Nevertheless, several border-line cases for label assignment occurred. Considering that the annotation task is 1. Compute pair-wise document affinities 2. Compute eigenvalues 3. Embed and cluster quite subjective, this is to be expected. In fact, one person's labeling often appears as justifiable as a contradicting label provided by another person. We explore the vagueness and subjective nature of image annotation further in a companion paper (Alm, Loeff, Forsyth, 2006).</Paragraph> </Section> class="xml-element"></Paper>