Towards a Framework for Learning Structured Shape Models from
Text-Annotated Images
Sven Wachsmutha0a2a1 , Suzanne Stevensona0 , Sven Dickinsona0
a1 Bielefeld University, Faculty of Technology, 33594 Bielefeld, Germany
a0 University of Toronto, Dept. of Computer Science, Toronto, ON, Canada
a3 swachsmu,suzanne,sven
a4 @cs.toronto.edu
Abstract
We present on-going work on the topic of learn-
ing translation models between image data and
text (English) captions. Most approaches to
this problem assume a one-to-one or a flat, one-
to-many mapping between a segmented image
region and a word. However, this assump-
tion is very restrictive from the computer vi-
sion standpoint, and fails to account for two
important properties of image segmentation: 1)
objects often consist of multiple parts, each
captured by an individual region; and 2) indi-
vidual regions are often over-segmented into
multiple subregions. Moreover, this assump-
tion also fails to capture the structural rela-
tions among words, e.g., part/whole relations.
We outline a general framework that accommo-
dates a many-to-many mapping between im-
age regions and words, allowing for struc-
tured descriptions on both sides. In this paper,
we describe our extensions to the probabilis-
tic translation model of Brown et al. (1993) (as
in Duygulu et al. (2002)) that enable the cre-
ation of structured models of image objects.
We demonstrate our work in progress, in which
a set of annotated images is used to derive a set
of labeled, structured descriptions in the pres-
ence of oversegmentation.
1 Introduction
Researchers in computer vision and computational lin-
guistics have similar goals in their desire to automati-
cally associate semantic information with the visual or
linguistic representations they extract from an image or
text. Given paired image and text data, one approach
0Wachsmuth is supported by the German Research Founda-
tion (DFG). Stevenson and Dickinson gratefully acknowledge
the support of NSERC of Canada.
is to use the visual and linguistic representations as im-
plicit semantics for each other—that is, using the words
as names for the visual features, and using the image ob-
jects as referents for the words in the text (cf. Roy, 2002).
The goal of our work is to automatically acquire struc-
tured object models from image data associated with text,
at the same time learning an assignment of text labels for
objects as well as for their subparts (and, in the long run,
also for collections of objects).
Multimodal datasets that contain both images and text
are ubiquitous, including annotated medical images and
the Corel dataset, not to mention the World Wide Web,
allowing the possibility of associating textual and visual
information in this way. For example, if a web crawler
encountered many images containing a particular shape,
and also found that the word chair was contained in the
captions of those images, it might associate the shape
with the word chair, simultaneously indicating a name
for the shape and a visual “definition” for the word. Such
a framework could then learn the class names for a set
of shape classes, effectively yielding a translation model
between image shapes (or more generally, features) and
words (Duygulu et al., 2002). This translation model
could then be used to answer many types of queries, in-
cluding labeling a new image in terms of its visible ob-
jects, or generating a visual prototype for a given class
name. Furthermore, since figure captions (or, in general,
image annotations) may contain words for entire objects,
as well as words for their component parts, a natural se-
mantic hierarchy may emerge from the words. For exam-
ple, just as tables in the image may be composed of “leg”
image parts, the word leg can be associated with the word
table in a part-whole relation.
Others have explored the problem of learning
associations between image regions (or features)
and text, including Barnard and Forsyth (2001),
Duygulu et al. (2002), Blei and Jordan (2002), and
Cascia et al. (1998). As impressive as the results are,
these approaches make limiting assumptions that prevent
them from being appropriate to our goals of a structured
object model. On the vision side, each segmented region
is mapped one-to-one or one-to-many to words. Concep-
tually, associating a word with only one region prevents
an appropriate treatment of objects with parts, since such
objects may consistently be region-segmented into a
collection of regions corresponding to those components.
Practically, even putting aside the goal of part-whole
processing, any given region may be (incorrectly)
oversegmented into a set of subregions (that are not
component parts) in real images. Barnard et al. (2003)
propose a ranking scheme for potential merges of regions
based on a model of word-region association, but do
not address the creation of a structured object model
from sequences of merges. To address these issues, we
propose a more elaborate translation/association model
in which we use the text of the image captions to guide
us in structuring the regions.
On the language side of this task, words have typi-
cally been treated individually with no semantic struc-
ture among them (though see Roy, 2002, which induces
syntactic structure among the words). Multiple words
may be assigned as the label to a region, but there’s
no knowledge of the relations among the words (and
in fact they may be treated as interchangeable labels,
Duygulu et al., 2002). The more restrictive goal of image
labeling has put the focus on the image as the (structured)
object. But we take an approach in principle of build-
ing a structured hierarchy for both the image objects and
their text labels. In this way, we aim not only to use the
words to help guide us in how to interpret image regions,
but also to use the image structure to help us induce a
part/whole hierarchy among the words. For example, as-
sume we find consistently associated leg and top regions
together referred to as a table. Then instead of treating
leg and table, e.g., as two labels for the same object, we
could capture the image part-whole structure as word re-
lations in our lexicon.
Our goal of inducing associated structured hierarchies
of visual and linguistic descriptions is a long-term one,
and this paper reports on our work thus far. We start with
the probabilistic translation model of Brown et al. (1993)
(as in Duygulu et al., 2002), and extend it to structured
shape descriptions of visual data. As alluded to earlier,
we distinguish between two types of structured shape de-
scriptions: collections of regions that should be merged
due to oversegmentation versus collections of regions that
represent components of an object. To handle both types,
we incorporate into our algorithm several region merge
operations that iteratively evaluate potential merges in
terms of their improvement to the translation model.
These operations can exploit probabilities over region
adjacency, thus constraining the potential combinatorial
explosion of possible region merges. We also permit a
many-to-many mapping between regions and words, in
support of our goal of inducing structured text as well,
although here we report only on the structured image
model, assuming similar mechanisms will be useful on
the text side.
We are currently developing a system to demonstrate
our proposal. The input to the system is a set of images
segmented into regions organized into a region adjacency
graph. Nodes in the graph encode the qualitative shape of
a region using a shock graph (Siddiqi et al., 1999), while
undirected edges represent region adjacency (used to con-
strain possible merges). On the text side, each image has
an associated caption which is processed by a part-of-
speech tagger (Brill, 1994) and chunker (Abney, 1991).
The result is a set of noun phrases (nouns with associated
modifiers) which may or may not pertain to image con-
tent. The output of the system is a set of many-to-many
(possibly structured) associations between image regions
and text words.
This paper represents work in progress, and not all the
components have been fully integrated. Initially, we have
focused on the issues of building the structured image
models. We demonstrate the ideas on a set of annotated
synthetic scenes with both multi-part objects and over-
segmented objects/parts. The results show that at least
on simple scenes, the model can cope with oversegmen-
tation and converge to a set of meaningful many-to-many
(regions to words) mappings.
2 Visual Shape Description
In order to learn structured visual representations, we
must be able to make meaningful generalizations over
image regions that are sufficiently similar to be treated
as equivalent. The key lies in determining categorical
shape classes whose definitions are invariant to within-
class shape deformation, color, texture, and part articula-
tion. In previous work, we have explored various generic
shape representations, and their application to generic ob-
ject recognition (Siddiqi et al., 1999; Shokoufandeh et al.,
2002) and content-based image retrieval (Dickinson et
al., 1998). Here we draw on our previous work, and adopt
a view-based 3-D shape representation, called a shock
graph, that is invariant to minor shape deformation, part
articulation, translation, rotation, and scale, along with
minor rotation in depth.
The vision component consists of a number of
steps. First, the image is segmented into regions, us-
ing the mean-shift region segmentation algorithm of
Comaniciu and Meer (1997).1 The result is a region ad-
jacency graph, in which nodes represent homogeneous
1The results presented in Section 4.2 are based on a syn-
thetic region segmentation. When working with real images,
we plan to use the mean-shift algorithm, although any region
segmentation algorithm could conceivably be used.
Type 3 
Type 1 Type 2
Type 4 
(a) (b)
#
3
Φ
3
3 3001
002
003 004
1 001 1 1 1 1
1 1 1
002 003 004 005
007 008 009
ΦΦΦ Φ
(c)
Figure 1: The Shock Graph Qualitative Shape Represen-
tation: (a) the taxonomy of qualitative shape parts; (b) the
computed shock points of a 2-D closed contour; and (c)
the resulting shock graph.
regions, and edges capture region adjacency. The param-
eters of the segmentation algorithm can be set so that it
typically errs on the side of oversegmentation (regions
may be broken into fragments), although undersegmen-
tation is still possible (regions may be merged incorrectly
with their neighbors). Next, the qualitative shape of each
region is encoded by its shock graph (Siddiqi et al., 1999),
in which nodes represent clusters of skeleton points that
share the same qualitative radius function, and edges rep-
resent adjacent clusters (directed from larger to smaller
average radii). As shown in Figure 1(a), the radius func-
tion may be: 1) monotonically increasing, reflecting a
bump or protrusion; 2) a local minimum, monotonically
increasing on either side of the minimum, reflecting a
neck-like structure; 3) constant, reflecting an elongated
structure; or 4) a local maximum, reflecting a disk-like or
blob-like structure. An example of a 2-D shape, along
with its corresponding shock graph, is shown in Fig-
ures 1(b) and (c).
The set of all regions from all training images are clus-
tered according to a distance function that measures the
similarity of two shock graphs in terms of their structure
and their node attributes. As mentioned above, the key
requirement of our shape representation and distance is
that it be invariant to both within-class shape deforma-
tion as well as image transformation. We have developed
Figure 2: Generic Shape Matching
a matching algorithm for 2-D shape recognition. As illus-
trated in Figure 2, the matcher can compute shock graph
correspondence between different exemplars belonging
to the same class.
During training, regions are compared to region
(shape) class prototypes. If the distance to a prototype is
small, the region is added to the class, and the prototype
recomputed as that region whose sum distance to all other
class members is minimum. However, if the distance to
the nearest prototype is large, a new class and prototype
are created from the region. Using the region adjacency
graph, we can also calculate the probability that two pro-
totypes are adjacent in an image. This is typically a very
large, yet sparse, matrix.
3 Learning of Translation Models
The learning of translation models from a corpus of bilin-
gual text has been extensively studied in computational
linguistics. Probabilistic translation models generally
seek to find the translation string e that maximizes the
probability Pra5 ea6fa7 , given the source string f (where f re-
ferred to French and e to English in the original work,
Brown et al., 1993). Using Bayes rule and maximizing
the numerator, the following equation is obtained:
ˆe a8 argmax
e
Pra5 fa6ea7 Pra5 ea7a10a9 (1)
The application of Bayes rule incorporates Pra5 ea7 into the
formula, which takes into account the probability that ˆe is
a correct English string.
Pra5 fa6ea7 is known as the translation model (prediction
of f from e), and Pra5 ea7 as the language model (probabil-
ities over e independent of f). Like others (Duygulu et
al., 2002), we will concentrate on the translation model;
taking f as the words in the text and e as the regions in the
images, we thus predict words from image regions. How-
ever, we see the omission of the language model compo-
nent, Pra5 ea7 (in our case, probabilities over the “language”
of images—i.e., over “good” region associations), as a
shortcoming. Indeed, as we see below, we insert some
simple aspects of a “language model” into our current
formulation, i.e. using the region adjacency graph to re-
strict possible merges, and using the a priori probability
of a region Pra5 ra7 if translating from words to regions. In
future work, we plan to elaborate the Pra5 ea7 component
more thoroughly.
Data sparseness prevents the direct estimation of
Pra5 fa6 ea7 (which predicts one complete sequence of sym-
bols from another), so practical translation models must
make independence assumptions to reduce the number of
parameters needed to be estimated. The first model of
Brown et al. (1993), which will be used and expanded in
our initial formulation, uses the following approximation
to Pra5 fa6ea7 :
Pra5 fa6 ea7a11a8 ∑
a
Pra5 Ma7 ∏
ja12 1a13a13a13 M
Pra5 a j a6 La7 Pra5 f j a6a j
a14
ea j a7 (2)
where M is the number of French words in f, L is the
number of English words in e, and a is an alignment
that maps each French word to one of the English words,
or to the “null” word e0. Pra5 Ma7a15a8 ε is constant and
Pra5 a j a6 La7a16a8 1a17a18a5 La19 1a7 depends only on the number of En-
glish words. The conditional probability of f j depends
only on its own alignment to an English word, and not on
the translation of other words fi. These assumptions lead
to the following formulation, in which ta5 f j a6 ea j a7 defines a
translation table from English words to French words:
Pra5 fa6 ea7a20a8 ε
a5 L a19 1a7
M ∏j
a12 1a13a13a13M
∑
a ja12 0a13a13a13L
ta5 f j a6 ea j a7 (3)
To learn such a translation between image objects and
text passages, it is necessary to: 1) Define the vocabu-
lary of image objects; 2) Extract this vocabulary from
an image; 3) Extract text that describes an image ob-
ject; 4) Deal with multiple word descriptions of ob-
jects; and 5) Deal with compound objects consisting of
parts. Duygulu et al. (2002) assume that all words (more
specifically, all nouns) are possible names of objects.
Each segmented region in an image is characterized by
a 33-dimensional feature vector. The vocabulary of im-
age objects is defined by a vector quantization of this
feature space. In the translation model of Brown et al.,
Duygulu et al. (2002) substitute the French string f by the
sequence w of caption words, and the English string e
by the sequence r of regions extracted from the image
(which they refer to as blobs, b). They do not consider
multiple word sequences describing an image object, nor
image objects that consist of multiple regions (overseg-
mentations or component parts).
In section 2 we argued that many object categories are
better characterized by generic shape descriptions rather
than finite sets of appearance-based features. However,
in moving to a shape-based representation, we need to
deal with image objects consisting of multiple regions
(cf. Barnard et al., 2003). We distinguish three different
types of multiple region sets:
1. Type A (accidental): Region over-segmentation due
to illumination effects or exemplar-specific mark-
ings on the object that results in a collection of sub-
regions that is not generic to the object’s class.
2. Type P (parts): Region over-segmentation common
to many exemplars of a given class that results in
a collection of subregions that may represent mean-
ingful parts of the object class. In this case, it is
assumed that on some occasions, the object is seen
as a silhouette, with no over-segmentation into parts.
3. Type C (compound): Objects that are always seg-
mented into their parts (e.g., due to differently col-
ored or textured parts). This type is similar to Type
P, except that these objects never appear as a whole
silhouette. (Our mechanism for dealing with these
objects will also allow us, in the future, to handle
conventional collections of objects, such as a set of
chairs with a table.)
We can extend the one-to-one translation model in
Eqn. (3) above by grouping or merging symbols (in this
case, regions) and then treating the group as a new sym-
bol to be aligned. Theoretically, then, multiple regions
can be handled in the same translation framework, by
adding to the sequence of regions in each image, the re-
gions resulting from all possible merges of image regions:
Pra5 wa6 ra7a11a8 ε
a5
˜L
a19 1a7
M ∏j
a12 1a13a13a13M
∑
a ja12 0a13a13a13
˜L
ta5 w j a6ra j a7 (4)
where ˜L denotes the total number of segmented and
merged regions in an image. However, in practice this
causes complexity and stability problems; the number of
possible merges may be intractable, while the number of
semantically meaningful merges is quite small.
Motivated by the three types of multiple region sets
described above, we have instead developed an iterative
bootstrapping strategy that filters hypothetically mean-
ingful merges and adds these to the data set. Our method
proceeds as follows:
1. As in Dyugulu et al., we calculate a translation
model t0a5 wa6 ra7 between words and regions, using a
data set of N image/caption pairs D a8a22a21a23a5 wd
a14
rda7a24a6 d a8
1 a9a25a9a26a9 Na27 . rd initially includes a region for each seg-
mented region in image d.
2. We next account for accidental over-segmentations
(Type A above) by adding all merges to the data set
that increase the score based on the old translation
model:
scorea5 Dia28 1a7a11a8 ∏
a29 w
a30ra31a33a32 D
ia34 1
Pa5 wa6r;tia5 wa6 ra7a25a7 (5)
That is, we use the current translation model to de-
termine whether to merge any two adjacent regions
into a new region. If the quality of the translation is
improved by the merge, we add the new region to r.
If the dataset was extended by any number of new
regions, the algorithm starts again with step 1 and
recalculates the translation model.
3. We then account for regular over-segmentation
(Type P above) by extending the number of regions
merged for adjacent region sets—i.e., merges are no
longer restricted to be pairwise. In this step, though,
only sets of regions that frequently appear together
in images are candidates for merging. Again, those
that increase the score are iteratively added to the
data set until the data set is stable.
4. For compound objects (Type C above), the score cri-
terion does not apply because the silhouette of the
merged structure does not appear in the rest of the
data set. Since the current translation model has
no information about the whole object, merging the
component regions cannot increase the quality of the
translation model.
Instead, we develop a new scoring criterion, based
on Melamed (1997). First, the current translation
model is used to induce translation links between
words and regions, and the mutual information of
words and regions is calculated, using the link
counts for the joint distribution. Next, the increase
in mutual information is estimated for a hypotheti-
cal data set Da35 in which the regions of potential com-
pounds are merged. If a compound contributes to an
increase in mutual information in Da35 , then the merge
is added to our data set.
5. The sequence of steps above is repeated until no new
regions are added to the data set.
In our algorithm above, we mapped our three ap-
proaches to dealing with region merges to the three types
of multiple regions sets identified earlier (Types A, P, C).
Indeed, each step in the algorithm is inspired by the cor-
responding type of region set; however, each step may
apply to other types. For example, in a given data set, the
legs of a table may only infrequently be segmented into
separate regions, so that a merge to form a table may oc-
cur in step 2 (Type A) instead of step 3 (Type P). Thus,
the actual application of steps 2–4 depends on the precise
make-up of regions and their frequencies in the data set.
In our demonstration system reported next, step 3 of
the algorithm is currently applied without considering
how frequent a region pair appears. It iteratively gen-
erates three pairwise merges, with the output restricted
to those that yield a shape seen before. We expect
that considering only frequent shape pairs will stabilize
merging effects and reduce computational complexity for
more expensive merge operations than on the synthetic
dataset. Our implementation of step 4 is in the early stage
and currently considers combinations of any two regions,
whether adjacent or not. This causes problems for im-
ages with more than one object or additional background
shapes.
4 Demonstration
4.1 Scene Generation
As this paper represents work in progress, we have
only tested our model on synthetic scenes with captions.
Scenes contain objects composed of parts drawn from a
small vocabulary of eight shapes, numbered 1–8, in Fig-
ure 3. (Our shapes are specified in terms of qualitative re-
lationships over lines and curves; precise angle measure-
ment is not important, Dickinson et al., 1992.) To simu-
late undersegmentation, primitive parts may be grouped
into larger regions; for example, an object composed of
three parts may appear as a single silhouette, represent-
ing the union of the three constituent parts. To simulate
oversegmentation, four of the shape primitives (1, 5, 6, 8)
can appear according to a finite set of oversegmentation
models, as shown in Figure 3. To add ambiguity, over-
segmentation may yield a subshape matching one of the
shape categories (e.g., primitive shape 5, the trapezoidal
shape, can be decomposed into shapes 1 and 4) or, al-
ternatively, matching a subshape arising from a different
oversegmentation. For example, the shape in the bottom
right of Figure 3 is decomposed into two parts, one of
which (25, representing two parallel lines bridged at one
end by a concave curve and at the other end by a line) oc-
curs in a different oversegmentation model (in this case,
the oversegmentation shown immediately above it).
Scenes are generated containing one or two objects,
drawn from a database of six objects (two chairs, two ta-
bles, and two lamps, differing in their primitive decom-
positions), shown in Figure 4. Given an object model,
a decomposition grammar (i.e., a set of rewrite rules) is
automatically generated that takes the silhouette of the
shape and decomposes it into pieces that are either: 1)
unions of the object’s primitive parts, representing an un-
dersegmentation of the object; 2) the object’s primitive
parts; or 3) oversegmentations of the object’s primitive
parts. In addition, the scene can contain up to four back-
ground shapes, drawn from Figure 3. These shapes in-
troduce ambiguity in the mapping from words to objects
in the scene, and can participate in merges of regions in
our algorithm. Finally, each scene has an associated text
caption that contains one word for each database object,
which specifies either the name of the whole object (ta-
ble/stand, chair/stool, lamp/light), or a part of the ob-
1
32 4
5 6 7 8
9
10
4 4
4 5
11
12
13
14
5
5
4
1
15
16
17
4
18 18
3
20
6 21
22
 
23
3
2
4
24
8 8
18 25
26
25
Possible oversegmentations for shapes 1, 5, 6, and 8:
Figure 3: Top: The eight primitive shapes used to con-
struct objects in the scene. Below: The various ways
in which four of the shapes (1, 5, 6, 8) can be overseg-
mented.
ject (base or leg). Just as the scene contains background
shapes, the caption may contain up to four “background”
words that have nothing to do with the objects (or primi-
tive parts) in the database.
We have developed a parameterized, synthetic scene
generator that uses the derived rules to automatically gen-
erate scenes with varying degrees of undersegmentation,
oversegmentation, ambiguous background objects, and
extraneous caption words. Although no substitute for
testing the model on real images, it has the advantage of
allowing us to analyze the behavior of the framework as
a function of these underlying parameters. Examples of
input scenes it produces are shown in Figure 5.
4.2 Experimental Results
The first experiment we report here (Exp. 1) tests our abil-
ity to learn a translation model in the presence of Type
A and Type P segmentation errors. We generated 1000
scenes with the following parameters: 1 or 2 objects per
image, forced oversegmentation to a depth of 4, maxi-
mum 4 background shapes, one relevant word (part or
whole descriptor), and maximum 2 meaningless random
words per image. Table 1 shows the translation tables
(Pra5 wa6 ra7 ) for this dataset, stopping the algorithm after
step 1 (no merging) and after step 3. For all of the ob-
jects, the merging step increased the probability of one
word, and decreased the probability of the others, creat-
ing a stronger word-shape association. For 5 of the ob-
jects, the highest probability word is a correct identifier
of the object (stand, chair, stool, light, lamp), and for the
6
1
5
1
1
1 1
1
8
6
1
5
5
1
5
6
8
1
1
1
1 1
table; leg table, stand;  
base
chair; leg chair, stool;  
base
lamp; base
light, lamp;  
base
1
Figure 4: The database of six objects and associated
words from which scenes are generated. Shape parts are
labeled according to Figure 3.
leg lamp
 
light leg table, base,  
cup, phone
Figure 5: Examples of scenes input to our system.
other object, a word indicating a part of the object has
high probability (leg for the first table object).
Although increasing the strength of one probability has
an advantage, we need to explore ways to allow associa-
tion of more than one “whole object” word (such as lamp
and light) with a single object (cf. Duygulu et al., 2002).
Since we maintain the component regions of a merged
region, having both a part and a whole word, such as leg
and table, associated with the same image is not a prob-
lem. Incorporating these into a structured word hierarchy
should help to focus associations appropriately.
Another way to view the data is to see which shapes
are most consistently associated with the meaningful
words in the captions. Here we calculate Pa5 ra6 wa7 by
Pra5 wa6 ra7 Pra5 ra7 , with the latter normalized over all shapes.
A problem with this formulation is that, due to the Pra5 ra7
component, high frequency shapes can increase the prob-
ability of primitive components. However, the merging
steps (2 and 3) of our algorithm raise the frequencies of
complex (multi-region) shapes. Table 2 shows the five
shapes with the highest values for each meaningful word,
again before and after the merging steps in Exp. 1. Sev-
a36a36a36
a36a36a36
a36a36a36
a36a36a36
a36a36a36
a36a36a36
a37a37
a37a37
a37a37
a37a37
a37a37
a37a37a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38a38
a38a38a38a38a38a38a38a38a38a38a38a38a38a38
a39a39a39a39a39a39a39a39a39a39a39a39a39a39
a39a39a39a39a39a39a39a39a39a39a39a39a39a39
a39a39a39a39a39a39a39a39a39a39a39a39a39a39
a40a40a40
a40a40a40
a40a40a40
a40a40a40
a40a40a40
a40a40a40
a41a41
a41a41
a41a41
a41a41
a41a41
a41a41 a42a42a42a42a42a42a42a42a42a42a42a42
a42a42a42a42a42a42a42a42a42a42a42a42
a42a42a42a42a42a42a42a42a42a42a42a42
a42a42a42a42a42a42a42a42a42a42a42a42
a43a43a43a43a43a43a43a43a43a43a43a43
a43a43a43a43a43a43a43a43a43a43a43a43
a43a43a43a43a43a43a43a43a43a43a43a43
a43a43a43a43a43a43a43a43a43a43a43a43
a44a44
a44a44
a44a44
a45a45
a45a45
a45a45
a46a46a46a46a46a46a47a47a47a47a47a47 a48a48a48a48a48a48
a48a48a48a48a48a48
a48a48a48a48a48a48
a48a48a48a48a48a48a49a49a49a49a49a49
a49a49a49a49a49a49
a49a49a49a49a49a49
a49a49a49a49a49a49
a50a50a50a50a50a50a50a50a50
a50a50a50a50a50a50a50a50a50
a51a51a51a51a51a51a51a51a51
a51a51a51a51a51a51a51a51a51
a52a52
a52a52
a52a52
a53a53
a53a53
a53a53 a54a54
a54a54
a54a54
a55a55
a55a55
a55a55a56a56
a56a56
a56a56
a57a57
a57a57
a57a57 a58a58a58a58a58a58
a59a59a59a59a59a59
a60a60a60a60a60a60a61a61a61a61a61a61
a62a62a62a62a62a62a62a63a63a63a63a63a63
a64a64
a64a64
a64a64
a65a65
a65a65
a65a65a66a66a66a66a66a66a66a66a66
a66a66a66a66a66a66a66a66a66
a67a67a67a67a67a67a67a67a67
a67a67a67a67a67a67a67a67a67
a68a68a68a68a68a68
a68a68a68a68a68a68
a68a68a68a68a68a68
a69a69a69a69a69a69
a69a69a69a69a69a69
a69a69a69a69a69a69 a70a70a70a70a70a70a70
a71a71a71a71a71a71a71
a72a72a72a72a72a72a72a73a73a73a73a73a73a73
a74a74a74a74a74a74
a74a74a74a74a74a74
a75a75a75a75a75
a75a75a75a75a75a76a76a76
a76a76a76
a76a76a76
a76a76a76
a77a77
a77a77
a77a77
a77a77
a78a78a78a78a78a78a78a78
a78a78a78a78a78a78a78a78
a78a78a78a78a78a78a78a78
a79a79a79a79a79a79a79
a79a79a79a79a79a79a79
a79a79a79a79a79a79a79 a80a80a80a80a80a80a80
a81a81a81a81a81a81
a82a82a82
a82a82a82
a82a82a82
a83a83
a83a83
a83a83
a84a84a84a84a84a84a84
a84a84a84a84a84a84a84
a84a84a84a84a84a84a84
a85a85a85a85a85a85
a85a85a85a85a85a85
a85a85a85a85a85a85
- 0.00 0.00 0.00 0.00 0.00 0.00
table 0.07 0.01 0.00 0.00 0.00 0.00
stand 0.00 0.98 0.00 0.00 0.00 0.00
chair 0.01 0.00 0.65 0.01 0.00 0.00
stool 0.00 0.00 0.00 0.13 0.00 0.00
lamp 0.00 0.00 0.00 0.00 0.41 0.80
light 0.00 0.00 0.00 0.00 0.33 0.00
leg 0.81 0.00 0.30 0.00 0.01 0.00
base 0.00 0.01 0.00 0.28 0.03 0.10
(a) before merging (step 1)
a86a86a86
a86a86a86
a86a86a86
a86a86a86
a86a86a86
a86a86a86
a87a87
a87a87
a87a87
a87a87
a87a87
a87a87a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88
a88a88a88a88a88a88a88a88a88a88a88a88a88a88a88
a89a89a89a89a89a89a89a89a89a89a89a89a89a89a89
a89a89a89a89a89a89a89a89a89a89a89a89a89a89a89
a89a89a89a89a89a89a89a89a89a89a89a89a89a89a89
a90a90a90
a90a90a90
a90a90a90
a90a90a90
a90a90a90
a90a90a90
a91a91
a91a91
a91a91
a91a91
a91a91
a91a91 a92a92a92a92a92a92a92a92a92a92a92a92
a92a92a92a92a92a92a92a92a92a92a92a92
a92a92a92a92a92a92a92a92a92a92a92a92
a92a92a92a92a92a92a92a92a92a92a92a92
a93a93a93a93a93a93a93a93a93a93a93a93
a93a93a93a93a93a93a93a93a93a93a93a93
a93a93a93a93a93a93a93a93a93a93a93a93
a93a93a93a93a93a93a93a93a93a93a93a93
a94a94
a94a94
a94a94
a95a95
a95a95
a95a95
a96a96a96a96a96a96a97a97a97a97a97a97 a98a98a98a98a98a98a98
a98a98a98a98a98a98a98
a98a98a98a98a98a98a98
a98a98a98a98a98a98a98a99a99a99a99a99a99
a99a99a99a99a99a99
a99a99a99a99a99a99
a99a99a99a99a99a99
a100a100a100a100a100a100a100a100a100
a100a100a100a100a100a100a100a100a100
a101a101a101a101a101a101a101a101a101
a101a101a101a101a101a101a101a101a101
a102a102
a102a102
a102a102
a103a103
a103a103
a103a103 a104a104
a104a104
a104a104
a105a105
a105a105
a105a105a106a106a106
a106a106a106
a106a106a106
a107a107
a107a107
a107a107 a108a108a108a108a108a108
a109a109a109a109a109a109
a110a110a110a110a110a110a111a111a111a111a111a111
a112a112a112a112a112a112a112a113a113a113a113a113a113
a114a114
a114a114
a114a114
a115a115
a115a115
a115a115a116a116a116a116a116a116a116a116a116
a116a116a116a116a116a116a116a116a116
a117a117a117a117a117a117a117a117a117
a117a117a117a117a117a117a117a117a117
a118a118a118a118a118a118
a118a118a118a118a118a118
a118a118a118a118a118a118
a119a119a119a119a119a119
a119a119a119a119a119a119
a119a119a119a119a119a119 a120a120a120a120a120a120a120
a121a121a121a121a121a121a121
a122a122a122a122a122a122a122a123a123a123a123a123a123a123
a124a124a124a124a124a124
a124a124a124a124a124a124
a125a125a125a125a125
a125a125a125a125a125a126a126a126
a126a126a126
a126a126a126
a126a126a126
a127a127
a127a127
a127a127
a127a127
a128a128a128a128a128a128a128a128
a128a128a128a128a128a128a128a128
a128a128a128a128a128a128a128a128
a129a129a129a129a129a129a129
a129a129a129a129a129a129a129
a129a129a129a129a129a129a129 a130a130a130a130a130a130a130
a131a131a131a131a131a131
a132a132a132
a132a132a132
a132a132a132
a133a133
a133a133
a133a133
a134a134a134a134a134a134a134
a134a134a134a134a134a134a134
a134a134a134a134a134a134a134
a135a135a135a135a135a135
a135a135a135a135a135a135
a135a135a135a135a135a135
- 0.00 0.00 0.00 0.00 0.00 0.00
table 0.00 0.00 0.00 0.00 0.00 0.00
stand 0.00 1.00 0.00 0.00 0.00 0.00
chair 0.00 0.00 0.98 0.00 0.00 0.00
stool 0.00 0.00 0.00 0.41 0.00 0.00
lamp 0.00 0.00 0.00 0.00 0.03 0.99
light 0.00 0.00 0.00 0.00 0.97 0.00
leg 1.00 0.00 0.01 0.00 0.00 0.00
base 0.00 0.00 0.00 0.00 0.00 0.00
(b) after merging (steps 2/3)
Table 1: Exp. 1: Translation tables from shapes to words Pa5 wa6 ra7 for the 6 object silhouettes.
a136a136a136a136a136a136a136a136a136a136a136a136a136a136a136
a136a136a136a136a136a136a136a136a136a136a136a136a136a136a136
a136a136a136a136a136a136a136a136a136a136a136a136a136a136a136
a137a137a137a137a137a137a137a137a137a137a137a137a137a137a137
a137a137a137a137a137a137a137a137a137a137a137a137a137a137a137
a137a137a137a137a137a137a137a137a137a137a137a137a137a137a137
a138a138a138
a138a138a138
a138a138a138
a138a138a138
a138a138a138
a138a138a138
a139a139
a139a139
a139a139
a139a139
a139a139
a139a139 a140a140a140a140a140a140a140a140a140a140a140a140
a140a140a140a140a140a140a140a140a140a140a140a140
a140a140a140a140a140a140a140a140a140a140a140a140
a140a140a140a140a140a140a140a140a140a140a140a140
a141a141a141a141a141a141a141a141a141a141a141a141
a141a141a141a141a141a141a141a141a141a141a141a141
a141a141a141a141a141a141a141a141a141a141a141a141
a141a141a141a141a141a141a141a141a141a141a141a141
a142a142
a142a142
a142a142
a142a142
a143a143
a143a143
a143a143
a143a143
a144a144a144a144a144a144a145a145a145a145a145a145 6 1
a146a146a146a146a146a146a146a146a146a146a146a146
a146a146a146a146a146a146a146a146a146a146a146a146
a146a146a146a146a146a146a146a146a146a146a146a146
a146a146a146a146a146a146a146a146a146a146a146a146a147a147a147a147a147a147a147a147a147a147a147a147
a147a147a147a147a147a147a147a147a147a147a147a147
a147a147a147a147a147a147a147a147a147a147a147a147
a147a147a147a147a147a147a147a147a147a147a147a147
a148a148
a148a148
a148a148
a149a149
a149a149
a149a149
table 0.70 0.14 0.09 0.04 0.03
a150a150a150a150a150a150a150a150a150a150a150a150
a150a150a150a150a150a150a150a150a150a150a150a150
a150a150a150a150a150a150a150a150a150a150a150a150
a150a150a150a150a150a150a150a150a150a150a150a150
a151a151a151a151a151a151a151a151a151a151a151a151
a151a151a151a151a151a151a151a151a151a151a151a151
a151a151a151a151a151a151a151a151a151a151a151a151
a151a151a151a151a151a151a151a151a151a151a151a151
a152a152
a152a152
a152a152
a152a152
a153a153
a153a153
a153a153
a153a153
a154a154a154a154a154a154a155a155a155a155a155a155 6
a156a156a156a156a156a156a156a156a156a156a156a156
a156a156a156a156a156a156a156a156a156a156a156a156
a156a156a156a156a156a156a156a156a156a156a156a156
a156a156a156a156a156a156a156a156a156a156a156a156a157a157a157a157a157a157a157a157a157a157a157a157
a157a157a157a157a157a157a157a157a157a157a157a157
a157a157a157a157a157a157a157a157a157a157a157a157
a157a157a157a157a157a157a157a157a157a157a157a157
a158a158
a158a158
a158a158
a159a159
a159a159
a159a159 a160a160a160
a160a160a160
a160a160a160
a160a160a160
a161a161
a161a161
a161a161
a161a161
a162a162a162a162a162a162a162
a162a162a162a162a162a162a162
a163a163a163a163a163a163
a163a163a163a163a163a163 1
stand 0.55 0.33 0.13 0.00 0.00
a164a164a164a164a164a164a164
a164a164a164a164a164a164a164
a165a165a165a165a165a165
a165a165a165a165a165a165a166a166
a166a166
a166a166
a167a167
a167a167
a167a167a168a168a168a168a168a168a168a168a168
a168a168a168a168a168a168a168a168a168
a169a169a169a169a169a169a169a169a169
a169a169a169a169a169a169a169a169a169 a170a170a170a170a170a170a170a170a170a170a170a170a170a170a170
a170a170a170a170a170a170a170a170a170a170a170a170a170a170a170
a170a170a170a170a170a170a170a170a170a170a170a170a170a170a170
a170a170a170a170a170a170a170a170a170a170a170a170a170a170a170
a171a171a171a171a171a171a171a171a171a171a171a171a171a171a171
a171a171a171a171a171a171a171a171a171a171a171a171a171a171a171
a171a171a171a171a171a171a171a171a171a171a171a171a171a171a171
a171a171a171a171a171a171a171a171a171a171a171a171a171a171a171
a172a172a172
a172a172a172
a172a172a172
a172a172a172
a172a172a172
a172a172a172
a173a173
a173a173
a173a173
a173a173
a173a173
a173a173 a174a174
a174a174
a174a174
a175a175
a175a175
a175a175 a176a176
a176a176
a176a176
a177a177
a177a177
a177a177a178a178a178a178a178a178a178a178a178
a178a178a178a178a178a178a178a178a178
a179a179a179a179a179a179a179a179a179
a179a179a179a179a179a179a179a179a179a180a180a180a180a180a180
a180a180a180a180a180a180
a180a180a180a180a180a180
a180a180a180a180a180a180
a180a180a180a180a180a180
a181a181a181a181a181a181
a181a181a181a181a181a181
a181a181a181a181a181a181
a181a181a181a181a181a181
a181a181a181a181a181a181 a182a182a182a182a182a182a182
a182a182a182a182a182a182a182
a182a182a182a182a182a182a182
a182a182a182a182a182a182a182
a183a183a183a183a183a183
a183a183a183a183a183a183
a183a183a183a183a183a183
a183a183a183a183a183a183
a184a184a184a184a184a184a184a184a184
a184a184a184a184a184a184a184a184a184
a185a185a185a185a185a185a185a185a185
a185a185a185a185a185a185a185a185a185
a186a186
a186a186
a186a186
a187a187
a187a187
a187a187 a188a188
a188a188
a188a188
a189a189
a189a189
a189a189a190a190
a190a190
a190a190
a191a191
a191a191
a191a191 1
chair 0.44 0.27 0.21 0.07 0.02
a192a192a192a192a192a192a193a193a193a193a193a193
a194a194a194a194a194a194a195a195a195a195a195a195
a196a196a196a196a196a196a196a197a197a197a197a197a197
a198a198
a198a198
a198a198
a198a198
a199a199
a199a199
a199a199
a199a199
a200a200a200a200a200a200a200a200a200
a200a200a200a200a200a200a200a200a200
a201a201a201a201a201a201a201a201a201
a201a201a201a201a201a201a201a201a201
a202a202a202a202a202a202
a202a202a202a202a202a202
a202a202a202a202a202a202
a203a203a203a203a203a203
a203a203a203a203a203a203
a203a203a203a203a203a203 a204a204a204a204a204a204a204
a205a205a205a205a205a205
a206a206
a206a206
a206a206
a207a207
a207a207
a207a207a208a208a208a208a208a208a208a208a208
a208a208a208a208a208a208a208a208a208
a209a209a209a209a209a209a209a209a209
a209a209a209a209a209a209a209a209a209
8 a210a210a210a210a210a210a210
a211a211a211a211a211a211a211a212a212a212a212a212a212a212a212
a212a212a212a212a212a212a212a212
a212a212a212a212a212a212a212a212
a213a213a213a213a213a213a213
a213a213a213a213a213a213a213
a213a213a213a213a213a213a213
a214a214a214
a214a214a214
a214a214a214
a214a214a214
a215a215
a215a215
a215a215
a215a215 a216a216a216a216a216a216a216
a217a217a217a217a217a217a217
a218a218a218a218a218a218a218a219a219a219a219a219a219a219a220a220a220a220a220a220a220a220
a220a220a220a220a220a220a220a220
a220a220a220a220a220a220a220a220
a220a220a220a220a220a220a220a220
a221a221a221a221a221a221a221
a221a221a221a221a221a221a221
a221a221a221a221a221a221a221
a221a221a221a221a221a221a221
stool 0.32 0.31 0.24 0.06 0.06
5 a222a222a222a222a222a222a222
a223a223a223a223a223a223a223
a224a224a224a224a224a224a224a225a225a225a225a225a225a225
a226a226a226a226a226a226
a226a226a226a226a226a226
a227a227a227a227a227
a227a227a227a227a227a228a228a228
a228a228a228
a228a228a228
a228a228a228
a229a229
a229a229
a229a229
a229a229
a230a230a230a230a230a230a230a230
a230a230a230a230a230a230a230a230
a230a230a230a230a230a230a230a230
a231a231a231a231a231a231a231
a231a231a231a231a231a231a231
a231a231a231a231a231a231a231 a232a232a232a232a232a232a232
a233a233a233a233a233a233a233
a234a234a234a234a234a234a234a235a235a235a235a235a235a235a236a236a236a236a236a236a236a236
a236a236a236a236a236a236a236a236
a236a236a236a236a236a236a236a236
a237a237a237a237a237a237a237
a237a237a237a237a237a237a237
a237a237a237a237a237a237a237
a238a238a238
a238a238a238
a238a238a238
a238a238a238
a239a239
a239a239
a239a239
a239a239 a240a240a240a240a240a240a240
a241a241a241a241a241a241a241a242a242a242a242a242a242a242a242
a242a242a242a242a242a242a242a242
a242a242a242a242a242a242a242a242
a243a243a243a243a243a243a243
a243a243a243a243a243a243a243
a243a243a243a243a243a243a243
a244a244a244
a244a244a244
a244a244a244
a244a244a244
a245a245
a245a245
a245a245
a245a245 a246a246a246a246a246a246a246
a247a247a247a247a247a247a247
a248a248a248a248a248a248a248a249a249a249a249a249a249a249a250a250a250a250a250a250a250a250
a250a250a250a250a250a250a250a250
a250a250a250a250a250a250a250a250
a250a250a250a250a250a250a250a250
a251a251a251a251a251a251a251
a251a251a251a251a251a251a251
a251a251a251a251a251a251a251
a251a251a251a251a251a251a251
lamp 0.64 0.10 0.08 0.07 0.07
a252a252a252a252a252a252a252a253a253a253a253a253a253a253
a254a254a254a254a254a254a254a255a255a255a255a255a255a255
a0 a0 a0 a0 a0 a0
a0 a0 a0 a0 a0 a0
a1 a1 a1 a1 a1
a1 a1 a1 a1 a1a2 a2 a2
a2 a2 a2
a2 a2 a2
a2 a2 a2
a3 a3
a3 a3
a3 a3
a3 a3
a4 a4 a4 a4 a4 a4 a4 a4
a4 a4 a4 a4 a4 a4 a4 a4
a4 a4 a4 a4 a4 a4 a4 a4
a5 a5 a5 a5 a5 a5 a5
a5 a5 a5 a5 a5 a5 a5
a5 a5 a5 a5 a5 a5 a5 a6 a6 a6 a6 a6 a6 a6
a7 a7 a7 a7 a7 a7 a7
a8 a8 a8 a8 a8 a8 a8a9 a9 a9 a9 a9 a9 a9a10 a10 a10 a10 a10 a10 a10 a10
a10 a10 a10 a10 a10 a10 a10 a10
a10 a10 a10 a10 a10 a10 a10 a10
a11 a11 a11 a11 a11 a11 a11
a11 a11 a11 a11 a11 a11 a11
a11 a11 a11 a11 a11 a11 a11
a12 a12 a12
a12 a12 a12
a12 a12 a12
a12 a12 a12
a13 a13
a13 a13
a13 a13
a13 a13 a14 a14 a14 a14 a14 a14 a14
a15 a15 a15 a15 a15 a15 a15a16 a16 a16 a16 a16 a16 a16 a16
a16 a16 a16 a16 a16 a16 a16 a16
a16 a16 a16 a16 a16 a16 a16 a16
a17 a17 a17 a17 a17 a17 a17
a17 a17 a17 a17 a17 a17 a17
a17 a17 a17 a17 a17 a17 a17
a18 a18 a18
a18 a18 a18
a18 a18 a18
a19 a19
a19 a19
a19 a19 a20 a20 a20 a20 a20 a20 a20
a21 a21 a21 a21 a21 a21 a21
a22 a22 a22 a22 a22 a22 a22a23 a23 a23 a23 a23 a23 a23a24 a24 a24 a24 a24 a24 a24 a24
a24 a24 a24 a24 a24 a24 a24 a24
a24 a24 a24 a24 a24 a24 a24 a24
a25 a25 a25 a25 a25 a25 a25
a25 a25 a25 a25 a25 a25 a25
a25 a25 a25 a25 a25 a25 a25 8
light 0.28 0.23 0.20 0.20 0.09
1 a26 a26 a26a26 a26 a26a26 a26 a26a26 a26 a26a26 a26 a26a27 a27a27 a27a27 a27a27 a27a27 a27a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28 a28a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29 a29a30 a30 a30a30 a30 a30a30 a30 a30a30 a30 a30a30 a30 a30a31 a31a31 a31a31 a31a31 a31a31 a31 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32 a32a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33 a33a34 a34 a34a34 a34 a34a34 a34 a34a34 a34 a34a34 a34 a34a34 a34 a34a35 a35a35 a35a35 a35a35 a35a35 a35a35 a35 a36 a36 a36 a36 a36 a36 a36
a36 a36 a36 a36 a36 a36 a36
a36 a36 a36 a36 a36 a36 a36
a36 a36 a36 a36 a36 a36 a36
a37 a37 a37 a37 a37 a37
a37 a37 a37 a37 a37 a37
a37 a37 a37 a37 a37 a37
a37 a37 a37 a37 a37 a37
a38 a38 a38 a38 a38 a38 a38 a38 a38
a38 a38 a38 a38 a38 a38 a38 a38 a38
a38 a38 a38 a38 a38 a38 a38 a38 a38
a39 a39 a39 a39 a39 a39 a39 a39 a39
a39 a39 a39 a39 a39 a39 a39 a39 a39
a39 a39 a39 a39 a39 a39 a39 a39 a39
a40 a40
a40 a40
a40 a40
a41 a41
a41 a41
a41 a41
a42 a42
a42 a42
a42 a42
a43 a43
a43 a43
a43 a43 a44 a44
a44 a44
a44 a44
a44 a44
a45 a45
a45 a45
a45 a45
a45 a45
a46 a46
a46 a46
a46 a46
a46 a46
a47 a47
a47 a47
a47 a47
a47 a47
a48 a48
a48 a48
a48 a48
a48 a48
a49 a49
a49 a49
a49 a49
a49 a49
a50 a50 a50 a50 a50 a50 a50 a50 a50
a50 a50 a50 a50 a50 a50 a50 a50 a50
a51 a51 a51 a51 a51 a51 a51 a51 a51
a51 a51 a51 a51 a51 a51 a51 a51 a51a52 a52 a52 a52 a52 a52 a52
a52 a52 a52 a52 a52 a52 a52
a52 a52 a52 a52 a52 a52 a52
a52 a52 a52 a52 a52 a52 a52
a53 a53 a53 a53 a53 a53
a53 a53 a53 a53 a53 a53
a53 a53 a53 a53 a53 a53
a53 a53 a53 a53 a53 a53
leg 0.84 0.15 0.01 0.00 0.00
a54 a54 a54
a54 a54 a54
a54 a54 a54
a54 a54 a54
a55 a55
a55 a55
a55 a55
a55 a55
a56 a56 a56 a56 a56 a56 a56
a56 a56 a56 a56 a56 a56 a56
a57 a57 a57 a57 a57 a57
a57 a57 a57 a57 a57 a57 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58
a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58
a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58
a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58 a58
a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59
a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59
a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59
a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59 a59
a60 a60
a60 a60
a60 a60
a61 a61
a61 a61
a61 a61
a62 a62 a62 a62 a62 a62 a62a63 a63 a63 a63 a63 a63 a63a64 a64 a64 a64 a64 a64 a64 a64
a64 a64 a64 a64 a64 a64 a64 a64
a64 a64 a64 a64 a64 a64 a64 a64
a65 a65 a65 a65 a65 a65 a65
a65 a65 a65 a65 a65 a65 a65
a65 a65 a65 a65 a65 a65 a65
a66 a66 a66
a66 a66 a66
a66 a66 a66
a66 a66 a66
a67 a67
a67 a67
a67 a67
a67 a67 a68 a68 a68 a68 a68 a68 a68
a69 a69 a69 a69 a69 a69 a69
a70 a70 a70 a70 a70 a70 a70a71 a71 a71 a71 a71 a71 a71a72 a72 a72 a72 a72 a72 a72 a72
a72 a72 a72 a72 a72 a72 a72 a72
a72 a72 a72 a72 a72 a72 a72 a72
a72 a72 a72 a72 a72 a72 a72 a72
a73 a73 a73 a73 a73 a73 a73
a73 a73 a73 a73 a73 a73 a73
a73 a73 a73 a73 a73 a73 a73
a73 a73 a73 a73 a73 a73 a73 a74 a74 a74 a74 a74 a74 a74a75 a75 a75 a75 a75 a75 a75
a76 a76 a76 a76 a76 a76 a76a77 a77 a77 a77 a77 a77 a77
a78 a78 a78 a78 a78 a78
a78 a78 a78 a78 a78 a78
a78 a78 a78 a78 a78 a78
a79 a79 a79 a79 a79
a79 a79 a79 a79 a79
a79 a79 a79 a79 a79
a80 a80 a80
a80 a80 a80
a80 a80 a80
a81 a81
a81 a81
a81 a81
a82 a82 a82 a82 a82 a82 a82 a82
a82 a82 a82 a82 a82 a82 a82 a82
a82 a82 a82 a82 a82 a82 a82 a82
a82 a82 a82 a82 a82 a82 a82 a82
a83 a83 a83 a83 a83 a83 a83
a83 a83 a83 a83 a83 a83 a83
a83 a83 a83 a83 a83 a83 a83
a83 a83 a83 a83 a83 a83 a83
base 0.28 0.23 0.11 0.11 0.09
Table 3: Exp. 2: The five shapes with highest Pra5 ra6wa7
for the meaningful words, after step 4. Shape icons (for
merged regions) or primitive shapes (indicated by num-
ber) have the probability for that word listed below.
eral complex shapes increase in probability after merging,
and a number of new complex shapes appear in the lists.
We report on one other experiment (Exp. 2) which was
designed to test our approach to handling oversegmenta-
tions of Type C in step 4 of our algorithm. Our dataset
again had 1000 images; here there was only one object
per image, but every object was oversegmented into its
primitive parts (that is, an object never appeared as a
complete silhouette). (We did not allow oversegmenta-
tion of the primitives here, nor did we include irrelevant
words in the captions.) Because our 6 objects never ap-
pear “whole,” steps 2 and 3 of our algorithm cannot ap-
ply; before step 4, words are associated with primitive
shapes only. After step 4, the highest probability word
(Pra5 wa6 ra7 ) for 4 of the objects is a correct identifier of the
object (stand, chair, stool, light); for one object, a word
indicating a part of the object had high probability (leg for
the rectangular table). (One object silhouette—the sec-
ond lamp—was not fully reconstructed.) Table 3 shows
the five shapes with the highest Pra5 ra6wa7 values for each
meaningful word, after step 4. For 3 of the whole ob-
ject words (stand, stool, light), and both part words (leg,
base), the best shape is a correct one. For the remain-
ing whole object words (table, chair, lamp), a correct full
silhouette is one of the top five. Step 4 clearly has high
potential for reconstructing objects that are consistently
oversegmented into their parts.
5 Conclusions
We have outlined a framework for the creation of asso-
ciated visual and linguistic structured models, from im-
ages annotated with textual captions. Thus far, we have
focused on the important open problem of dealing with
oversegmentation in images. We have developed a set of
extensions to a probabilistic translation model (Brown et
al., 1993) that enable us to successfully merge overseg-
mented regions into coherent objects. Our initial exper-
iments on synthetic data demonstrate that our algorithm
can learn a useful translation model between image ob-
jects and words, even in the presence of substantial over-
segmentation. We are currently experimenting with vari-
ous parameters in our synthetic scene generator to guide
further development of the algorithm, as well as experi-
menting on real data from the Web.

References
Steven Abney. 1991. Parsing by Chunks. In Robert
Berwick, Steven Abney, and Carol Tenny, editors,
Principle-Based Parsing. Kluwer.
Kobus Barnard and David Forsyth. 2001. Learning the
Semantics of Words and Pictures. In Proc. of Int. Conf.
on Computer Vision (ICCV-2001), pages 408–415.
Kobus Barnard, Pinar Duygulu, Raghavendra Guru,
Prasad Gabbur, and David Forsythi. 2003. The ef-
fects of segmentation and feature choice in a transla-
tion model of object recognition. In Proc. of Computer
Vision and Pattern Recognition, page to appear.
David M. Blei and Michael I. Jordan. 2002. Modeling
Annotated Data. Technical report, Computer Science
Division, University of California, Berkeley, USA.
Eric Brill. 1994. Some advances in transformation-based
part of speech tagging. In Proceedings of the 12th Na-
tional Conference on Artificial Intelligence, volume 1,
pages 722–727, Menlo Park, CA, USA. AAAI Press.
P. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L.
Mercer. 1993. The mathematics of statistical machine
translation: Parameter estimation. Computational Lin-
guistics, 32(2):263–311.
Marco La Cascia, Saratendu Sethi, and Stan Sclaroff.
1998. Combining Textual and Visual Cues for
Content-based Image Retrieval on the World Wide
Web. In Proc. of IEEE Workshop on Content-based
Access of Image and Video Libraries, June.
D. Comaniciu and P. Meer. 1997. Robust analysis of
feature spaces: Color image segmentation. In IEEE
Computer Society Conference on Computer Vision and
Pattern Recognition, pages 750–755.
S. Dickinson, A. Pentland, and A. Rosenfeld. 1992.
3-D shape recovery using distributed aspect matching.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 14(2):174–198.
S. Dickinson, A. Pentland, and S. Stevenson. 1998.
Viewpoint-invariant indexing for content-based image
retrieval. In IEEE International Workshop on Content-
based Access of Image and Video Databases, Bombay.
P. Duygulu, Kobus Barnard, J.F.G. de Freitas, and D.A.
Forsyth. 2002. Object Recognition as Machine Trans-
lation: Learning a Lexicon for a Fixed Image Vocab-
ulary. In Proc. of European Conference on Computer
Vision (ECCV-2002), volume 4, pages 97–112.
I. Dan Melamed. 1997. Automatic Discovery of Non-
Compositional Compounds in Parallel Data. In 2nd
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP’97), Providence, RI.
Deb Roy. 2002. Learning visually grounded words and
syntax of natural spoken language. Evolution of Com-
munication, 4(1).
A. Shokoufandeh, S. Dickinson, C. Jonsson, L. Bret-
zner, and T. Lindeberg. 2002. The representation and
matching of qualitative shape at multiple scales. In
Proceedings, ECCV, pages 759–775, Copenhagen.
K. Siddiqi, A. Shokoufandeh, S. Dickinson, and
S. Zucker. 1999. Shock graphs and shape matching.
International Journal of Computer Vision, 30:1–24.
