The Problem of Naming Shapes: 
Vision-Language Interface 
by 
R. Baj csy* 
and 
A.K. Joshi ~= 
Computer and Information Science Department 
University of Pennsylvania 
Philadelphia, PA 19104 
i. Introduction 
In this paper, we willpose more questions 
than present solutions. We want to raise some 
questions in the context of the representation of 
shapes of 3-D objects. One way to get a handle on 
this problem is to investigate whether labels of 
shapes and their acquisition reveals any structure 
of attributes or components of shapes that might 
be used for representation purposes. Another 
aspect of the puzzle of representation is the 
question whether the information is to be stored 
in analog or propositional form, and at what level 
this transformation from analog to propositional 
form takes place. 
In general, shape of a 3-D compact object has 
two aspects: the surface aspect, and the volume 
" The surface aspect includes properties 
ncavity, convexity, planarity of surfaces, 
edges, and corners. The volume aspect distin- 
guishes objects with holes from those without 
(topological properties), and describes objects 
with respect to their sy~netry planes and axes, 
relative proportions, etc. 
We will discuss some questions pertinent to 
representation of a shape of a 3-D compact object, 
without holes, for example: Is the surface 
aspect more important than the volume aspect? 
Are there any shape primitives? In what form 
are shape attributes stored?, etc. We shall 
extensively draw from psychological and psycho- 
linguistic literature, as well as from the recent 
AI activities in this area. 
2. Surface and Volume 
In this section, we will investigate the 
relationship between the surface aspect and 
the volume aspect from the developmental point 
of view and from the needs of a recognition 
process. By doing so, we hope to learn about 
the representation of shapes. Later, we will 
examine the naming process for shapes and its re- 
lation to representation. 
* This work has been supported under NSF Grant 
#MCS76-19465 and NSF Grant #MCS76-19466. 
There is evidence that a silhouette of an ob- 
ject, that is its boundary with respect to the 
background, is the determining factor for the rec- 
ognition of the object (Rock 1975, Zusne 1970). 
If we accept the above hypotheses then the fact 
that the silhouette is a projected outline of the 
3-D object implies that the recognition of the 3-D 
object at first is reduced to the recognition of a 
2-D outline. This is not entirely true, however, 
as Gibson (Gibson 1950) has argued. According to 
Gibson's theory, the primitives of form perception 
are gradients of various variables as opposed to 
the absolute values of these variables. From this 
follows the emphasis on perceiving the surface 
first and the perception of the outline only falls 
out as a consequence of discontinuities of the 
surface with respect to the background. 
We are pursuaded by Gibson's argument and re- 
gard the recognition process as starting with sur- 
face properties; Miller and Johnson-Laird (Miller & 
Johnson-Laird 1976) have suggested some surface 
predicates as possible primitives, such as convex, 
concave, planar, edge, and corner. The 2-D out- 
line is furthermore analyzed as a whole according 
to the Gestalist and some salient features 
(Pragantz) are detected faster and more frequently 
than others (Koffka 1935, Goldmeir 1972, Rosh 
1973); such pragmatic features are for example, 
rectangularity, symmetry, regularity, parallelness, 
and rectilinearity. 
Piaget also argues (Paiget, Inhelder 1956) 
from the developmental point of view that children 
first learn to recognize surfaces and their out- 
lines, and only later, after an ability to compose 
multiple views of the same object has been devel- 
oped, they can form a concept of its volume. 
Volume representation becomes essential as 
soon as there is motion of the object or of the 
observer. Note that the salient features of 2-D 
shapes are invariant under transformations such as 
rotation, translation, expansion and shrinking. 
Features with a similar property must be found in 
the 3-D space for the volume representation. We 
feel that the most important feature is sy~netry. 
Clark's work seem to support this (Clark 1975); 
he shows that in language space as in the percep- 
tual space, we have 3 prlmar~ planes of reference: 
ground level ; vertical: left-rlght; vertical: 
front-back. While the ground level is not a sym- 
metry plane, the two vertical ones are sy~netry 
157 
planes. The fact that the ground level is not a 
sy~netry plane is supported by the experiments of 
Rock (Rock 1973), who has shown that some familiar 
shapes are hard to recognize with 180 ° rotation 
with respect to the ground level. After a careful 
examination of the relevant literature to date, we 
find that there is a claim that we ean recognize 
shapes via some features which are ndre salient 
than others. But does it follow from this that 
shape is an independent attribute like color, or 
is it a derived concept from other features? 
In an effort to answer this question, we set 
out to examine labels of shapes in the hope that 
if there are any shape primitives (other than 
angles, edges, paralleLness, and the like) then 
they may show up in labels describing more complex 
shapes. One inmediate observation we can make is 
that thereare very few names which only describe 
a shape, such as triangle or rectangle. More 
conmenly, label of a shape is derived from the 
label objects which have such a typical shape, for 
example, letter-like shapes (V, L, X), cross-like 
shape, pear-like shape, heart-like shape, etc. A 
special category of labels are well defined geo- 
metric objects, such as circle, ellipse, sphere, 
torus, etc. The question is whether we store for 
every shape a template or whether there are any 
con~non primitives from which we can describe dif- 
ferent shapes. 
In addition to the 2-D features mentioned 
earlier, primarily 2-D features, we do use 3-D 
shape descriptions (primitives) such as: round, 
having 3 syn~netryplanes and all the syn~netryaxes 
approximately of the same length, elongated, 
where the size in one dimension is much longer 
than thetwo remaining, thin, where the size of 
one dimension is much smaller than the other, 
etc. Note that many of these descriptions are 
vague, though often there more accurate shape 
labels available; for example, cone stands for 
an elongated object with two sy~net-ry planes, 
a circular crossection, and sides tapering 
evenly up to a point, called appex (Webster's 
dicitionary). 
We believe that there are some descrip- 
tions of shapes which aremore primitive than 
others; for example, round, elongated, thin, 
flat, circular, planar, etc., as opposed to 
heart-like, star-like, and so on. As pointed 
out earlier, these latter~ descriptors are 
derived from the names of previously recognized 
objects. When we use these descriptions during 
a recognition process we do not necessarily 
match exactly all features of the template 
shape to the recognized shape, but rather we 
depict some characteristic properties we 
associate wi~h the given label, and only these 
are matched during the recognition process. 
In this sense, we approximate the real data to our 
model and primitives. The labels which encompass 
andre complex structure of these properties (like 
cone, heart, star, etc.) when they are used in 
describing other shapes, are used as economical 
shorthand expressions for the complexity that 
these shapes represent. (This appears to be re- 
lated to the eodability notion of Chafe (Chafe 
1975)). 
3. Analog and Propositional Representation 
In this section, we will discuss certain 
issues concerning the form of the stored informa- 
tion, necessary not only for recognition purposes 
(matching the perceived data with a stored model) 
but also for recall, and introspection of images. 
There are two questions: 
i. At which level the analog inform/tion is con- 
verted to propositional (verbal or non-verbal) 
and after this conversion, is the analog in- 
formation retained? 
2. How much of the propositional information is 
procedural and how much structural? 
For simplicity, we will regard analog infor- 
mation in our context as picture points, or retina 
points. Any further labeling, of a point or of a 
cluster of points, such as an edge, line, region, 
etc. leads to derived entities by one criterion or 
another and therefore may be regarded as proposi- 
tional.* 
At this point, it is appropriate to point out 
that any such unit as an edge, line or region can 
be described in at least two different ways; one 
is structural or organizational, and the other is 
parametric or dimensional. Structural information 
refers to the organization of perceptual elements 
into groups. Figure-ground and part-whole rela- 
tionships are paradigm examples of structural in- 
formation. Parametric information refers to the 
continuous values of the stimulus along various 
perceivable dimensions. Color, size, position, 
orientation, and sy~netry, are some examples of 
parametric information. 
We are not advocating that these two types of 
information are independent (cf. Palmer 1975). It 
is, for example, a well known experience that by 
changing drastically one dimension (one parameter) 
of an object (say a box), one can change the 
structure of the object (in this case, it becomes 
a wall-like object). However, we do wish to keep 
the distinction between structural and parametric 
information. The importance of this distinction 
is that while structural information is inherently 
discrete and propositional, parametric information, 
is both holistic (integral) and atomic (separable). 
The fact that parametric information is separable 
is quite obvious if we just recognize that differ- 
parameters represent clearly distinguishable dif- 
ferent aspects of the visual information. For 
example, color, size, position, etc. On the other 
hand all these parameters are represented 
holistically in an image, and can be separated 
only by feature (parameter) extraction procedures 
(Palmer 1975). 
Parametric information is separable; however, 
the question is whether each parameter-feature 
The distinction is not really as sharp as 
stated here. One way to make the distinction 
is to look at the closeness with which a trans- 
formation of a representation parallels the 
transformation of the object renresented. The 
closer it is the more analog the representation 
is. 
158 
has continuous or discrete values. Continuous 
values would imply some retainment of analog in- 
formation (Kossylyn 1977), while discrete values 
would not. Opponents of the discrete value rep- 
resentationargue that a) the number of primitives 
needed would be astronomical, and b) the number of 
potential relationships between primitives would 
be also very large (Fishier 1977). This is 
further supported byexperiments on recall of 
mental images (Kosslyn, Shwartz 1977) where these 
images appear in continuous-analog fashion. 
Another similar argument in favor of analog rep- 
resentation is the experiment of comparing objects 
with respect to some of their parameters, like 
size, or experiments on mental rotation (Shepard, 
Metzler 1971). 
Pylyshyn (Pylyshyn 1977) cautiously argues 
against the analog representation for the same 
object viewed under different conditions as a 
result of the semantic interpretation function 
(SIF). The SIF will extract only those invariances 
characteristic for the object in a given situation, 
and thus reduce the number of possible discrete 
values and their range for a given parameter. The 
invariances are determined by laws of physics and 
optics, and by the context, i.e., the object sizes 
will remain fixed as they move, the smaller ob- 
jects will partially occlude the larger object, 
etc. 
We would like to propose a discrete value 
representation for parametric information with an 
associated interpolation function (sampling is an 
inverse of interpolation) and a clustering pr O- 
cedure. During the recognition process, a 
clustering procedure is evoked in order to cate- 
gorize a parameter while during an image recall 
an interpolation procedure is applied to generate 
the continuous data. Our model seems not to 
contradict Kosslyn's findings, that is we assume 
as he does, that the deep representation of an 
image consists of stored facts about the image 
in a propositional format. Facts include infor- 
mation about: 
a) How and where a part is attached to the whole 
objeet. 
b) How to find a given part. 
e) A name of the category that the object belongs 
to. 
d) The range of the size of the object, which 
implies the resolution necessary to see the 
object or part. 
e) The name of the file which contains visual 
features that the object is composed of 
(corners, edges, curvature descriptions of 
edge segments, their relationships, etc. ). 
The only place where we differ from Kossyln's 
model is in the details of the perceptual memory. 
While his perceptual memory contains coordinates 
for every point, our perceptual memory has iden- 
tified and stored clusters of these points, like 
corners, edges, lines, etc. From these features 
and the interpolation procedure, we create the 
continuous image. This is very much in the spirit 
of a constructive vision theory as proposed by 
Kosslyn and others. A similar argument can be 
used for preserving continuity in transformation 
of images, such as rotation (Shepard, Metzler 
1971) and expansion (Kosslyn 1975, 1976). The 
contraction process is the inverse of expansion 
and therefore will envoke the sampling routine 
instead of the interpolation routine. The problem 
of too many discrete values and their relation- 
ships, as stated by Fishier, is taken care of by 
the fact that for each parameter there is an asso- 
ciated range with only a few categories such as 
small, medium, and large. As pointed out by 
Pylyshyn, it is the range of parameters which is 
context dependent and thus differs from situation 
to situation. This view also offers some explana- 
tion that often incomplete figures are perceived 
as whole. 
We also want to postulate that analog infor- 
mation, as we specified it, is not retained, and 
if there are ambiguities due to the inadequacy of 
the input data, a new set of data is inputed. 
This is supported by several psychological experi- 
ments, for example, by asking-people to recognize 
a building where they work from accurate drawings 
and sloppy pictures (Norman 1975). The over- 
whelming evidence is that people prefer a sloppy 
picture to the more accurate one, for recognizing 
their own building. Even the experiment of 
Averbach and Sperling (Averbach and Sperling 1968) 
concerning the visual short memory after 1/20 see 
exposure to letters does not contradict our hy- 
pothesis that we maintain in this ease, edges 
rather than picture points, although it allows the 
other interpretation as well. 
We now turn to the second question. Since 
propositional information can be represented by 
an equivalent procedure (giving a true or a f&ise 
value), the question of propositional information 
vs struetural information can be replaced by the 
question: What are the necessary procedures that 
have to be performed during a recognition process 
and what type of data they require? Clearly, the 
parametric information is derived pronedurally. 
There are well defined procedures for finding 
color, size, orientation, etc. The part-whole 
relationship as well as the instance relationship 
clearly have to be structurally represented 
(Miller and Johnson-Laird 1977). 
While the structural information is derived 
from symbolic - propositional data and the trans- 
formations performed are, for example, reductions, 
and expansions, the parametric information is 
derived from the perceptual data and the transfor- 
mations performed are more like measurements, 
detections, and geometric transformations. 
In the context of 3-D shape representation 
we believe in a combination of procedural - para- 
metric and propositional nodes organized in a 
structure. Take an example of representing a 
shape of a human. We have the part-whole rela- 
tionship: head, neck, torso, arms, legs, etc. 
Head has parts: eye, nose, mouth, etc. These 
concepts are propositional - symbolic. From the 
shape point of view, however, head is round, neck 
is short and wide elongated blob, the arms and 
legs are elongated and the torso is elongated but 
wide. Although these labels correspong to 2-D 
as well as 3-D shape, there is a mechanism: pro- 
jection transformation which transforms elongated 
3-D into elongated 2-D shape. In any case, rotund, 
159 
elongated, wide, short, are procedures - tests 
whether an object is round, elongated, etc.- We 
know that round (circle) in 2-D corresponds to 
sphere in 3-D, elongated (rectangle, or ellipse) 
to a polyhedra or cylinder, or ellipsoid. 
When we view only one view of a scene or a 
photograph, we analyse the 2-D outline. However, 
when we have more than one view at our disposal 
or when we are asked to make 3-D interpretation 
then we reach from the 2-D information to corre- 
sponding 3-D representation. This is the time 
when volume primitives like sphere, cylinder, and 
their like come into play. These primitives do 
not seem to be explicit (we do not say a shape of 
a man is a sphere attached to several cylinders) 
in the representation. Rather what is in the 
shape representation are the feature primitives, 
(like the sy~netryplanes, the ratio of syn~netry 
axis) attached to other pointers, which point also, 
if appropriate, to labels like sphere, cylinder, 
flat object, polyhedron, etc. These labels are in 
turn used for shortening a complex description. 
An implementation of a 3-D shape deeomposition 
and labelling system is under development (Bajcsy, 
Soroka 1977). Earlier we have experimented with 
a partially ordered structure as means to repre- 
sent 2-D shape (Tidhar 1974, Bajcsy, Tidhar 1977) 
in recognition of outdoor landscapes (Sloan 1977) 
and in the context of natural language under- 
standing (Joshi and Rosenschein (1975), Rosenschein 
(75)). 
Note that not always are we able to describe 
a shape as a composition of some volume primitives 
like sphere, cylinder, or a flat object. As an 
example in the case is a shape of a heart. A 
heart has 2 sy~netryplanes and it is roughly 
round, but its typical features are the two 
corners centered, one, concave and the other 
convex connected by a convex smooth surface. Here 
clearly, any attempt to describe this shape, by 
two ellipsoids or some other 'primitive' is 
artificial. Thus, the representation will have 
only feature primitives but no volume primitives. 
Of course, there are cases that fall between. 
As an example, consider a kidney shape where one 
can say it is an ellipsoid with a concavity on 
one side. 
What are the implications from all of this? 
i. We do not measure or extract spheres, cylin- 
ders and their like as primitives, but rather 
we measure convexity, eoneavity, planar, 
corners, symmetry planes, which are primitive 
features. 
2. These features form different structures to 
which are attached different but in general, 
not independent labels. 
3. While these structures represent explicit con- 
ceptual relationships, the nodes are either 
labels or procedures with discrete values 
denoting, in general, N-ary relations. 
4. Conclusions 
In this paper, we have considered the fol= 
lowing problems: 
i. How mueh of analog information is retained 
during recognition process and at which level 
the transformation from analog to propositional 
takes place? 
2. How much of the information stored is pro- 
cedural (implicit) and structural (explicit) 
form? 
3. What are the primitives for two dimensional 
and three dimensional shapes? 
4. How is the labelling of shapes effected by the 
way the shapes are represented?- By studying 
the shape labels can we hope to learn some- 
thing about the internal representation of 
shapes? 
Clearly, these four questions are intimately 
related to the general problem: representation of 
three dimensional objects. 
We are led to the following conclusions. Our 
conclusions are derived primarily on the basis of 
our experience in constructing 2-D and 3-D recog- 
nition systems and the study of the relevant psyco- 
logicaland psycholinguistic literature. 
i. Analog information is not retained even in a 
short term memory. 
2. Our experience and the analysis of the relevant 
literature leads us to be in favor of the con- 
structuve vision theory. The visual informa- 
tion is represented as structures, with nodes 
which are either unary or n-ary predicates. 
The structures denote conceptual relationships 
such as part-whole, class inclusion, cause- 
effect, ete. 
3. The shape primitives are on the level of prim- 
itive features rather than primitive shapes. 
By primitive features we mean, corners, con- 
vex, concave and planar surfaces and their 
like. 
4. The labels of shapes, except in a few special 
cases, do not describe any shape properties 
and are derived from objects associated with 
that shape. 
5. In order to preserve continuity, we need inter- 
polation procedures. We assume that several 
such procedures exist, for example, clustering 
mechanisms, sampling p~ocedures, perspective 
transformations, rotation, etc. These are 
available as a general mechanisms for image 
processing. 
We certainly have not offered complete solu- 
tions to all the issues diseussed above, but we 
hope that we have raised several valid questions 
and suggested some approaches. 
References 
i. Averbach, E., and Sperling, G.: Short-Term 
Storage of Information in Vision in: Con- 
temporary Theoryand Researeh in Visual Per- ~ 
, (ed.) R.N. Haber, NY, Holt, Rinehart 
ston, Ine. 1968 
2. Bajcsy, R., and Soroka, B.: Steps towards the 
Representation of Complex Three-Dimensional 
Objects, Proceedings o n Int. Artificial Intel- 
ligemce Conference, Boston, August ig77. 
160 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
i0. 
ii. 
12. 
13. 
14. 
15. 
16. 
17. 
Bajcsy, R., and Tidhar, A.: Using a Structured 
World Model inFlexible Recognition of Two 
Dimensional Patterns, Pattern Recognition Vol. 
9, pp. 1-10, 1977. 
Clark, E.V.: What's in a Word? On the 
Child's Acquisition of Semantics in His First 
language, in: CognitiveDevelopmentand the 
Acquisition of Imnguage, (ed.) T.E. Moore, 
Academic Press, NY 1973, pp. 65-110. 
Clark, H.L.: Space, Time Semantics, and the 
Child, in: Cognitive Development and the 
Acquisition of language, (ed.) T.E. Moore, 
Academic Press, NY 1973 pp. 27-63. 
Chafe, W.L.: Creativity in Verbalization as 
Evidence for Analogic Knowledge, Proc. on 
Theoretical Issues in Natural language Pro- 
cessing, Cambridge, June 1975 pp. 144Z145. 
Fishler, M.A.: On the Representation of 
Natural Scenes, Advanced Papers for The Work- 
shop on Computer Vision Systems, Univ. of 
Massachusetts, June 1977, Amherst. 
Gibson, J.J.: The Perception of the Visual 
World, Boston, MA, Houghton, 1950. 
Goldmeir, E.: Similarity inVisually Per- 
ceived Forms, Psychological Issues 8, 1972, 
No. 1 pp. 1-135. 
Koffka, K.: Principles of Gestalt Psychology, 
New York, Harcourt, Brace 1935. 
Kosslyn, S.M.: Information Representation in 
Visual Images, Cognitive Psychology\[, pp. 
341-370, 1975. 
Kosslyn, S.M.: Can Imagery Be Distinguished 
from Other Forms of Internal Representation? 
Evidence from Studies of Information Retriev- 
al Times, Memory& Cognition Vol. 4, 1976, 
No. 3, pp. 291-297. 
Kosslyn, S.M., and Shwar~z, S.P.: Visual 
Images as Spatial Representations inActive 
Memory, in: Machine Visions, (eds.) E.M. 
Riseman g A.R. Hanson, NYAcademic Press 
(in press) 1978. 
Miller, A., and Johnson-~, P.N.: 
language and PercepTion, Harvard Univ. Press, 
Cambridge, MA1976. 
Norman, D.A., and Bobrow, D.G. : On the Role 
of Active Memory Processes in Perception and 
Cognition, in: C.N. Cofer (ed.) The 
Structure of Human Memory, San Francisco, 
W.H. Freeman, 1975. 
Palmer, S.E.: 'The Nature of Perceptual Rep- 
resentation: An examination of the Analog/ 
Propositional Contraversy, Proc. on Theoret- 
ical Issues in Natural Language Processing, 
Cambridge, June 1975 pp. 151-159. 
Piaget, J., and Inhelder, B.; 
Conception o_~fSpace, New York: 
Press, 1956. 
The Child's 
Humanities 
18. 
19. 
20. 
21. 
22. 
23. 
24. 
25. 
26. 
27. 
Pylshyn, Z.W.: Representation of Knowledge: 
Non-Linguistic Forms, Proc. on Theoretical 
Issues in Natural Languase Process i~, 
Camb-~dge, June 1975 pp. 160-163. 
Rock, I.: Orientation and Form, Academic 
Press, Inc. Ny 1973. 
Rock, I.: An__ Introduction t oPerception, 
MacMillan Publ. Co., NY 1975. 
Rosh, E.H.: 0nthe Internal Structure of 
Perceptual and Semantic Categories, in: 
Cognitive Development and the Acquisition of 
language. (ed.) T.E. Moore, Academic Press, 
NY 1973, pp. 111-144. 
Shepard, R.N., and Metzler, J.: Mental 
Rotation of Three-Dimensional Objects, 
Science, 171, 1971, pp. 701-703. 
Tidhar, A.: Using a Structured World Model 
in Flexible Recognition of Two Dimensional 
Pattern, Moore School Tech. Report No. 75-02, 
Univ. of Pennsylv--~, Philadelp~a~ 1974. 
Zusne, L. : Visual Perception of Form, 
Academic Press, 1970, NY and London. 
Sloan, K. : World Model Driven Recognition 
of Natural Scenes, Ph.D. Dissertation, 
Computer Science Department, University of 
Pennsylvania, Philadelphia, June 1977. 
Joshi, A.K., and Rosensehein, S.J., "A 
Formalism for Relating Lexical and Pragmatic 
Information: Its Relevance to Recognition 
and Generation", Proc. of TINIAP Workshop, 
Cambridge 1975. 
Rosenschein, S.J., "Structuring a Pattern 
Space, with Applications to Lexical Informa- 
tion and Event Interpretation", Ph.D. 
Dissertation, University of Pennsylvania, 
Philadelphia, PA 1975. 
161 
