Proceedings of the 2nd Workshop on Ontology Learning and Population, pages 1–9,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Enriching a formal ontology with a thesaurus: an aplication in the 
cultural heritage domain 
 
Roberto Navigli, Paola Velardi 
Dipartimento di Informatica, Università “La Sapienza”, Italy  
navigli,velardi@di.uniroma1.it 
 
 
Abstract 
This paper describes a pattern-based 
method to automatically enrich a core 
ontology with the definitions of a 
domain glosary. We show an 
application of our methodology to the 
cultural heritage domain, using the 
CIDOC CRM core ontology. To enrich 
the CIDOC, we use available resources 
such as the AT art and architecture 
glosary, WordNet, the Dmoz 
taxonomy for named entities, and 
others. 
1 Introduction 
Large-scale, automatic semantic annotation of 
web documents based on well established 
domain ontologies would allow various 
Semantic Web applications to emerge and gain 
acceptance. Wide coverage ontologies are 
indeed available for general-purpose domains 
(e.g. WordNet, CYC, SUMO
1
), however 
semantic annotation in unconstrained areas 
seems stil out of reach for state of art systems. 
Domain-specific ontologies are preferable 
since they limit the domain and make the 
applications feasible. Furthermore, real-world 
applications (e.g tourism, cultural heritage, e-
commerce) are dominated by the requirements 
of the related web communities, who began to 
believe in the benefits deriving from the 
application of Semantic Web techniques. 
These communities are interested in extracting 
from texts specific types of information, rather 
than general-purpose relations. Accordingly, 
they produced remarkable efforts to 
conceptualize their competence domain 
through the definition of a core ontology
2
. 
                                                
1
 WordNet: htp:/wordnet.princeton.edu, 
CYC: htp:/ww.opencyc.org, SUMO: 
htp:/ww.ontologyportal.org 
2
 a core ontology is a very basic ontology consisting 
only of the minimal concepts relations and axioms 
Relevant examples are in the area of enterprise 
modeling (Fox et al. 197) (Uschold et al. 
198) and cultural heritage (Doerr, 203). 
Core ontologies are indeed a necessary starting 
point to model in a principled way the basic 
concepts, relations and axioms of a given 
domain. But in order for an ontology to be 
really usable in applications, it is necessary to 
enrich the core structure with the thousands of 
concepts and instances that “make” the 
domain. 
In this paper we present a methodology to 
automatically annotate a glosary G with the 
semantic relations of an existing core ontology 
O. Gloses are then converted into formal 
concepts, used to enrich O. The annotation of 
glosary definitions is performed using regular 
expressions, a widely adopted text mining 
approach. However, while in the literature 
regular expressions seek mostly for patterns at 
the lexical and part of speech level, we defined 
more complex expressions enriched with 
syntactic and semantic constraints.  A word 
sense disambiguation algorithm, SSI (Velardi 
and Navigli, 205), is used to automatically 
replace the high level semantic constraints 
specified in the core ontology with fine–
grained sense restrictions, using the sense 
inventory of a general purpose lexicalized 
ontology, WordNet. 
We experimented our methodology in the 
cultural heritage domain, since for this domain 
several well-established resources are 
available, like the CIDOC-CRM core 
ontology, the AT art and architecture 
thesaurus, and others. 
The paper is organized as folows: in Section 2 
we briefly present the CIDOC and the other 
resources used in this work. In Section 3 we 
describe in detail the ontology enrichment 
algorithm. Finally, in Section 4 we provide a 
performance evaluation on a subset of CIDOC 
                                                                    
required to understand the other concepts in the domain. 
1
properties and a sub-tree of the AT thesaurus. 
Related literature is examined in Section 5. 
2 Semantic and lexical resources in 
the cultural heritage domain 
In this section we briefly describe the 
resources that have been used in this work. 
2.1 The CIDOC RM 
The core ontology O is the CIDOC CRM 
(Doerr, 203), a formal core ontology whose 
purpose is to facilitate the integration and 
exchange of cultural heritage information 
between heterogeneous sources. It is currently 
being elaborated to become an ISO standard. 
In the current version (4.0) the CIDOC 
includes 84 taxonomically structured concepts 
(called entities) and a flat set of 141 semantic 
relations, called properties. Properties are 
defined in terms of domain (the class for which 
a property is formally defined) and range (the 
class that comprises all potential values of a 
property), e.g.: 
 
P46 is composed of (forms part of) 
Domain:  E19 Physical Object 
Range:  E42 Object Identifier 
 
The CIDOC is an “informal” resource. To 
make it usable by a computer program, we 
replaced specifications writen in natural 
language with formal ones. For each property 
R, we created a tuple R(C
d
,C
r
) where C
d
 and C
r
 
are the domain and range entities specified in 
the CIDOC reference manual. 
2.2 The AT thesaurus 
The domain glosary G is the Art and 
Architecture Thesaurus (AT) a controled 
vocabulary for use by indexers, catalogers, and 
other professionals concerned with information 
management in the fields of art and 
architecture. In its current version
3
 it includes 
more than 13,00 terms, descriptions, 
bibliographic citations, and other information 
relating to fine art, architecture, decorative 
arts, archival materials, and material culture. 
An example is the folowing: 
 
maestà 
Note: Refers to a work of a specific iconographic type, 
depicting the Virgin Mary and Christ Child enthroned in 
                                                
3
 htp:/ww.gety.edu/research/conducting_research/ 
vocabularies/at/ 
the center with saints and angels in adoration to each 
side. The type developed in Italy in the 13th century and 
was based on earlier Grek types. Works of this type are 
typicaly two-dimensional, including painted panels 
(often altarpieces), manuscript iluminations, and low-
relief carvings. 
Hierarchical Position: 
 Objects Facet 
 .. Visual and Verbal Comunication 
 .... Visual Works 
 ...... <visual works> 
 ........ <visual works by subject type> 
 .......... maestà 
 
We manually mapped the top CIDOC 
entities to AT concepts, as shown in Table 1. 
 
AT topmost CIDOC entities 
Top concept of AT CRM Entity (E1), Persistent Item (E7) 
Styles and Periods Period (E4) 
Events Event (E5) 
Activities Facet Activity (E7) 
Proceses/Techniques Begining of Existence (E63) 
Objects Facet Physical Stuff (E18), 
Physical Object (E19) 
Artifacts Physical Man-Made Stuff (E24) 
Materials Facet Material (E57) 
Agents Facet Actor (E39) 
Time Time-Span (E52) 
Place Place (E53) 
Table 1: mapping between AT and CIDOC. 
2.3 Additional resources 
A general purpose lexicalised ontology, 
WordNet, is used to bridge the high level 
concepts defined in the core ontology with the 
words in a fragment of text. As better clarified 
later, WordNet  is used to verify that certain 
words in a string of text f satisfy the range 
constraints R(C
d
,C
r
) in the CIDOC. In order to 
do so, we manually linked the WordNet 
topmost concepts to the CIDOC entities. For 
example, the concept E19 (Physical Object) is 
mapped to the WordNet synset “object, 
physical object”. Furthermore, we created a 
gazetteer I of named entities extracting names 
from the Dmoz
4
, a large human-edited 
directory of the web, the Union List of Artist 
Names (ULAN) and the Getty Thesaurus of 
Geographic Names (GTG) provided by the 
Getty institute, along with the AT. Named 
entities often occur in AT definitions, 
therefore, NE recognition is relevant for our 
task. 
                                                
4
 htp:/dmoz.org/about.html 
2
3 Enriching the CIDOC CRM with 
the AT thesaurus 
In this Section we describe in detail the method 
for automatic semantic annotation and 
ontology enrichment in the cultural heritage 
domain. 
We start with an example of the task to be 
performed: given a glos G of a term t in the 
glosary G, the first objective is to anotate 
certain glos fragments with CIDOC relations. 
For example, the folowing glos fragment for 
“vedute” is annotated with a CIDOC relation, 
as folows: 
[.]The first vedute probably were <caried-out-
by>painted by northern European artists</caried-
out-by> [..] 
Then, for each annotated fragment, we extract 
a semantic relation instance R(C
t
,C
w
), where R 
is a relation in O, C
t
 and C
w
 are respectively 
the domain and range of R. The concept C
t
 
corresponds to its lexical realization t, while 
C
w
 is the concept associated to the “head” 
word w in the annotated segment of the glos. 
In the previous example, the relation instance 
is: R 
caried_out_by
(vedute,European_artist) 
The annotation process allows to automatically 
enrich O with an existing glosary in the same 
domain of O, since each pair of term and glos 
(t,G) in the glosary G is transformed into a 
formal definition, compliant with O. 
Furthermore, the very same method used to 
anotate definitions can be used to annotate 
free text with the relations of the enriched 
ontology O’. 
We now describe the method in detail. Let G 
be a glosary, t a term in G and G the 
corresponding natural language definition 
(glos).  The main steps of the algorithm are 
the folowing: 
1. Part-of-Speech analysis. 
Each input glos is processed with a part-of-
speech tagger, TreeTagger
5
. As a result, for 
each glos G = w
1
 w
2
 … w
n
, a string of part-of-
speech tags p
1 
p
2
 … p
n
 is produced, where p
i
 
! 
"
P is the part-of-speech tag chosen by 
TreeTager for word w
i
, and P = { N, A, V, J, 
R, C, P, S, W } is a simplified set of syntactic 
categories (respectively, nouns, articles, verbs, 
adjectives, adverbs, conjunctions, prepositions, 
                                                
5
 TreTager is available at: htp:/ww.ims.uni-
stutgart.de/projekte/corplex/TreTager. 
symbols, wh-words). Terminological strings 
(european artist) are detected using our Term 
Extractor tol, already described in (Navigli 
and Velardi, 204). 
2. Named Entity recognition. 
We augmented TreeTagger with the ability to 
capture named entities of locations, 
organizations, persons, numbers and time 
expressions. In order to do so, we use regular 
expressions (Friedl, 197) in a rather standard 
way, therefore we omit details. When a named 
entity string w
i
 w
i+1
 … w
i+j
 is recognized, it is 
transformed into a single term and a specific 
part of speech denoting the kind of entity is 
asigned to it (L for cities (e.g. Venice), 
countries and continents, T for time and 
historical periods (e.g. Midle Ages), O for 
organizations and persons (e.g. Leonardo Da 
Vinci), B for numbers). 
3. Anotation of sentence segments with 
CIDOC properties. 
Once the text has been parsed, we use 
manually defined regular expressions to 
capture relevant fragments. The regular 
expressions are used to annotate glos 
segments with properties grounded on the 
CIDOC-CRM relation model. Given a glos G 
and a property
6
 R, we define a relation checker 
c
R
 taking in input G and producing in output a 
set F
R
 of fragments of G annotated with the 
property R: <R>f</R>. The selection of a 
fragment f to be included in the set F
R
 is based 
on three different kinds of constraints: 
 
 a part-of-speech constraint p(f, pos-
string) matches the part-of-speech (pos) 
string associated with the fragment f 
against a regular expresion (pos-string), 
specifying the required syntactic structure. 
 a lexical constraint l(f, k, lexical-
constraint) matches the lemma of the word 
in k-th position of f against a regular 
expression (lexical-constraint), 
constraining the lexical conformation of 
words occurring within the fragment f. 
 semantic constraints on domain and 
range s
D
(f, semantic-domain) and s(f, k, 
semantic-range) are valid, respectively, if 
the term t and the word in the k-th position 
of f match the semantic constraints on 
domain and range imposed by the CIDOC, 
i.e. if there exists at least one sense of t C
t
 
and one sense of w C
w
 such that: 
                                                
6
 In what folows, we adopt the CIDOC terminology for 
relations and concepts, i.e. properties and entities. 
3
R
kind-of
*(C
d
, C
t
) and R
kind-of
*(C
r
, C
w
)
7
 
 
More formally, the annotation process is 
defined as folows: 
A relation checker c
R
 for a property R is a 
logical expression composed with constraint 
predicates and logical conectives, using the 
folowing production rules: 
 
c
R
 → s
D
(f, semantic-domain) 
! 
"
 c
R
’ 
c
R
’  ¬c
R
‘| (c
R
’ 
!
c
R
’) | (c
R
’ 
 
 c
R
’) 
c
R
’
 
→ p(f, pos-string) | l(f, k, lexical-constraint) 
| s(f, k, semantic-range) 
 
where f is a variable representing a sentence 
fragment. Notice that a relation checker must 
always specify a semantic constraint s
D
 on the 
domain of the relation R being checked on 
fragment f. Optionally, it must also satisfy a 
semantic constraint s on the k-th element of f, 
the range of R. 
For example, the folowing excerpt of the 
checker for the is-composed-of relation (P46): 
 
(1) c
is-composed-of
(f) = s
D
(f, physical object#1) 
! 
"
 p(f, “(V)
1
(P)
2
R?A?[CRJVN]*(N)
3
”) 
 
 l(f, 1, 
“^(consisting|composed|comprised|constructed)$”) 
! 
"
 l(f, 2, “of”) 
! 
"
 s(f, 3, physical_object#1) 
 
reads as folows: “the fragment f is valid if it 
consists of a verb in the set { consisting, 
composed, comprised, constructed }, folowed 
by a preposition “of”, a posibly empty number 
of adverbs, adjectives, verbs and nouns, and 
terminated by a noun interpretable as a 
physical object in the WordNet concept 
inventory”. The first predicate, s
D
, requires that 
also the term t whose glos contains f (i.e., its 
domain) be interpretable as a physical object. 
Notice that some letter in the regular 
expresion specified for the part-of-speech 
constraint is enclosed in parentheses. This 
alows it to identify the relative positions of 
words to be matched against lexical and 
semantic constraints, as shown graphically in 
Figure 1. 
()
1
P
2
?[JN]*()
3
oed twroe(ngtvs
prt-fspec string
glsfragment
Figure 1. Corespondence betwen parenthesized 
part-of-spech tags and words in a glos fragment. 
 
Checker (1) recognizes, among others, the 
folowing fragments (the words whose part-of-
                                                
7
 R
kind-of
* denotes zero, one, or more aplications of R
kind-of
. 
speech tags are enclosed in parentheses are 
indicated in bold): 
 (consisting)
1
 (of)
2
 semi-precious (stones)
3
 
(matching part-of-speech string: (V)
1
(P)
2 
J(N)
3
) 
(composed)
1
 (of)
2
 (knots)
3
 (matching part-of-
speech string: (V)
 1
(P)
2
(N)
3
) 
As a second example, an excerpt of the 
checker for the consists-of (P45) relation is the 
folowing: 
 (2) c
consists-of
(f) = s
D
(f, physical object#1) 
! 
"
p(f, “(V)
1
(P)
2
A?[JN,VC]*(N)
3
”) 
 
 l(f, 1, “^(make|do|produce|decorated)$”) 
! 
"
 l(f, 2, “^(of|by|with)$”) 
 
 ¬s(f, 3, color#1)
! 
"
 ¬s(f, 3, activity#1) 
! 
"
 (s(f, 3, material#1) 
 
 s(f, 3, solid#1) 
 
 s(f, 3, liquid#1) 
recognizing, among others, the folowing 
phrases: 
 (made)
1
 (with)
2
 the red earth pigment 
(sinopia)
3
 (matching part-of-speech string: 
(V)
1
(P)
2
AJN(N)
3
) 
 (decorated)
1
 (with)
2
 red, black, and white 
(paint)
3
 (matching part-of-speech string: 
(V)
1
(P)
2
JCJ(N)
3
) 
Notice that in both checkers (1) and (2) 
semantic constraints are specified in terms of 
WordNet sense numbers (material#1, solid#1 
and liquid#1), and can also be negative 
(¬color#1 and ¬activity#1). The motivation is 
that CIDOC constraints are coarse-grained 
due to the small number of available core 
concepts: for example, the property P45 
consists of simply requires that the range 
belongs to the class Material (E57). Using 
these coarse grained constraints would produce 
false positives in the annotation task, as 
discused later. Using WordNet for semantic 
constraints has two advantages: first, it is 
posible to write more fine-grained (and hence 
more reliable) constraints, second, regular 
expressions can be re-used, at least in part, for 
other domains and ontologies. In fact, several 
CIDOC properties are rather general-purpose. 
Notice that, as remarked in section 2.3, 
replacing coarse CIDOC sense restrictions 
with WordNet fine-grained restrictions is 
posible since we mapped the 84 CIDOC 
entities onto WordNet topmost concepts. 
4. Formalisation of gloses. 
The annotations generated in the previous step 
are the basis for extracting property instances 
to enrich the CIDOC CRM with a 
conceptualization of the AT terms. In 
general, for each glos G defining a concept C
t
, 
4
and for each fragment f ∈ F
R
 of G annotated 
with the property R: <R>f</R>, it is posible to 
extract one or more property instances in the 
form of a triple R(C
t
, C
w
), where C
w
 is the 
concept asociated with a term or multi-word 
expression w occurring in f (i.e. its language 
realization) and C
t
 is the concept asociated to 
the defined term t in AT. For example, from 
the definition of tating (a kind of lace) the 
algorithm automatically annotates the phrase 
composed of knots, sugesting that this phrase 
specifies the range of the is-composed-of 
property for the term tating: 
R
is-composed-of
(C
tating
, C
knot
) 
In this property instance, C
tating
 is the domain 
of the property (a term in the AT glosary) 
and C
knot
 is the range (a specific term in the 
definition G of tating). 
Selecting the concept associated to the domain 
is rather straightforward: glosary terms are in 
general not ambiguous, and, if they are, we 
simply use a numbering policy to identify the 
appropriate concept. In the example at hand, 
C
tating
=tating#1 (the first and only sense in 
AT). Therefore, if C
t
 matches the domain 
restrictions in the regular expression for R, 
then the domain of the relation is considered to 
be C
t
. Selecting the range of a relation is 
instead more complicated. The first problem is 
to select the correct words in a fragment f. 
Only certain words of an annotated glos 
fragment can be exploited to extract the range 
of a property instance. For example, in the 
phrase “depiction of fruit, flowers, and other 
objects” (from the definition of stil life), only 
fruit, flowers, objects represent the range of the 
property instances of kind depicts (P62). 
When writing relation checkers, as described 
in the previous paragraph of this Section, we 
can add markers of ontological relevance by 
specifying a predicate r(f, k) for each relevant 
position k in a fragment f. The purpose of these 
markers is precisely to identify words in f 
whose corresponding concepts are in the range 
of a property. For instance, the checker (1) c
is-
composed-of
 from the previous paragraph is 
augmented with the conjunction: 
! 
"
 r(f, 3). We 
added the predicate r(f, 3) because the third 
parenthesis in the part-of-speech string refers 
to an ontologically relevant element (i.e. the 
candidate range of the is-composed-of 
property). 
The second problem is that words that are 
candidate ranges can be ambiguous, and they 
often are, especially if they do not belong to 
the domain glosary G. Considering the 
previous example of the property depicts, the 
word fruit is not a term of the AT glosary, 
and it has 3 senses in WordNet (the fruit of a 
plant, the consequence of some action, an 
amount of product). The property depicts, as 
defined in the CIDOC, simply requires that the 
range be of type Entity (E1). Therefore, all the 
three senses of fruit in WordNet satisfy this 
constraint. Whenever the range constraints in a 
relation checker do not alow a ful 
disambiguation, we apply the SSI algorithm 
(Navigli and Velardi, 205), a semantic 
disambiguation algorithm based on structural 
pattern recognition, available on-line
8
. The 
algorithm is applied to the words belonging to 
the segment fragment f
 
and is based on the 
detection of relevant semantic interconection 
patterns between the appropriate senses. These 
patterns are extracted from a lexical knowledge 
base that merges WordNet with other 
resources, like word colocations, on-line 
dictionaries, etc. 
For example, in the fragment “depictions of 
fruit, flowers, and other objects” the folowing 
properties are created for the concept stil_ 
life#1: 
 
R
depicts
(stil_ life#1, fruit#1) 
R
depicts
 (stil_ life#1, flower#2) 
R
depicts
 (stil_ life#1, object#1) 
 
Some of the semantic patterns suporting this 
sense selection are shown in Figure 2. 
A further posibility is that the range of a 
relation R is a concept instance. We create 
concept instances if the word w extracted from 
the fragment f is a named entity. For example, 
the definition of Venetian lace is annotated as 
“Refers to needle lace created <current-or-
former-location> in Venice</current-or-
former-location> […]”. 
As a result, the folowing triple is produced: 
R
has-curent-or-former-location
(Venetian_lace#1, 
Venice:city#1) 
where Venetian_ lace#1 is the concept label 
generated for the term Venetian lace in the 
AT and Venice is an instance of the concept 
city#1 (city, metropolis, urban center) in 
WordNet. 
 
                                                
8
 SI is an on-line knowledge-based WSD algorithm 
acesible from htp:/lcl.di.uniroma1.it/si. The on-line 
version also outputs the detected semantic conections 
(as those in Figure 2). 
5
fruit#1
flower#2
object#1
depicton#1
bunch#1
r
e
l
a
t
e
d
-
t
o
r
e
l
a
e
t
o
flower
had#1
rlated-to
h
a
s
-
p
t
cyme#1
r
la
t
d
-
o
inflorescn
#2
k
i
n
d
-
relatd
-o
stil fe#1
r
la
te
d-
o
r
e
l
a
t
d
-
t
o
descripton#1
k
i
-
f
staemnt#1
k
i
d
-
o
f
thing#5
r
e
l
a
t
d
-
t
o
r
e
l
a
t
d
-
t
pearnc#1
portayl#1
r
e
l
a
t
e
d
-
t
o
r
e
l
t
k
in
d
-
f
forest#2
land#3
k
i
n
-
o
f
rel
at
d-
o
l
a
t
d
-
plant#1
re
l
t
d
-
organism#1
livng thi#1
k
i
d
-
f
kind-of
k
i
n
d
-
o
f
 
Figure 2. Semantic Interconections selected 
by the SSI algorithm when given the word list: 
“depiction, fruit, flower, object”. 
4 Evaluation 
Since the CIDOC-CRM model formalizes a 
large number of fine-grained properties 
(precisely, 141), we selected a subset of 
properties for our experiments (reported in 
Table 2). We wrote a relation checker for each 
property in the Table. By applying the 
checkers in cascade to a glos G, a set of 
annotations is produced. The folowing is an 
example of an anotated glos for the term 
“vedute”: 
 
Refers to detailed, largely factual topographical views, especialy 
<has-time-span>18th-century</has-time-span> Italian paintings, 
drawings, or prints of cities. The first vedute probably were 
<caried-out-by>painted by northern European artists</caried-
out-by> who worked <has former-or-curent-location>in 
Italy</has former-or-curent-location><has-time-span>in the 
16th century</has-time-span>. The term refers ore generaly to 
any painting, drawing or print <depicts>representing a landscape 
or town view</depicts> that is largely topographical in conception. 
 
Figure 3 shows a more comprehensive graph 
representation of the outcome for the concepts 
vedute#1 and maestà#1 (see the glos in 
Section 2.2). 
To evaluate the methodology described in 
Section 3 we considered 814 gloses from the 
Visual Works sub-tree of the AT thesaurus, 
containing a total of 27,925 words. The authors 
wrote the relation checkers by tuning them on 
a subset of 12 gloses, and tested their 
generality on the remaining 692. The test set 
was manually tagged with the subset of the 
CIDOC-CRM properties shown in Table 2 by 
two annotators with adjudication (requiring a 
careful comparison of the two sets of 
annotations). 
We performed two experiments: in the first, we 
evaluated the glos anotation task, in the 
second the property instance extraction task, 
i.e. the ability to identify the appropriate 
domain and range of a property instance. In the 
case of the glos annotation task, for evaluating 
each piece of information we adopted the 
measures of “labeled” precision and recal. 
These measures are commonly used to 
evaluate parse trees obtained by a parser 
(Charniak, 197) and allow the rewarding of 
god partial results. Given a property R, 
labeled precision is the number of words 
annotated correctly with R over the number of 
words annotated automatically with R, while 
labeled recal is the number of words 
annotated correctly with R over the total 
number of words manually annotated with R. 
Table 3 shows the results obtained by applying 
the checkers to tag the test set (containing a 
total number of 1,328 distinct annotations and 
5,965 annotated words). Note that here we are 
evaluating the ability of the system to assign 
the correct tag to every word in a glos 
fragment f, according to the appropriate 
relation checker. We chose to evaluate the tag 
assigned to single words rather than to a whole 
phrase, because each misalignment would 
count as a mistake even if the most part of a 
phrase was tagged correctly by the automatic 
annotator. 
The second experiment consisted in the 
evaluation of the property instances extracted. 
Starting from 1,328 manually annotated 
fragments of 692 gloses, the checkers 
extracted an overall number of 1,101 property 
instances. We randomly selected a subset of 
160 gloses for evaluation, from which we 
manually extracted 34 property instances. 
Two aspects of the property instance extraction 
task had to be assessed: 
 the extraction of the appropriate range 
words in a glos, for a given property 
instance 
 the precision and recall in the extraction of 
the appropriate concepts for both domain 
and range of the property instance. 
An overall number of 23 property instances 
were automatically colected by the checkers, 
out of which 203 were correct with respect to 
the first assessment (87.12% precision 
(203/23), 59.01% recall (203/34)). 
In the second evaluation, for each property 
instance R(C
t
, C
w
) we assessed the semantic 
correctness of both the concepts C
t
 and C
w
. 
The apropriateness of the concept C
t
 chosen 
6
for the domain must be evaluated, since, even 
if a term t satisfies the semantic constraints of 
the domain for a property R, stil it can be the 
case that a fragment f in G does not refer to t, 
like in the folowing example: 
pastels (visual works) -- Works of art, typicaly on a paper or 
velum suport, to which designs are aplied using crayons 
made of ground pigment held together with a binder, typicaly 
oil or water and gum. 
 
Code Name Domain Range Example 
P26 moved to Move Place P26(instalation of public sculpture, public place) 
P27 moved from ove Place P27(removal of cornice pictures, wal) 
P53 has former/current location Physical Stuff Place P53(fancy pictures, London) 
P5 has current location Physical Object Place P5(macrame, Genoa) 
P46 is composed of (is part of) Physical Stuff Physical Stuff P46(lace, knot) 
P62 depicts Physical Man-Made Stuff Entity P62(stil life, fruit) 
P4 has time span Temporal Entity Time Span P4(patern drawings, Renaisance) 
P14 carried out by (performed) Activity Actor P14(bloted line drawings, Andy Warhol) 
P92 brought into existence by Persistent Item Begining of Existence P92(aquatints, aquatint proces) 
P45 consists of (incorporated in) Physical Stuff Material P45(sculpture, stone) 
Table 2: A subset of the relations from the CIDOC-CRM model. 
 
maestà
Virgn Mary
Christ cild
Italy13
th
 century
painted panel
carving
altrpiec
iluminatos
d
e
p
i
c
t
s
ic
ts
i
s
-
c
o
m
p
o
d
-
o
f
i
s
-
c
o
p
s
e
d
-
f
is
-
m
is-copse-f
h
a
s
-
c
u
r
e
n
t
o
r
f
m
-
l
a
t
i
o
h
a
s
t
i
m
e
-
p
n
vedut
landscpe
town vie
Italy
18
th
 century
artis
d
e
p
i
c
t
s
ic
ts
c
a
r
i
e
d
-
o
u
t
b
y
h
a
s
-
c
u
r
e
n
t
o
r
f
m
-
l
a
t
i
o
h
a
s
t
i
m
e
-
p
a
n
16
th
 century
ha
s
tim
e-
p
n
work
h
a
s
-
t
y
p
e
topgrahicl
views
h
a
s
-
t
y
p
 
Figure 3. Extracted conceptualisation (in graphical form) of the terms maestà#1 and vedute#1 (sense 
numbers are omited for clarity).
 
In this example, ground pigment refers to 
crayons (not to pastels). 
The evaluation of the semantic correctness of the 
domain and range of the property instances 
extracted led to the final figures of 81.1% 
(189/23) precision and 54.94% (189/34) recall, 
due to 9 errors in the choice of C
t
 as a domain for 
an instance R(C
t
, C
w
) and 5 errors in the semantic 
disambiguation of range words w not appearing 
in AT, but encoded in WordNet (as described 
in the last part of Section 3). A final experiment 
was performed to evaluate the generality of the 
approach presented in this paper. 
As already remarked, the same procedure used 
for annotating the gloses of a thesaurus can be 
used to annotate web documents. Our objective 
in this third experiment was to: 
 Evaluate the ability of the system to annotate 
fragments of web documents with CIDOC 
relations 
 Evaluate the domain dependency of the 
relation checkers, by letting the system 
annotate documents not in the cultural 
heritage domain. 
We then selected 5 documents at random from an 
historical archive and an artist’s biographies 
archive
9
 including about 6,00 words in total, 
about 5,00 of which in the historical domain. 
We then ran the automatic annotation procedure 
on these documents and we evaluated the result, 
using the same criteria as in Table 3. 
 
Property Precision Recal 
 P26 – moved to 84.95% (79/93) 64.23% (79/123) 
 P27 – moved from 81.25% (39/48) 78.0% (39/50) 
 P53 - has former or 
 current location 
78.09% (916/173) 67.80% (916/1351) 
 P5 – has current 
 location 
10.0% (8/8) 10.0% (8/8) 
 P46 –composed of 87.49% (94/1079) 70.76% (94/134) 
 P62 – depicts 94.15% (370/393) 65.26% (370/567) 
 P4 – has time span 91.93% (547/595) 76.40% (547/716) 
 P14 - carried out by 91.71% (343/374) 71.91% (343/47) 
 P92 – brought into 
 existence 
89.54% (471/526) 62.72% (471/751) 
 P45 – consists of 74.67% (398/53) 57.60% (398/691) 
Avg. performance 85.34 (415/482) 67.81 (415/6068) 
Table 3: Precision and Recall of the glos 
annotation task. 
Table 4 presents the results of the experiment. 
Only 5 out of 10 properties had at least one 
                                                
9
 htp:/historicaltextarchive.com and 
htp:/ww.artnet.com/library 
7
instance in the analysed documents. It is 
remarkable that, especially for the less domain-
dependent properties, the precision and recall of 
the algorithm is stil high, thus showing the 
generality of the method. Notice that the 
historical documents influenced the result much 
more than the artist biographies, because of their 
dimension. 
In Table 4 the recall of P14 (caried out by) is 
omited. This is motivated by the fact that this 
property, in a generic domain, corresponds to the 
agent relation (“an active animate entity that 
voluntarily initiates an action”
10
), while in the 
cultural heritage domain it has a more narrow 
interpretation (an example of this relation in the 
CIDOC handbok is: “the painting of the Sistine 
Chapel (E7) was caried out by Michelangelo 
Buonaroti (E21) in the role of master craftsman 
(E5)”). However, the domain and range 
restrictions for P14 correspond to an agent 
relation, therefore, in a generic domain, one 
should annotate as “carried out by” almost any 
verb phrase with the subject (including pronouns 
and anaphoric references) in the class Human. 
 
Property Precision Recal 
P53 – has former or 
current location 
79.84% (198/248) 7.95% (198/254) 
P46 – composed of 83.58% (12/134) 96.5% (12/16) 
P4 – has time span 78.32% (12/143) 50.68% (12/21) 
P14 – carried out by 60.61% (40/6) - - 
P45 – consists of 85.71% (6/7) 37.50% (6/16) 
Avg. performance 78.26 (468/598) 7.10 (468/607) 
Table 4: Precision and Recall of a web 
document annotation task. 
 
5 Related work 
This paper presented a method to automatically 
annotate the gloses of a thesaurus, the AT, 
with the properties (conceptual relations) of a 
core ontology, the CIDOC-CRM. Several 
methods for ontology population and semantic 
annotation described in literature (e.g. (Thelen 
and Riloff, 202; Califf and Mooney, 204; 
Cimiano et al. 205; Valarakos et al. 204) use 
regular expressions to identify named entities, 
i.e. concept instances. Other methods extract 
hypernym
1
 relations using syntactic and lexical 
                                                
10
 htp:/ww.jfsowa.com/ontology/thematic.htm 
11
 In AT the hyperny relation is already available, 
since T is a thesaurus, not a glosary. However we 
developed regular expresions also for hypernym extraction 
from definitions. For sake of space this is not discused in 
this paper, however the remarkable result (wrt analogous 
evaluations in literature) is that in 34% of the cases the 
automaticaly extracted hypernym is the same as in AT, 
and in 26% of the cases, either the extracted hypernym is 
more general than the one defined in AT, or the contrary, 
patterns (Snow et al. 205; Morin and Jaquemin 
204) or supervised clustering techniques 
(Kashyap et al. 203). 
In our work, we automatically learn formal 
concepts, not simply instances or taxonomies 
(e.g. the graphs of Figure 3) compliant with the 
semantics of a well-established core ontology, 
the CIDOC. The method is unsupervised, in the 
sense that it does not need manual annotation of 
a significant fragment of text. However, it relies 
on a set of manually writen regular expresions, 
based on lexical, part-of-speech, and semantic 
constraints. The structure of regular expressions 
is rather more complex than in similar works 
using regular expressions, especially for the use 
of automatically verified semantic constraints. 
This complexity is indeed necessary to identify 
non-trivial relations in an unconstrained text and 
without training. The isue is however how much 
this method generalizes to other domains: 
• A first problem is the availability of lexical 
and semantic resources used by the 
algorithm. The most critical requirement of 
the method is the availability of sound 
domain core ontologies, which hopefuly 
wil be produced by other web communities 
stimulated by the recent success of CIDOC 
CRM. On the other side, in absence of an 
agreed conceptual reference model, no large 
scale anotation is posible at al. As for the 
other resources used by our algorithm, 
glosaries, thesaura and gazetteers are widely 
available in “mature” domains. If not, we 
developed a methodology, described in 
(Navigli and Velardi, 205b), to 
automatically create a glosary in novel 
domains (e.g. enterprise interoperability), 
extracting definition sentences from domain-
relevant documents and authoritative web 
sites. 
• The second problem is about the generality 
of regular expressions. Clearly, the relation 
checkers that we defined are tuned on the 
CIDOC properties. This however is 
consistent with our target: in specific 
domains users are interested to identify 
specific relations, not general purpose. 
Certain relevant application domains –like 
cultural heritage, e-commerce, or tourism- 
are those that dictate specifications for real-
world applications of NLP techniques. 
However, several CIDOC properties are 
rather general (especially locative and 
                                                                       
wrt the AT hierarchy. 
8
temporal relations) therefore some relation 
checkers easily apply to other domains, as 
demonstrated by the experiment on 
automatic annotation of historical archives in 
Table 4. Furthermore, the method used to 
verify semantic constraints is fuly general, 
since it is based on WordNet and a general-
purpose, untrained semantic disambiguation 
algorithm, SSI.  
• Finally, the authors believe with some degree 
of convincement that automatic pattern-
learning methods often require non-trivial 
human effort just like manual methods 
(because of the need of annotated data, 
careful parameter setting, etc.), and 
furthermore they are unable to combine in a 
non-trivial way different types of features 
(e.g. lexical, syntactic, semantic). To make 
an example, a recent work on learning 
hypernymy patterns (Morin and Jacquemin, 
204) provides the ful list of learned 
patterns. The complexity of these patterns is 
certainly lower than the regular expression 
structures used in this work, and many of 
them are rather intuitive.  
In the literature the tasks on which automatic 
methods have been tested are rather constrained, 
and do not convincingly demonstrate the 
superiority of automatic with respect to manually 
defined patterns. For example, in Senseval-3 
(automated labeling of semantic roles
12
), 
participating systems are requested to identify 
semantic roles in a sentence fragment for which 
the “frame semantics” is given, therefore the 
posible semantic relations to be identified are 
quite limited. 
However, we believe that our method can be 
automated to some degree (for example, machine 
learning methods can be used to botstrap the 
syntactic patterns, and to learn semantic 
constraints), a research line we are currently 
exploring. 

References 
M. E. Calif and R.J. Money, “Botom-up relational 
learning of patern matching rules for information 
extraction” Machine Learning research, 4 (2)17-
210, 204. 
E. Charniak, “Statistical Techniques for Natural 
Language Parsing”, AI Magazine 18(4), 3-4, 
197. 
P. Cimiano, G. Ladwig and S. Stab, “Gime the 
context: context-driven automatic semantic 
anotation with C-PANKOW” In: Procedings of 
the 14th International W Conference, WW 
205, Chiba, Japan, May, 205. ACM Pres. 
M. Doer, “The CIDOC Conceptual Reference 
odule: An Ontological Aproach to Semantic 
Interoperability of Metadata”. AI Magazine, 
Volume 24, Number 3, Fal 203. 
M. S. Fox, M. Barbuceanu, M. Gruninger, and J. Lin, 
"An Organisation Ontology for Enterprise 
Modeling", In Simulating Organizations: 
Computational Models of Institutions and Groups, 
M. Prietula, K. Carley & L. Gaser (Eds), Menlo 
Park CA: AAI/MIT Pres, p. 131-152. 197 
J.E. F. Friedl “astering Regular Expresions” 
O’Reily eds., ISBN: 1-56592-257-3, First edition 
January 197. 
V. Kashyap, C. Ramakrishnan, T. Rindflesch. 
"Toward (Semi)-Automatic Generation of Bio-
medical Ontologies", Procedings of American 
Medical Informatics Asociation, 203 
G. A. Miler, `WordNet: a lexical database for 
English.' In: Comunications of the ACM 38 (1), 
November 195, p. 39 - 41. 
E. Morin and C. Jacquemin “Automatic acquisition 
and expansion of hypernym links” Computer and 
the Humanities, 38: 363-396, 204 
R. Navigli, P. Velardi. Learning Domain Ontologies 
from Document Warehouses and Dedicated 
Websites, Computational Linguistics (30-2), MIT 
Pres, June, 204. 
R. Navigli and P. Velardi, “Structural Semantic 
Interconections: a knowledge-based aproach to 
word sense disambiguation”, Special Isue-
Syntactic and Structural Patern Recognition, IEE 
TPAMI, Volume: 27, Isue: 7, 205. 
R. Navigli, P. Velardi. Automatic Acquisition of a 
Thesaurus of Interoperability Terms, Proc. of 16th 
IFAC World Congres, Praha, Czech Republic, July 
4-8th, 205b. 
R. Snow, D. Jurafsky, A. Y. Ng, "Learning syntactic 
paters for automatic hypernym discovery", NIPS 
17, 205. 
M. Thelen, E. Rilof, "A Botstraping Method for 
Learning Semantic Lexicons using Extraction 
Patern Contexts", Procedings of the Conference 
on Empirical Methods in Natural Language 
Procesing, 202. 
M. Uschold, M. King, S. Morale and Y. Zorgios, 
“The Enterprise Ontology”, The Knowledge 
Enginering Review , Vol. 13, Special Isue on 
Puting Ontologies to Use (eds. Uschold. M. and 
Tate. A.), 198 
Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, 
“Enhancing Ontological Knowledge through 
Ontology Population and Enrichment” in 
Procedings of the 14th EKAW conf., LNAI, Vol. 
3257, p. 14-156, Springer Verlag, 204. 
