File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0501_metho.xml

Size: 17,183 bytes

Last Modified: 2025-10-06 14:10:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0501">
  <Title>Enriching a formal ontology with a thesaurus: an aplication in the cultural heritage domain</Title>
  <Section position="5" start_page="2" end_page="4" type="metho">
    <SectionTitle>
2.1 The CIDOC RM
</SectionTitle>
    <Paragraph position="0"> The core ontology O is the CIDOC CRM (Doerr, 203), a formal core ontology whose purpose is to facilitate the integration and exchange of cultural heritage information between heterogeneous sources. It is currently being elaborated to become an ISO standard.</Paragraph>
    <Paragraph position="1"> In the current version (4.0) the CIDOC includes 84 taxonomically structured concepts (called entities) and a flat set of 141 semantic relations, called properties. Properties are defined in terms of domain (the class for which a property is formally defined) and range (the class that comprises all potential values of a property), e.g.: P46 is composed of (forms part of) Domain: E19 Physical Object Range: E42 Object Identifier The CIDOC is an &amp;quot;informal&amp;quot; resource. To make it usable by a computer program, we replaced specifications writen in natural language with formal ones. For each property R, we created a tuple R(C d</Paragraph>
    <Paragraph position="3"> are the domain and range entities specified in the CIDOC reference manual.</Paragraph>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
2.2 The AT thesaurus
</SectionTitle>
      <Paragraph position="0"> The domain glosary G is the Art and Architecture Thesaurus (AT) a controled vocabulary for use by indexers, catalogers, and other professionals concerned with information management in the fields of art and architecture. In its current version  it includes more than 13,00 terms, descriptions, bibliographic citations, and other information relating to fine art, architecture, decorative arts, archival materials, and material culture. An example is the folowing: maesta Note: Refers to a work of a specific iconographic type, depicting the Virgin Mary and Christ Child enthroned in  htp:/ww.gety.edu/research/conducting_research/ vocabularies/at/ the center with saints and angels in adoration to each side. The type developed in Italy in the 13th century and was based on earlier Grek types. Works of this type are typicaly two-dimensional, including painted panels (often altarpieces), manuscript iluminations, and lowrelief carvings.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Hierarchical Position:
Objects Facet
</SectionTitle>
      <Paragraph position="0"> .. Visual and Verbal Comunication .... Visual Works ...... &lt;visual works&gt; ........ &lt;visual works by subject type&gt; .......... maesta We manually mapped the top CIDOC entities to AT concepts, as shown in Table 1.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
2.3 Additional resources
</SectionTitle>
      <Paragraph position="0"> A general purpose lexicalised ontology, WordNet, is used to bridge the high level concepts defined in the core ontology with the words in a fragment of text. As better clarified later, WordNet is used to verify that certain words in a string of text f satisfy the range</Paragraph>
      <Paragraph position="2"> ) in the CIDOC. In order to do so, we manually linked the WordNet topmost concepts to the CIDOC entities. For example, the concept E19 (Physical Object) is mapped to the WordNet synset &amp;quot;object, physical object&amp;quot;. Furthermore, we created a gazetteer I of named entities extracting names from the Dmoz  , a large human-edited directory of the web, the Union List of Artist Names (ULAN) and the Getty Thesaurus of Geographic Names (GTG) provided by the Getty institute, along with the AT. Named entities often occur in AT definitions, therefore, NE recognition is relevant for our task.</Paragraph>
      <Paragraph position="3">  htp:/dmoz.org/about.html</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="8" type="metho">
    <SectionTitle>
3 Enriching the CIDOC CRM with
</SectionTitle>
    <Paragraph position="0"> the AT thesaurus In this Section we describe in detail the method for automatic semantic annotation and ontology enrichment in the cultural heritage domain.</Paragraph>
    <Paragraph position="1"> We start with an example of the task to be performed: given a glos G of a term t in the glosary G, the first objective is to anotate certain glos fragments with CIDOC relations. For example, the folowing glos fragment for &amp;quot;vedute&amp;quot; is annotated with a CIDOC relation, as folows: [.]The first vedute probably were &lt;caried-out-by&gt;painted by northern European artists&lt;/cariedout-by&gt; [..] Then, for each annotated fragment, we extract a semantic relation instance R(C</Paragraph>
    <Paragraph position="3"> are respectively the domain and range of R. The concept C</Paragraph>
    <Paragraph position="5"> is the concept associated to the &amp;quot;head&amp;quot; word w in the annotated segment of the glos.</Paragraph>
    <Paragraph position="6"> In the previous example, the relation instance is: R caried_out_by (vedute,European_artist) The annotation process allows to automatically enrich O with an existing glosary in the same domain of O, since each pair of term and glos (t,G) in the glosary G is transformed into a formal definition, compliant with O.</Paragraph>
    <Paragraph position="7"> Furthermore, the very same method used to anotate definitions can be used to annotate free text with the relations of the enriched ontology O'.</Paragraph>
    <Paragraph position="8"> We now describe the method in detail. Let G be a glosary, t a term in G and G the corresponding natural language definition (glos). The main steps of the algorithm are  the folowing: 1. Part-of-Speech analysis.</Paragraph>
    <Paragraph position="9">  Each input glos is processed with a part-of-speech tagger, TreeTagger</Paragraph>
    <Paragraph position="11"> R, C, P, S, W } is a simplified set of syntactic categories (respectively, nouns, articles, verbs, adjectives, adverbs, conjunctions, prepositions,  TreTager is available at: htp:/ww.ims.unistutgart.de/projekte/corplex/TreTager. null symbols, wh-words). Terminological strings (european artist) are detected using our Term Extractor tol, already described in (Navigli and Velardi, 204).</Paragraph>
    <Paragraph position="12"> 2. Named Entity recognition.</Paragraph>
    <Paragraph position="13"> We augmented TreeTagger with the ability to capture named entities of locations, organizations, persons, numbers and time expressions. In order to do so, we use regular expressions (Friedl, 197) in a rather standard way, therefore we omit details. When a named entity string w</Paragraph>
    <Paragraph position="15"> is recognized, it is transformed into a single term and a specific part of speech denoting the kind of entity is asigned to it (L for cities (e.g. Venice), countries and continents, T for time and historical periods (e.g. Midle Ages), O for organizations and persons (e.g. Leonardo Da Vinci), B for numbers).</Paragraph>
    <Paragraph position="16"> 3. Anotation of sentence segments with CIDOC properties.</Paragraph>
    <Paragraph position="17"> Once the text has been parsed, we use manually defined regular expressions to capture relevant fragments. The regular expressions are used to annotate glos segments with properties grounded on the CIDOC-CRM relation model. Given a glos G and a property  R, we define a relation checker</Paragraph>
    <Paragraph position="19"> taking in input G and producing in output a</Paragraph>
    <Paragraph position="21"> of fragments of G annotated with the property R: &lt;R&gt;f&lt;/R&gt;. The selection of a fragment f to be included in the set F R is based on three different kinds of constraints: a part-of-speech constraint p(f, posstring) matches the part-of-speech (pos) string associated with the fragment f against a regular expresion (pos-string), specifying the required syntactic structure. a lexical constraint l(f, k, lexicalconstraint) matches the lemma of the word in k-th position of f against a regular expression (lexical-constraint), constraining the lexical conformation of words occurring within the fragment f.</Paragraph>
    <Paragraph position="22"> semantic constraints on domain and range s D (f, semantic-domain) and s(f, k, semantic-range) are valid, respectively, if the term t and the word in the k-th position of f match the semantic constraints on domain and range imposed by the CIDOC, i.e. if there exists at least one sense of t C</Paragraph>
    <Paragraph position="24"> and one sense of w C w such that:  In what folows, we adopt the CIDOC terminology for relations and concepts, i.e. properties and entities.</Paragraph>
    <Paragraph position="26"> More formally, the annotation process is defined as folows: A relation checker c R for a property R is a logical expression composed with constraint predicates and logical conectives, using the folowing production rules:</Paragraph>
    <Paragraph position="28"> where f is a variable representing a sentence fragment. Notice that a relation checker must always specify a semantic constraint s D on the domain of the relation R being checked on fragment f. Optionally, it must also satisfy a semantic constraint s on the k-th element of f, the range of R.</Paragraph>
    <Paragraph position="29"> For example, the folowing excerpt of the checker for the is-composed-of relation (P46):  s(f, 3, physical_object#1) reads as folows: &amp;quot;the fragment f is valid if it consists of a verb in the set { consisting, composed, comprised, constructed }, folowed by a preposition &amp;quot;of&amp;quot;, a posibly empty number of adverbs, adjectives, verbs and nouns, and terminated by a noun interpretable as a physical object in the WordNet concept inventory&amp;quot;. The first predicate, s D , requires that also the term t whose glos contains f (i.e., its domain) be interpretable as a physical object. Notice that some letter in the regular expresion specified for the part-of-speech constraint is enclosed in parentheses. This alows it to identify the relative positions of words to be matched against lexical and semantic constraints, as shown graphically in  part-of-spech tags and words in a glos fragment. Checker (1) recognizes, among others, the folowing fragments (the words whose part-of-</Paragraph>
    <Paragraph position="31"> speech tags are enclosed in parentheses are indicated in bold):</Paragraph>
    <Paragraph position="33"> recognizing, among others, the folowing phrases:  ) Notice that in both checkers (1) and (2) semantic constraints are specified in terms of WordNet sense numbers (material#1, solid#1 and liquid#1), and can also be negative (!color#1 and !activity#1). The motivation is that CIDOC constraints are coarse-grained due to the small number of available core concepts: for example, the property P45 consists of simply requires that the range belongs to the class Material (E57). Using these coarse grained constraints would produce false positives in the annotation task, as discused later. Using WordNet for semantic constraints has two advantages: first, it is posible to write more fine-grained (and hence more reliable) constraints, second, regular expressions can be re-used, at least in part, for other domains and ontologies. In fact, several CIDOC properties are rather general-purpose. Notice that, as remarked in section 2.3, replacing coarse CIDOC sense restrictions with WordNet fine-grained restrictions is posible since we mapped the 84 CIDOC entities onto WordNet topmost concepts.</Paragraph>
    <Paragraph position="34"> 4. Formalisation of gloses.</Paragraph>
    <Paragraph position="35"> The annotations generated in the previous step are the basis for extracting property instances to enrich the CIDOC CRM with a conceptualization of the AT terms. In general, for each glos G defining a concept C</Paragraph>
    <Paragraph position="37"> and for each fragment f [?] F R of G annotated with the property R: &lt;R&gt;f&lt;/R&gt;, it is posible to extract one or more property instances in the form of a triple R(C</Paragraph>
    <Paragraph position="39"> concept asociated with a term or multi-word expression w occurring in f (i.e. its language</Paragraph>
    <Paragraph position="41"> is the concept asociated to the defined term t in AT. For example, from the definition of tating (a kind of lace) the algorithm automatically annotates the phrase composed of knots, sugesting that this phrase specifies the range of the is-composed-of property for the term tating:</Paragraph>
    <Paragraph position="43"> In this property instance, C tating is the domain of the property (a term in the AT glosary) and C knot is the range (a specific term in the definition G of tating). Selecting the concept associated to the domain is rather straightforward: glosary terms are in general not ambiguous, and, if they are, we simply use a numbering policy to identify the appropriate concept. In the example at hand, C tating =tating#1 (the first and only sense in AT). Therefore, if C</Paragraph>
    <Paragraph position="45"> matches the domain restrictions in the regular expression for R, then the domain of the relation is considered to</Paragraph>
    <Paragraph position="47"> . Selecting the range of a relation is instead more complicated. The first problem is to select the correct words in a fragment f.</Paragraph>
    <Paragraph position="48"> Only certain words of an annotated glos fragment can be exploited to extract the range of a property instance. For example, in the phrase &amp;quot;depiction of fruit, flowers, and other objects&amp;quot; (from the definition of stil life), only fruit, flowers, objects represent the range of the property instances of kind depicts (P62).</Paragraph>
    <Paragraph position="49"> When writing relation checkers, as described in the previous paragraph of this Section, we can add markers of ontological relevance by specifying a predicate r(f, k) for each relevant position k in a fragment f. The purpose of these markers is precisely to identify words in f whose corresponding concepts are in the range of a property. For instance, the checker (1) c  is-composed-of null from the previous paragraph is augmented with the conjunction:</Paragraph>
    <Paragraph position="51"> added the predicate r(f, 3) because the third parenthesis in the part-of-speech string refers to an ontologically relevant element (i.e. the candidate range of the is-composed-of property).</Paragraph>
    <Paragraph position="52"> The second problem is that words that are candidate ranges can be ambiguous, and they often are, especially if they do not belong to the domain glosary G. Considering the previous example of the property depicts, the word fruit is not a term of the AT glosary, and it has 3 senses in WordNet (the fruit of a plant, the consequence of some action, an amount of product). The property depicts, as defined in the CIDOC, simply requires that the range be of type Entity (E1). Therefore, all the three senses of fruit in WordNet satisfy this constraint. Whenever the range constraints in a relation checker do not alow a ful disambiguation, we apply the SSI algorithm (Navigli and Velardi, 205), a semantic disambiguation algorithm based on structural pattern recognition, available on-line</Paragraph>
  </Section>
  <Section position="7" start_page="8" end_page="8" type="metho">
    <SectionTitle>
. The
</SectionTitle>
    <Paragraph position="0"> algorithm is applied to the words belonging to the segment fragment f and is based on the detection of relevant semantic interconection patterns between the appropriate senses. These patterns are extracted from a lexical knowledge base that merges WordNet with other resources, like word colocations, on-line dictionaries, etc.</Paragraph>
    <Paragraph position="1"> For example, in the fragment &amp;quot;depictions of fruit, flowers, and other objects&amp;quot; the folowing properties are created for the concept stil_</Paragraph>
    <Paragraph position="3"> Some of the semantic patterns suporting this sense selection are shown in Figure 2.</Paragraph>
    <Paragraph position="4"> A further posibility is that the range of a relation R is a concept instance. We create concept instances if the word w extracted from the fragment f is a named entity. For example, the definition of Venetian lace is annotated as &amp;quot;Refers to needle lace created &lt;current-orformer-location&gt; in Venice&lt;/current-orformer-location&gt; [...]&amp;quot;.</Paragraph>
    <Paragraph position="5"> As a result, the folowing triple is produced:</Paragraph>
    <Paragraph position="7"> where Venetian_ lace#1 is the concept label generated for the term Venetian lace in the AT and Venice is an instance of the concept city#1 (city, metropolis, urban center) in WordNet.</Paragraph>
    <Paragraph position="8">  SI is an on-line knowledge-based WSD algorithm acesible from htp:/lcl.di.uniroma1.it/si. The on-line version also outputs the detected semantic conections (as those in Figure 2).</Paragraph>
    <Paragraph position="10"/>
  </Section>
class="xml-element"></Paper>
Download Original XML