File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-2002_metho.xml

Size: 46,411 bytes

Last Modified: 2025-10-06 14:08:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-2002">
  <Title>c(c) 2004 Association for Computational Linguistics Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites</Title>
  <Section position="3" start_page="153" end_page="153" type="metho">
    <SectionTitle>
2. The Ontology Engineering Architecture
</SectionTitle>
    <Paragraph position="0"> Figure 2 reports the proposed ontology-engineering method, that is, the sequence of steps and the intermediate outputs that are produced in building a domain ontology. As shown in the figure, ontology engineering is an iterative process involving concept learning (OntoLearn), machine-supported concept validation (ConSys), and management (SymOntoX).</Paragraph>
    <Paragraph position="1"> The engineering process starts with OntoLearn exploring available documents and related Web sites to learn domain concepts and detect taxonomic relations among them, producing as output a domain concept forest. Initially, we base concept learning on external, generic knowledge sources (we use WordNet and SemCor). In subsequent cycles, the domain ontology receives progressively more use as it becomes adequately populated.</Paragraph>
    <Paragraph position="2"> Ontology validation is undertaken with ConSys (Missikoff and Wang 2001), a Web-based groupware package that performs consensus building by means of thorough validation by the representatives of the communities active in the application domain. Throughout the cycle, OntoLearn operates in connection with the ontology management system, SymOntoX (Formica and Missikoff 2003). Ontology engineers use this management system to define and update concepts and their mutual connections, thus allowing the construction of a semantic net. Further, SymOntoX's environment can attach the automatically learned domain concept trees to the appropriate nodes of the core ontology, thereby enriching concepts with additional information. SymOntoX</Paragraph>
  </Section>
  <Section position="4" start_page="153" end_page="158" type="metho">
    <SectionTitle>
3 The LEKS-CNR laboratory in Rome.
4 The Fetish EC project, ITS-13015 (http://fetish.singladura.com/index.php) and the Harmonise EC
</SectionTitle>
    <Paragraph position="0"> project, IST-2000-29329 (http://dbs.cordis.lu), both in the tourism domain, and the INTEROP Network of Excellence on interoperability IST-2003-508011.</Paragraph>
    <Paragraph position="1">  The ontology-engineering chain.</Paragraph>
    <Paragraph position="2"> also performs consistency checks. The self-learning cycle in Figure 2 consists, then, of two steps: first, domain users and experts use ConSys to validate the automatically learned ontology and forward their suggestions to the knowledge engineers, who implement them as updates to SymOntoX. Then, the updated domain ontology is used by OntoLearn to learn new concepts from new documents.</Paragraph>
    <Paragraph position="3"> The focus of this article is the description of the OntoLearn system. Details on other modules of the ontology-engineering architecture can be found in the referenced papers.</Paragraph>
    <Paragraph position="4"> 3. Architecture of the OntoLearn System Figure 3 shows the architecture of the OntoLearn system. There are three main phases: First, a domain terminology is extracted from available texts in the application domain (specialized Web sites and warehouses, or documents exchanged among members of a virtual community), and filtered using natural language processing and statistical techniques. Second, terms are semantically interpreted (in a sense that we clarify in Section 3.2) and ordered according to taxonomic relations, generating a domain concept forest (DCF). Third, the DCF is used to update the existing ontology (WordNet or any available domain ontology).</Paragraph>
    <Paragraph position="5"> In a &amp;quot;stand-alone&amp;quot; mode, OntoLearn automatically creates a specialized view of WordNet, pruning certain generic concepts and adding new domain concepts. When used within the engineering chain shown in Figure 2, ontology integration and updating is performed by the ontology engineers, who update an existing core ontology using SymOntoX.</Paragraph>
    <Paragraph position="6"> In this article we describe the stand-alone procedure.</Paragraph>
    <Section position="1" start_page="154" end_page="157" type="sub_section">
      <SectionTitle>
3.1 Phase 1: Terminology Extraction
</SectionTitle>
      <Paragraph position="0"> Terminology is the set of words or word strings that convey a single (possibly complex) meaning within a given community. In a sense, terminology is the surface appearance, in texts, of the domain knowledge of a community. Because of their low ambiguity and high specificity, these words are also particularly useful for conceptualizing a knowledge domain or for supporting the creation of a domain ontology. Candidate terminological expressions are usually captured with more or less shallow techniques, ranging from stochastic methods (Church and Hanks 1989; Yamamoto and Church 2001) to more sophisticated syntactic approaches (Jacquemin 1997).</Paragraph>
      <Paragraph position="1">  The architecture of OntoLearn.</Paragraph>
      <Paragraph position="2"> Obviously, richer syntactic information positively influences the quality of the result to be input to the statistical filtering. In our experiments we used the linguistic processor ARIOSTO (Basili, Pazienza, and Velardi 1996) and the syntactic parser CHAOS (Basili, Pazienza, and Zanzotto 1998). We parsed the available documents in the application domain in order to extract a list T c of syntactically plausible terminological noun phrases (NPs), for example, compounds (credit card), adjective-NPs (local tourist information office), and prepositional-NPs (board of directors). In English, the first two constructs are the most frequent.</Paragraph>
      <Paragraph position="3"> OntoLearn uses a novel method for filtering &amp;quot;true&amp;quot; terminology, described in detail in (Velardi, Missikoff, and Basili 2001). The method is based on two measures, called Domain Relevance (DR) and Domain Consensus (DC), that we introduce hereafter.  Computational Linguistics Volume 30, Number 2 High frequency in a corpus is a property observable for terminological as well as nonterminological expressions (e.g., last week or real time). We measure the specificity of a terminological candidate with respect to the target domain via comparative analysis across different domains. To this end a specific DR score has been defined. A quantitative definition of the DR can be given according to the amount of information captured within the target corpus with respect to a larger collection of corpora. More precisely, given a set of n domains {D  where the conditional probabilities (P(t|D</Paragraph>
      <Paragraph position="5"> (i.e., in its related corpus).</Paragraph>
      <Paragraph position="6"> Terms are concepts whose meaning is agreed upon by large user communities in a given domain. A more selective analysis should take into account not only the overall occurrence of a term in the target corpus but also its appearance in single documents. Domain terms (e.g., travel agent) are referred to frequently throughout the documents of a domain, while there are certain specific terms with a high frequency within single documents but completely absent in others (e.g., petrol station, foreign income). Distributed usage expresses a form of consensus tied to the consolidated semantics of a term (within the target domain) as well as to its centrality in communicating domain knowledge.</Paragraph>
      <Paragraph position="7"> A second relevance indicator, DC, is then assigned to candidate terms. DC measures the distributed use of a term in a domain D k . The distribution of a term t in documents d [?] D k can be taken as a stochastic variable estimated throughout all</Paragraph>
      <Paragraph position="9"> More precisely, the domain consensus is expressed as follows:</Paragraph>
      <Paragraph position="11"> Nonterminological (or nondomain) candidate terms are filtered using a combination of measures (1) and (2).</Paragraph>
      <Paragraph position="12"> For each candidate term the following term weight is computed:</Paragraph>
      <Paragraph position="14"> is a normalized entropy and a,b [?] (0, 1). We experimented with several thresholds for a and b, with consistent results in two domains (Velardi, Missikoff, and Basili 2001). Usually, a value close to 0.9 is to be chosen for a. The threshold for  Navigli and Velardi Learning Domain Ontologies Table 1 The first 10 terms from a tourism (left) and finance (right) domain.</Paragraph>
      <Paragraph position="15"> tourism finance travel information vice president shopping street net income airline ticket executive officer booking form composite trading bus service stock market car rental interest rate airport transfer million share contact detail holding company continental breakfast third-quarter net tourist information office chief executive  A lexicalized tree in a tourism domain. b depends upon the number N of documents in the training set of D k . When N is sufficiently large, &amp;quot;good&amp;quot; values are between 0.35 and 0.25. Table 1 shows some of the accepted terms in two domains, ordered by TW.</Paragraph>
    </Section>
    <Section position="2" start_page="157" end_page="158" type="sub_section">
      <SectionTitle>
3.2 Phase 2: Semantic Interpretation
</SectionTitle>
      <Paragraph position="0"> The set of terms accepted by the filtering method described in the previous section are first arranged in subtrees, according to simple string inclusion.</Paragraph>
      <Paragraph position="1">  of what we call a lexicalized tree T . In absence of semantic interpretation, it is not possible to fully capture conceptual relationships between concepts (for example, the taxonomic relation between bus service and public transport service in Figure 4). Semantic interpretation is the process of determining the right concept (sense) for each component of a complex term (this is known as sense disambiguation) and then identifying the semantic relations holding among the concept components, in order to build a complex concept. For example, given the complex term bus service, we would like to associate a complex concept with this term as in Figure 5, where bus#1 and service#1 are unique concept names taken from a preexisting concept inventory (e.g., WordNet, though other general-purpose ontologies could be used), and INSTR is a semantic relation indicating that there is a service, which is a type of work (service#1), operated through (instrument) a bus, which is a type of public transport (bus#1).</Paragraph>
      <Paragraph position="2"> 5 Inclusion is on the right side in the case of compound terms (the most common syntactic construct for terminology in English).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="158" end_page="170" type="metho">
    <SectionTitle>
INSTR
</SectionTitle>
    <Paragraph position="0"> Figure 5 A complex term represented as a complex concept. This kind of semantic interpretation is indeed possible if the meaning of a new complex concept can be interpreted compositionally from its components. Clearly, this is not always possible. Furthermore, some of the component concepts may be absent in the initial ontology. In this case, other strategies can be adopted, as sketched in Section 6.</Paragraph>
    <Paragraph position="1"> To perform semantic disambiguation, we use available lexical resources, like Word-Net and annotated corpora, and a novel word sense disambiguation (WSD) algorithm called structural semantic interconnection. A state-of-art inductive learner is used to learn rules for tagging concept pairs with the appropriate semantic relation. In the following, we first describe the semantic disambiguation algorithm (Sections 3.2.1 to 3.2.4). We then describe the semantic relation extractor (Section 3.2.5).  extending and trimming a general-purpose ontology. In its current implementation, it uses a concept inventory taken from WordNet. WordNet associates one or more synsets (e.g., unique concept names) to over 120,000 words but includes very few domain terms: for example, bus and service are individually included, but not bus service as a unique term.</Paragraph>
    <Paragraph position="2"> The primary strategy used by OntoLearn to attach a new concept under the appropriate hyperonym of an existing ontology is compositional interpretation. Let</Paragraph>
    <Paragraph position="4"> be a valid multiword term belonging to a lexicalized tree T .</Paragraph>
    <Paragraph position="5"> Let w  be the syntactic head of t (e.g., the rightmost word in a compound, or the leftmost in a prepositional NP). The process of compositional interpretation associates the appropriate WordNet synset S</Paragraph>
    <Paragraph position="7"> in t. The sense of t is hence compositionally defined as</Paragraph>
    <Paragraph position="9"> corresponding to sense 1 of company (an institution created to conduct business) and sense 3 of transport (the commercial enterprise of transporting goods and materials).</Paragraph>
    <Paragraph position="10"> Compositional interpretation is a form of word sense disambiguation. In this section, we define a new approach to sense disambiguation called structural semantic interconnections (SSI).</Paragraph>
    <Paragraph position="11"> The SSI algorithm is a kind of structural pattern recognition. Structural pattern recognition (Bunke and Sanfeliu 1990) has proven to be effective when the objects to be classified contain an inherent, identifiable organization, such as image data and time-series data. For these objects, a representation based on a &amp;quot;flat&amp;quot; vector of features causes a loss of information that has a negative impact on classification per- null Two representations of the same concept: (a) as a feature vector and (b) as a semantic graph. formances. The classification task in a structural pattern recognition system is implemented through the use of grammars that embody precise criteria to discriminate among different classes. The drawback of this approach is that grammars are by their very nature application and domain specific. However, automatic learning techniques may be adopted to learn from available examples.</Paragraph>
    <Paragraph position="12"> Word senses clearly fall under the category of objects that are better described through a set of structured features. Compare for example the following two feature-vector (a) and graph-based (b) representations of the WordNet definition of coach#5 (a vehicle carrying many passengers, used for public transport) in Figure 6. The graph representation shows the semantic interrelationships among the words in the definition, in contrast with the flat feature vector representation.</Paragraph>
    <Paragraph position="13"> Provided that a graph representation for alternative word senses in a context is available, disambiguation can be seen as the task of detecting certain &amp;quot;meaningful&amp;quot; interconnecting patterns among such graphs. We use a context-free grammar to specify the type of patterns that are the best indicators of a semantic interrelationship and to select the appropriate sense configurations accordingly.</Paragraph>
    <Paragraph position="14"> In what follows, we first describe the method to obtain a graph representation of word senses from WordNet and other available resources. Then, we illustrate the disambiguation algorithm.</Paragraph>
    <Paragraph position="15"> Creating a graph representation for word senses. A graph representation of word senses is automatically built using a variety of knowledge source:  1. WordNet. In WordNet, in addition to synsets, the following information is provided: (a) a textual sense definition (gloss); (b) hyperonymy links (i.e., kind-of relations: for example, bus#1 is a kind of public transport#1); (c) meronymy relations (i.e., part-of relations: for example, bus#1 has part roof#2 and window#2); (d) other syntactic-semantic relations, as detailed later, not systematically provided throughout the lexical knowledge base.</Paragraph>
    <Paragraph position="16"> 2. Domain labels  extracted by a semiautomatic methodology described in Magnini and Cavaglia (2000) for assigning domain information (e.g., tourism, zoology, sport) to WordNet synsets.</Paragraph>
    <Paragraph position="17">  3. Annotated corpora providing examples of word sense usages in contexts: 6 Domain labels have been kindly made available by the IRST to our institution for research purposes.  is a corpus in which each word in a sentence is assigned a sense selected from the WordNet sense inventory for that word. Examples of a SemCor document are the following:  is a corpus in which each document is a collection of sentences having a certain word in common. The corpus provides a sense tag for each occurrence of the word within the document. Examples from the document focused on the noun house are the following: Ten years ago, he had come to the house#2 to be interviewed.</Paragraph>
    <Paragraph position="18"> Halfway across the house#1, he could have smelled her morning perfume.</Paragraph>
    <Paragraph position="19"> (c) In WordNet, besides glosses, examples are sometimes provided for certain synsets. From these examples, as for the LDC and SemCor corpora, co-occurrence information can be extracted.</Paragraph>
    <Paragraph position="20"> Some examples are the following: Overnight accommodations#4 are available.</Paragraph>
    <Paragraph position="21"> Is there intelligent#1 life in the universe? An intelligent#1 question.</Paragraph>
    <Paragraph position="22"> The use of other semantic knowledge repositories (e.g., FrameNet  and Verbnet  ) is currently being explored, the main problem being the need of harmonizing these resources with the WordNet sense and relations inventory.</Paragraph>
    <Paragraph position="23"> The information available in WordNet and in the other resources described in the previous section is used to automatically generate a labeled directed graph (digraph) representation of word senses. We call this a semantic graph.</Paragraph>
    <Paragraph position="24">  and 2 (conductor) of bus; in the figure, nodes represent concepts (WordNet synsets) and edges are semantic relations. In each graph in the figure, we include only nodes with a maximum distance of three from the central node, as suggested by the dashed oval. This distance has been experimentally established.</Paragraph>
    <Paragraph position="25"> The following semantic relations are used: hyperonymy (car is a kind of vehicle,  -). All these relations are explicitly encoded in WordNet, except for the last three. Topic, gloss, and domain are extracted from annotated corpora, sense definitions, and domain labels, respectively. Topic expresses a co-occurrence relation between concepts in texts, extracted from annotated corpora and usage examples. Gloss relates a concept to another concept occurring in its natural language definition. Finally, domain relates two concepts sharing the same domain label. In parsing glosses, we use a stop list to eliminate the most frequent words. The SSI algorithm. The SSI algorithm is a knowledge-based iterative approach to word sense disambiguation. The classification problem can be stated as follows:  Computational Linguistics Volume 30, Number 2</Paragraph>
    <Paragraph position="27"> are structural specifications of the possible senses for t (semantic graphs).</Paragraph>
    <Paragraph position="28"> * G is a grammar describing structural relations (semantic interconnections) among the objects to be analyzed.</Paragraph>
    <Paragraph position="29"> * Determine how well the structure of I matches that of each of</Paragraph>
    <Paragraph position="31"> Structural representations are graphs, as previously detailed. The SSI algorithm consists of an initialization step and an iterative step.</Paragraph>
    <Paragraph position="32"> In a generic iteration of the algorithm, the input is a list of co-occurring terms</Paragraph>
    <Paragraph position="34"> ], that is, the semantic interpretation of T, where S</Paragraph>
    <Paragraph position="36"> is either the chosen sense for t i (i.e., the result of a previous disambiguation step) or the empty set (i.e., the term is not yet disambiguated). A set of pending terms is also maintained, P = {t</Paragraph>
    <Paragraph position="38"> = [?]}. I is referred to as the semantic context of T and is used, at each step, to disambiguate new terms in P. The algorithm works in an iterative way, so that at each stage either at least one term is removed from P (i.e., at least one pending term is disambiguated) or the procedure stops because no more terms can be disambiguated. The output is the updated list I of senses associated with the input terms T.</Paragraph>
    <Paragraph position="39"> Initially, the list I includes the senses of monosemous terms in T. If no monosemous terms are found, the algorithm uses an initialization policy described later.</Paragraph>
    <Paragraph position="40"> During a generic iteration, the algorithm selects those terms t in P showing an interconnection between at least one sense S of t and one or more senses in I. The likelihood that a sense S will be the correct interpretation of t, given the semantic context I, is estimated by the function f</Paragraph>
    <Paragraph position="42"> all the concepts in WordNet, and defined as follows:  ) is given by the sum of the 11 Note that with S</Paragraph>
    <Paragraph position="44"> we refer interchangeably to the semantic graph associated with a sense or to the sense label (i.e., the synset).</Paragraph>
    <Paragraph position="45">  G is described in the next subsection.) Finally, the algorithm selects S</Paragraph>
    <Paragraph position="47"> to improve the robustness of the system's choices.</Paragraph>
    <Paragraph position="48"> At the end of a generic iteration, a number of terms are disambiguated, and each of them is removed from the set of pending terms P. The algorithm stops with output I when no sense S prime can be found for the remaining terms in P such that f</Paragraph>
    <Paragraph position="50"> that is, P cannot be further reduced. In each iteration, interconnections can be found only between the sense of a pending term t and the senses disambiguated during the previous iteration.</Paragraph>
    <Paragraph position="51"> If no monosemous words are found, we explore two alternatives: either we provide manually the synset of the root term h (e.g., service#1 in Figure 4: work done by one person or group that benefits another), or we fork the execution of the algorithm into as many processes as the number of senses of the root term h. Let n be such a number. For each process i (i = 1,..., n), the input is given by I</Paragraph>
    <Paragraph position="53"> Figure 8 provides pseudocode for the SSI algorithm.</Paragraph>
    <Paragraph position="54">  connecting patterns among semantic graphs representing concepts in the ontology. We define a pattern as a sequence of consecutive semantic relations e</Paragraph>
    <Paragraph position="56"> the set of terminal symbols, that is, the vocabulary of conceptual relations. Two rela-</Paragraph>
    <Paragraph position="58"> are consecutive if the edges labeled with e</Paragraph>
    <Paragraph position="60"> are incoming and/or outgoing from the same concept node, for example,</Paragraph>
    <Paragraph position="62"> In its current version, the grammar G has been defined manually, inspecting the intersecting patterns automatically extracted from pairs of manually disambiguated word senses co-occurring in different domains. Some of the rules in G are inspired by previous work in the eXtended WordNet  project. The terminal symbols e</Paragraph>
    <Paragraph position="64"> the conceptual relations extracted from WordNet and other on-line lexical-semantic resources, as described in Section 3.2.1.</Paragraph>
    <Paragraph position="65">  Computational Linguistics Volume 30, Number 2</Paragraph>
    <Paragraph position="67"> The SSI algorithm in pseudocode.</Paragraph>
    <Paragraph position="68">  (for example, in web site, the gloss of web#5 contains the word site;inwaiter service, the gloss of restaurant attendant#1, hyperonym of waiter#1, contains the word service) 8. topic, if S  (for example, in the term archeological site, in which both words are tagged with sense 1 in a SemCor file; notice that WordNet provides no mutual information about them; also consider picturesque village: WordNet provides the example &amp;quot;a picturesque village&amp;quot; for sense 1 of picturesque)  (for example, in mountain range, mountain#1 and range#5 both contain the word hill so that the right senses can be chosen) 12. hyperonymy/meronymy+gloss path, if [?]G [?] Synsets : G  and there is a parallelism path between S  and G.</Paragraph>
    <Paragraph position="69">  applied to the task of disambiguating a lexicalized tree T . With reference to Figure 4, the list T is initialized with all the component words in T , that is, [service, train, ferry, car, boat, car-ferry, bus, coach, transport, public transport, taxi, express, customer]. Step 1. In T there are four monosemous words, taxi, car-ferry, public transport, and customer; therefore, we have</Paragraph>
    <Paragraph position="71"> Step 2. During the second iteration, the following rules are matched:</Paragraph>
    <Paragraph position="73"/>
    <Paragraph position="75"> Then the algorithm stops since the list P is empty.</Paragraph>
    <Paragraph position="76">  all the terms in a lexicalized tree T are disambiguated. Subsequently, we proceed as follows: a. Concept clustering: Certain concepts can be clustered in a unique concept on the basis of pertainymy, similarity, and synonymy (e.g., manor house and manorial house, expert guide and skilled guide, bus service and coach service, respectively); notice again that we detect semantic relations between concepts, not words. For example, bus#1 and coach#5 are synonyms, but this relation does not hold for other senses of these two words.</Paragraph>
    <Paragraph position="77"> b. Hierarchical structuring: Taxonomic information in WordNet is used to replace syntactic relations with kind-of relations (e.g., ferry service kind-of  [?]boat service), on the basis of hyperonymy, rather than string inclusion as in T .</Paragraph>
    <Paragraph position="78"> 14 Notice that bus#1 and coach#5 belong to the same synset, therefore they are disambiguated by the same rule.</Paragraph>
    <Paragraph position="79">  Computational Linguistics Volume 30, Number 2 service transport service car service public transport service car service#2 boat service coach service, bus service train servicebus service#2 taxi service coach service#2 express service#2express service coach service#3 ferry service  Domain concept tree.</Paragraph>
    <Paragraph position="80"> Each lexicalized tree T is finally transformed into a domain concept tree U. Figure 9 shows the concept tree obtained from the lexicalized tree of Figure 4. For the sake of legibility, in Figure 9 concepts are labeled with the associated terms (rather than with synsets), and numbers are shown only when more than one semantic interpretation holds for a term. In fact, it is possible to find more than one matching hyperonymy relation. For example, an express can be a bus or a train, and both interpretations are valid, because they are obtained from relations between terms within the domain.</Paragraph>
    <Paragraph position="81">  volves finding the appropriate semantic relations holding among concept components. In order to extract semantic relations, we need to do the following: * Select an inventory of domain-appropriate semantic relations. * Learn a formal model to select the relations that hold between pairs of concepts, given ontological information on these concepts. * Apply the model to semantically relate the components of a complex concept.</Paragraph>
    <Paragraph position="82"> First, we selected an inventory of semantic relations types. To this end, we consulted John Sowa's (1984) formalization on conceptual relations, as well as other studies conducted within the CoreLex,  FrameNet, and EuroWordNet (Vossen 1998) projects. In the literature, no systematic definitions are provided for semantic relations; therefore we selected only the more intuitive and widely used ones. To begin, we selected a kernel inventory including the following 10 relations, which we found pertinent (at least) to the tourism and finance  -[?] hotel). This set can be easily adapted or extended to other domains.</Paragraph>
    <Paragraph position="83"> In order to associate the appropriate relation(s) that hold among the components of a domain concept, we decided to use inductive machine learning. In inductive learning, one has first to manually tag with the appropriate semantic relations a subset of domain concepts (this is called the learning set) and then let an inductive learner build a tagging model. Among the many available inductive learning programs, we experimented both with Quinlan's C4.5 and with TiMBL (Daelemans et al. 1999). An inductive learning system requires selecting a set of features to represent instances in the learning domain. Instances in our case are concept-relation-concept triples (e.g., wine OBJ -[?] production), where the type of relation is given only in the learning set.</Paragraph>
    <Paragraph position="84"> We explored several alternatives for feature selection. We obtained the best result when representing each concept component by the complete list of its hyperonyms (up to the topmost), as follows: feature [?] vector[[list of hyperonyms]  For example, the feature vector for tourism operator, where tourism is the modifier and operator is the head, is built as the sequence of hyperonyms of tourism#1: [tourism#1, commercial enterprise#2, commerce#1, transaction#1, group-action#1, act#1, humanaction#1], followed by the sequence of hyperonyms for operator#2 [operator#2, capitalist#2, causal agent#1, entity#1, life form#1, person#1, individual#1]. Features are converted into a binary representation to obtain vectors of equal length. We ran several experiments, using a tagged set of 405 complex concepts, a varying fragment of which were used for learning, the remainder for testing (we used two-fold cross-validation). Overall, the best experiment provided a 6% error rate over 405 examples and produced around 20 classification rules.</Paragraph>
    <Paragraph position="85"> The following are examples of extracted rules (from C4.5), along with their confidence factor (in parentheses) and examples: If in modifier [knowledge domain#1, knowledge base#1] = 1 then relation THEME(63%) Examples : arts festival, science center If in modifier [building material#1]=1 then relation MATTER(50%) Examples : stave church, cobblestone street If in modifier [conveyance#3, transport#1]=1 and in head[act#1,human act#1] = 1 then relation MANNER(92.2%) Examples : bus service, coach tour Selection and extraction of conceptual relations is one of the active research areas in the OntoLearn project. Current research is directed toward the exploitation of on-line resources (e.g., the tagged set of conceptual relations in FrameNet) and the automatic  Computational Linguistics Volume 30, Number 2 generation of glosses for complex concepts (e.g., for travel service we have travel#1</Paragraph>
  </Section>
  <Section position="6" start_page="170" end_page="170" type="metho">
    <SectionTitle>
PURPOSE
</SectionTitle>
    <Paragraph position="0"> -[?] service#1:&amp;quot;a kind of service, work done by one person or group that benefits another, for travel, the act of going from one place to another&amp;quot;). Automatic generation of glosses (see Navigli et al. [2004] for preliminary results) relies on the compositional interpretation criterion, as well as the semantic information provided by conceptual relations.</Paragraph>
    <Section position="1" start_page="170" end_page="170" type="sub_section">
      <SectionTitle>
3.3 Phase 3: Ontology Integration
</SectionTitle>
      <Paragraph position="0"> The domain concept forest generated by OntoLearn is used to trim and update Word-Net, creating a domain ontology. WordNet is pruned and trimmed as follows:  * After the domain concept trees are attached to the appropriate nodes in WordNet in either a manual or an automatic manner, all branches not containing a domain node can be removed from the WordNet hierarchy.</Paragraph>
      <Paragraph position="1"> * An intermediate node in WordNet is pruned whenever the following conditions all hold: 1. It has no &amp;quot;brother&amp;quot; nodes.</Paragraph>
      <Paragraph position="2"> 2. It has only one direct hyponym.</Paragraph>
      <Paragraph position="3"> 3. It is not the root of a domain concept tree.</Paragraph>
      <Paragraph position="4"> 4. It is not at a distance greater than two from a WordNet unique  beginner (this is to preserve a &amp;quot;minimal&amp;quot; top ontology). Figure 10 shows an example of pruning the nodes located over the domain concept tree rooted at wine#1. The appendix shows an example of a domain-adapted branch of WordNet in the tourism domain.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="170" end_page="174" type="metho">
    <SectionTitle>
4. Evaluation
</SectionTitle>
    <Paragraph position="0"> The evaluation of ontologies is recognized to be an open problem.</Paragraph>
    <Paragraph position="1">  Though the number of contributions in the area of ontology learning and construction has considerably increased in the past few years, especially in relation to the forthcoming Semantic Web, experimental data on the utility of ontologies are not available, other than those in Farquhar et al. (1998), in which an analysis of user distribution and requests is presented for the Ontology Server system. A better performance indicator would have been the number of users that access the Ontology Server on a regular basis, but the authors mention that regular users account for only a small percentage of the total. Efforts have recently being made on the side of ontology evaluation tools and methods, but available results are on the methodological rather than on the experimental side. The ontology community is still in the process of assessing an evaluation framework. We believe that, in absence of a commonly agreed-upon schema for analyzing the properties of an ontology, the best way to proceed is evaluating an ontology within some existing application. Our current work is precisely in this direction: The results of a terminology translation experiment appear in Navigli, Velardi, and Gangemi (2003), while preliminary results on a query expansion task are presented in Navigli and Velardi (2003).</Paragraph>
    <Paragraph position="2">  Pruning steps over the domain concept tree for wine1.</Paragraph>
    <Paragraph position="3"> In this evaluation section we proceed as follows: First, we provide an account of the feedback that we obtained from tourism experts participating in the Harmonise EC project on interoperability in the tourism domain. Then, we evaluate in detail the SSI algorithm, which is the &amp;quot;heart&amp;quot; of the OntoLearn methodology.</Paragraph>
    <Section position="1" start_page="170" end_page="172" type="sub_section">
      <SectionTitle>
4.1 OntoLearn as a Support for Ontology Engineers
</SectionTitle>
      <Paragraph position="0"> During the first year of the Harmonise project, a core ontology of about three hundred concepts was developed using ConSys and SymOntoX. In parallel, we collected a corpus of about one million words from tourism documents, mainly descriptions of travels and tourism sites. From this corpus, OntoLearn extracted an initial list of 14,383  Computational Linguistics Volume 30, Number 2 candidate terms (the first phase of terminology extraction in Section 3.1), from which the system derived a domain concept forest of 3,840 concepts, which were submitted to the domain experts for ontology updating and integration.</Paragraph>
      <Paragraph position="1"> The Harmonise ontology partners lacked the requisite expertise to evaluate the WordNet synset associations generated by OntoLearn for each complex term, therefore we asked them to evaluate only the domain appropriateness of the terms, arranged in a hierarchical fashion (as in Figure 9). We obtained a precision ranging from 72.9% to about 80% and a recall of 52.74%.</Paragraph>
      <Paragraph position="2">  The precision shift is due to the well-known fact that experts may have different intuitions about the relevance of a concept for a given domain. The recall estimate was produced by manually inspecting 6,000 of the initial 14,383 candidate terms, asking the experts to mark all the terms judged as &amp;quot;good&amp;quot; domain terms, and comparing the obtained list with the list of terms automatically filtered by OntoLearn (the phase of terminology filtering described in Section 3.1). As a result of the feedback obtained from the tourism experts, we decided that experts' interpretation difficulties could indeed be alleviated by associating a textual definition with each new concept proposed by OntoLearn. This new research (automatic generation of glosses) was mentioned in Section 3.2.5. We still need to produce an in-field evaluation of the improved readability of the ontology enriched with textual definitions.</Paragraph>
      <Paragraph position="3"> In any case, OntoLearn favored a considerable speed up in ontology development, since shortly after we provided the results of our OntoLearn tool, the Harmonise ontology reached about three thousand concepts. Clearly, the definition of an initial set of basic domain concepts is sufficiently crucial, to justify long-lasting and even heated discussions. But once an agreement is reached, filling the lower levels of the ontology can still take a long time, simply because it is a tedious and time-consuming task. Therefore we think that OntoLearn revealed itself indeed to be a useful tool within Harmonise.</Paragraph>
    </Section>
    <Section position="2" start_page="172" end_page="174" type="sub_section">
      <SectionTitle>
4.2 Evaluation of the SSI Word Sense Disambiguation Algorithm
</SectionTitle>
      <Paragraph position="0"> As we will argue in Section 5, one of the novel aspects of OntoLearn with respect to current ontology-learning literature is semantic interpretation of extracted terms.</Paragraph>
      <Paragraph position="1"> The SSI algorithm described in section 3.2 was subjected to several evaluation experiments by the authors of this article. The output of these experiments was used to tune certain heuristics adopted by the algorithm, for example, the dimension of the semantic graph (i.e., the maximum distance of a concept S prime from the central concept S) and the weights associated with grammar rules. To obtain a domain-independent tuning, tuning experiments were performed applying the SSI algorithm on standard word sense disambiguation data,  such as SemCor and Senseval all-words.</Paragraph>
      <Paragraph position="2">  However, OntoLearn's main task is terminology disambiguation, rather than plain word sense disambiguation. In complex terms, words are likely to be more tightly semantically related than in a sentence; therefore the SSI algorithm seems more appropriate.</Paragraph>
      <Paragraph position="3">  To test the SSI algorithm, we selected 650 complex terms from the set of 3,840 concepts mentioned in Section 4.1, and we manually assigned the appropriate 18 In a paper specifically dedicated to terminology extraction and evaluation (Velardi, Missikoff, and Basili 2001) we performed an evaluation also on an economics domain, with similar results. 19 In standard WSD tasks, the list T in input to the SSI algorithm is the set of all words in a sentence fragment to be disambiguated.</Paragraph>
      <Paragraph position="4"> 20 http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/resources.html#test 21 For better performance on a standard WSD task, it would be essential to improve lexical knowledge of verbs (e.g. by integrating VerbNet and FrameNet, as previously mentioned), as well as to enhance the grammar.</Paragraph>
      <Paragraph position="5">  Different runs of the semantic disambiguation algorithm when certain rules in the grammar G are removed.</Paragraph>
      <Paragraph position="6"> WordNet synset to each word composing the term. We used two annotators to ensure some degree of objectivity in the test set. In this task we experienced difficulties already pointed out by other annotators, namely, that certain synsets are very similar, to the point that choosing one or the other--even with reference to our specific tourism domain--seemed a mere guess. Though we can't say that our 650 tagged terms are a &amp;quot;gold standard,&amp;quot; evaluating OntoLearn against this test set still produced interesting outcomes and a good intuition of system performance. Furthermore, as shown by the example of Section 3.2.3, OntoLearn produces a motivation for its choices, that is, the detected semantic patterns. Though it was not feasible to analyze in detail all the output of the system, we found more than one example in which the choices of OntoLearn were more consistent  and more convincing than those produced by the annotators, to the point that OntoLearn could also be used to support human annotators in disambiguation tasks.</Paragraph>
      <Paragraph position="7"> First, we evaluated the effectiveness of the rules in G (Section 3.2.2) in regard to the disambiguation algorithm. Since certain rules are clearly related (for example, rules 4 and 5, rules 9 and 11), we computed the precision of the disambiguation when adding or removing groups of rules. The results are shown in Figure 11. The shaded bars in the figure show the results obtained when those terms containing unambiguous words are removed from the set of complex terms.</Paragraph>
      <Paragraph position="8"> We found that the grammar rules involving the gloss and hyperonym relations contribute more than others to the precision of the algorithm. Certain rules (not listed in 3.2.2 since they were eventually removed) were found to produce a negative effect. All the rules described in 3.2.2 were found to give more or less a comparable positive contribution to the final performance.</Paragraph>
      <Paragraph position="9"> 22 Consistent at least with respect to the lexical knowledge encoded in WordNet.  manually dis. head fully automatic precision recall Figure 12 Precision and recall for the terminology disambiguation task: manual disambiguation of the head and fully automatic disambiguation.</Paragraph>
      <Paragraph position="10"> The precision computed in Figure 11 refers to the case in which the head node of each term tree is sense-tagged manually. In Figure 12 the light and dark bars represent precision and recall, respectively, of the algorithm when the head (i.e., the root) of a term tree is manually assigned and when the disambiguation is fully automatic. The limited drop in performance (2%) of the fully automated task with respect to manual head disambiguation shows that, indeed, the assumption of a strong semantic interrelationship between the head and the other terms of the term tree is indeed justified.</Paragraph>
      <Paragraph position="11"> Finally, we computed a baseline, comparing the performance of the algorithm with that obtained by a method that always chooses the first synset for each word in a complex term. (We remind readers that in WordNet, the first sense is the most probable.) The results are shown in Figure 13, where it is seen, as expected, that the increment in performance with respect to the baseline is higher (around 5%) when only polysemous terms are considered. A 5% difference (3% with respect to the fully automatic disambiguation) is not striking, however, the tourism domain is not very technical, and often the first sense is the correct one. We plan in the future to run experiments with more technical domains, for example, economics or software products.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="174" end_page="175" type="metho">
    <SectionTitle>
5. Related Work
</SectionTitle>
    <Paragraph position="0"> Comprehensive ontology construction and learning has been an active research field in the past few years. Several workshops  have been dedicated to ontology learning and related issues. The majority of papers in this area propose methods for extending an existing ontology with unknown words (e.g., Agirre et al. 2000 and Alfonseca and Manandhar 2002). Alfonseca and Manandhar present an algorithm to enrich WordNet with unknown concepts on the basis of hyponymy patterns. For example, the pattern  ) captures a hyponymy relation between Shakespeare and poet in the appositive NP &amp;quot;Shakespeare, the poet.&amp;quot; This approach heavily  Comparison with a baseline.</Paragraph>
    <Paragraph position="1"> depends upon the ability of discovering such patterns, however, it appears a useful complementary strategy with respect to OntoLearn. OntoLearn, in fact, is unable to analyze totally unknown terms (though ongoing research is in progress to remedy this limitation). Berland and Charniak (1999) propose a method for extracting whole-part relations from corpora and enrich an ontology with this information. Few papers propose methods of extensively enriching an ontology with domain terms. For example, Vossen (2001) uses statistical methods and string inclusion to create lexicalized trees, as we do (see Figure 4). However, no semantic disambiguation of terms is performed. Very often, in fact, ontology-learning papers regard domain terms as concepts. A statistical classifier for automatic identification of semantic roles between co-occuring terms is presented in Gildea and Jurafsky (2002). In order to tag texts with the appropriate semantic role, Gildea and Jurafsky use a training set of fifty thousand sentences manually annotated within the FrameNet semantic labeling project. Finally, in Maedche and Staab (2000, 2001), an architecture is presented to help ontology engineers in the difficult task of creating an ontology. The main contribution of this work is in the area of ontology engineering, although machine-learning methods are also proposed to automatically enrich the ontology with semantic relations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML