File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1005_metho.xml
Size: 18,416 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1005"> <Title>Identification of relevant terms to support the construction of Domain Ontologies</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 SymOntos: a symbolic Ontology </SectionTitle> <Paragraph position="0"> management system SymOntos (SymOntos 2000) is an Ontology management system under development at IASI_CNR. It supports the construction of an Ontology following the OPAL (Object, Process, and Actor modeling Language) methodology. OPAL is a methodology for the modeling and management of the Enterprise Knowledge Base and, in particular, it allows the representation of the semi-formal knowledge of an enterprise. As already mentioned, an Ontology gathers a set of concepts that are considered relevant to a given domain. Therefore, in SymOntos the construction of an Ontology is performed by defining a set of concepts. In essence, in SymOntos a concept is characterized by: a term, that denotes the concept, a definition, explaining the meaning of the concept, generally in natural language, a set of relationships with other concepts. 1 The interested reader may access the Web site reported in the bibliography Figure 1 shows an example of filled concept form in the Tourism domain. The Domain Ontology is called OntoTour. Concept relationships play a key role since they allow concepts to be inter-linked according to their semantics. The set of concepts, together with their links, forms a semantic network (Brachman 1979).</Paragraph> <Paragraph position="1"> In a semantically rich Ontology, both concepts and semantic relationships are categorized.</Paragraph> <Paragraph position="2"> Semantic relationships are distinguished according to three main categories2 namely, Broader Terms, Similar Terms, Related Terms, that are described below.</Paragraph> <Paragraph position="3"> The Broader Terms relationship allows a set of concepts to be organized according to a generalization hierarchy (corresponding in the literature to the well-known ISA hierarchy). With the Similar Terms relationship, a set of concepts that are similar to the concept being defined are given, each of which annotated with a similarity degree. For instance, the concept Hotel can have as similar concepts Bed&Breackfast, , with similarity degree 0.6, and camping, with similarity degree 0.4.</Paragraph> <Paragraph position="4"> Finally, the Related Terms relationship allows the definition of a set of concepts that are semantically related to the concept being defined. Related concepts may be of different kinds, but they must be defined in the Ontology.</Paragraph> <Paragraph position="5"> For instance, TravelAgency, Customer, or CreditCard, are concepts that are semantically related to the Hotel concept.</Paragraph> <Paragraph position="6"> In SymOntos, Broader relations are also referred to as &quot;vertical&quot; , while Related and Similar are called &quot;horizontal&quot; relations. SymOntos is equipped with functions to ensure concept management, verification and 2 The represented information is in fact quite more rich, but we omit a detailed description for sake of space Ontology closure, and a web interface to help developing consensus definitions in a given user community (Missikoff and Wang, 2000).</Paragraph> <Paragraph position="7"> These functions are not described here since they are outside the purpose of the paper.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Text Mining tools to construct a </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Domain Ontology </SectionTitle> <Paragraph position="0"> In Section 2 we illustrated the main features of the SymOntos system, and provided an example of concept definition in the Tourism domain.</Paragraph> <Paragraph position="1"> The techniques described in this Section are intended to significantly improve human productivity in the process that a group of domain experts accomplish in order to find an agreement on: * the identification of the key concepts and relationships in the domain of interest * providing an explicit representation of the conceptualization captured in the previous stage To reduce time, cost (and, sometimes, harsh discussions) it is highly advisable to refer to the documents available in the field. In this paper we show that text-mining tools may be of great help in this task.</Paragraph> <Paragraph position="2"> At the present state of the project, natural language processing tools have been used for the following tasks: 1. Identification of thesauric information, i.e. discovery of terms that are good candidate names for the concepts in the Ontology.</Paragraph> <Paragraph position="3"> 2. Identification of taxonomic relations among these terms.</Paragraph> <Paragraph position="4"> 3. Identification of related terms For sake of space, only the first method is described in this paper. Details of the other methods may be found in (Missikoff et al. 2001).</Paragraph> <Paragraph position="5"> To mine texts, we used a corpus processor named ARIOSTO (Basili et al. 1996) whose performance has been improved with the addition of a Named Entity recognizer (Cucchiarelli et al. 1998) (Paliouras et al. 2000) and a chunk parser CHAOS (Basili et al, 1998). In the following, we will refer to this enhanced release of the system as ARIOSTO+. Figure 2 provides an example of final output (simplified for sake of readability) produced by ARIOSTO+ on a Tourism text. Interpreting the output predicates of Figure 2 is rather straightforward.</Paragraph> <Paragraph position="6"> The main principles underlying the CHAOS parsing technology are decomposition and lexicalization. Parsing is carried out in four steps: (1) POS tagging, (2) Chunking, (3) Verb argument structure matching and (4) Shallow grammatical analysis. .</Paragraph> <Paragraph position="7"> Chunks are defined via prototypes. These are sequences of morphosyntactical labels mapped to specific grammatical functions, called chunk types. Examples of labels for the inner components are Det, N, Adj, and Prep while types are related to traditional constituents, like NP, PP, etc.</Paragraph> <Paragraph position="8"> The definition of chunk prototypes in CHAOS is implemented through regular expressions. Chunks are the first types of output shown in result of shallow parsing. Whenever the argument structure information cannot be used to link chunks, a plausibility measure is computed, which is inversely proportional to the number of colliding syntactic attachments (see the referred papers for details). The first phase of the Ontology building process consists in the identification of the key concepts of the application domain.</Paragraph> <Paragraph position="9"> ______________________________________ The Colorado River Trail follows the Colorado River across 600 miles of beautiful Texas Country - from the pecan orchards of San Saba to the Gulf of Mexico .</Paragraph> <Paragraph position="11"> Though concept names do not always have a lexical correspondent in natural language, especially at the most general levels of the Ontology, one such correspondence may be naturally drawn among the more specific concept names and domain-specific words and complex nominals, like: * Domain Named Entities (e.g., gulf of Mexico, Texas Country, Texas Wildlife Association) * Domain-specific complex nominals (e.g., travel agent, reservation list, historic site, preservation area) * Domain-specific singleton words (e.g., hotel, reservation, trail, campground) We denote these singleton and complex words as Terminology.</Paragraph> <Paragraph position="12"> Terminology is the set of words or word strings , which convey a single, possibly complex, meaning within a given community. In a sense, Terminology is the surface appearance, in texts, of the domain knowledge in a given domain. Because of their low ambiguity and high specificity, these words are also particularly useful to conceptualize a knowledge domain, but on the other side, these words are often not found in Dictionaries. We now describe how the different types of Terminology are captured using NLP techniques.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Detection of Named Entities </SectionTitle> <Paragraph position="0"> Proper names are the instances of domain concepts, therefore they populate the leaves of the Ontology.</Paragraph> <Paragraph position="1"> Proper names are pervasive in texts. In the Tourism domain, as in most domains, Named Entities (NE) represent more than 20% of the total occurring words.</Paragraph> <Paragraph position="2"> To detect NE, we used a module already available in ARIOSTO+. A detailed description of the method summarized hereafter may be found in (Cucchiarelli et al. 1998) (Paliouras et al. 2000). In ARIOSTO+, NE are detected and semantically tagged according to three main conceptual categories: locations (objects in OPAL), organizations and persons (actors in OPAL) . When contextual cues are sufficiently strong (e.g. &quot;lake Tahoe is located.&quot;.), names of locations are further sub-categorized (city, bank, hotel, geographic location, ..), therefore the Ontology Engineer is provided with semantic cues to correctly place the instance under the appropriate concept node of the Ontology.</Paragraph> <Paragraph position="3"> Named Entity recognition is based on a set of contextual rules (e.g. &quot;a complex or simple proper name followed by the trigger word authority is a organization named entity&quot;). Rules are manually entered or machine learned using decision lists. If a complex nominal does not match any contextual rule in the NE rule base, the decision is delayed until syntactic parsing. A classification based on syntactically augmented context similarity is later attempted.</Paragraph> <Paragraph position="4"> The NE tagger is also used to automatically enrich the Proper Names dictionary, thus leading to increasingly better coverage as long as new texts are analyzed.</Paragraph> <Paragraph position="5"> As reported in the referred papers, the F-measure (combined recall and precision with a weight factor w=0,5) of this method is consistently (i.e. with different experimental settings) around 89%, a performance that compares very well with other NE recognizers described in the literature3.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Detection of domain-specific words </SectionTitle> <Paragraph position="0"> and complex nominals NEs are word strings in part or totally capitalized, and they often appear in wellcharacterized contexts. Therefore, the task of NE recognition is relatively well assessed in literature. Other not-named terminological patterns (that we will refer hereafter again with the word &quot;terminology&quot; though in principle terminology includes also NEs) are rather more difficult to capture since the notion of term is mostly underspecified.</Paragraph> <Paragraph position="1"> In the literature (see Bourigault et al. (1998) for an overview of recent research) the following steps are in general adopted: usually captured with more or less shallow techniques, ranging from stochastic methods (Church, 1988) to more sophisticated syntactic approaches (e.g. Jacquemin, 1997).</Paragraph> <Paragraph position="2"> Obviously, richer syntactic information positively influences the quality of the result to be input to the statistical filtering. In our research, we used the CHAOS parser to select candidate terminological patterns. Nominal expressions usually denoting terminological items are very similar to chunk instances.</Paragraph> <Paragraph position="3"> Specific chunk prototypes have been used to match terminological structures.</Paragraph> <Paragraph position="4"> A traditional problem of purely syntactic approaches to term extraction is overgeneration. The available candidates that satisfy grammatical constraints are far more than the true terminological entries. Extensive studies suggest that statistical filters be always faced with 50-80% of non-terminological candidates.</Paragraph> <Paragraph position="5"> Filtering of true terms can be done by estimating the strength of an association among words in a candidate terminological expression. Commonly used association measures are the Mutual Information (Fano, 1961) and the Dice factor (Smadja et al. 1996). In both formulas, the denominator combines the marginal probability of each word appearing in the candidate term. If one of these words is particularly frequent, both measures tend to be low. This is indeed not desirable, because certain very prominent domain words appear in many terminological patterns. For example, in the Tourism domain, the term visa appears both in isolation and in many multiword patterns, e.g.: business visa, extended visa, multiple entry business visa, transit visa, student visa, etc....Such patterns are usually not captured by standard association measures, because of the high marginal probability of visa.</Paragraph> <Paragraph position="6"> Another widely used measure is the inverse document frequency, idf.</Paragraph> <Paragraph position="8"> Where dfi is the number of documents in a domain Di that include a term t, and N is the total number of documents in a collection of n domains (D1, ..., Dn). The idea underlying this measure is to capture words that are frequent in a subset of documents representing a given domain, but are relatively rare in a collection of generic documents. This measure captures also words that appear just one time in a domain, which is in principle correct, but is also a major source of noise.</Paragraph> <Paragraph position="9"> Other corpus-driven studies suggested that pure frequency as a ranking score (i.e. a measure of the plausibility of any candidate to be a term) is a good metrics (Daille 1994).</Paragraph> <Paragraph position="10"> However, frequency alone cannot be taken as a good indicator: several very frequent expressions (e.g. last week) are perfect candidates from a grammatical point of view but they are totally irrelevant as terminological expressions. It is worth noticing that this is true for two independent reasons. First, they are not related to specific knowledge, pertinent to the target domain, but are language specific: different languages express with different syntactic structures (adverbial vs. nominal phrases) similar temporal or spatial expressions. As a result such expressions have similar distributions in different domain corpora. True terminology is tightly related to specific concepts so that their use in the target corpus is highly different wrt other corpora.</Paragraph> <Paragraph position="11"> Second, common sense expressions are only occasionally used, their meaning depending on factual rather than on conceptual information.</Paragraph> <Paragraph position="12"> They occur often once in a document and tend not to repeat throughout the discourse. Their appearance is thus evenly spread throughout documents of any corpus. Conversely, true terms are central elements in discourses and they tend to recur in the documents where they appear. They are thus expected to show more skewed (i.e. low entropy) distributions.</Paragraph> <Paragraph position="13"> The above issues suggest the application of two different evaluation (utility) functions.</Paragraph> <Paragraph position="14"> Although both are related to the widely employed notion of term probability, they capture more specific aspects and provide a more effective ranking.</Paragraph> <Paragraph position="15"> As observed above, high frequency in a corpus is a property observable for terminological as well as non-terminological expressions (e.g.</Paragraph> <Paragraph position="16"> &quot;last week&quot; or &quot;real time&quot;). The specificity of a terminological candidate with respect to the target domain (Tourism in our case) is measured via comparative analysis across different domains. A specific score, called Domain Relevance (DR), has been defined.</Paragraph> <Paragraph position="17"> More precisely, given a set of n domains4 (D1, ..., Dn) the domain relevance of a term t is computed as:</Paragraph> <Paragraph position="19"> where the conditional probabilities (P(t|Di)) are estimated as: Terms are concepts whose meaning is agreed upon large user communities in a given domain. A more selective analysis should take into account not only the overall occurrence in the target corpus but also its appearance in 4 << domains >> are (pragmatically) represented by texts collections in different areas, e.g. medicine, finance, tourism, etc.</Paragraph> <Paragraph position="20"> single documents. Domain concepts (e.g.</Paragraph> <Paragraph position="21"> travel agent) are referred frequently throughout the documents of a domain, while there are certain specific terms with a high frequency within single documents but completely absent in others (e.g. petrol station, foreign income).</Paragraph> <Paragraph position="22"> Distributed usage expresses a form of consensus tied to the consolidated semantics of a term (within the target domain) as well as to its centrality in communicating domain knowledge. A second indicator to be assigned to candidate terms can thus be defined.</Paragraph> <Paragraph position="23"> Domain consensus measures the distributed use of a term in a domain Di.. The distribution of a term t in documents dj can be taken as a stochastic variable estimated throughout all dj [?]Di. The entropy H of this distribution expresses the degree of consensus of t in Di. More precisely, the domain consensus is expressed as follows (2)</Paragraph> <Paragraph position="25"> candidate terms is performed using a combination of the measures (1) and (2). We experimented several combinations of these two measures, with similar results. The results, discussed in the next Section, have been obtained applying a threshold a to the set of terms ranked according to (1) and then eliminating the candidates with a rank (2) lower than b.</Paragraph> </Section> </Section> class="xml-element"></Paper>