File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0709_metho.xml

Size: 23,638 bytes

Last Modified: 2025-10-06 14:15:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0709">
  <Title>I I I I I I I I I I I I I I I</Title>
  <Section position="3" start_page="65" end_page="65" type="metho">
    <SectionTitle>
2 Our way of building WordNets
</SectionTitle>
    <Paragraph position="0"> As we have pointed out in the introduction, our aim has been to design a methodology (and a software environment supporting it) for facilitating the task of building WNs from our sources. As we are involved in EWN project (covering the Spanish part), the methodology has been defined to be compatible which the general approach, guidelines and landmarks of the whole project but also to allow a parallel development of the CtWN.</Paragraph>
    <Paragraph position="1"> The general approach for building EWN is described in \[Vossen et al. 97\]. Roughly speaking, the approach follows a top-down strategy trying to assure a high level of overlapping between languages, at least in the highest levels of the hierarchy, but reflecting the language-specific lexicalizations and providing the maximum of freedom and flexibility for building the individual WordNets. Basically it consists of three major steps: l) Construction of core-WordNets for a set of common base concepts (around 800 nouns and 200 verbs), 2) enrichment of these sets providing relational links and incorporating their direct semantic contexts and 3) top-down extension of these core-WordNets.</Paragraph>
    <Paragraph position="2"> In our case two different approaches have been followed for dealing with nouns and verbs 3.</Paragraph>
    <Paragraph position="3"> 3Although other categories can be included in EWN (and cross-category relations an be established) only nouns and verbs have been introduced until now in our WordNets except for demostration purposes. In the case of verbs most of the work has been performed manually. The main source of information has been the Pirapides database \[Castell6n et al. 97\] that consists of 3,600 English verbs forms organized around Levin's Semantic Classes connected to WN1.5 senses. The database contains the theta-Grids specifications for each verb (its semantic structure in terms of cases or thematic roles), translation to Spanish and Catalan forms 4 and diathesis information. The connections extracted from this database were cross-validated with the information provided by bilingual dictionaries in order to improve their accuracy.</Paragraph>
    <Paragraph position="4"> In the case of nouns we have followed EWN strategy in the next way: 1) The two highest levels of EnWN (top concepts and direct hyponyms) were manually translated into Spanish (including variants). The results were filtered dropping out words appearing less than five times as genus terms in our monolingual dictionary \[DGILE 87\] or occurring less than 50 times in DGILE definition corpus 5 and less than 100 times in LEXESP corpus 6.</Paragraph>
    <Paragraph position="5"> This initial set (Spanish core concepts, 361 synsets) was then compared with base concept sets of other sites of EWN (roughly the union of intersection pairs between languages was considered as the common base concepts set). The missing concepts in Spanish were manually added and vertically bottom up extended leading to the common Base Concept set (around 800 synsets).</Paragraph>
    <Paragraph position="6"> Catalan Base Concepts set was then built to cover the Spanish Base Concepts set.</Paragraph>
    <Paragraph position="7"> 2) The enrichment of the BC set has been performed in two steps. First, using bilinguals as main lexical source, and then using other sources (mainly taxonomies). These processes are described below.</Paragraph>
  </Section>
  <Section position="4" start_page="65" end_page="67" type="metho">
    <SectionTitle>
3 Using English WordNet with
</SectionTitle>
    <Paragraph position="0"> bilinguals When trying to build a lexical taxonomy from scratch, we can take profit of a preexisting lexical taxonomy, EnWN in our case, assuming it is weU formed, as a skeleton of a taxonomy where we will fill in the lexical data. This ensures several advantages: it speeds up the construction of a large lexicon, as the only problem left is the</Paragraph>
    <Paragraph position="2"> decision where to attach the lexical data. There are also some problems: nobody ensures that the wellformedness of a lexical taxonomy for a language keeps true for another language, there must be semantic closeness between both languages. We have therefore assumed that the structure of the WN taxonomy would suffice in the earlier stages of the construction of the our WNs. So, we need to choose synonyms in Spanish 7 for the English words present in the original synsets of WN. One way to fulfil our requirements is using bilingual dictionaries (see \[Knight &amp; Luk 94\], \[Okumura &amp; Hovy 94\]). But we have to perform a sense disambiguation task in order to know which sense of both words (the Spanish and the English one) is being referred. In other words, we have to decide, for which sense of the Spanish word and for which synset in WordNet a relation of synonymy is being defined.</Paragraph>
    <Paragraph position="3"> There is also another minor problem to overcome, the unification of the two directions of the bilingual dictionary, which in few cases are symmetrical, to collect all translations together. It is true that unifying both directions of the bilingual dictionary implies loss of information potentially important (e.g. the order in which translations are written is relevant). But the lack of systematic work in the construction of the bilinguals makes this information of very doubtful utility.</Paragraph>
    <Paragraph position="4"> Thus, we have processed the bilinguals creating what we have called the homogeneous bilingual, which is a bilingual with both directions mixed. Then, for each Spanish word, we have collected all the words given as correct translations. And this has been the source for our work of attachment of Spanish words to WordNet synsets.</Paragraph>
    <Paragraph position="5"> Having collected all the translations of a Spanish word together, we have then classified the words in classes depending on their behaviour. They can be classified in three dimensions: polysemy, structural and conceptual.</Paragraph>
    <Paragraph position="6"> In the polysemy dimension, we classify the words in classes depending on the number and kind of translations. For example, all entries that have only one translation fall in the same class when this translation is monosemous in WN terms; all entries that have several translations fall in another class when these translations are polysemous.</Paragraph>
    <Paragraph position="7"> 7Although we ilustrate the methodology considering only Spanish, we performed the whole process for both Catalan and Spanish (and we provide results for both). In the structural dimension, we classify the words in classes depending on the relation that the translations owns in WN. For example, all entries which have several translations, sharing some of them a common synset in WN, fall in the same category; all entries in which one translation is a direct hyponym of other translation fall in the same category, etc.</Paragraph>
    <Paragraph position="8"> In the conceptual dimension, we apply the conceptual distance formula (which is explained in section 4.2.1.) on elements of the entries. For example, all entries with a low conceptual distance between synsets of their translations fall in the same class.</Paragraph>
    <Paragraph position="9"> Each of these classes defines a set of entries with the same behaviour. A confidence score has been assigned to each class by means of a manual validation of a significant sample extracted from them. We decided to accept the classes with a precission of 85% or more as classes of words to include in the first version of SpWN.</Paragraph>
    <Paragraph position="10"> Bilinguals can be used a step further stating a supposition: when several methods give the same result for the same Spanish word, the confidence for this attachment increases. We have carried out an experiment checking the classes in pairs, evaluating the precission of the set of intersections, and in all cases the precission increased. We have removed the cases where the precision was over 85%, the threshold applied in the previous experiment. This caused an increment of 40% of the original set of attachments.</Paragraph>
    <Paragraph position="11"> Furthermore, it is clear that if we merge more bilinguals, the homogeneous resulting will be larger, and will then generate larger classes. But, what is even more important, the classes are more precise because some bilinguals lack the inclusion of some translations for some words. Table I shows the current figures of both CtWN and SpWN following this approach (see \[Atserias et al. 97\] and \[Benitez et al. 98\] for further details of the whole process and tools used).</Paragraph>
    <Paragraph position="12">  The last point to address is the extension of the intersection method to larger number of classes. If with two classes the intersection increased the confidence an equivalent increase when</Paragraph>
    <Paragraph position="14"> intersecting larger numbers of classes can be expected.</Paragraph>
    <Paragraph position="15"> As a matter of fact, the extension of the intersection method would be nothing more than performing a multivariant statistical analysis, where each of the classes would be a factor. The interesting result of this multivariant analysis would be a formula which could be used to calculate the value of the confidence of an attachment, depending on the number of classes in which it occurs.</Paragraph>
  </Section>
  <Section position="5" start_page="67" end_page="68" type="metho">
    <SectionTitle>
4 Building Taxonomies using WordNet
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
4.1 Exploiting taxonomies f~m MRDs
</SectionTitle>
      <Paragraph position="0"> A straightforward way of obtaining a LgWN can be performed acquiring taxonomic relations from conventional dictionaries following a purely bottom up strategy. That is, 1) parsing each * definition for obtaining the genus, 2) performing a genus disambiguation procedure, and 3) building a natural classification of the concepts as a concept taxonomy with several tops. Following this purely descriptive methodology, the semantic primitives of the LgWN could be obtained by collecting those dictionary senses appearing at the top of the complete taxonomies derived from the dictionary. By characterizing each of these tops, the complete LgWN could be produced. For DGILE, the complete noun taxonomy was derived using the automatic method described by \[Rigau et al. 97\]8.</Paragraph>
      <Paragraph position="1"> However, several problems arise due to a) the source (i.e., circularity, errors, inconsistencies, omitted genus, etc.) and b) the limitation of the genus sense disambiguation techniques applied (i.e., \[Bruce et al. 92\] report 80% accuracy using automatic techniques, while \[Rigau et al. 97\] report 83%). Furthermore, the top dictionary senses do not usually represent the semantic subsets that the LgWN needs to characterize in order to represent useful knowledge for NLP systems. In other words, there is a mismatch between the knowledge directly derived from an MRD and the knowledge needed by a LgWN.</Paragraph>
      <Paragraph position="2"> To illustrate the problem we are facing, let us suppose we plan to place the FOOD concepts in the LgWN. Neither collecting the taxonomies derived from a top dictionary sense (or selecting a 8This taxonomy contains 111,624 dictionary senses and has only 832 dictionary senses which are tops of the taxonomy (these top dictionary senses have no hypernyms), and 89,458 leaves (which have no hyponyms). That is, 21,334 definitions are placed between the top nodes and the leaves. subset of the top dictionary senses of DGILE) closest tO FOOD concepts (e.g., substancia -substance-), nor collecting those subtaxonomies starting from closely related senses (e.g., bebida -drinkable liquids- and alimento -food-) we are able to collect exactly the FOOD concepts present in the MRD. The first are too general (they would cover non-FOOD concepts) and the second are too specific (they would not cover all FOOD dictionary senses because FOODs are described in many ways).</Paragraph>
      <Paragraph position="3"> All these problems can be solved using a mixed methodology. That is, by attaching selected top concepts (and its derived taxonomies) to prescribed semantic primitives represented in the LgWN. Thus, first, we prescribe a minimal ontology (represented by the semantic primitives of the LgWN) able to represent the whole lexicon derived from the MRD, and second, following a descriptive approach, we collect, for every semantic primitive placed in the LgWN, its subtaxonomies. Finally, those subtaxonomies selected for a semantic primitive are attached to the corresponding LgWN semantic category.</Paragraph>
      <Paragraph position="4"> We used as semantic primitives the 24 lexicographer's files (or semantic files) into which the 60,557 noun synsets (87,641 nouns) of WN are classified 9. Thus, we considered the 24 semantic tags of WN as the main LgWN semantic primitives to which all dictionary senses must be attached. In order to overcome the language gap we also used a bilingual Spanish/English dictionary.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="68" type="sub_section">
      <SectionTitle>
4.2 Attaching DGILE dictionary senses to semantic
</SectionTitle>
      <Paragraph position="0"> primitives In order to classify all nominal DGILE senses with respect to WordNet semantic files, we used a similar approach to that suggested by \[Yarowsky 92\]. This task is divided into three fully automatic consecutive subtasks. First, we tag a subset (due to the difference in size between the monolingual and the bilingual dictionaries) of DGILE dictionary senses by means of a process that uses the conceptual distance formula (see 4.2.1); second, we collect salient words for each semantic file; and third, we enrich each DGILE</Paragraph>
      <Paragraph position="2"> dictionary sense with a semantic tag collecting evidence from the salient words previously computed.</Paragraph>
      <Paragraph position="3">  headwords.</Paragraph>
      <Paragraph position="4"> For each DGILE definition, the conceptual distance between headword and genus has been computed using WN1.5 as a semantic net. We obtained results only for those definitions having English translations (using a bilingual dictionary) for both headword and genus. By computing the conceptual distance between two words (wl,w2) we are also selecting those concepts (Cli,C2j) which represent them and seem to be closer with respect to the semantic net used. Conceptual distance is computed using formula (1).</Paragraph>
      <Paragraph position="6"> That is, the conceptual distance between two concepts depends on the length of the shortest path 10 that connects them and the specificity of the concepts in the path.</Paragraph>
      <Paragraph position="7"> In this way, we obtained a preliminary version of 29,20511 dictionary definitions semantically labelled (that is, with WN lexicographer's files) with an accuracy of 64% (61% at a sense level).</Paragraph>
      <Paragraph position="8"> That is, a corpus (collection of dictionary senses) classified in 24 partitions (each one corresponding to a semantic category).</Paragraph>
      <Paragraph position="9">  semantic primitive.</Paragraph>
      <Paragraph position="10"> Thus, we can collect the salient words (that is, those representative words for a particular category) using a Mutual Information-like formula (2), where w means word and SC semantic class.</Paragraph>
      <Paragraph position="12"> Intuitively, a salient word 12 appears significantly more often in the context of a I OWe only consider hypo / hypermy m relations.</Paragraph>
      <Paragraph position="13"> llDue to the different sizes of the dictionaries used we only compute the conceptual distance for 31% of the noun dictionary senses.</Paragraph>
      <Paragraph position="14"> 12Instead of word lemmas, this study has been carried out using word forms because word forms rather than lemmas semantic category than at other points in the whole corpus, and hence is a better than average indicator for that semantic category. The words selected are those most relevant to the semantic category, where relevance is defined as the product of salience and local frequency. That is to say, important words should be distinctive and frequent.</Paragraph>
      <Paragraph position="15"> We performed the training process considering only the content word forms from dictionary definitions 13 and we discarded those salient words with a negative score. Thus, we derived a lexicon of 23,418 salient words (one word can be a salient word for many semantic categories).</Paragraph>
      <Paragraph position="16"> 4.2.3 Enriching DGILE definitions with WordNet semantic primitives.</Paragraph>
      <Paragraph position="17"> Using the salient words per category (or semantic class) gathered in the previous step we labelled the DGILE dictionary definitions again.</Paragraph>
      <Paragraph position="18"> When any of the salient words appears in a definition, there is evidence that the word belongs to the category indicated. If several of these words appear, the evidence grows. We add together their weights, over all words in the definition, and determine the category for which the sum is greatest, using formula (3).</Paragraph>
      <Paragraph position="20"> Thus, we obtained a second semantically labelled version of DGILE. This version has 86,759 labelled definitions (covering more than 93% of all noun definitions) with an accuracy rate of 80% (we have gained, since the previous labelled version, 62% coverage and 16% accuracy).</Paragraph>
      <Paragraph position="21"> Although we used the 24 lexicographer's files of WordNet as semantic primitives, a more fine-grained classification could be made. For example, all FOOD synsets are classified under &lt;food, nutrient&gt; synset in file 13. However, FOOD concepts are themselves classified into 11 subclasses (i.e., &lt;yolk&gt;, &lt;gastronomy&gt;, &lt;comestible, edible, eatable, ...&gt;, etc.). Thus, if the LgWN we are planning to build needs to represent &lt;beverage, drink, potable&gt; separately from the concepts &lt;comestible, edible, eatable, ...&gt; a finer set of semantic primitives should be chosen, for instance, considering each direct hyponym of a synset belonging to a semantic file also as a new semantic primitive or even selecting usedare representatiVein dictionaries.degf .typical usages of the sublanguage 6S3After discarding functional words. for each semantic file the level of abstraction we need.</Paragraph>
    </Section>
    <Section position="3" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
4.3 Selecting the main top beginners for a semantic
</SectionTitle>
      <Paragraph position="0"> primitive This section is devoted to the location of the main top dictionary senses for a given semantic primitive in order to correctly attach all its subtaxonomies to the correct semantic primitive in the LgWN.</Paragraph>
      <Paragraph position="1"> In order to illustrate this process we will locate the main top beginners for the FOOD dictionary senses. However, we must consider that many of these top beginners are structured. That is, some of them belong to taxonomies derived from other ones, and then cannot be directly placed within the FOOD type. This is the case of vino (wine), which is a zumo (juice). Both are top beginners for FOOD and one is a hyponym of the other.</Paragraph>
      <Paragraph position="2"> First, we collect all genus terms from the whole set of DGILE dictionary senses labelled in the previous section with the FOOD tag (2,614 senses), producing a lexicon of 958 different genus terms (only 309, 32%, appear more than once in the FOOD subset of dictionary senses).</Paragraph>
      <Paragraph position="3"> As the automatic dictionary sense labelling is not free of errors (around 80% accuracy) 14 we can discard some senses by using filtering criteria. * Filter I (F1) removes all FOOD genus terms not assigned to the FOOD semantic file during the mapping process between the bilingual dictionary and WN.</Paragraph>
      <Paragraph position="4"> * Filter 2 (F2) selects only those genus terms which appear more times as genus terms in the FOOD category. That is, those genus terms which appear more frequently in dictionary definitions belonging to other semantic tags are discarded.</Paragraph>
      <Paragraph position="5"> * Filter 3 (F3) discards those genus terms which appear with a low frequency as genus terms in the FOOD semantic category. That is, infrequent genus terms (given a certain threshold) are removed. Thus, F3&gt;1 means that the filtering criteria have discarded those genus terms appearing in the FOOD subset of dictionary definitions less than twice.</Paragraph>
      <Paragraph position="6"> At the same level of genus frequency, filter 2 (removing genus terms which are more frequent in other semantic categories) is more accurate than filter 1 (removing all genus terms the translation 14Most of them are not really errors. For instance, all fishes must be ANIMALs, but some of them are edible (that is, FOODs). Nevertheless. all fishes labelled as FOOD have been considered mistakes. of which cannot be FOOD). For instance, no error appears when selecting those genus terms which appear 10 or more limes (F3) and are more frequent in that category than in any other (F2), discarding only 3% of correct genus terms (see \[Rigau et aL 98\] for complete figures).</Paragraph>
    </Section>
    <Section position="4" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
4.4 Automatically building large scale
</SectionTitle>
      <Paragraph position="0"> taxonomies from DGILE The automatic Genus Sense Disambiguation task in DGILE has been performed following \[Rigau et al. 97\]. This method reports 83% accuracy when selecting the correct hypemym by combining eight different heuristics using several methods and types of knowledge (two of the heuristics use WN).</Paragraph>
      <Paragraph position="1"> Once the main top beginners (relevant genus terms) of a semantic category are selected and every dictionary definition has been disambiguated, we collect all those pairs labelled with the semantic category we are working on having one of the genus terms selected. Using these pairs we finally build up the complete taxonomy for a given semantic primitive. That is, in order to build the complete taxonomy for a semantic primitive we fit the lower senses using the second labelled lexicon and the genus selected from this labelled lexicon.</Paragraph>
      <Paragraph position="2"> Although, both final taxonomic structures produce more fiat taxonomies than if the task is done manually, a few arrangements could be done at the top level of the automatic taxonomies.</Paragraph>
      <Paragraph position="3"> Studying the main top beginners we can easily discover an internal structure between them (for FOOD, 18 or 48 depending on the criteria selected).</Paragraph>
      <Paragraph position="4"> Performing the process for the whole dictionary we obtained for F2+(F3&gt;9) a taxonomic structure of 35,099 definitions and for F2+(F3&gt;4) the size grows to 40,754. Testing the results on FOOD taxonomies we achived 99% accuracy with the first criterion and 96% with the second.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML