File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0801_metho.xml
Size: 20,121 bytes
Last Modified: 2025-10-06 14:14:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0801"> <Title>Multilingual design of EuroWordNet Piek Vossen, University of Amsterdam</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. High-level Design of the EuroWord- </SectionTitle> <Paragraph position="0"> Net Database All language specific wordnets will be stored in a central lexical database system. Each wordnet represents a language-internal system of synsets with semantic relations such as hyponymy, meronymy, cause, roles (e.g. agent, patient, instrument, location). Equivalence relations between the synsets in different languages and Word-Netl.5 will be made explicit in the so-called Inter-Lingual-Index (ILI). Each synset in the monolingual wordnets will have at least one equivalence relanon with a record in this ILI. Language-specific synsets linked to the same ILI-record should thus be equivalent across the languages. The ILI starts off as an unstructured list of WordNetl.5 synsets, and will grow when new concepts will be added which are not present in WordNetl.5 (note that the actual internal organization of the synsets by means of semantic relations can still be recovered from the WordNet database which is linked to the index as any of the other wordnets). The only organization that will be provided to the ILI is via two separate ontologies which are linked to ILl records: * the top-concept ontology: which is a hierarchy of language-independent concepts, reflecting explicit opposition relations (e.g. Object and Substance).</Paragraph> <Paragraph position="1"> * a hierarchy of domains labels which relate concepts on the basis of scripts or topics, e.g. &quot;sports&quot;, &quot;water sports&quot;, &quot;winter sports&quot;, &quot;military&quot;, &quot;hospital&quot;. Both the top-concepts and the domain labels can be transferred via the equivalence relations of the hiLlrecords to the language-specific meanings and, next, via the language-internal relations to any other meaning in the wordnets, as is illustrated in Figure 1 for the top-concepts Object and Substance. The ILl-record object is linked to the Top-Concept Object. Since the Dutch synset voorwerp has an equivalence-relation to the ILl-record the Top-Concept Object also applies to the Dutch synset. Furthermore, it can be applied to all Dutch synsets related via the language-internal relations to the Dutch voorwerp.</Paragraph> <Paragraph position="2"> Both hierarchies will enable a user to customize the database with semantic features without having to access the language-internal relations of each wordnet. Furthermore, the domain-labels can directly be used in information retrieval (also in language-learning tools and dictionary publishing) to group concepts in a different way, based on scripts rather than classification. Domains can also be used to separate the generic from the domain-specific vocabularies. This is important to control the ambiguity problem in Natural Language Processing.</Paragraph> <Paragraph position="3"> Finally, we save space by storing the language-independent information only once.</Paragraph> <Paragraph position="4"> The overall modular structure of the EuroWordNet database can then be summed up as follows: first, there are the language modules containing the conceptual lexicons of each language involved. Secondly, there is the Language Independent Module which comprises the ILl, the Domain Ontology and the Top-Concept Ontology. null Language internal Relationships Language Module A Language Module A Interlingual relationships Language Module A ILl Module Three different types of relationships are necessary in this architecture, summarized in the table 1. The relationships operate upon five different types of data entities: Word-Meanings, Instances, ILl records, Domains and Top-Concepts. The Word-Meanings are senses with denotational meanings (man) while the Instances are senses with referential meanings (John Smith).</Paragraph> <Paragraph position="5"> Figure 2 gives a simplified overview of how the different modules are interconnected. In the middle the ILl is given in the form of a list of ILl-records: &quot;animal&quot;, &quot;mammal&quot;, ... &quot;mane&quot;, &quot;Bob&quot;, with relations to the language-modules, the domains, and the top-concepts. Two examples of inter-linked domains (D) and top-concepts (TC) are given above the ILl-records. The boxes with language-names (Spanish, English, Dutch, Italian and WNI.5) represent the Language Modules and are centered around the ILI. For space limitations, we only show a more detailed box for the Spanish module. In this box we see examples of hyponymy and meronymy relations between Spanish word-meanings and some of the equivalence-relations with the ILl-records. The full list of relations distinguished, its characteristics and assignment tests, as well as the structures of the different records can be found in the EuroWordNet deliverables D005, D006, D007 (available at: http://www.let.uva.nl/~ewn).</Paragraph> <Paragraph position="6"> The language dependent objects are connected with strings that are words.| The language independent objects are connected with strings that are labels. \[ Figure 2 Next to the language-internal relations there are also six different types of inter-lingual relations. The most straight-forward relation is EQ_SYNONYM which applies to meanings which are directly equivalent to some ILl-record. In addition there are relations for complex-equivalent relations, among which the most important are: * EQ NEAR SYNONYM when a meaning matches multiple ILl-records simultaneously, * HAS_EQ_HYPERONYM when a meaning is more specific than any available ILl-record: e.g.</Paragraph> <Paragraph position="7"> Dutch hoofd only refers to human head and kop only refers to animal head, while English has head for both.</Paragraph> <Paragraph position="8"> * HAS_EQ_HYPONYM when a meaning can only be linked to more specific ILl-records: e.g. Spanish dedo which can beused to refer to bothfinger and toe.</Paragraph> <Paragraph position="9"> The complex-equivalence relations are needed to help the relation assignment during the development process when there is a lexical gap in one language or when meanings do not exactly fit.</Paragraph> <Paragraph position="10"> As mentioned above, the ILl should be the super-set of all concepts occurring in the separate wordnets. The main reasons for this are: * it should be possible to link equivalent non-English meanings (e.g. Italian-Spanish) to the same ILl-record even when there is no English or WordNet equivalent.</Paragraph> <Paragraph position="11"> * it should be possible to store domain-labels for non-English meanings, e.g: all Spanish bullfightmg terms should be linked to ILl-records with the domain-label bull-fighting.</Paragraph> <Paragraph position="12"> Initially, the ILl will only contain all WordNetl.5 synsets but eventually it will be updated with language-specific concepts using a specific update policy: * a site that cannot find a proper equivalent among the available ILI-concepts will link the meaning to another ILl-record using a so-called complex-equivalence relation and will generate a potential new ILl-record (see table 2).</Paragraph> <Paragraph position="13"> * after a building-phase all potentially-new ILl-records are collected and verified for overlap by one site.</Paragraph> <Paragraph position="14"> * a proposal for updating the ILI is distributed to all sites and has to be verified.</Paragraph> <Paragraph position="15"> * the ILI is updated and all sites have to reconsider the equivalence relations for all meanings that can potentially be linked to the new ILl-records.</Paragraph> <Paragraph position="16"> 3. Mismatches and language-specific semantic configurations Within the EuroWordNet database, the wordnets can be compared with respect to the language-internal relations (their lexical semantic configuration) and in terms of their equivalence relations. The following general situations can then occur (Vossen 1996).</Paragraph> <Paragraph position="17"> 1. a set of word-meanings across languages have a simple-equivalence relation and they have parallel language-internal semantic relations.</Paragraph> <Paragraph position="18"> 2. a set of word-meanings across languages have a simple-equivalence relation but they have diverging language-internal semantic relations.</Paragraph> <Paragraph position="19"> 3. a set of word-meanings across languages have complex-equivalence relations but they have parallel language-internal semantic relations.</Paragraph> <Paragraph position="20"> 4. a set of word-meanings across languages have complex-equivalence relation and they have di-</Paragraph> <Paragraph position="22"/> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> HAS EQ,,HYPONYM </SectionTitle> <Paragraph position="0"> matches. Here we see that head-1 represents an intermediate level between human-head-1 and externalbody part-1 in WordNetl.5 which is missing between their Dutch equivalent lichaamsdeel-1 and hoofd-1.</Paragraph> <Paragraph position="1"> While the equivalence relations match, the hyponymystructure does not (situation 2 above). Furthermore, kop-1 does not match any synset in WordNet1.5. In the Spanish-English example we see on the other hand that ap6ndice-4 and dedo-1 have complex equivalence relations which are not incompatible with the structure of the language-internal relations in the Spanish word-net and in WordNetl.5 (situation 4 above).</Paragraph> <Paragraph position="2"> In general we can state that situation (1) is the ideal case. In the case of (4), it may still be that the wordnets exhibit language-specific differences which have lead to similar differences in the equivalence relations.</Paragraph> <Paragraph position="3"> Situation (2) may indicate a mistake or it may indicate that equivalent meanings have been encoded in an alternative way in terms of the language-internal relations. Situation (3) may also indicate a mistake or it may be the case that the meanings are non-equivalent and therefore show different language-internal configurations. null The EuroWordNet database is developed in tandem with the Novell ConceptNet toolkit (Dlez-Orzas et al 1995). This toolkit makes it possible to directly edit and add relations in the wordnets. It is also possible to formulate complex queries in which any piece of information is combined. Furthermore, the ConceptNet toolkit makes it possible to visualize the semantic relations as a tree-structure which can directly be edited. These trees can be expanded and shrunk by clicking on word-meanings and by specifying so-called filters indicating the kind and depth of relations that need to be shown.</Paragraph> <Paragraph position="4"> However, to get to grips with the multi-linguality of the database we have developed a specific interface to deal with the different matching problems. The multi-lingual interface has the following objectives: * it should offer new or better equivalence relations for a set of word-meanings * it should offer better or alternative language-internal configurations for a set of word-meanings</Paragraph> <Paragraph position="6"> For visualising these aspects we designed an interface in which two wordnets can be aligned (see Cuypers and Adriaens 1997 for further details). In the screen-dump of the interface (figure 4) we see a fragment of the Dutch wordnet in the left box and a fragment of the Spanish wordnet in the right box. 2 The dark squares represent the meanings (WMs) m the languages which are interconnected by lines labeled with the relation type that holds: has_hyperonym, has mero_madeof. Each meaning is followed by the synset ( as a list of variants with a sense-number) and on the next lines by the ILI-records to which it is linked (if any). These ILI-records are represented by their gloss (here all taken from WordNetl.5) and the kind of equivalence relation is indicated by a preceding icon, = for EQ_SYNONYM and ~ for EQ NEAR SYNONYM. By displaying the wordnets adjacently and by specifying the ILl-records separately for each synset in each tree the matching of the ILI-records can be indicated by drawing lines between the same ILI-records. When comparing wordnets one specific language can be taken as a starting point. This language will be the Source Language (SL). The SL is compared with one or more other languages which will be called the Reference Languages (RLs).</Paragraph> <Paragraph position="7"> There are then two general ways in which the aligned wordnets can be accessed: * given a (set of) WM(s) in a source wordnet with their corresponding ILIR(s), generate the same ILIRs in the adjacent wordnet box with the corresponding WMs in the reference wordnet.</Paragraph> <Paragraph position="8"> given two comparable wordnet structures visualise the matching of the ILIRs: i.e. draw the lines between the ILl-records that are the same.</Paragraph> <Paragraph position="9"> In the first option, a WM is first 'translated' into the second wordnet box, yielding a parallel twin-structure of ILI-records. Next the language-specific configuration of the Reference-wordnet can be generated (bottom-up). This gives you the semantic structuring of a particular set of WMs according to another word-net as compared to the Source-wordnet.</Paragraph> <Paragraph position="10"> In the second option the structures of both the Reference and the Source wordnet are compatible and the inter-lingual relations are compared relative to this structure. Each set of ILI-records represents the most direct matching of a fragment of a wordnet from the available fund of ILI-records, regardless of the matching of the other wordnet. The equivalence relations of these compatible fragments can then directly be compared. Loose-ends at either site of the ILl-records can be used to detect possible ILIR-records that have not been considered as translations in one wordnet but have been used in another wordnet. Differences in the kind of equivalence relations of WMs with compatible structure are suspect. Obviously, a comparison in this way only makes sense if the semantic-scope of the language internal relations is more or less the same.</Paragraph> <Paragraph position="11"> Both these options are illustrated in the above screen-dump. For example, the Dutch vleeswaren:l (meat-products) has an EQ_SYNONYM relation with meat.2 (= the flesh of animals ...), where the sense numbers do not necessarily correspond with Word-Netl.5 numbers, and a HAS_HYPERONYM relation to the synset voedsel:l. The latter is in its turn linked to the ILI-synset food:l(=any substance that can be metabolized...). We then copied the ILl-record meat 2 into the Spanish wordnet yielding carne 1 as the synset linked to it. By expanding the hyperonymyrelations for carne'l we see that the Spanish wordnet gives three hyperonyms: tejido'3 (tissue: 1 = a part of an organism ..), comlda.'l (fare:l = the food and drink that are regularly consumed), and sustento 1 (nourishment: 1 = a source of nourishment), all linked to ILl-records different from the Dutch case. When generating back the matching Dutch synsets for these hyperonyms it becomes clear that they are all present in this fragment, except for comida'l (fare:l) which does not yield a corresponding Dutch synset. First of all this comparison gives us new hyperonyms that can be considered and, secondly, it gives us a new potential ILl-record fare:l for the Dutch wordnet. Further expanding the Dutch wordnet also shows that there is a closely-related concept vlees:l (the stuff where meat-products consist of) which matches both meat.2 and flesh:l(= the soft tissue of the body...). This concept thus partially matches the Spanish carne: 1. Since there is no matching Spanish concept related to flesh 1 the Dutch wordnet thus in its turn suggests a new potential ILI-record for the Spanish wordnet. In this way the aligned wordnets can be used to help each other and derive a more compatible and consistent structure.</Paragraph> <Paragraph position="12"> Given the fact that we allow for a large number of language-internal relations and six types of equivalence relations, it may be clear that the different combinations of mismatches is exponential. Therefore we are differentiating the degree of compatibility of the different mismatches: some mismatches are more serious than others. First of all, some relations in EuroWordNet have deliberately been defined to give somewhat more flexibility in assigning relations. In addition to the strict synonymy-relation which holds between synset-variants there is also the possibility to encode a NEAR SYNONYM relation between synsets which are close in meaning but cannot be substituted as easily as synset-members: e.g. machine, apparatus, tool. Despite the tests for each relation there are always border-cases where intuitions will vary. Therefore it makes sense to allow for mismatches across wordnets where the same type of equivalence relation holds between a single synset in one language and several synsets with a NEARSYNONYM relation in another language.</Paragraph> <Paragraph position="13"> As we have seen above, a single WM may be linked to multiple ILI-records and a single ILl-record may be linked to multiple WMs. This allows for some constrained flexibility. The former case is only allowed when another more-global relation EQ_NEAR_SYNONYM has been used (see above). In the reverse case, the same ILl-record is either linked to synsets which have a NEAR_SYNONYM relation among them (in which case they can be linked as EQ_SYNONYM or as EQ_NEAR_SYNONYM of the same ILl-record) or any other complex equivalence relation which parallels the relation between the WMs.</Paragraph> <Paragraph position="14"> Thus, two WMs which have a hyponymy-relation among them and which are linked to the same ILl-record should have equivalence-relations that parallel the hyponymy-relation: EQ_HAS_HYPERONYM and EQ_SYNONYM. A final type of flexibility is built in by distinguishing subtypes of relations. In addition to more specific meronymy-relations such as membergroup, portion-substance there is an a-specific meronymy relation which is compatible with all the specific subtypes.</Paragraph> <Paragraph position="15"> In addition to more global or flexible relations, we also try explicitly define compatibility of configurations. First of all, differences in levels of generality are acceptable, although deeper hierarchies are preferred. So if one wordnet links dog to ammal and another wordnet links it to mammal and only via the latter to animal first these structures are not considered as serious mismatches. Furthermore, since we allow for multiple hyperonyms it is possible that different hyperonyms may still both be valid. To make the compatibility of hyperonyms more explicit, the most frequent hyperonyms can be defined as allowable or non-allowable combinations. For example, a frequent combination such as act or result can be seen as incompatible (and therefore have to be split into different synsets), whereas object or arnfact are very common combinations.</Paragraph> <Paragraph position="16"> Finally, we have experienced that some relations tend to overlap for unclear cases. For example, intuitions appear to vary on causation or hyponymy as the relation between Dutch pairs such as dzchttrekken (close by pulling) and dichtgaan (become closed). In these cases it is not clear whether we are dealing with different events in which one causes the other or one makes up the other. The events are fully co-extensive in time: there is no time point where one event takes place and the other event does not. This makes them less typical examples of cause-relations. By documenting such border-line cases we hope to achieve consensus about the ways in which they should be treated and the severity of the incompatibility.</Paragraph> </Section> class="xml-element"></Paper>