XML Viewer - w00-0204

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0204_metho.xml
Size: 24,689 bytes
Last Modified: 2025-10-06 14:07:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0204">
  <Title>An interlingua aiming at communication on the Web: How language-independent can it be? Ronaldo Teixeira Martins ronaIdo @nilc.icmsc.sc.usp.br</Title>
  <Section position="3" start_page="24" end_page="25" type="metho">
    <SectionTitle>
2. The UNL Project
</SectionTitle>
    <Paragraph position="0"> The UNL Project 3 has been launched by the United Nations University to foster and ease international web communication by means of NLP systems. Its main strength lies on the development of the UNL, as a unique semantic (or meaning) representation that can be interchanged with the various languages to be integrated in the KBMT system. In the UNL Project, plug-in software to encode NL texts onto UNL ones (NL-UNL encoders) and to decode UNL into NL texts (UNL-NL decoders) have been developed by R&amp;D groups in their own native languages. The modules to process Brazilian Portuguese 4, for example, have been developed by a team of Portuguese native speakers that comprises linguists, computational linguists, and computer experts. Such packages will be made available in WWW servers and will be accessible by browsing through Internet, thus overcoming the need for people all around the world tO learn the language of their interlocutors. Several linguistic groups have signed to the. Project, namely: the Indo-European (Portuguese, Spanish, French, Italian, English, German, Russian, Latvian and Hindi), the Semitic (Arabic), the Sino-Tibetan (Chinese), the Ural-Altaic (Mongolian), the Malayan-Polynesian (Indonesian), and the Japanese.</Paragraph>
    <Paragraph position="1"> On the one hand, the main strength of the Project is that knowledgeable specialists address language-dependent issues of their mother tongue, most of which are related to R&amp;D of the encoding and decoding modules and to the specification of the NL-UNL lexicon. On the other hand, this also represents a crucial problem faced by the project participants, for distinct groups may interpret the interlingua specification differently. There is thus the need for a consensus about the UNL formalism,  bringing about an assessment of its coverage, completeness, and consistency, all features that will be discussed shortly.</Paragraph>
  </Section>
  <Section position="4" start_page="25" end_page="28" type="metho">
    <SectionTitle>
3. The Universal Networking Language
</SectionTitle>
    <Paragraph position="0"> The UNL is a formal language designed for rendering automatic multilingual information exchange. It is intended to be a cross-linguistic semantic representation of NL sentence meaning, being the core of the UNL System, the KBMT system developed by H. Uchida (1996) at the Institute of Advanced- Studies, United Nations University, Tokyo; Japan.</Paragraph>
    <Paragraph position="1"> UNL subsumes a tridimensional theory of (sentence) meaning, whose components are defined according to one of the following sets (Martins et al., 1998a): concepts (e.g., &amp;quot;cat&amp;quot;, &amp;quot;sit&amp;quot;, &amp;quot;on&amp;quot;, or &amp;quot;mat&amp;quot;), concept relations (e.g., &amp;quot;agent&amp;quot;, &amp;quot;place&amp;quot;, or &amp;quot;object&amp;quot;), and concept predicates (e.g., &amp;quot;past&amp;quot; or &amp;quot;definite&amp;quot;). Such components are formally and correspondingly represented by three different kinds of entities, namely: Universal Words (UWs), Relation Labels (RLs), and Attribute Labels (ALs). According to the UNL syntax, information conveyed by each sentence can be represented by a hypergraph whose nodes represent UWs and whose arcs represent RLs. To make symbol processing simpler, hypergraphs are often reduced to lists of ordered binary relations between concepts, as it is shown in Figure 1 for the  English sentence &amp;quot;The cat sat on the mat.&amp;quot; UWs are labels for concept-like information, roughly corresponding to the lexical level in the sentence structure. They comprise an open large inventory, virtually capable of denoting every non-compositional meaning to be conveyed by any speaker of any language. For the sake of representation, these atomic semantic contents are associated to English words and expressions, which play the role of semantic labels. However, there is no one-to-one mapping between the English vocabulary and the UNL lexicon, for UNL, as a multilingual representation code, is larger than the English vocabulary. To avoid unnecessary proliferation of the UNL vocabulary and to certify that standards be observed by UNL teams, control over the specification of the UW set is centered at the UNL Center, in Japan.</Paragraph>
    <Paragraph position="2"> Several semantic relationships hold between UWs, namely synonymy, antonymy, hyponymy, hypemymy and meronymy, which compose the UNL Ontology. Steady semantic valencies (such as agent and object features) can also be represented, forming the UNL Knowledge-Base. Both Ontology and Knowledge-Base aim at constraining the scope of UW labels, whenever ambiguity is to be avoided. The. UNL representation of sentence (1), for example, can be ambiguous  in Romance languages, for the translation of 'cat' should make explicit the animal sex: if male, it would be &amp;quot;gato&amp;quot; (Portuguese and Spanish), &amp;quot;gatto&amp;quot; (Italian), &amp;quot;chat&amp;quot; (French), whereas different names would have to be used for the female cat. Instead of having a unique UW 'cat', it is thus quite feasible to have a whole structure in which 'cat' is only the hyper-ordinate option.</Paragraph>
    <Paragraph position="3"> For the English-UNL association not to undermine the intended universality of the UW inventory, its semantic-orthograpical correspondence has to be considered rather incidental, or even. approximated. It is not always the case that extensions 6 of a UW label and of its corresponding English word coincide. The extension of the English word &amp;quot;mat&amp;quot;, for example, does not exactly coincide with the extension of any Portuguese word, although we can find many overlaps between &amp;quot;mat&amp;quot; and, e.g., &amp;quot;capacho&amp;quot; (Portuguese). Portuguese speakers, however, would not say &amp;quot;capacho&amp;quot; for the ornamental dishmat, as would not English speakers use the word &amp;quot;mat&amp;quot; for a fawner (still &amp;quot;capacho&amp;quot; in Portuguese). Since each language categorizes the world in a very idiosyncratic way, it would be misleading to impose a straightforward correspondence between lexical items of two different languages. In UNL, this problem has been overcome by proposing a rather analogic lexicon, instead of a digital one. Although discrete, UWs convey continuous entities, in the sense that semantic gaps between concepts are fulfilled by the UNL Knowledge-Base, as it is shown for the UW 'mat' in Figure 2. Granularity thus plays an important role in UNL lexical organization and brings flexibility into cross-linguistic lexical matching.</Paragraph>
    <Paragraph position="4"> Cf. (Frege, 1892), extension here is used to establish the relationship between a word and the world, opposed to intension, referring to the relationship between aword and its meaning.</Paragraph>
    <Paragraph position="5"> icl Figure 2a: UNL hypergraph partial representation for the meaning denoted by the English word &amp;quot;mat&amp;quot;  the meaning denoted by the English word &amp;quot;mat&amp;quot; While lexical representation in UNL comprises a set of universal concepts signaled by UWs, the cross-lexical level involves a set of ordered binary relations between UWs, which are the Relation Labels (RLs). RLs specification are similar to Fillmore's semantic cases (1968), with RLs corresponding to semantic-value relations linking concept-like information. There are currently 44 RLs, but this set has been continuously modified by empirical evidence of lack, or redundancy, of relations. The inventory of RLs can be divided into three parts, according to the functional aspects of the related concepts: ontological, event-like and logical relations. Ontological relations are used as UW constraints in reducing lexical granularity or avoiding ambiguity as shown above, and they help positioning UWs in a UNL lexical structure. Five different labels are used to convey ontological relations: icl (hyponymy), equ (synonymy), ant (antonymy), pof (meronymy), and fld (semantic field).</Paragraph>
    <Paragraph position="6">  UNL depicts sentence meaning as a fact composed by either a simple or a complex event, which is considered here the starting point of a UNL representation, i.e., its minimal complete semantic unit. Event-like relations are assigned by an event external or internal structure, or by both. An event external structure has to do nearly always with time and space boundaries. It can be referred to by a set of RLs signaling the event co-occurrent meanings, such as 7 its environment (scn); starting place (pl0, finishing p!ace (pit), or, simply, place (plc); range (fmt); starting time (tmf), finishing time (tmt), or, simply, time (tim); and duration (dur). Action modifiers, such as manner (man) and method (met) can also qualify this structure. An event internal structure is associated to one of the following simple frames: action, activity, movement, state, and process, each expressing different RLs in the event itself, including its actors and circumstances.</Paragraph>
    <Paragraph position="7"> Event actors are any animate or inanimate character playing any role in events, which can be the main or the coadjutant actors.</Paragraph>
    <Paragraph position="8"> There can be up to eight actors, signaled by the following RLs: agent (agt), co-agent (cag), object (obj), co-object (cob), object place (opl), beneficiary (ben), partner (ptn) and instrument (ins). They can also be coordinated through the RLs conjunction (and) and disjunction (or), or subordinated to each other by possession (pos), content (cnt), naming (nam), comparison (bas), proportion (per), and modification (mod). They can still be quantified (qua) or qualified by the RLs &amp;quot;property attribution&amp;quot; (aoj) and co-attribution (cao). It is possible to refer to an &amp;quot;initial actor&amp;quot; (src), a &amp;quot;final actor&amp;quot; (gol), or an &amp;quot;intermediary actor&amp;quot; (via). Finally, spatial relationships can also hold between actors: current place (plc), origin (firm), destination (to), and path (via). Besides single events, there can still be complex cross-event relationships which express either paralleled events - co-occurrence (coo), conjunction (and), and disjunction (or) - or hierarchically posed events - purpose (pur), reason (rsn), condition (con), and sequence (seq). They can all be referred to as logical relations, since they are often isomorphic to first-order logic predicates.</Paragraph>
    <Paragraph position="9"> According to the UNL authors, it is possible to codify any sentence written in any NL into a corresponding UNL text expressing the sentence meaning through the use of the above RLs. This is still a claim to be verified, since cases of superposition and competition between different RLs have been observed, as it is discussed in Section 5.</Paragraph>
    <Paragraph position="10"> In addition to UWs and RLs, UNL makes use of predicate-like information, or Attribute Labels (ALs), which are names for event and concept &amp;quot;transformations&amp;quot;, in a sense very close to that intended by Chomsky (1957, 1965). They are not explicitly represented in a UNL hypergraph, although they are used to modify its nodes. ALs can convey information about concept intensions and extensions. In the former case, ALs name information about utterers' intensions over either specific parts of a sentence (focus, topic, emphasis, theme) or the whole structure (exclamation, interrogation, invitation, recommendation, obligation, etc.). In the latter case, ALs refer to spatial (definite, indefinite, generic, plural) or temporal (past, present, future)information, or still, temporal external (begin-soon, beginjust, end-soon, end-just) or intemal (perfecfive, progressive, imperfective, iterative) structures. To differentiate ALs from UWs, ALs are attached to UWs by the symbol &amp;quot;.@&amp;quot;. The cOncept expressed by the UW 'sit' in &amp;quot;sit. @entry. @past&amp;quot;, for example, is taken as the starting point (. @entry) of the corresponding hypergraph and it is to be modified by temporal information (. @past).</Paragraph>
    <Paragraph position="11"> 7 RLs names are bracketed.</Paragraph>
  </Section>
  <Section position="5" start_page="28" end_page="31" type="metho">
    <SectionTitle>
4. The UNL System
</SectionTitle>
    <Paragraph position="0"> The UNL system architecture consists of two main processes, the encoder and decoder, and several linguistic resources, each group of these corresponding to a NL embedded in the system, as depicted in  A source document (SLD) conveys written text on any subject, in any of the NLs considered. There is no constraint in the domain or structure of the SLD, but there is necessarily a loss of semantic expressiveness during NL-UNL encoding. The goal of the UNL is not, in principle, to fully preserve text meaning, but only its main components, i.e., those considered to be essential. However, there is no measurable account as to what is essential in the UNL Project. By convention, this is linked to what has been called the literal meaning, whi.ch is directly derived from interpreting the sentence surface structure. Therefore, there is no room to represent content that is not directly mapped onto the NL syntactic-semantic licensed structures.</Paragraph>
    <Paragraph position="1"> The NL-UNL encoding tool, or UNL Encoder, is generic enough to handle all the  languages included in the Project. Apart from the (supposedly) universal knowledge-base, used to fill-in possible interlexical gaps when mapping is not precise, all other linguistic resources are language-dependent. The source grammar essentially guides the elicitation of the sentence semantic structure into its corresponding UNL structure, by determining RLs and ALs, always giving priority to information content.</Paragraph>
    <Paragraph position="2"> The UNL-NL decoding tool, or UNL Decoder, works in the opposite way to the Encoder. Besides the lexicon and the grammar, a cooccurrence dictionary is also used at this stage, to disentangle lexical choice. The target grammar is responsible for the semantic-syntactic mapping, now resolving semantic organization by making syntactic and dependence choices between UWs, taking RLs and ALs into account.</Paragraph>
    <Paragraph position="3"> 5. Remarks on language-independence The main strength of the UNL Project rests on human expertise: language-specific aspects to be included in the multilingual KBMT system are handled by native speakers of that language, in an attempt to overcome the need of representing knowledge across several languages or cultures. It has been successful in developing NL-driven resources and processes by researchers all around the world. For example, the BP UNL lexicon has over 65,000 entries that are categorized according to grammatical and some semantic features, and this will be extended considerably in the future to cover the Portuguese vocabulary to a greater extent. Up to the present time, only decoding systems customized to each NL have been plugged into a general decoder skeleton (provided by the UNL Center) and have already been assessed, producing promising results. The BP decoder, for example, is able to produce outputs whose literal meaning is preserved in most cases (Martins et al., 1998b), using handcoded UNL expressions. Actually, to decode any UNL text, NL-UNL encoding has to be handmade, since customization of the UNL Encoder to each NL has not yet been undertaken in the project. In spite of the promising decoding results, a) output quality varies enormously with UNL sentences encoding, which can be different across distinct research groups; b) communicative aspects of information exchange on the web are not explored in depth, as it can be seen through the list of RLs or ALs. UNL is not knowledge intensive and there are no guidelines as to consistently recognize or extract such kind of information from the surface of the source texts.</Paragraph>
    <Paragraph position="4"> There are several reasons why interpretation and use of the UNL among the various teams are not uniform, including cultural aspects and syntax differences of the languages involved. Using English as the lingua franca for communication and cooperation among the research groups and as the knowledge representation language has also brought limitations into the Project, since it implies a non-desirable level of language-dependence. This is inevitable, however, for limitations definitely come along with the choice made. For example, attaching a NL word to a UW may be difficult, owing to the cross-references introduced by using English to convey UNL symbols. Resuming the example shown in Figure 1, this is the case of the UW &amp;quot;on&amp;quot; in (lb): the preposition 'on' fills in the position feature of the verb 'sit' and, thus, is represented in UNL correspondingly as the second term of the binary relation 'plc' and the first term of 'obj'. This, undoubtedly, is critical, for 'sit' can be juxtaposed to other prepositions leading to different meanings, which, in turn, may introduce different sets of binary relations, implying a high-level complexity in the UNL representation. As a result, languages whose syntactic structures deeply differ from the English ones may  present an additional level of complexity that makes mapping to/from UNL impossible or unrealistic. In this respect, we have not been facing many problems in fitting Portuguese structures with UNL ones, since Portuguese, like English, is an inflectional language that also employs prepositional constructions.</Paragraph>
    <Paragraph position="5"> However, prepositions in Portuguese may play considerably different roles compared to English. Various extensions of the English spatial prepositions &amp;quot;on&amp;quot;, &amp;quot;over&amp;quot; and &amp;quot;above&amp;quot;, for example, are subsumed in Portuguese by a single form &amp;quot;sobre&amp;quot; (which may also mean ..about). Therefore, in Portuguese, cats could be, at the same time, not only &amp;quot;on&amp;quot; but also &amp;quot;over&amp;quot; and &amp;quot;above&amp;quot; mats. Only world knowledge, associated to contextual indexes, both absent in the referred UNL hypergraph, could avoid the unsuited encodings The cat sat over the mat.</Paragraph>
    <Paragraph position="6"> or The cat sat above the mat. from the Portuguese sentence &amp;quot;O gato sentou sobre o tapete&amp;quot;.</Paragraph>
    <Paragraph position="7"> Another problem related to the sentence The cat sat on the mat. refers to the existence of competing analyses: it is quite plausible that a UNL representation suggesting a noun phrase instead of a full sentence holds for this sentence. It so happens when the arc between 'sitting' and 'cat' concepts are labeled by the RL 'obj', instead of the RL 'agt' in (1), as it is shown in Figure 1 a', yielding the UNL text shown in Figure lb'.</Paragraph>
    <Paragraph position="8"> o Figure la': UNL hypergraph representation of the English sentence &amp;quot;The cat sat on the mat.&amp;quot; obj(sit. @entry. @past,cat. @def) plc(sit. @ entry. @ past,on) obj(on,mat. @def) Figure lb': UNL linear representation of the English sentence &amp;quot;The cat sat on the mat.&amp;quot; Both analyses are equally accurate and can lead to good NL surface expressions, although they refer to different semantic facts. Indeed, to define an object relationship between &amp;quot;sitting&amp;quot; and &amp;quot;cat&amp;quot; is to say that the cat was already sat before the beginning of the event (e,g., The cat sat on the mat ate the fish.). In this case, the animal does not actually perform the action, but is conditioned to it, the main performer position being empty, thus yielding the referred noun phrase. In Figure 1, instead, the cat on its own has taken the sitting position, therefore introducing an agent relationship. These two different semantic facts may correspond, in English, to a single surface structure. Indeed, (1) is orthographically identical to (1').</Paragraph>
    <Paragraph position="9"> However, other languages (e.g., Portuguese) do behave differently.</Paragraph>
    <Paragraph position="10"> Although it is also possible to have, in Portuguese, the same surface structure corresponding to both UNL representations (&amp;quot;sentado no tapete&amp;quot;), it is more feasible to have, for each case, completely different constructions. In the case depicted by Figure 1, the UW &amp;quot;sit&amp;quot; would be associated to the verb &amp;quot;sentar&amp;quot; (corresponding to &amp;quot;to sit&amp;quot;). Thus, the generation result should be something like &amp;quot;O gato sentou no tapete&amp;quot; or &amp;quot;O gato sentado no tapete&amp;quot;. On the other hand, for Figure 1', the same UW 'sit' would be generated in a completely different way, corresponding to the passive form of the Portuguese expression &amp;quot;colocar sentado&amp;quot; (to be put in a sitted position), for which there is no adequate English surface expression.</Paragraph>
    <Paragraph position="11"> Distinguishing such situations to cope with syntactic-semantic troublesome mappings, though interesting, is a highly  context-sensitive task, often surpassing sentence boundaries. UNL descriptions do not address such fine-grained level of meaning representation, being limited to meanings derived from context-free source sentences, even when context-freeness implies insufficient information. When this is not possible, UNL offers a default analysis for semantically ambiguous sentences, in which case we can say that the UNL representation is probabilistic, rather than deterministic.</Paragraph>
    <Paragraph position="12"> The _way we believe some of UNL limitations can-be overcome and/or minimized is by designing a fully-fledged testing procedure to assess outputs of both decoder and encoder for the various languages. Since the same encoding and decoding procedures have been delivered to the UNL teams, it is possible that part of the set of rules or translation strategies of a given team may be interchangeable with another one from a different language. In this way, sharing procedures may become a warranty for common ground assessment of the varied models, in which case it may be possible to make eligible concurrent strategies equally available for the languages involved.</Paragraph>
    <Paragraph position="13"> Concerning the UNL means to disambiguate or proceed to reference resolution or other discourse figures, most of the troublesome occurrences are enclosed in the treatment issued by specialists and, thus, they are constrained to, and handled by, at the level of native speakers use. This measure can be somewhat fruitful, provided that each signatory of the Project finds a way to trace a UNL text back onto its own NL text or viceversa, making a proper use of the UNL syntax or symbols. This, in fact, can be a good method to evaluate (de)coding: once a UNL code has been produced from any NL text, this code can be the source to decoding into the same NL, in order to compare the original NL text with the automatically generated one. Evaluation, in this case, can be carried out by the same research group responsible for both processes.</Paragraph>
    <Paragraph position="14"> Compared to other interlingua approaches (e.g., Mikrokosmos, Gazelle, or Kant), the UNL Project is in a much earlier stage - most of those are over 10 years old, while the UNL one is about 3 years old - but it is much more ambitious than most of the current systems under construction. For UNL is actually a front-end to a many-to-many communication system, with no constraints that are normally inherent in MT systems. Since knowledge is specified by native speakers for each NL module, grammar, semantics and world knowledge can be well founded. Its limitations, from a conceptual viewpoint, are shared by most of its counterparts, as in treating text at the sentence level only. In addition, by no means is the UNL system committed to event replication as it is the case of human translation. Automatic strategies have no psychological motivation whatsoever and are solely based upon computer efficiency principles, namely time and space.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML