File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2106_metho.xml

Size: 19,598 bytes

Last Modified: 2025-10-06 14:10:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2106">
  <Title>Virach Sornlertlamvanich TCL, NICT Thatsanee Charoenporn TCL, NICT</Title>
  <Section position="5" start_page="827" end_page="829" type="metho">
    <SectionTitle>
* the MILE Data Categories (MDC) which
</SectionTitle>
    <Paragraph position="0"> constitute the attributes and values to adorn the structural classes and allow concrete entries to be instantiated. MDC can belong to a shared repository or be user-defined. &amp;quot;NP&amp;quot; and &amp;quot;VP&amp;quot; are data category instances of the class SyntacticPhrase, whereas and &amp;quot;subj&amp;quot; and &amp;quot;obj&amp;quot; are data category instances of the class SyntacticFunction.</Paragraph>
    <Paragraph position="1"> * lexical operations, which are special lexical entities allowing the user to define multilin- null MILE is based on the experience derived from existing computational lexicons (e.g. LE-PAROLE,SIMPLE, EuroWordNet, etc.).</Paragraph>
    <Paragraph position="2">  gual conditions and perform operations on lexical entries.</Paragraph>
    <Paragraph position="3"> Originally, in order to meet expectations placed upon lexicons as critical resources for content processing in the Semantic Web, the MILE syntactic and semantic lexical objects have been formalized in RDF(S), thus providing a web-based means to implement the MILEarchitecture and allowing for encoding individual lexical entries as instances of the model (Ide etal., 2003; Bertagna et al., 2004b). In the framework of our project, by situating our work in the context of W3C standards and relying on standardized technologies underlying this community, the original RDF schema for ISLE lexical entries has been made compliant to OWL. The whole data model has been formalized in OWL by using Prot'eg'e 3.2 beta and has been extended to cover the morphological component as well (see Figure 2). Prot'eg'e 3.2 beta has been also used as a tool to instantiate the lexical entries of our sample monolingual lexicons, thus ensuring adherence to the model, encoding coherence and inter- and intra-lexicon consistency.</Paragraph>
    <Paragraph position="4"> 3 Existing problems with the MILE framework for Asian languages In this section, we will explain some problematic phenomena of Asian languages and discuss possible extensions of the MILE framework to solve them.</Paragraph>
    <Paragraph position="5"> Inflection The MILE provides the powerful framework to describe the information about inflection. InflectedForm class is devoted to describe inflected forms of a word, while InflectionalParadigm to define general inflection rules. However, there is no inflection in several Asian languages, such as Chinese and Thai. For these languages, we do not use the Inflected Form and Inflectional Paradigm.</Paragraph>
    <Paragraph position="6"> Classifier Many Asian languages, such as Japanese, Chinese, Thai and Korean, do not distinguish singularity and plurality of nouns, but use classifiers to denote the number of objects. The followings are examples of classifiers of Japanese.  &amp;quot;CL&amp;quot; stands for a classifier. They always follow cardinal numbers in Japanese. Note that different classifiers are used for different nouns. In the above examples, classifier &amp;quot;hiki&amp;quot; is used to count noun &amp;quot;inu (dog)&amp;quot;, while &amp;quot;satsu&amp;quot;for&amp;quot;hon (book)&amp;quot;. The classifier is determined based on the semantic type of the noun.</Paragraph>
    <Paragraph position="7"> In the Thai language, classifiers are used in various situations (Sornlertlamvanich et al., 1994). The classifier plays an important role in construction with noun to express ordinal, pronoun, for instance. The classifier phrase issyntactically generated according to a specificpattern.Herearesome usages of classifiers and their syntactic patterns.  Classifiers could be dealt as a class of the partof-speech. However, since classifiers depend on the semantic type of nouns, we need to refer to semantic features in the morphological layer, and vice versa. Some mechanism to link between features beyond layers needs to be introduced into the current MILE framework.</Paragraph>
    <Paragraph position="8"> Orthographic variants Many Chinese words have orthographic variants. For instance, the concept of rising can be represented by either character variants of sheng1:or.However, the free variants become non-free in certain compound forms. For instance, onlyallowed for 'liter', and onlyis allowed forZ'to sublime'. The interaction of lemmas and orthographic variations is not yet represented in MILE.</Paragraph>
    <Paragraph position="9"> Reduplication as a derivational process In some Asian languages, reduplication of words derives another word, and the derived word often has a different part-of-speech. Here are some examples of reduplication in Chinese. Man4'to be slow' is a state verb, while a reduplicated form  man4-man4is an adverb. Another example of reduplication involves verbal aspect. Kan4 'to look' is an activity verb, while the reduplicative form kan4-kan4, refers to the tentative aspect, introducing either stage-like sub-division or the event or tentativeness of the action of the agent. This morphological process is not provided for in the current MILE standard.</Paragraph>
    <Paragraph position="10"> There are alsovarious usages of reduplication in Thai. Some words reduplicate themselves to add a specific aspect to the original meaning. The reduplication can be grouped into 3 types according to the tonal sound change of the original word.</Paragraph>
    <Paragraph position="11">  Infact, only the reduplication of the same sound is accepted in the written text, and a special symbol, namely /mai-yamok/ is attached to the original word to represent the reduplication. The reduplication occurs in many parts-of-speech, such as noun, verb, adverb, classifier, adjective, preposition. Furthermore, various aspects can be added to the original meaning of the word by reduplication, such as pluralization, emphasis, generalization, and so on. These aspects should be instantiated as features.</Paragraph>
    <Paragraph position="12"> Change of parts-of-speech by affixes Affixes change parts-of-speech of words in Thai (Charoenporn et al., 1997). There are three prefixes changing the part-of-speech of the original word, namely /ka:n/, /khwa:m/, /ya:ng/.</Paragraph>
    <Paragraph position="13"> They are used in the following cases.</Paragraph>
  </Section>
  <Section position="6" start_page="829" end_page="829" type="metho">
    <SectionTitle>
* Nominalization
</SectionTitle>
    <Paragraph position="0"> /ka:n/ is used to prefixanactionverband /khwa:m/ is used to prefix a state verb in nominalization such as /ka:n-tham-nga:n/ (working), /khwa:m-suk/ (happiness).</Paragraph>
  </Section>
  <Section position="7" start_page="829" end_page="830" type="metho">
    <SectionTitle>
* Adverbialization
</SectionTitle>
    <Paragraph position="0"> An adverb can be derived by using /ya:ng/ to prefix a state verb such as /ya:ng-di:/ (well).</Paragraph>
    <Paragraph position="1"> Note that these prefixes are also words, and form multi-word expressions with the original word.</Paragraph>
    <Paragraph position="2"> This phenomenon is similar to derivation which is not handled in the current MILE framework.</Paragraph>
    <Paragraph position="3"> Derivation istraditionally considered asa different phenomenon from inflection, and current MILE focuses on inflection. The MILE framework is already being extended to treat such linguistic phenomenon, since it is important to European languages as well. It would be handled in either the morphological layer or syntactic layer.</Paragraph>
    <Paragraph position="4">  Function Type Function types of predicates (verbs, adjectives etc.) might be handled in a partially different way for Japanese. In the syntactic layer of the MILE framework, FunctionType class is prepared to denote subcategorization frames of predicates, and they have function types such as &amp;quot;subj&amp;quot; and &amp;quot;obj&amp;quot;. For example, the verb &amp;quot;eat&amp;quot; has two FunctionType data categories of &amp;quot;subj&amp;quot; and &amp;quot;obj&amp;quot;. Function types basically stand for positions of case filler nouns. In Japanese, cases are usually marked bypostpositions andcase filler positions themselves do not provide much information on case marking. For example, both of the following sentences mean the same, &amp;quot;She eats  nominative and accusative cases respectively.</Paragraph>
    <Paragraph position="5"> Note that two case filler nouns &amp;quot;she&amp;quot; and &amp;quot;pizza&amp;quot; can be exchanged. That is, the number of slots is important, but their order is not.</Paragraph>
    <Paragraph position="6"> For Japanese, we might use the set of post-positions as values of FunctionType instead of conventional function types such as &amp;quot;subj&amp;quot; and &amp;quot;obj&amp;quot;. It might be an user defined data category or language dependent data category. Furthermore, it is preferable to prepare the mapping between Japanese postpositions and conventional function types. This is interesting because it seems more a terminological difference, but the model can be applied also to Japanese.</Paragraph>
  </Section>
  <Section position="8" start_page="830" end_page="832" type="metho">
    <SectionTitle>
4 Building sample lexicons
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="830" end_page="830" type="sub_section">
      <SectionTitle>
4.1 Swadesh list and basic lexicon
</SectionTitle>
      <Paragraph position="0"> The issue involved in defining a basic lexicon for a given language is more complicated than one may think (Zhang et al., 2004). The naive approach of simply taking the most frequent words in a language is flawed in many ways. First, all frequency counts are corpus-based and hence inherit the bias of corpus sampling. For instance, since it is easier to sample written formal texts, words used predominantly ininformal contexts are usually underrepresented. Second, frequency of content words is topic-dependent and may vary from corpus to corpus. Last, and most crucially, frequency of a word does not correlate toits conceptual necessity, which should be an important, if not only, criteria for core lexicon. The definition of a cross-lingual basic lexicon is even more complicated. The first issue involves determination of cross-lingual lexical equivalencies. That is, how to determine that word a (and not a') in language A really is word b inlanguage B.Thesecond issue involves the determination of what is a basic word in a multilingual context. In this case, not even the frequency offers an easy answer since lexical frequency may vary greatly among different languages. The third issue involves lexical gaps. That is, if there is a word that meets all criteria of being a basic word in language A, yet it does not exist in language D (though it may exist in languages B,andC). Is this word still qualified to be included in the multilingual basic lexicon? It is clear not all the above issues can be unequivocally solved with the time frame of our project. Fortunately, there isan empirical core lexicon that we can adopt as a starting point. The Swadesh list was proposed by the historical linguist Morris Swadesh (Swadesh, 1952), and has been widely used by field and historical linguists for languages over the world. The Swadesh list was first proposed as lexico-statistical metrics.</Paragraph>
      <Paragraph position="1"> That is, these are words that can be reliably expected to occur in all historical languages and can be used as the metrics for quantifying language variations and language distance. The Swadesh list is also widely used by field linguists when they encounter a new language, since almost all of these terms can be expected to occur in any language. Note that the Swadesh list consists of terms that embody human direct experience, with culture-specific terms avoided. Swadesh started with a 215 items list, before cutting back to 200 items and then to 100 items. A standard list of 207 items is arrived at by unifying the 200 items list and the 100 items list. We take the 207 terms from the Swadesh list as the core of our basic lexicon. Inclusion of the Swadesh list also gives us the possibility of covering many Asian languages in which we do not have the resources to make a full and fully annotated lexicon. For some of these languages, a Swadesh lexicon for reference is provided by a collaborator.</Paragraph>
    </Section>
    <Section position="2" start_page="830" end_page="832" type="sub_section">
      <SectionTitle>
4.2 Aligning multilingual lexical entries
</SectionTitle>
      <Paragraph position="0"> Since our goal is to build a multilingual sample lexicon, it is required to align words in several  Asian languages. In this subsection, we propose a simple method to align words in different languages. The basic idea for multilingual alignment is an intermediary by English. That is, first we prepare word pairs between English and other languages, then combine them together to make correspondence among words in several languages. The multilingual alignment method currently we consider is as follows:  1. Preparing the set of frequent words of each</Paragraph>
      <Paragraph position="2"> set of frequent words of Japanese, Chinese and Thai, respectively. Now we try to construct a multilingual lexicon for these three languages, however, our multilingual alignment method can be easily extended to handle more languages.</Paragraph>
      <Paragraph position="3">  2. Obtaining English translations</Paragraph>
      <Paragraph position="5"> is translated into a set of English words EXw ij by referring to the bilingual dictionary, where X denotes one of our languages, J, C or T. We can obtain mappings as in (1).</Paragraph>
      <Paragraph position="7"> Notice that this procedure is automatically done and ambiguities would be left at this stage.</Paragraph>
      <Paragraph position="8"> 3. Generating new mapping From mappings in (1), a new mapping is generated by inverting the key. That is, in the new mapping, a key is an English word Ew i and a correspondence for each key is sets of translations XEw ij for 3 languages, as shown in (2):  Notice that at this stage, correspondence between different languages is very loose, since they are aligned on the basis of sharing only a single English word.</Paragraph>
      <Paragraph position="9"> 4. Refinement of alignment  Groups of English words are constructed by referring to the WordNet synset information. For example, suppose that Ew</Paragraph>
      <Paragraph position="11">  belong to the same synset S k . We will make a new alignment by making an intersection of  In (3), the key is a synset S k , which is supposed to be a conjunction of Ew</Paragraph>
      <Paragraph position="13"> and the counterpart is the intersection of set of translations for each language. This operation would reduce the number of words of each language. That means, we can expect that the correspondence among words of different languages becomes more precise. This new word alignment based on a synset is a final result.</Paragraph>
      <Paragraph position="14"> To evaluate the performance of this method, we conducted a preliminary experiment using the Swadesh list. Given the Swadesh list of Chinese, Italian, Japanese and Thai as a gold standard, we tried to replicate these lists from the English Swadesh list and bilingual dictionaries between English and these languages. In this experiment, we did not perform the refinement step with WordNet. From 207 words in the Swadesh list, we dropped 4 words (&amp;quot;at&amp;quot;, &amp;quot;in&amp;quot;, &amp;quot;with&amp;quot; and &amp;quot;and&amp;quot;) due to their too many ambiguities in translation.</Paragraph>
      <Paragraph position="15"> As a result, we obtained 181 word groups aligned across 5 languages (Chinese, English, Italian, Japanese and Thai) for 203 words. An aligned wordgroup wasjudged &amp;quot;correct&amp;quot; whenthe words of each language include only words in the Swadesh list of that language. It was judged &amp;quot;partially correct&amp;quot; when the words of a language also include the words which are not in the Swadesh list. Based on the correct instances, we obtain 0.497 for precision and 0.443 for recall. These figures go up to 0.912 for precision and 0.813 for recall when based on the partially correct instances.</Paragraph>
      <Paragraph position="16"> This is quite a promising result.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="832" end_page="832" type="metho">
    <SectionTitle>
5 Upper-layer ontology
</SectionTitle>
    <Paragraph position="0"> The empirical success of the Swadesh list poses an interesting question that has not been explored before. Thatis, does the Swadeshlist instantiates a shared, fundamental human conceptual structure? And if there is such as a structure, can we discover it? In the project these fundamental issues are associated with our quest for cross-lingual interoperability. We must make sure that the items of the basic lexicon are given the same interpretation. One measure taken to ensure this consists in constructing an upper-ontology based on the basic lexicon. Our preliminary work of mapping the Swadesh list items to SUMO (Suggested Upper Merged Ontology) (Niles and Pease, 2001) has already been completed. We are in the process of mapping the list to DOLCE(Descriptive Ontology for Linguistic and Cognitive Engineering) (Masolo et al., 2003). After the initial mapping, we carry on the work to restructure the mapped nodes to form a genuine conceptual ontology based on the language universal basic lexical items. However one important observation that we have made so far is that the success of the Swadesh list is partly due to its underspecification and to the liberty it gives to compilers of the list in a new language. If this idea of underspecification is essential forbasic lexiconfor humanlanguages, thenwe must resolve this apparent dilemma of specifying them in a formal ontology that requires fully specified categories. For the time being, genuine ambiguities resulted in the introduction of each disambiguated sense in the ontology. We are currently investigating another solution that allows the inclusion of underspecified elements in the ontology without threatening its coherence. More specifically we introduce a underspecified relation in the structure for linking the underspecified meaning to the different specified meaning. The specified meanings are included in the taxonomic hierarchy in a traditional manner, while a hierarchy of underspecified meanings can be derived thanks to the newrelation. Anunderspecified node only inherits from the most specific common mother of its fully specified terms. Such distinction avoids the classical misuse of the subsumption relation for representing multiple meanings. This method does not reflect a dubious collapse of the linguistic and conceptual levels but the treatment of such underspecifications as truly conceptual. Moreover we  hope this proposal will provide a knowledge representation framework for the multilingual alignment method presented in the previous section.</Paragraph>
    <Paragraph position="1"> Finally, our ontology will not only play the role of a structured interlingual index. It will also serve as a common conceptual base for lexical expansion, as well as for comparative studies of the lexical differences of different languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML