File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2108_intro.xml
Size: 4,683 bytes
Last Modified: 2025-10-06 14:06:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2108"> <Title>An Empirical Architecture for Verb Subcategorization Frame a Lexicon for a Real-world Scale Japanese-English Interlingual MT</Title> <Section position="2" start_page="0" end_page="640" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> An NLP system is supposed to be able to recognize the differences and commonalities between almost the same set of words in two different syntactic structures, such as &quot;Mary bit the dog&quot; vs &quot;Mary was bit by the dog.&quot; Considering this requirement, the contents in verb subcategorization frames play a major role of disambiguations in many NLP systems.</Paragraph> <Paragraph position="1"> These frames should reflect linguistic facts of lexical information being concentrated on verbs.</Paragraph> <Paragraph position="2"> There have been, on the other hand, some researches that aim to equally treat lexical information of nouns and that of verbs (see e.g. EDR94). Pustejovsky91 stepped further into formulating useful, intrinsic information in nouns with the notion of eo-compositionality among others, so as to recover from an elliptical sentence such as 'he began the book' a default verb such as 'reading.' However, in order to develop a working NLP system, even these recent researches may presuppose the use of exhaustive coding of verb subcategorization frame knowledge to let the new lexical features be automatically extracted and fully functional in their systems.</Paragraph> <Paragraph position="3"> Ellipses appear to be a much more serious problem with Japanese than with English because all the supposedly obligatory case elements are virtually free to be dropped or to be placed anywhere in the sentence except for the predicate position at the end.</Paragraph> <Paragraph position="4"> These phenomena seem to have imposed difficulties upon the design of the lexicon so that no list of Japanese verb classes and types comparable to Grishman94, Levin93 or Hornby75 for English was readily available when our project started 1. Thus, we decided to make one by ourselves by a bootstrapping method: to make the initial list of the classification and to make it to grow by developing a working lexicon for MT systems. Upon this empirical study and development of over 30,000 Japanese verbs and adjectives, we propose an architecture for verb subcategorization that represents the mapping information between surface case frame and deep case (thematic role) frame.</Paragraph> <Paragraph position="5"> The proposal is to serve as a solution to the empirical difficulties with Japanese verbs and case elements described above. The lexicon by this design has comprehensive information on both the surface frame and the deep frame, and the correspondences between them, which are embedded in a code Yet, the number of codes has been controlled under a manageable figure of several hundreds so that the coding system could evade the potential combinatorial explosion.</Paragraph> <Paragraph position="6"> This is to be done by identifying superficially different case patterns with an idea of alternative case markers and semantic roles, and by largely extending the notion and the formulation of voice conversion for Japanese auxiliary verbs and equivalents.</Paragraph> <Paragraph position="7"> The developed lexicon is adopted in a real world scale intedingua-based MT system that translates between English and Japanese (Muraki87). Our aim I Martin75 & FM&T85 contain some lists, but is too partial for the purpose of developing an MT system.</Paragraph> <Paragraph position="8"> here is to show an empirical result of the development and analysis of the lexicon from the point of view of space complexity order (cf. Jackendoflg0&93). In the following section are described the major linguistic requirements of the architecture, the case elements of which are free of word ordering and can increase in number when their voice is converted. The architecture that combines the verb surface case frame and deep case frame is described in section 3, followed by extended mechanisms lot applying what we generalized from voice conversion phenomena triggered by auxiliary verbs. We, then, describe the lexicon structure for ambiguity representations in relation to word senses. Finally, we present some statistic figures from the results of the lexicon development and confirm that the proposed architecture and the code system can empirically constrain the potential combinatorial explosions of the verb subcategorization frame representation varieties.</Paragraph> </Section> class="xml-element"></Paper>