File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2155_metho.xml
Size: 9,672 bytes
Last Modified: 2025-10-06 14:07:16
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2155"> <Title>An HPSG-to-CFG Approximation of Japanese</Title> <Section position="4" start_page="1046" end_page="1046" type="metho"> <SectionTitle> 2 Japanese Grammar </SectionTitle> <Paragraph position="0"> The grammar was developed for machine translation of spoken dialogues. It is capable of dealing with spoken language phenomena and ungrammatical or corrupted input. This leads on the one hand to the necessity of robustness and on the other hand to mnbiguitics that must be dealt with. Being used in an MT system for spoken language, the grammar must firstly accept fragmentary input and bc able to deliver partial analyses, where no spanning analysis is awdlable. A coinplete fragmentary utterance could, e.g., be: dai~oubu OKay This is an adjective without any noun or (copula) verb. There is still an analysis available. If an utterance is corrupted by not being fully recognized~ the grammar delivers analyses for those parts that could be understood. An example would be the following transliteration of input to the MT system: son desu ne watakushi so COP TAG i no hou wa dai~oubu GEN side 'FOP okay desu da ga kono hi COP but this day wa kayoubi desu ~te</Paragraph> </Section> <Section position="5" start_page="1046" end_page="1047" type="metho"> <SectionTitle> TOP Tuesday COP TAG </SectionTitle> <Paragraph position="0"> (lit.: Well, it is okay for my side, but this day is ~l~msday, isn't it?) Here, analyses for the following fragments arc delivered (where the parser found opera wa in the word lattice of the speech recognizer): sou dcsu nc watakushi so COP TAG I no hou wa dai{oubu Another necessity for partial analysis comes fl'om real-time restrictions imposed by the MT system. If tile parser is not allowed to produce a spanning analysis, it delivers best partial fragments. null rl'tle grammar must also be applicable to phenomena of spoken language. A typical problem is tile extensive use of topicalization and even omission of particles. Also serialization of particles occur nlore often than in written language, as described in (Siegel, 1999). A well-defined type hierarchy of Japanese particles is necessary here to describe their functions in the dialogues. Extensive use of honorification is another significance of spoken Japanese. A detailed description is necessary for different purposes in an MT system: honorification is a syntactic restrictor in subject-verb agreement and complement sentences, l~lrthermore, it is a very useflfl source of information for the solution of zero pronominalization (Metzing and Siegel, 1994). It is finally necessary for Japanese generation in order to tind the appropriate honorific forms. The sign-based in%rmation structure of HPSG (Pollard and Sag, 1994) is predestined to describe honorification on the different levels of linguistics: on the syntactic level for agreement phenomena, on tile contextual level for anaphora resolution and connection to speaker and addressee reference, and via co-indexing on the semantic level. Connected to honorification is the extensive use of auxiliary and light verb constructions that require solutions in the areas of morphosyntax, semantics, and context (see (Siegel, 2000) for a more detailled description). Finally, a severe problem of tile Japanese grammar in the MT system is the high po- null tential of ambiguity arising from the syntax of Japanese itself, and especially from the syntax of Japanese spoken language. For example, the Japanese particle ga marks verbal argmnents in most cases. There are, however, occurrences of ga that are assigned to verbal adjuncts. Allowing g a in any case to mark arguments or adjuncts would lead to a high potential of (spurious) ambiguity. Thus, a restriction was set on the adjunctive g a, requiring the modified verb not to have any unsaturated ga arguments.</Paragraph> <Paragraph position="1"> The Japanese language allows many verbal arguments to be optional. For example, pronouns are very often not uttered. This phenomenon is basic for spoken Japanese, such that a syntax urgently needs a clear distinction between optional and obligatory (and adjacent) arguments. We therefore used a description of subcategorization that differs from standard HPSG description in that it explicitly states the optionality of arguments.</Paragraph> </Section> <Section position="6" start_page="1047" end_page="1047" type="metho"> <SectionTitle> 3 Basic Algorithm </SectionTitle> <Paragraph position="0"> We stm't with the description of the top-level function HPSG2CFG which initiates the approximation process (cf. section 1.1 for the main idea). Let 7~ be the set of all rules/rule schemata, 12 the set of all lexicon entries, R the rule restrictor, and L the lexicon restrictor.</Paragraph> <Paragraph position="1"> We begin the approximation by first abstracting from the lexicon entries /2 with the help of the lexicon restrictor L (line 5 of the algorithm).</Paragraph> <Paragraph position="2"> This constitutes our initial set To (line 6). Finally, we start the fixpoint iteration calling Itcrate with the necessary parameters.</Paragraph> </Section> <Section position="7" start_page="1047" end_page="1048" type="metho"> <SectionTitle> 7 Iterate(T~, R, To). </SectionTitle> <Paragraph position="0"> After that, the instantiation of the rule schemata with rule/lexicon-restricted elements from the previous iteration Ti begins (line 1114). Instantiation via unification is performed by Fill-Daughters which takes into account a single rule r and Ti, returning successful instantiations (line 12) to which we apply the rule restrictor (line 13). The outcome of this restriction is added to the actual set of rule-restricted feature structures Ti+l iff it is new (remember how set union works; line 14). In case that really new feature structures have not been added during the current iteration (line 15), meaning that we have reached a fixpoint, we immediately exit with T/ (line 16) from which we generate the context-free rules as indicated in section 1.1.</Paragraph> <Paragraph position="1"> Otherwise, we proceed with the iteration (line We note here that the pseudo code above is only a naYve version of the implemented algorithm. It is still correct, but not computationally tractable when dealing with large HPSG grammars. Technical details and optimizations of the actual algorithm, together with a description of the theoretical foundations are described in (Kiefer and Krieger, 2000a). Due to space limitations, we can only give a glimpse of the actual implementation.</Paragraph> <Paragraph position="2"> Firstly, the most obvious optimization applies to the function Fill-Daughters (line 12), where the number of unifications is reduced by avoiding recomputation of combinations of daughters and rules that already have been checked. To do this in a simple way, we split the set Ti into Ti \ T/.-1 and T/_I and fill a rule with only those permutations of daughters which contain at least one element from T/\r/_ 1 . This guarantees checking of only those configurations which were enabled by the last iteration.</Paragraph> <Paragraph position="3"> Secondly, we use techniques developed in (Kiefer et al., 1999), namely the so-called rule filter and the quick-check method. The rule filter precomputes the applicability of rules into each other and thus is able to predict a failing unification using a simple and fast table lookup. The quick-check method exploits the flint that unification fails snore often at certain points in feature structnres than at others. In an off line stage, we parse a test corpus, using a special unifier that records all failures instead of bailing out after the first one in order to determine the most prominent failure points/paths. These points constitute the so-called quick-check vector. When executing a unification during approximation, those points are efficiently accessed and checked using type unification prior to the rest of the structure. Exactly these quick-check points are used to build the lexicon and the rule restrictor as described earlier (see fig. 1). During ore: experinmnts, nearly 100% of all failing unifications in Fill-Daughters could be quickly detected using the above two techniques.</Paragraph> <Paragraph position="4"> Thirdly, instead of using set union we use tlhe more elaborate operation during the addition of new feature structures to T/.+I. In fact, we add a new structure only if it is not subsumed by some structure already in tile set. To do this efficiently, tile quick-check vectors described above are employed here: before perfl)rming full feature structure subsnmption, we pairwise check the elements of the vectors using type subsumption and only if this succeeds do a full subsmnption test. If we add a new structure, we also remove all those structures in 7)ql that are subsumed by the slew structure in order to keep the set small. This does not change the language of tile resulting CF grammar because a more general structure can be put into at least those daughter positions which can be fillcd by the more specific one. Consequently, fbr each production that employs the more specific structure, there will be a (possibly) more general production employing the more general structure in the same daughter positions. Extending feature structure subsumplion by quick-check subsumption definitely pays off: more than 98% of all failing subsumptions could be detected early.</Paragraph> <Paragraph position="5"> Further optimizations to make the algorithm works in practice are described in (Kiefer and Krieger, 2000b).</Paragraph> </Section> class="xml-element"></Paper>