File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0414_metho.xml
Size: 9,174 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0414"> <Title>A Polish-to-English Text-to-text Translation System Based on an Electronic Dictionary</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 The dictionary </SectionTitle> <Paragraph position="0"> The process of creating a Polish-to-English electronic dictionary destined to be used in computerised text translation was performed out in the following steps: I. The preparatory phase. In this phase classification files of the inflection of Polish words as well as the coding of Polish and English inflection paradigms were prepared.</Paragraph> <Paragraph position="1"> 2. The phase of creating the dictionary of canonical forms. This phase was carried out by lexicographers aimed by an interactive computer application. Each entry in the dictionary is supplied with inflection codes of its Polish and English parts as well as other syntactic-semantic information. The inflection code of the Polish part of an entry is a reference to a set of inflection endings stored in one of classification files prepared in phase 1. The format of the inflection code of the English part is &quot;selfconstructive&quot;, i.e. it enables the generation of appropriate inflected forms from a canonical form without the necessity of a time-consuming look up of any classification file. (Designing a &quot;self-constructive&quot; code for a highly flexional language like Polish would have been a complex task).</Paragraph> <Paragraph position="2"> 3. The phase of generating the SGML.type dictionary of inflected forms. This phase is executed automatically. The inflected forms are generated on the basis of Polish inflection codes attached to all entries in the dictionary of canonical forms. Only inflected forms of Polish words (phrases) are created in this phase. Each form inherits the syntactic-semantic information from its canonical form.</Paragraph> <Paragraph position="3"> English equivalents of Polish inflected forms are not derived. The derivation of an appropriate English form is left to the morphological synthesis in the translation process. Storing the dictionary as an SGML-type document aims at comfortable browsing of its contents as well as facilitating its use in an application other than the POLENG translation algorithm.</Paragraph> <Paragraph position="4"> 4. Converting the dictionary into two modified finite-state automata. This phase is executed automatically in order to optimise the access time. The first automaton stores single words. Its alphabet coincides with the Polish orthographic alphabet. Reaching a terminal state of the automaton is equivalent to finding a Polish inflected form in the dictionary. Whenever a finite state is reached, references to the table of morphological features and the table of canonical forms are obtained. Due to the references, all morphological and syntactic-semantic information as well as the canonical forms of the English equivalents of the found word are achieved. The second automaton stores the lexical phrases. The alphabet of the automaton is the set of identifiers of the words stored in the dictionary of single words. This means that the process of searching a phrase is executed &quot;word by word&quot; (in contrast to the &quot;letter by letter&quot; search in the automaton of single words).</Paragraph> </Section> <Section position="3" start_page="0" end_page="94" type="metho"> <SectionTitle> 2 The translation algorithm </SectionTitle> <Paragraph position="0"> The translation algorithm is non-modular: its only results are the phrasal structure and the surface form of the English expression corresponding to the Polish input. The processes of syntactic parsing, semantic analysis, transfer and morphological generation are not separated. The grammar assumed in parsing Polish expressions consists mostly of DGC rules. A specific algorithm is used for parsing verbal phrases. The algorithm deals with a characteristic feature of Polish syntax: an almost arbitrary order of verb modifiers. Types and admissible orders of modifiers of a given verb are listed in the dictionary. A few English verbs may correspond to one Polish verb depending on the type and the order of its modifiers. For each type of a modifier of a Polish verb, the type of the corresponding modifier of the English equivalent is given in the dictionary. The translation algorithm searches for the constituent verb of a clause, consults the information extracted from the dictionary and first checks for the constructions admissible for the verb. This approach enables the analysis of the Polish input expression, the choice of the appropriate English verb equivalent and the synthesis of the correct English output. If verb modifiers in the clause fulfil none of the types given in the dictionary for the predicate, then default values for the English verb equivalent and the English modifier types are taken.</Paragraph> <Paragraph position="1"> The algorithm makes it possible to transfer special verbal constructions called (-T) constructions and (T-shifted) constructions, where T denotes a type of a modifier. (-T) constructions are used in analysing object questions and relative object clauses in which a modifier of the type T does not explicitly occur, although the type T-modifier is required for the verb according to the dictionary information (e.g. in the sentence &quot;He is a man I was talking to&quot; the object of the clause &quot;I was talking to&quot; appears &quot;outside&quot; the clause). The (T-shifted) constructions are used in analysing sentences in which an object of the Polish expression should be transferred into the subject of the English expression (e.g. the Polish.sentence: &quot;Nie powiedziano mi (object) o tyro&quot;; the English translation: I (subject) have not been told about that&quot;).</Paragraph> <Paragraph position="2"> in a Polish sentence verbs are characterised both by preand post-modifiers. However, the sequence of words between the subject and the predicate in a sentence is subject to a number of constraints and is therefore amenable to deterministic parsing. This deterministic part of the algorithm has a notable impact on the effectiveness of the translation process.</Paragraph> <Paragraph position="3"> A few heuristic methods have been developed in order to limit the search space and thus achieve better efficiency. The &quot;method of filters&quot; consists in checking &quot;negative rules&quot; first. The success of a negative rule is equivalent to a failure of the hypothesis. The method &quot;replace by alternative&quot; consists in replacing two rules with the same left-hand symbol and the same beginning of their right sides by one rule - e.g. two rules A---~BC, A---~BD are replaced by one rule A---~B(C or D). This improves effectiveness because of a single (rather than double) attempt to expand the same non-terminal symbol to the given string of terminals. The method &quot;from longest to shortest&quot; says that if a symbol A occurs as a non-last right-hand symbol of a rule - e.g. in the rule D--cAE - and the grammar includes more rules than one to replace the symbol A - e.g. rules: A--~BC, A--~B, then it is more effective to make the algorithm check the &quot;longer&quot; rule A---~BC before checking the &quot;shorter&quot; rule A---~B. This makes it possible to block backtracking in the rule D-+A!E. The method &quot;replace symbol by parameter&quot; may be used when right-hand sides of productions for different symbols start with the same (sequence of) symbol(s). For example, two rules A--~BC, D--~BE may be replaced by one rule: F(P)---~ B((C and P is P1) or (D and P is P2)).</Paragraph> </Section> <Section position="4" start_page="94" end_page="94" type="metho"> <SectionTitle> 3 Status of the system </SectionTitle> <Paragraph position="0"> Currently the POLENG dictionary consists of about 2000 lexemes which corresponds to about 30000 inflected forms, mainly from the domain of computer science. The research plans for near future include consulting a large corpus of computer texts in order to create a bilingual dictionary which will enable the translation of a wide range of Polish computational texts into English.</Paragraph> <Paragraph position="1"> The starting point for the translation algorithm was a parser of Polish sentences described in (Szpakowicz, 1986). Most of the expressions parsed by that system are transferable in the system POLENG. There are still a lot of grammatical, syntactic and semantic problems to be solved. The problem of assigning a correct tense of the English output (there are only 3 tenses in Polish) is currently solved on the basis of the surface structure of the Polish input expression (the solution is far from perfect).</Paragraph> <Paragraph position="2"> The problem of determiners is not solved at all (all noun groups are assumed definite). The solutions of these two problems will be sought in the near future.</Paragraph> </Section> class="xml-element"></Paper>