File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/87/e87-1019_abstr.xml
Size: 14,723 bytes
Last Modified: 2025-10-06 13:46:23
<?xml version="1.0" standalone="yes"?> <Paper uid="E87-1019"> <Title>lish-to-Czech Machine Translation Sys-</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> FAIL-SOFT C&quot;EMERGENCY&quot;) MEASURES IN A PRODUCTION-ORIENTED MT SYSTEM </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="107" type="abstr"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> A system of fail-soft (emergency) measures for a production-oriented MT system is discussed, stating first the specific purposes of such a system, and showing then, how these measures are being used in the system of English-to-Czech machine translation as prepared by the group of mathematical linguistics at Charles University in Prague.</Paragraph> <Paragraph position="1"> i. In view of a production-oriented system of machine translation, under the present-day conditions, one should keep in mind that the end-user expects to have at his disposal a complete text rather than an alternating sequence of ~sentence) segments and blanks. On the other hand, everyone who has ever made even a perfunctory look at the problems involved in MT would agree that there is no such thing as a &quot;complete&quot; MT system, neither in the dictionary nor in the grammar part of it. Also, as is commonly accepted nowadays, any system with texts written in natural language at the input should provide measures for some kind of treatment of ill-formed input. Thus it is inevitable to consider, first, what type and quality of translation can meet the demands of a prospective user and what kind of translation is realizable under the given conditions, and, second, to decide what is to be sacrificed from, and what added to the system to make it work in a production enviroment.</Paragraph> <Paragraph position="2"> In the present paper we would like to outline one aspect of the approach taken up at the start of our experiment APAC3-2 the En@lish-.~-Czech machine translation system for translating INSPEC abstracts from the field of microelectronics (for a description of the history of the MT efforts of our team, see Haji~ov~, 1986; the APAC series is described in full detail in Kirschner, 1982; 1984; in press). We should emphasize that in the conditions within our reach, we aim at a satisfactorily accurate rendering in target language of the contents of a relatively simple text in source language, which would suffice for such a system to be applicable in information acquisition and would be capable to meet the main requirements set up by average users.</Paragraph> <Section position="1" start_page="0" end_page="104" type="sub_section"> <SectionTitle> 2.1 The specific purposes of the system </SectionTitle> <Paragraph position="0"> of fail-soft C&quot;emergency&quot;) measures to overcome anomalous input phenomena and partial failures of the MT system can be stated as follows: - The whole processing is divided into several stages to identify and treat more probable interpretations of some structures preferentially - to make the &quot;preferential&quot; approach (see below, Sect.2.2) a reality.</Paragraph> <Paragraph position="1"> The drawback is that in these circumstances the danger of a dilemmatic situation resulting in a cul-de-sac impasse increases. A special device has been introduced to overcome such an abnormal end: instead of eliminating such a defective string, the appllcation of the phase in question is suspended the program processing this string skips the phase and continues in the next one. The rules that compensate for the lacuna can be either the rules that in the framework of &quot;preferential&quot; approach take up the role of their more strict predecessors, or rules added particularly for this purpose - to deal with the most undesirable consequences of such an omission.</Paragraph> <Paragraph position="2"> - In the analysis, the emergency rules interpret unrecognized elements and integrate them into more complex structures.</Paragraph> <Paragraph position="3"> - In the synthesis, they help to produce an output that makes sense, corresponds to the source language, and is easier to post-edit. null - ~henever it is possible, they attempt at forming target language equivalents for the unidentified elements, either by adapting international words or by &quot;czechizing&quot; English dictionary forms by enduing them wlth qualities and forms proper to their presumptive Czech counterparts - e.g., gender, suffixes, etc.</Paragraph> <Paragraph position="4"> - With some classes of words, they serve as general dictionary rules provided the sets of semantic features, frame information and other necessary outfit of individual members of these classes correspond to the standard apparatus assigned to their representation in the framework of a general device, and their orthography ensures forming correct equivalents in Czech.</Paragraph> </Section> <Section position="2" start_page="104" end_page="107" type="sub_section"> <SectionTitle> 2.2 The fail-soft measures can be charac- </SectionTitle> <Paragraph position="0"> terized as consisting of three main parts: the first two concern elements not found in the basic dictionaries and the third concerns failures to arrive at an accomplished parse. \[We leave aside a discussion of the unification of orthography - such as American and British usage, different ways of spelling, use of hyphens etc. which comes before the first device described here.) In a sense, there is a set of other rules of &quot;emergency&quot; character: general rules \[which can be called &quot;sweeping rules&quot;) designed to operate after all more specific rules failed to apply - e.g., in the formation of compounds or in nominal syntax in general, etc.~ however, this being a constitutional component of what we call &quot;preferential&quot; approach, we shall confine ourselves to describing only the former three sets. To avoid a possible misunderstanding, we should make clear that when we call our approach &quot;preferential&quot;, it is only the name that it has in common with Wilks&quot; &quot;preferential semantics&quot;. In our system, we apply a rather trivial and simple principle with the aid of which the different probability of interpretation\[s) of some parts of a string is taken into account and exploited. The most probable solutions are covered by the rules first and with as detailed an accuracy as possible) the next probable solution is offered in some of the subsequent phases, etc., under more liberal conditions. The &quot;sweeping rules&quot; come last. That is also the reason why we write &quot;preferential&quot; with quotation marks.</Paragraph> <Paragraph position="1"> 2.21 The first device aimed at intercepting and interpreting words that failed to be found in the basic dictionaries is the co-called transducing dictionary (TD).</Paragraph> <Paragraph position="2"> Its task is to interpret the still unrecognized words according to their typical and (mostly) productive suffixes (the inflectional endings being detached and dictionary forms reconstructed by morphemic analysis in the preceding steps), and to assign to them part-ofmspeech and semantic information. Thus, e.g., words ending in -ER, -OR, -GRAPH, -ODE and some others are interpreted as nouns, concrete, instruments, capable of being substituted for human ac ~ tor; words ending in -CS, -CY, -ESS, -TUDE are supposed to be nouns, abstract, properties and, as distict from those ending in -ITY, -ICS, -SM, -SHIP, -HOOD, -THM, which otherwise have the same semantic characteristics, they form adjectives in a regular manner in Czech; the endings -FY, -ATE, -ISE (-IZE), -DUCE indicate verbs that can be both transitive and intransitive, of causative and (semi) terminological character,yet not allowed to form adjectices of the purposive type. A number of adjectival suffixes is contained, too, viz. -ARY, -AL, -RSE, -IVE, -OUS, -IC, -BLE, -LESS, -ANARt -LEAR, -NEAR, -OLAR, -ULAR.</Paragraph> <Paragraph position="3"> In all, about 50 classes of nouns, 13 of adjectives and 4 of verbs are covered by the TD device.</Paragraph> <Paragraph position="4"> Two further pieces of information should be added, the first being probably superfluous: i) All words having such suffixes but different properties as regards their part-of-speech category, semantic features, etc., are supposed to be contained in the basic dictionaries. 2) Most of the classes of words treated by the TD are international words of Latin or Greek origin; they can easily be &quot;transduced&quot; to Czech by relatively simple procedures; some of these procedures precede the TD operation as a part of a special morphemic analysis, but most of them operate in the synthesis, as an accessory to the English - Czech dictionary. A set of recursively applied rules (in several cycles) takes over the words identified by TD, desintegrates them, replaces the English suffixes by the corresponding Czech ones, and scans the bases for spelling configurations to be transformed or adapted to Czech orthography (replacing, e.g. PH by F, TH by T, C preceding A,L,O,R,T,U by K; S preceded by A, E,I,N,O,R,Y and followed by A,E,I,O is replaced by Z, etc.). Thus, e.g. PHOTOLITO-GRAPHIC changes into FOTOLITOGRAFICKE2, CYCLOTRON gives CYKLOTRON, ISOSMOTIC is transcribed as IZOSMOTICKE2. To give an example of solving similar problems, let us consider the word ISOSEISMIC: to preclude the second S situated at a morphemic juncture from becoming a Z, would require either a special entry in the main dictionary - as one word or as combination of the prefixal ISO + SEISMIC, in which case the adjective must be contained in the dictionary - or some similar preliminary treatment in the special morphemic analysis preceding the TD; the latter way of treatment would probably represent the best solution, which may be generalized for all or most of the typical terminological prefixes involving analogous problems as IZOSEISMICKE2 - e.g., A-, INFRA-, PRE-, PERI-, SEMI-, SYN-, MESO-, MONO-, HYPER-, POLY- etc. (needless to add that this time it would be such words as ISOSMOTIC that would require a specific treatment, e.g. to proce~only SMOTIC -- from ISO + SMOTIC - in the dictionary).</Paragraph> <Paragraph position="5"> It should be remarked that, in principle, this part of the transducing device - orthographical changes - need not be separated from the front part operating in the analysis.</Paragraph> <Paragraph position="6"> 2.22 Words that remain unaccounted for after passing the TD phases - i.e., not found in the dictionaries and not belonging to any of the classes dealt with in the transducing device - are subjected to further analysis; those having typical verbal inflectional endings C-ING, -ED) are regarded as verbs, those ending in -LY are taken for adverbs provided that more than 2 characters precede and their tentative status is syntactically corroborated. The rest are first treated as proper names and if the subsequent analysis fails to c~onfirm this conjecture - i.e., they are not inPSegrated into wider nominal complexes, e.g., as an apposition - they become nouns (which, by the way, happens to the tentative adverbs, too). The words identified in this tentative manner are &quot;czechized&quot;, which insome cases might result in quite acceptable formations - e.g., if the original words can be taken an &quot;international&quot; or technically and terminologically univocal terms: GETTERING --~ GETEROVA2NI2, ABEND --) ABENDOVAT - in other cases in more or less comical &quot;macaronic&quot; creations. In conclusion, it should be added that the original more ambitious idea of assigning to each unrecognized word ~that does not carry any characteristic clue making the guess easier) three parallel tentative interpretations to let the syntactic analysis decide -noun, verb, adverb - had to be abandoned for reasons similar to those that led to the resignation in the case of hypersentential context. Too many possibilities, often combined with other parallel solutions, led to combinatorial explosion that Cthough often not assuming the character of an infinite loop) expanded the structures to such an extent that sooner or later an overflow became inevitable. So far, there is no remedy for overflow in our system.</Paragraph> <Paragraph position="7"> 2.23 The last, relatively simple, measure concerns cases where a single parse \[or more parallel single parses) - i.e., trees covering individual input strings - failed to be formed in the last phase of the analysis; usually two or more partial trees are formed instead, which fact may be caused by anomalous structure of the input string, or owing to some partial failure in analyzing one or more substrings \[e.g., when some elementCs) or structure\[s~ were misinterpreted), or as a result of some subjective shortcomings in the program -- omission, error, etc. The synthesis program is able to process even such partial and fragmentary results and attempt at compiling an acceptable output, only a special character (~ or } ) is placed in front of such output strings to signalize that they had been formed on the basks of defective results of the analysis. If necessary, a set of rules of a more or less ad-hoc character deprives &quot;underdone&quot; (sub)trees of all auxiliary structures Ccategory labels, parentheses, separators, features, etc.) leaving only !exical vale ues, and performs thus the finishing touches to bring the substitute output as close to readable and acceptable results as possible.</Paragraph> <Paragraph position="8"> 3. The outputs of individual phases can be obtained in the listing. Some of these phases, esp. the last-but-one phase fixing the state of things before the syntactic measures have been applied, usually preserve information enough to recognize and examine the unretouched results and to divulge the diagnosis of errors or shortcomings necessary for further progress.</Paragraph> <Paragraph position="9"> This is to say that most of the &quot;emergency&quot; devices operate at moments and in a manner which permit to examine the previous state of things, so that their action does not obscure the regular course of the processing and allows normal control of it. It should be added that a part of emergency devices has a temporary character dealing with omissions and bugs proper to the system under development. We are sure that at least some of them will become superfluons.</Paragraph> </Section> </Section> class="xml-element"></Paper>