File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/c90-3101_abstr.xml
Size: 12,893 bytes
Last Modified: 2025-10-06 13:46:58
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3101"> <Title>PILOT IMPLEMENTATION OF A BILINGUAL KNOWLEDGE BANK</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> HmTis (1988) has called for a &quot;hyper-bitext&quot; tool for professional translators, a tool which would permit them easy on-line retrieval of bilingual equivalences, or &quot;translation units&quot;, they have used in the past. The translator's previous output would be stored as hypertext, with the parallel texts as far as possible aligned.</Paragraph> <Paragraph position="1"> A search for a given expression or term would thus display, for each occurrence in the corpus, a chunk of source language context togett:er with the corresponding fragment in the target language.</Paragraph> <Paragraph position="2"> At the same time, but independently, the authors and their colleagues at BSO/Research have been experimenting with bilingual corpora as a potential knowledge source for the Distributed Language Translation system (for an overview of this machine translation project, see Witkam 1988). They have argued that a bilingual corpus, appropriately structured, can largely replace conventional dictionaries (Sadler 1989: 133) and grammar rules (van Zuijlen 1989) in machine translation. The aim is to automate as far as possible the acquisition of the various types of knowledge required for machine translation - from monolingual knowledge of morphology, word classes, syntactic structures etc., through bilingual knowledge of lexical equivalences and translation syntax, to purely extra-linguistic knowledge-of-the-world - by structuring the evidence explicitly and implicitly available in human translations. Tim structured bilingual corpus is trained a &quot;Bilingual Knowledge Bank&quot;, or BKB. It appears that the tools now under development for constructing a BKB may also provide the professional translator with a more sophisticated form of &quot;hyper-bitext&quot; than that envisaged by Harris.</Paragraph> <Paragraph position="3"> 2. Building a Bilingual Knowledge Bank There are basically three steps involved in building a BKB structure. First, each hmguage version must be structured syntactically if it is to serve as a source of (monolingual and contmstive) grammatical knowledge.</Paragraph> <Paragraph position="4"> Second, semantically equivalent units (translation units) must be identified and cross-linked between the two versions. Third, referential or conceptual links must be added to identify various types of deixis and co-reference. The process can be illustrated with the following English-French example from Harris (1988).</Paragraph> <Paragraph position="5"> \[1\] The board of PAC unanimously confirms the mandate.</Paragraph> <Paragraph position="6"> = Le conseil du PAC est ,unanime dans sa confirmation du mandat.</Paragraph> <Paragraph position="7"> The Distributed Language Translation project has adopted dependency, rather than constituency, syntax (Schubert 1987; Maxwell & Schubert 1989), and figure 1 shows the dependency trees for this example, cross-coded for translation units (TUs). Each ellipse corresponds to a subtree. The basic TUs are dependency (sub)trees. Each of the seven subtrees which are directly identifiable as translation units has been assigned an identification number.</Paragraph> <Paragraph position="9"> Table 1 lists the TU numbers with the corresponding equivalences. For example, Ti.J 1 identifies the complete sentence~ TU 2 is the subject noun phrase, 3 the determiner, 4 the prepositional phrase, etc. While each of the basic translation units corresponds to a (sub)tree, not every subtree con'esponds to a translatton unit. The French subtree governed by dans, for instance, does not constitute a translation unit, In the TU coding, this is shown by the identification &quot;1/2&quot; attached to dans, which indicates that this subtree is the second bound dependent in TU 1.</Paragraph> <Paragraph position="10"> \[ Table 1: Translation units identified in figure 1.</Paragraph> <Paragraph position="11"> ~s----~ English phrase French phrase The ... mandate. Le ... du mandat.</Paragraph> <Paragraph position="12"> the board of PAC le conseil du PAC the le of PAC du PAC PAC le PAC the mandate le mandat the le The subtree approach to translation units allows for a process of tree subtraction which amounts to a kind of generalization. This allows the productive use of all the equivalences in the text, even if they do not con:;timte independent subtrees. For example, subtrac!ing TUs 2 and 6 from TU 1 in figure 1 yields the C,.luivalence of to ,o~animoudy co~firm with &re io,ar~izze dans sa corC/grmation de. In a machine Iranslation application, TUs 2 and 6 can be thought of as variables in a productive translation rule. Table 2 lists tt-e remaining possibilities and the corresponding subtractions. Once the basic TUs have been identified, these other equivalences can be atttomatically deduced by tree subtraction.</Paragraph> <Paragraph position="13"> The rcmaini~g s~ep in BKB construction is the coding of references. In figure 1, TU 6 ( the mandate = le mae~dat) will be linked by a pointer to its antecedent in a previous sentence. This link is bilingual, but other references may be language-specific. For example, the possessive pronoun in the French Sentence has no correspondent in the English version, as shown by the coding &quot;1/4&quot; in figure 1. Nevertheless, a monolingual link must be established between sa (or its normalized form so~,) and the antecedent, which can be identified as unit 2 ( le conseil du PAC).</Paragraph> <Paragraph position="14"> i~,tc;:.mr~cc!ing .'.he various surface forms used to :~e,,pr ~., .: ~,_.a coa~ccpt multiplies, for any given surface form, the contextual constraints which can be derived from the BKB, e.g. for the purposes of automatic disambiguation. It also &quot;allows the BKB structure to be reg~rded as a type of knowledge representation to which inference rules can be applied (Sadler 1989: 149-233).</Paragraph> <Paragraph position="15"> The building of a Bilingual Knowledge Bank entails a great deal of interactive text processing. Even after the text in each language has been correctly parsed, the conversion of the parallel deixmdency trees to the BKB structure cannot be performed automatically. However, it does appear that a great deal of the work can become automatic. There are two reasons for this. First, the BKB itself can provide more and more support, in a kind of boot-strapping process, the larger it becomes. Second, the information contained in one language version can support the disambiguation of the other version.</Paragraph> <Paragraph position="16"> 3. The pilot implementation in order to serve as a general provider of linguistic and world knowledge, a BKB should contain large amounts of data. When considering time-critical BKB applications, such as the BKB within a machine translation system, it is clear that efficient data storage techniques arc needed. Of course, it is not possible to investigate BKB techniques on a very large scale at present, because it takes a relatively long time to process the corpus. For this reason a small-scale implementation has been designed which gives a good impression of a future large-scale BKB system. The basis for this pilot BKB is formed by three parallel 20,000-word text corpora in the field of computer manuals. From these corpora, two BKBs have been built: one for English/Esperanto, the other for French/Esperanto. The pilot implementation consists of three, main parts: the parser, the &quot;synsemizer&quot; and the retrieval system.</Paragraph> <Paragraph position="17"> The parser is used to parse each input text. Since each sentence which is stored in the BKB should have only one meaning (i.e., should contain no syntactic ~unbiguities), the parser yields only one analysis per sentence. This deterministic behaviour is produced by a simple category-based grammar on the one hand, and built-in mechanisms which take care of coordination, ellipsis and uncertain syntagma attachments on the other hand. The analysis found is presented graphically to the user, and can be edited as required before it is stored in the BKB. Words are stored in their normalized forms with categories and some basic syntactic features. The parsing process is BKBsupported: with each new sentence, the information that was stored earlier is used to give clues to categories, features and normalized forms. Besides this learning capability, a future BKB system will also use the structure of sentences already par.~ed to resolve attachment problems that the parser was unable to resolve.</Paragraph> <Paragraph position="18"> The synsemizer is used both to define translation units by establishing bilingual relations between corresponding monolingmd subtrees, and to establish monolingual referential relationships. The first part of the work is presented to the user graphically: the computer searches for probable TU constituents and displays them for the user's confirmation or correction. Subsequent proposals are influenced by the user's response. The system is self-improving, since the computer's guesses are based on the whole of the text processed so far. Referential relations must be 450 2 identified manually in this pilot implementation.</Paragraph> <Paragraph position="19"> However, since bilingual relations (TUs) have already been established before this process begins, there is additional information available to aid the operator. The retrieval system is a tool which extracts information from a BKB that has been built using the parser and the synsemizer. On lthe basis of input phrases, which can be augmented with syntactic information, the BKB is queried. The resulting answers are presenteA to the user, either graphically or textually. Possible queries include concordance queries, translation and back-translation queries, and - to some extent - bridge translation (e.g. simulated English-to-French translation via Esperanto by &quot;chaining&quot; the two available BKBs).</Paragraph> <Paragraph position="20"> An interesting aspect of this pilot implementation is that it is not just a simplified prototype system in which decisions about various difficult issues are postponed. On the contrary, it contains the required functionality for building a real large-scale BKB. Any weaknesses of the pilot system derive from its limited size and from inefficiencies in implementation, rather than from its functionality. The system can therefore be used for examining various extrapolation-directed aspects such as linguistic and technical applicability, consistency mechanisms and also user interface presentation at the BKB building stage.</Paragraph> <Paragraph position="21"> 4, Comparison with other research The corpus-based approach to dictionary acquisition, which is part of the motivation behind the Bilingual Knowledge Bank, should not be confused with attempts made elsewhere to derive lexical equivalences from a bilingual corpus by purely probabilistic means (e.g. Brown et al. 1988). Syntactic structure is an essential BKB ingredient. Sumita & Tsutsumi (1988) have implemented a database of equivalent sentences in Japanese and English, but no full syntactic parsing is done, and retrieval is based on Patterns of function words in the Japanese text. In their tool, sentences retrieved in bilingual form serve merely as models for the human translator. Another translation aid has been described and implemented by Kjzersgaard (1987). This system allows the translator to retrieve a key word from one half of a bilingual corpus, together with its context in the source language and the corresponding chunk of text in the target language. It is up to the user, however, to decide which, if any, is the equivalent expression in the target language chunk.</Paragraph> <Paragraph position="22"> The closest comparable research appears to be that of Ogura et al. (1989), who have structured some 40,000 words of running text in Japanese and English in what they term a &quot;linguistic database&quot;. This does comprise a hierarchical syntactic and text-level structure, as well as cross-references between equivalent expressions in the two languages, although it is not clear whether all translation units have been coded.</Paragraph> <Paragraph position="23"> Their primary aim is to provide a friendly interface for the linguist, answering queries on word-class statistics, displaying the context and translations of key expressions, etc. In contrast, the present research is directly primarily towards applications in machine translation.</Paragraph> </Section> class="xml-element"></Paper>