File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3031_metho.xml
Size: 15,873 bytes
Last Modified: 2025-10-06 14:12:30
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3031"> <Title>amp;quot;Corpus-based Lexical Acquisition for Translation</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> SENTENCE ~: 3S7748 </SectionTitle> <Paragraph position="0"> The a.~,assador's con~rlbu~ionwas one small parly a~ which a r'mu~er of us ended up dancing on a lable.</Paragraph> <Paragraph position="1"> L'appor~ de l'ambassadeur s'es~ resume a une petite f;~e ou nous avons fini par danser sup une table.</Paragraph> <Paragraph position="2"> Figure One : Sample Ci~alion Some rcprescntalive verbs which have at least one movement sense were selected. We compared the extent of the information found in the bilingual corpus with the information found in the CR machine-readable dictionary (MRD). For verbs like commute which do not have a straightforward translation, we found either (I) all the components of the verb concept, as in 'se rendre au travail quotidiennement'; (2) parts of the translation, as in 'faire le trajet'; or (3) a totally different verb from that given in the MRD, such as 'parcourir' or 'voyager'.</Paragraph> <Paragraph position="3"> Wc observed that, not only was the MRD informalion incomplete, but also only a partial ex-This work was completed at IBM, T.J. Watson Research, although the second author is currently at A.T. & T., Bell Laboratories. $, 174 1 pression of the typical meaning of the verb was provided. In the past, since printed dictionaries have been subject to the constraints of time and space, they have not always been able to offer full information about entries, ltowever, with electronic dictionaries and lexical data bases, this should no longer be a restriction. In fact, given more and richer information, we envision a move away from the flat tfierarchieal structure of dictionaries to a more network-like representation of lexical knowledge. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Rclate~l R~earch. Combining linguistic and sta- </SectionTitle> <Paragraph position="0"> tistical methods is becoming increasingly popular in computational linguistics especially as more corpora become available. 2 Work in this vein ranges from the syntactic and semantic to the lexical. For example, Atkins 1987 demonstrates convincingly that with corpus data, the lexicographer can attack the difficult problem of word senses in a systematic way.</Paragraph> <Paragraph position="1"> Church and ttanks 1989 and Church et al. 1990 develop a battery of statistical methods to induce linguistic regularities. They identify coocurrence relations by computing statistics (e.g. by use of mutual information, t-score) over millions of words of text. Their approach is focussed on monolingual rather than bilingual corpus analysis, and constitutes a significant contribution to lexical research. On more syntactic note, Dagan and Itai 1990 use statistical methods over linguistically parsed text (Jensen 1986) to resolve anaphorie reference.</Paragraph> <Paragraph position="2"> In the arena of automatic bilingual lexicon construction, Catizone el: al. 1989 take two corresponding texts (English and German) and develop aigoritluns to deternffne lexical alignments by using statistical methods over texts combined with the optional support of an MRD. In contrast, Sadler 1989 proposes parsing aligned corpora into dependency trees, which form the structures upon which lexieal correspondences are suggested to the user.</Paragraph> <Paragraph position="3"> The early stages of the construction of the Bilingual Knowledge Base (BKB) rely heavily on human input but gradually becomes more automatic as data is collected. Using purely statistical techniques, Brown et al. 1988 make use of the Itansard bilingual corpus for the purpose of building a machine translation system. Such a system is a good example of using exclusively statistical non-linguistic methods to induce translations.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. The BICORD System - Bilingual Corpus-based </SectionTitle> <Paragraph position="0"> Dictionary. Our approach involves a combination of standard linguistic methodology using MRD's, enhanced with some statistical techniques. Dictionaries are often discounted because they are built on basis of introspective intuition rather than purely on objective observation of data. ttowever, our underlying assumption is that the insights that a dictkmary encodes and represents should not be disregarded (although there are some limitations resuiting from the structural organisation). \]'his is a controversial assumption. Even though, in the past, dictionaries havc been built solely on the basis of intuition, current trends are to use corpus-driven criteria, as, for example, in the Collins COBUILD dictionary (1987). Without question tiffs is a step in the riglat direction towards completness and accuracy of coverage of the language as it actually occurs. Itowever, the limitation of corpus analysis is that subtle linguistic inluitions about word behavior (such as &quot;negative evidence&quot;) cannot be obtained from corpora; in other words, what is disallowed in the language may never be discovered. Tiros we disagree with the claim of Garside, Leech, and Sampson 1987 that the survival of both descriptive and theoretical computational linguistics lies primarily in statistical analysis. We take the more moderate view that both approaches (linguistic and statistical) are essential if the language is to be characterized accurately and in its entirety.</Paragraph> <Paragraph position="1"> We extracted occurrences of several movement verbs (called &quot;probe&quot; strings) from the English side of the I lansard corpus. The criteria used to ensure that the verb was a member of tiffs semantic class is described in Atkins, Boguraev and Klavans 1990 (in preparation). The test set of verbs was drift, dance, commute, emigrate, immigrate, ascend, descend, circle, sail and glide. The probe string was used to search in CR; both for translations and collocations under the entry itself, and also for French headwords in the French side of the dictionary with the probe as a translation. The extracted corpora, consisting of the set of English citations containing the probe string (ha any morphological shape) and the corresponding French sentence, is called a &quot;probe corpus&quot;. A statistical tagger (Tzoukermann and Merialdo 1989) was used to assign a part of speech to the English side of the corpora. Translations and collocations were abstracted automatically from the parsed version of CR (see Neff and Boguraev 1989) using LQL (Neff et al. 1988). For illustration, a partial entry for dance is: +-bdw: dance</Paragraph> <Paragraph position="3"> :t For example, the ACL Data Collection Initiative (ACL/DCI) coordinated by Dr. Mark l.iberman at A.T.& T. Bell Laboratories was established to make corpora of all shapes and sizes mole widely ,~vailable to the research community.</Paragraph> <Paragraph position="5"> to dance about to dance up and down gambadar saul i 1 ler the child dano.d I~ay /or/ off J * * 1'enfant s'es~ elo,gne on gambadant /or/ensautillan~ Figure Two: Partial HRD entry for dance Also, the French words 'gambiller' and 'guincher' have dance as a translation. Probes had a maximum of 1 t46 citations, with a maximum of 25 senses and collocations in CR (a rough measure of polysemy). The tagger used to preprocess the corpus was trained on 1 million words (about 42,000 sentences) tagged manually and provided by the tree bank of Lancaster University (Garside, Leech, and Sampson 1987). Our version has 81 tags, a subset of the tree bank tags. Of these tags, 52 are categorial (such as VV+I for infmitival form of a non-auxiliary verb) and 29 are lexically bound, some of the latter being bound to a class of one (e.g. I0' is for the preposition of), and some are bound to a small sub-class of category (such as PP*S for &quot;personal pronoun subject&quot;). Some tags (such as N+I &quot;singular noun&quot;) provide morphological information, as well as categorial. The program, based on a tfigram model, computes the probability of a word in relation to its tag and assigns the tag that corresponds to the highest likelihood. In its simplest form:</Paragraph> <Paragraph position="7"> that is, the probability of a tag given its word corresponds to the product of the probability of observing the word given its tag by the probability of observing the tag. By random sampling, we determined the error rate for part of speech tagging to be about 3%.</Paragraph> <Paragraph position="8"> In this way, examples of sample strings as a verb were separated from the nominal uses. This is the first step in disambiguation, enabling lexical correspondences. To give an idea of size, there were 293 citations (about 12,000 words) with the string dance in its four morphological forms in English.</Paragraph> <Paragraph position="9"> The distribution by part of speech for these citations pass at filtering out pre-linked pairs common to both data resources. Citations that have lexical correspondences already provided by the machine-readable dictionary are extracted from the probe corpus. For example, consider again the verb dance. Thc character strings in the translation and collocation fields are extracted from CR; these strings arc filtered to remove function words and some common words (such as 'faire' (to make or do), morphological variants are generated. Some examples for dance are 'danser/dansa/dansera ..., gambader/gambadont .... ' Probe translations and collocations from CR are then ready to be used to automatically match stmlgs in the French side of the corpus. Each correspondence that matches one of the MRD probes is removed from the probe corpus, stored, and counted, leaving a reduced probe corpus. For example, for 109 citations of dance as a verb, 52 sentences matched the MRD correspondences, as shown in Figure One. An extended lexicon can then be built, using the structure already provided by CR where the frequencies are computed over these matches. For example, an initial partial enhanced entry for dance is:</Paragraph> <Paragraph position="11"> Notice that dictionary nodes are now identified with a prefix &quot;d', and corpus motivated nodes with &quot;c_&quot; New information is placed at the relevant node, low in the tree if there is no ambiguity of attachment or scope, and higher in the tree if necessary until evidence is found to permit the information to be moved down in the structure. For example, an additional node is added to the MRD structure to insort danser since danser is a translation both in homograph 2 and in homograph 3. Since transitivity of a verb cannot be determined automatically, there is no evidence to rnotivate placement so the data is inserted high in the tree, at the homograph level. In contrast, 'gambader' and 'sautiUer' m'e always intransitive (as determined by a look-up in CR), so they can be automatically placed under homograph three. Notice also that corpus derived information is placed under the relevant d_collocat |or 'gambadcr' and 'sautiller' since these are cases where matches occun'ed on the target term, but the source is different.</Paragraph> <Paragraph position="12"> The \]lansard, being the Canadian Parliamentary proceedings, contains a number of juridical and parliamentary terms, usages, and structures, a typical feature of any sublanguage. However the tlexi~ bility inherent in the BICORD system woukt allow a repetition of the sarne process over different sub..</Paragraph> <Paragraph position="13"> languages. As other texts are used, frequencies can be updaled in two ways, by counting all tiequencies into a general score, and also by keeping separate li'equencies linked to the source text. This feature allows a representation of the lexical correspondences of general and specific texts in one data struco lure. It also permits comparison between sublanguages. The result would be a balanced lexio con built over a balanced variety of corpora to re|lect the actual uses of the words or phrases in context.</Paragraph> <Paragraph position="14"> Further analysis of the remaining probe corpus is pursued by observing cooccurences both over tags and lexical items. For example, with dance, looking at immediate right context over tags reveals majority of these cases are for the preposition to.</Paragraph> <Paragraph position="15"> Including coocurrences over a larger window of five words, idioms are revealed like dance to ... tune, which is not found in CR, either under tune or dance. These and other patterns cma be discovered by statistical analysis over tags and lexical items it\] the reduced probe corpora. Therefore, a new set of collocations can be inserted in the lexicon; an entry for &quot;dance&quot; enhanced furl.her is shown as follows: +-h<~. : dance +-source: to dance to +-argument: (~he) t~ loll +-freq : llZ +-target : se mettre ~u diapason +-target : com~pl~er io qua~uor +-o_eollocat I +-source: to dance around</Paragraph> <Paragraph position="17"> I +-source: th~ child danced away /or/ off l +-target: l'mnfan~ s'es~ +loi~ I en gambadan{ /or/ en saulillant I.deg, conversely, to enhance a statistical system with data from an Mill.). The first application can be viewed in the light of a lexicographer's workstation; it can also be viewed as a contribution to the choice of lexical item made by the component responsible for lexical transfcr in a machine translation system. Translations and collocations in the original MRD are ordered by frequency, orderings which can easily be updated depending on the sub-language corpus. The enhanced MRD is more complete in containing correspondences not found in the original dictionary, and in suggesting new statistically significant translations. As for the second type of application, systems such as described in Brown et al. 1988 which use purely statistical approaches to infer translations from a bilingual corpus can benefit di~ rectly from the information already given in the MRD. This information can be used to preset values in the computation of correspondences, rather than letting the system learn values &quot;already discovered. null Future work depends on testing these two applications, namely that MRD-based lexieal transfer will proceed more accurately given statistical information and that statistical implementations, given enhanced Mill) data, will demonstrate improved perlormance in determining lexical correspondences. null Acknowledgements: We thank members of the Speech Recognition (;roup at IBM for cleaning and maintaining the I lansard corpus. In particular, we acknowledge help from Bernard Merialdo.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> References </SectionTitle> <Paragraph position="0"> It is not always the case that the remaining corpus data can be easily inserted in the lexicon and in fact, we encountered a few problems during this process.</Paragraph> <Paragraph position="1"> First, it is not straightforward to ~aow with which field to associate the resulting correspondences. For example, in dance, does dance around go under a separate translation field or is it related to the collocation field with dance about? Second, some new context fields should be added to the collocation nodes, but determining the criteria tbr selecting them automatically is not always evident.</Paragraph> <Paragraph position="2"> Further, there is a question of locating and integrating robust new data from the corpus into the already existing structure.</Paragraph> </Section> class="xml-element"></Paper>