File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/c00-1055_abstr.xml
Size: 16,157 bytes
Last Modified: 2025-10-06 13:41:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1055"> <Title>Toward a Scoring Function for Quality-Driven Machine Translation</Title> <Section position="1" start_page="0" end_page="378" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We describe how we constructed an automatic scoring function for machine translation quality; this function makes use of arbitrarily many pieces of natural language processing software that has been designed to process English language text. By machine-learning values of fnnctions available inside the software and by constructing functions that yield values based upon the software output, we are able to achieve preliminary, positive results in machine-learning the difference between human-produced English and machine-translation English. We suggest how the scoring ftmction may be used for MT system development.</Paragraph> <Paragraph position="1"> Introduction to the MT Plateau We believe it is fair to say that the field of machine translation has been on a plateau for at least the past decade. 2 Traditional, band-built MT systems held up very well in the ARPA MT evaluation (White and O'Connell 1994).</Paragraph> <Paragraph position="2"> These systems are relatively expensive to build and generally require a trained staff working for several years to produce a mature system.</Paragraph> <Paragraph position="3"> This is the current commercial state of the art: hand-building specialized lexicons and translation rules. A completely different type of system was competitive in this evaluation, namely, the purely statistical CANDIDE system built at IBM. It was generally felt that this system had also reached a plateau in that more data and more training was not likely to improve the quality of the output.</Paragraph> <Paragraph position="4"> Low Density Machine Translation However, in the case of &quot;Low Density Machine Translation&quot; (see Nirenburg and Raskin 1998, Jones and Havrilla 1998) commercial market forces are not likely to provide significant incentives for machine translation systems for Low Density (Non-Major) languages any time soon. Two noteworthy efforts to break past the data and labor bottlenecks for high-quality machine translation development are the following. The NSF Summer Workshop on i Douglas Jones is now at National Institute of Standards & Technology, Gaithersburg, MD 20899, Douglas.Jones @NIST.gov a A sensible, plateau-fi'iendly strategy may be to accumulate translation memory to improve both the long-term efficiency of human translators and the quality of machine translation systems. If we imagine that the plateau is really a kind of logarithmic function tending ever upwards, we need only be patient.</Paragraph> <Paragraph position="5"> Statistical Machine Translation held at Johns Hopkins University summer 1999 developed a public-domain version intended as a platform for further development of a CANDIDE-style MT system. Part of the goal here is to improve the trauslation by adding levels of linguistic analysis beyond the word N-gram. An effort addressing the labor bottleneck is the Expedition Project at New Mexico State University where a preliminary elicitation environlnent for a computational field linguistics system has been developed (the Boas interface; see Nirenburg and Raskin 1998) A Scoring Function for MT quality Our contribution toward working beyond this plateau is to look for a way to define a scoring function for the quality of the English output such that we can use it to machine-learn a good translation grammar. The novelty of our idea for this function is that we do not have to define the internals of it ourselves per se. We are able to define a successful function for two reasons.</Paragraph> <Paragraph position="6"> First, there is a growing body of software worldwide that has been designed to consume English; all we need is for each piece of software to provide a metric as to how English-like its input is. Second, we can tell whether the software had trouble with the input, either by system-internal diagnosis or by diagnosing the software's output. A good illustration is the facility in current word-processing software to put red squiggly lines underneath text it thinks should be revised. We know fi'om experience that this feature is often only annoying.</Paragraph> <Paragraph position="7"> Nevertheless, imagine that it is correct some percentage of the time, and that each piece of software we use for this purpose is correct solne percentage of the time. Our strategy is to extract or create nurneric wflues fl'om each piece of software that corresponds to the degree to which the software was happy with the input.</Paragraph> <Paragraph position="8"> That array of numbers is tile heart of our scorim, function for En~lishness ~- we are calling these numeric values &quot;indicators&quot; of Englishness. We then use that array of indicators to drive the machine translation development. In this paper we will report on how we have constructed a prototype of this function; in separate work we discuss how to insert this function into a machine-learning regimen designed to maximize the overall quality of the rnachine translation output.</Paragraph> <Paragraph position="9"> A Reverse Turing Test People can generally tell the difference between human-produced English and machine translation Englisll~ assuming all tile obvious constraints such as that tile reader and writer have command of the language. Whether or not a machine can tell the difference depends of course, on how good tim MT system is. Can we get a machine to tell tile difference? Of course it depends on how good the MT system is: if it were perfecL neither we nor the machines ought to be able to distinguish them. MT quality being what it is, that is not a problem for us now. An essential first step toward Q1)MT is what we are calling a &quot;Reverse Turing Test&quot;. In the ordinary Turing Test, we want to fool a person into thinking the machine is a person. Here, we are turning that on its head. We want to define a function that can tell tile difference between English that a human being has produced versus English that the machine has produced) To construct the test, we use a bilingual parallel aligned corpus: we take tile foreign language side and send that through the MT system; then we see if we can define a scoring function that can distinguish the two w:rsions (original English and MT English). With our current indicators and corpus, we can machiue-leam a function that behaves as follows: if you hand it a human sentence, it conectly classifies it as human 74% of the time. If you hand it a machine sentence, it correctly classifies it as a machine sentence 57% of the time. In tile remainder of the paper, we will step through the details of tile experiment; we will also discuss why we 3Obviously the end goal here is to fail this Reverse Turing Test for a &quot;perfect&quot; machine translation system. We are very far away from this, but we would like to use this function to drive the process toward that eventual alld ti)rtunate failure.</Paragraph> <Paragraph position="10"> neither expect nor require 100% accuracy for this function. Our boundary tests behave as expected and are shown ill the final section -we use tile same test to distinguish between English and (a) English word salad, (b) English alphabet soup, (c) Japanese, and (d) the identity case of more human-produced English.</Paragraph> <Section position="1" start_page="376" end_page="377" type="sub_section"> <SectionTitle> Case Study: Japanese-English </SectionTitle> <Paragraph position="0"> In this paper, we report on results using a small corpus of 2,340 sentences drawn from the Kenkyusha New Japanese-English Dictionary.</Paragraph> <Paragraph position="1"> It was important in this particular experiment to use a very clean corpus (perfectly aligned and minimally formatted). This case study is situated in a broader context: we have conducted exploratory experiments on samples from several corpora, for example the ARPA MT Evaluation corpus, samples from European Corpus Initiative Data corpus (ECI-I) and others. Since we found that the scoring function was quite sensitive to forrnatting problems (for example, the presence of tables and sentence segmentation enors cause problems) we are examining a small corpus that is free flom these issues. The sentences are on average relatively short (7.0 words per sentence; 37.6 characters/sentence), this makes our task both easier and harder. It is easier because we have overcome tile forlnatting problems. It is harder because the MT system is able to perform much better on the shorter, cleaner sentences than it was on longer sentences with formatting problems. Since the output is better, it is more difficult to define a function that can tell the difference between the original English and the machine translation English. On balance, this corpus is a good one to illustrate our technique.</Paragraph> <Paragraph position="2"> i(l) #208 .~j~0) ~z:~: {j ~: {.a ff~:3;\]~ b j3';'\]-2 ~ x _3 \]-7_o }tie beauty ballled descnptu n MT It described he, beauty and the abno,mal play applied , She was radiant with happiness MT she had shone happily In terror the child seized his father's\] / allll.</Paragraph> <Paragraph position="3"> ! MT !Becoming fearful, the child , \] I : &quot;,c a,m fa!h e'- I lFigure 1. Subjective Quality Ranking \] Figure 1 shows a range of output quality. (1) is the worst -- it is obviously MT output. For us this output is only partially intelligible. (2) is not so bad, but it is still not perfect English. But (3)is nearly perfect. We want to design a system that can tell the difference. We will now walk through our suite of indicators; the goal is to get the machine to see what we see in terms of quality.</Paragraph> <Paragraph position="4"> Suite of Indicators We have defined a suite of functions that operate at various levels of linguistic analysis: syntactic, semantic, and phonological (orthographic). For each of these levels, we have integrated at least one tool for which we construct an indicator function. The task is to use these indicators to generate an array of values which we can use to capture the subjective quality we see wheu we read the sentences. We will step through these indicator functions one by one. In some cases, in order to get numbers, we take what amounts to debugging information from the tool (lnany of the tools have very nice API's that give access to a variety of information about how it processed input). In other cases, we define a function that yields an output based oil the output of the tool (for example, we defined a function that indicated the degree to which a parse tree was balanced; it turned out that a balanced tree was a negative indicator of Englishness, probably because English is rightbranching). null</Paragraph> </Section> <Section position="2" start_page="377" end_page="377" type="sub_section"> <SectionTitle> Syntactic Indicators </SectionTitle> <Paragraph position="0"> Two sources of local syntactic information are (a) parse trees and (b) N-grams. Within tile parsers, we looked at internal processing information as well as output structures. For example, we measured the probability of a parse and number of edges in the parse from the Collins parser. The Apple Pie Parser provided various weights which we used. The Appendix lists all of the indicator functions that we used. N-Gram Language Model (Cross-Perplexity) An easy number to calculate is the cross-perplexity of a given text, as calculated using an N-gram language model. 4 4 We used the Cambridge/CMU language modeling toolkit, trained on the Wall Street Journal (4/1990 through 3/1992), (hn parameters: n=4, Good-Turing Notice that the subjective order is mirrored by the cross-perl?lexity scores in Figure 2.</Paragraph> </Section> <Section position="3" start_page="377" end_page="378" type="sub_section"> <SectionTitle> Collins Parser </SectionTitle> <Paragraph position="0"> The task here is to write functions that process the parse trees and return a number. We have experimented with lnore elaborate functions that indicate how balanced the parse tree is and less complicated functions such as the level of embedding, number of parentheses, and so oil.</Paragraph> <Paragraph position="1"> Interestingly, the number of parentheses in the parse was a helpful indicator in conjunction with other indicators.</Paragraph> <Paragraph position="2"> Indicators of Semantic Cohesiveness For the semantic indicators, we want some indication as to how nmch the words in a text are related to each other by virtue of their meaning. Which words belong together, regardless of exactly how they are used in the seutence? Two resources we have begun to integrate for this purpose are WordNet and the Trigger Toolkit (measuring mutual information). The overall experimental design is roughly the same in both cases. Our method was to remove stop words, lemmatize the text, and then take a measurement of pairwise semantic cohesiveness of the iemmatized words 5. For WordNet, we are counting how many ways two words are related by the hyponylny relation (future indicators will be snore sophisticated). For the Trigger Toolkit, we weighted the connections (by mutual information).</Paragraph> <Paragraph position="3"> Orthographic We had two motivations for an orthographic level: one was methodological (we wanted to look at each of tile traditional levels of linguistic analysis). The other was driven by alone, or transliterate them, or insert a dummy synlbol, such as &quot;X&quot;. These cities were adequate to give us apl)ropriate hints as to whether the text was produced by human or by machine. But some of our tools missed these clues because of how they were designed.</Paragraph> <Paragraph position="4"> Robust parsers often treat uukllowu words as UOUlIS,; SO if we got au uutrauslated telill or an &quot;X&quot;, the parser simply treats it as a noun. Five X's in a row might be a noun phrase followed by a verb. a Smoothed N-gram models of words usually treat any string of letters as a possible word.</Paragraph> <Paragraph position="5"> Because the parsers and N-gram models were designed to be very robust, they are not necessarily sensitive to these obvious clues. In order to get at these hints, we built a character-based N-gram model of English. Although these indicators were not very informative on their own for distinguishing htunan froln machine English, they boosted l)erforlnancc in conjunction with the syntactic aud semantic indicators.</Paragraph> <Paragraph position="6"> Combined Indicators Let's come back to the three sentences t'rom ranking of tile sentences with appropriate indicator willies. In other words, we want the machine to be able to see differences which a human might see.</Paragraph> <Paragraph position="7"> For these three examples, some scores correhtte well with our subjective ranking of Englishuess (e.g. cross-perplexity, Edges). However, the other scores on their own only partially correlate. The expectation is that an indicator on its own will not be sutTicient to score tile Englishness. It is the combined effect of all indicators which ultimately decides the 6We found that we cot, ld often guess the &quot;del'ault&quot; behavior that a parser used and we have begun to design indicators that can tell when a parser has defaulled to these.</Paragraph> <Paragraph position="8"> Englishness. Now we have enough raw data to begin machine-learning a way to distinguish these kinds of sentences.</Paragraph> </Section> <Section position="4" start_page="378" end_page="378" type="sub_section"> <SectionTitle> Simple Machine Learning Regimen </SectionTitle> <Paragraph position="0"> We have started out with very simple memory-based machine learning techniques. Since we are del'ining a range of functions, we wanted to keep things relatively simple for debugging and didactic purposes.</Paragraph> </Section> </Section> class="xml-element"></Paper>