File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1042_intro.xml
Size: 3,255 bytes
Last Modified: 2025-10-06 14:01:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1042"> <Title>Is That Your Final Answer?</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Machine translation evaluation and language learner evaluation have been associated for many years, for example [5, 7]. One attractive aspect of language learner evaluation which recommends it to machine translation evaluation is the expectation that the produced language is not perfect, well-formed language. Language learner evaluation systems are geared towards determining the specific kinds of errors that language learners make. Additionally, language learner evaluation, more than many MT evaluations, seeks to build models of language acquisition which could parallel (but not correspond directly to) the development of MT systems. These models frequently are feature-based and may provide informative metrics for diagnostic evaluation for system designers and users.</Paragraph> <Paragraph position="1"> In a recent experiment along these lines, Jones and Rusk [2] present a reasonable idea for measuring intelligibility, that of trying to score the English output of translation systems using a wide variety of metrics. In essence, they are looking at the degree to which a given output is English and comparing this to human-produced English. Their goal was to find a scoring function for the quality of English that can enable the learning of a good translation grammar. Their method for accomplishing this is through using existing natural language processing applications on the translated data and using these to come up with a numeric value indicating degree of &quot;Englishness&quot;. The measures they utilized included syntactic indicators such as word n-grams, number of edges in the parse (both Collins and Apple Pie parser were used), log probability of the parse, execution of the parse, overall score of the parse, etc. Semantic criteria were based primarily on WordNet and incorporated the average minimum hyponym path length, path found ratio, percent of words with sense in WordNet. Other semantic criteria utilized mutual information measures.</Paragraph> <Paragraph position="2"> Two problems can be found with their approach. The first is that the data was drawn from dictionaries. Usage examples in dictionaries, while they provide great information, are not necessarily representative of typical language use. In fact, they tend to highlight unusual usage patterns or cases. Second, and more relevant to our purposes, is that they were looking at the glass as half-full instead of half-empty. We believe that our results will show that measuring intelligibility is not nearly as useful as finding a lack of intelligibility. This is not new in MT evaluation - as numerous approaches have been suggested to identify translation errors, such as [1, 6]. In this instance, however, we are not counting errors to come up with a intelligibility score as much as finding out how quickly the intelligibility can be measured. Additionally, we are looking to a field where the essence of scoring is looking at error cases, that of language learning.</Paragraph> </Section> class="xml-element"></Paper>