File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-4014_metho.xml
Size: 4,548 bytes
Last Modified: 2025-10-06 14:10:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-4014"> <Title>Re-Usable Tools for Precision Machine Translation[?]</Title> <Section position="5" start_page="54" end_page="54" type="metho"> <SectionTitle> 3 Stochastic Components </SectionTitle> <Paragraph position="0"> To deal with competing hypotheses at all processing levels, LOGON incorporates various stochastic processes for disambiguation. In the following, we present the ones that are best developed to date.</Paragraph> <Paragraph position="1"> Training Material A corpus of some 50,000 words of edited, running Norwegian text was gathered and translated by three professional translators. Three quarters of the material are available for system development and also serve as training data for machine learning approaches. Using the discriminant-based Redwoods approach to treebanking (Oepen, Flickinger, Toutanova, & Manning, 2004), a rst 5,000 English reference translations were hand-annotated and released to the public.1 In on-going work on adapting the Redwoods approach to (Norwegian) LFG, we are working to treebank a sizable text segment (Ros*en, Smedt, Dyvik, & Meurer, 2005; Oepen & Llnning, 2006).</Paragraph> <Paragraph position="2"> Parse Selection The XLE analyzer includes support for stochastic parse selection models, assigning likelihood measures to competing analyses for the LinGO Redwoods treebank in its latest release, dubbed Norwegian Growth.</Paragraph> <Paragraph position="3"> (Riezler et al., 2002). Using a trial LFG treebank for Norwegian (of less than 100 annotated sentences), we have adapted the tools for the current LOGON version and are now working to train on larger data sets and evaluate parse selection performance. Despite the very limited amount of training so far, the model already appears to pick up on plausible, albeit crude preferences (as regards topicalization, for example). Furthermore, to reduce fan-out in exhaustive processing, we collapse analyses that project equivalent MRSs, i.e. syntactic distinctions made in the grammar but not reected in the semantics.</Paragraph> <Paragraph position="4"> Realization Ranking At an average of more of the LOGON pipeline. Based on a notion of automatically derived symmetric treebanks, we have trained comprehensive discriminative, log-linear models that (within the LOGON domain) achieve up to 75 per cent exact match accuracy in picking the most likely realization among competing outputs (Velldal & Oepen, 2005). The best-performing models make use of con gurational (in terms of tree topology) as well as of string-level properties (including local word order and constituent weight), both with varied domains of locality. In total, there are around 300,000 features with non-trivial distribution, and we combine the MaxEnt model with a traditional language model trained on a much larger corpus (the BNC). The latter, more standard approach to realization ranking, when used in isolation only achieves around 50 per cent accuracy, however.</Paragraph> </Section> <Section position="6" start_page="54" end_page="55" type="metho"> <SectionTitle> 4 Implementation </SectionTitle> <Paragraph position="0"> Figure 2 presents the main components of the LOGON prototype, where all component communication is in terms of sets of MRSs and, thus, can easily be managed in a distributed and (potentially) parallel client server set-up. Both the analysis and generation grammars 'publish' their interface to transfer i.e. the inventory and synopsis of seman- null tic predicates in the form of a Semantic Interface speci cation ('SEM-I'; Flickinger, Llnning, Dyvik, Oepen, & Bond, 2005), such that transfer can operate without knowledge about grammar internals. In practical terms, SEM-Is are an important development tool (facilitating well-formedness testing of interface representations at all levels), but they also have interesting theoretical status with regard to transfer. The SEM-Is for the Norwegian analysis and English generation grammars, respectively, provide an exhaustive enumeration of legitimate semantic predicates (i.e. the transfer vocabulary) and 'terms of use', i.e. for each predicate its set of appropriate roles, corresponding value constraints, and indication of (semantic) optionality of roles. Furthermore, the SEM-I provides generalizations over classes of predicates e.g. hierarchical relations like those depicted in Figure 3 below that play an important role in the organization of MRS transfer rules.</Paragraph> </Section> class="xml-element"></Paper>