File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0108_intro.xml
Size: 3,938 bytes
Last Modified: 2025-10-06 14:02:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0108"> <Title>A Comparison of Two Different Approaches to Morphological Analysis of Dutch</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> For many NLP and speech processing tasks, an extensive and rich lexical database is essential.</Paragraph> <Paragraph position="1"> Even a simple word list can often constitute an invaluable information source. One of the most challenging problems with lexicons is the issue of out-of-vocabulary words. Especially for languages that have a richer morphology such as German and Dutch, it is often unfeasible to build a lexicon that covers a sufficient number of items. We can however go a long way into resolving this issue by accounting for novel productions through the use of a limited lexicon and a morphological system.</Paragraph> <Paragraph position="2"> This paper describes two systems for morphological analysis of Dutch. They are conceived as part of a morpho-syntactic language model for inclusion in a modular speech recognition engine being developed in the context of the FLaVoR project (Demuynck et al., 2003). The FLaVoR project investigates the feasibility of using powerful linguistic information in the recognition process. It is generally acknowledged that more accurate linguistic knowledge sources improve on speech recognition accuracy, but are only rarely incorporated into the recognition process (Rosenfeld, 2000). This is due to the fact that the architecture of most current speech recognition systems requires all knowledge sources to be compiled into the recognition process at run time, making it virtually impossible to include extensive language models into the process.</Paragraph> <Paragraph position="3"> The FLaVoR project tries to overcome this restriction by using a more flexible architecture in which the search engine is split into two layers: an acoustic-phonemic decoding layer and a word decoding layer. The reduction in data flow performed by the first layer allows for more complex linguistic information in the word decoding layer. Both morpho-phonological and morpho-syntactic modules function in the word decoding process. Here we focus on the morpho-syntactic model which, apart from assigning a probability to word strings, provides (scored) morphological analyses of word candidates. This morphological analysis can help overcome the previously mentioned problem of out-of-vocabulary words, as well as enhance the granularity of the speech recognizer's language model.</Paragraph> <Paragraph position="4"> Successful experiments on introducing morphology into a speech recognition system have recently been reported for the morphologically rich languages of Finnish (Siivola et al., 2003) and Hungarian (Szarvas and Furui, 2003), so that significant advances can be expected for FLaVoR's target language Dutch as well. But as the modular nature of the FLaVoR architecture requires the modules to function as stand-alone systems, we are also able to evaluate and compare the modules more generally as morphological analyzers in their own right, which can be used in a wide range of natural language applications such as information retrieval or spell checking.</Paragraph> <Paragraph position="5"> In this paper, we describe and evaluate these two independently developed systems for morphological analysis: one system uses a machine learning approach for morphological analysis, while the other system employs finite state techniques. After looking at some of the issues when dealing with Dutch morphology in section 2, we discuss the architecture of the machine learn- null Proceedings of the Workshop of the ing approach in section 3, followed by the finite state method in section 4. We discuss and compare theresults insection5, afterwhichwedraw conclusions.</Paragraph> </Section> class="xml-element"></Paper>