File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0802_intro.xml

Size: 7,591 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0802">
  <Title>A Modern Computational Linguistics Course Using Dutch</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes a set of exercises in computational linguistics. The material was primarily developed for two courses: an general introduction to computational linguistics, and a more advanced course focusing on natural language interfaces. Students who enter the first course have a background in either humanities computing or cognitive science. This implies that they possess some general programming skills and that they have at least some knowledge of general linguistics. Furthermore, all students entering the course are familiar with logic programming and Prolog.</Paragraph>
    <Paragraph position="1"> The native language of practically all students is Dutch.</Paragraph>
    <Paragraph position="2"> The aim of the introductory course is to provide a overview of language technology applications, of the concepts and techniques used to develop such applications, and to let students gain practical experience in developing (components) of these applications. The second course focuses on computational semantics and the construction of natural language interfaces using computational grammars. null Course material for computational linguistics exists primarily in the form of text books, such as Allen (1987), Gazdar and Mellish (1989) and Covington (1994). They focus primarily on basic concepts and techniques (finite state automata, definite clause grammar, parsing algorithms, construction of semantic representations, etc.) and the implementation of toy systems for experimenting with these techniques. If course-ware is provided, it consists of the code and grammar fragments discussed in the text-material. The language used for illustration is primarily English.</Paragraph>
    <Paragraph position="3"> While attention for basic concepts and techniques is indispensable for any course in this field, one may wonder whether implementation issues need to be so prominent as they are in the text-books of, say, Gazdar and Mellish (1989) and Covington (1994). Developing natural language applications from scratch may lead to maximal control and understanding, but is also timeconsuming, requires good programming skills rather than insight in natural language phenomena, and, in tutorial settings, is restricted to toysystems. These are disadvantages for an introductory course in particular. In such a course, an attractive alternative is to skip most of the implementation issues, and focus instead on what can be achieved if one has the right tools and data available. The advantage is that the emphasis will shift naturally to a situation where students must concentrate primarily on developing accounts for linguistic data, on exploring data available in the form of corpora or word-lists, and on using real high-level tools. Consequently, it becomes feasible to consider not only toy-systems and toyfragments, but to develop more or less realistic components of natural language applications. As the target language of the course is Dutch, this also implies that at least some attention has to be paid to specific properties of Dutch grammar, and to (electronic) linguistic resources for Dutch.</Paragraph>
    <Paragraph position="4"> Since students nowadays have access to powerful hardware and both tools and data can be distributed easily over the internet, there are no real practical obstacles.</Paragraph>
    <Paragraph position="5"> Text-books which are concerned primarily with computational semantics and natural language interfaces, such as Pereira and Shieber (1987) and Blackburn and Bos (1998), tend to introduce a toy-domain, such as a geography database or an excerpt of a movie-script, as application area. In trying to develop exercises which are closer to real applications, we have explored the possibilities of using web-accessible databases as back-end for a natural language interface program.</Paragraph>
    <Paragraph position="6"> More in particular, we hope to achieve the following: null * Students learn to use high-level tools. The development of a component for morphological analysis requires far more than what can be achieved by specifying and implementing the underlying finite state automata directly.</Paragraph>
    <Paragraph position="7"> Rather, abstract descriptions of morphological rules should be possible, and software should be provided to support development and debugging. Similarly, while a programming language such as Prolog offers possibilities for relatively high-level descriptions of natural language grammars, the advant, ages of specialised languages for implementing unification-based grammars and accompanying tools are obvious. Furthermore, the availability of graphical interfaces and visualisation in tutorial situations is a bonus which should not be underestimated.</Paragraph>
    <Paragraph position="8"> * Students learn to work with real data. In developing practical, robust, wide-coverage, language technology applications, researchers have found that the use of corpora and electronic dictionaries is absolutely indispensable. Students should gain at least some familiarity with such sources, learn how to search large datasets, and how to deal with exceptions, errors, or unclear cases in real data.</Paragraph>
    <Paragraph position="9"> * Students become familiar with quantitative evaluation methods. One advantage of developing components using real data is that one can use the evaluation metrics dominant in most current computational linguistics research. That is, an implementation of hyphenatiOn-rule or a grammar for temporal expressions can be tested by measuring its accuracy on a list of unseen words or utterances.</Paragraph>
    <Paragraph position="10"> This provides insight in the difficulty of solving similar problems in a robust fashion for unrestricted text.</Paragraph>
    <Paragraph position="11"> Students develop language technology components for Dutch. In teaching computational linguistics to students whose native language is not English, it is common practice to fbcus primarily on the question how the (English) examples in the text book can be carried over to a grammar for one's own language. As this may take considerable time and effort, more advanced topics are usually skipped. In a course which aims primarily at Dutch, and which also contains material describing some of the peculiarities of this language (hyphenation rules, spelling rules relevant to morphology, word order in main and subordinate clauses, verb clusters), there is room for developing more elaborate and extended components.</Paragraph>
    <Paragraph position="12"> Students develop realistic applications. The use of tools and real data makes it easier to develop components which are robust and which have relatively good coverage. Applications in the area of computational semantics can be made more interesting by exploiting the possibilities offered by the internet. The growing amount of information available on the internet provides opportunities for accessing much larger databases (such as public transport time-tables or library catalogues), and therefore, for developing more realistic applications.</Paragraph>
    <Paragraph position="13"> The sections below are primarily concerned with a number of exercises we have developed to achieve the goals mentioned above. A accompanying text is under development. 1</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML