File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0403_intro.xml
Size: 4,377 bytes
Last Modified: 2025-10-06 14:02:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0403"> <Title>What is at stake: a case study of Russian expressions starting with a preposition</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Computational research on multiword expressions (MWEs) has mostly addressed the topic for English (Sag et al., 2001). Some research has dealt with other languages, such as French (Michiels and Dufour, 1998) or Chinese (Zhang et al., 2000), but there has been no computationally tractable research on the topic for Russian. What is more, the study of MWEs in English has been mostly devoted to the description of nominal groups or light verbs, e.g. (Calzolari et al., 2002), (Sag et al., 2001), while constructions starting with a preposition, such as in line, at large, have not been the focus of attention.</Paragraph> <Paragraph position="1"> Even though the tradition of studying Russian idiomatic expressions resulted in many descriptions of Russian idioms and phraseological dictionaries, like (Dobrovol'skij, 2000) or (Fedorov, 1995), the studies and dictionaries often concentrate on non-decomposable colourful expressions of the 'kick-the-bucket' type, such as byt' bez carja v golove ('to have a screw loose', lit. 'to be without a tsar in one's head') and pay no attention to the very notion of their frequency. However, many expressions of this sort are relatively rare in modern language. For example, there is no single instance of bez carja v golove in the corpus we used. At the same time, existing Russian dictionaries of idioms often miss more frequent constructions, which are important both for translation studies and for the development of NLP applications. The task of the current study is defined by the ongoing development of the Russian Reference Corpus (Sharoff, 2004), a general-purpose corpus of Russian that is comparable to the British National Corpus (BNC) in its size and coverage.</Paragraph> <Paragraph position="2"> The goal of the study was to identify the list of statistically important MWEs in the corpus and to use them to reduce the ambiguity in corpus analysis. null Existing research on the detection of MWEs can be positioned between two extremes: linguistic and statistical. The former approaches assume syntactic parsing of source texts (sometimes shallow, sometimes deep to identify the semantic roles of MWE components) and the ability to get information from a thesaurus. Detection results can be further improved by deep semantic analysis of Second ACL Workshop on Multiword Expressions: Integrating Processing, July 2004, pp. 17-23 source texts (Piao et al., 2003). When we apply such techniques to a Russian corpus of the size of the BNC, this means that we need accurate and robust parsing tools, which do not exist for Russian.</Paragraph> <Paragraph position="3"> Also, no electronic thesaurus, such as WordNet (Miller, 1990), is available for Russian. Purely statistical approaches treat multiword expressions as a bag of words and pay no attention to the possibility of variation in the inventory and order of MWE components. Given that the word order in Russian (and other Slavonic languages) is relatively free and a typical word (i.e. lemma) has many forms (typically from 9 for nouns to 50 for verbs), the sequences of exact N-grams are much less frequent than in English, thus rendering purely statistical approaches useless.</Paragraph> <Paragraph position="4"> This paper discusses a hybrid approach to the identification of a specific type of MWEs in Russian, namely constructions starting with prepositional phrases with the emphasis on those that are frequent in the corpus. The study is also aimed at a specific task, namely the disambiguation of their morphological properties and syntactic functions in a corpus. The approach assumes the development of a list of MWEs supported by computational tools, including the calculation of standard statistical measures and shallow parsing of prepositional phrases. In addition, the scope of the study is further distinguished by the goal of extracting MWEs from the core lexicon on the basis of a general-purpose corpus, while many other MWE detection studies concerned the extraction of technical terms specific to a particular domain.</Paragraph> </Section> class="xml-element"></Paper>