XML Viewer - w04-0403

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-0403_evalu.xml
Size: 7,741 bytes
Last Modified: 2025-10-06 13:59:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0403">
  <Title>What is at stake: a case study of Russian expressions starting with a preposition</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The automatic procedure detected 4384 candidate expressions, out of which we selected 720 MWEs.</Paragraph>
    <Paragraph position="1"> The summary of prepositions and the number of their patterns identified in the study is given in Table 2. It was expected that more frequent prepositions participate in a larger number of MWEs.</Paragraph>
    <Paragraph position="2"> However, the situation is more complex. Some prepositions like u or iz occur almost exclusively in fully compositional patterns, for example, expressing location: u okna, morja (by the window, by the sea), or possession: u menja, u Ivana (I have, Ivan has). Other prepositions that are less frequent regularly produce non-compositional patterns, e.g. pod rukami ('at hand', which expresses the specific meaning of availability, not literally 'under hands'), pod konec ('at the end').</Paragraph>
    <Paragraph position="3"> The results retained in the database include well-formed prepositional phrases that function as proper idioms, as well as syntactic constructions that can take a noun or another nominal group on their right, such as v techenie ('in the course of'), which is a PP in its own, or an incomplete combination of a preposition and an adjective such as dlja puschij ('for greater'). The latter is a part of an open list of well-formed PPs, as in dlja puschej vazhnosti, ('for greater importance'), soxrannosti (safety), ostrastki (frightening), but the word puschij in itself occurs only in this construction. In other cases, the 'noun' from the nominal group does not even exist in the contemporary language, like in bez umolku ([to talk] without a pause), so the expression cannot be analysed correctly without knowing that it is an MWE.</Paragraph>
    <Paragraph position="4"> The resulting list also includes multiword expressions with a slightly different structure, in cases where an MWE naturally extends to the left of the prepositionto form a larger pattern. One example is sudja po vsemu ('to all appearances', lit. 'judging over all'), which is an extension of a prepositional phrase po vsemu, as it gives the only suitable pattern by far with 1626 instances in the corpus, with the next most frequent left neighbour razbrosat' po vsemu ('scatter all over' followed by a spatial location) having only 34 instances. Also, the sequence of words po vsemu is ambiguous, e.g.</Paragraph>
    <Paragraph position="5"> it can be a part of larger PPs, such as po vsemu gorodu, domu, zalu (over the whole city, house, hall), so from the viewpoint of automatic detection the MWE sudja po vsemu is more reliable.</Paragraph>
    <Paragraph position="6"> Another example of an extended pattern is a complex reflexive expression: drug druga ('each other', lit. 'friend friend-acc'), which is a multiword expression of its own, because no meaning of friendship is explicitly communicated here, as in nenavidet' drug druga ('to hate each other', lit. 'to hate friend friend-acc'). Even though the original pattern did not cover this structure, the expression has been detected for almost all prepositions in the form of PREP+drug-ending, because the reflexive expression allows the insertion of any preposition between the two elements, e.g. drug k drugu ('to each other', lit. friend to friend). Expressions of this sort resist the automatic identification by means of a simple pattern such as those used for other MWEs in the study.</Paragraph>
    <Paragraph position="7"> It is well-known that ambiguity is abundant in natural languages. As discussed above, many word forms in Russian allow several morphological analyses and this applies to forms used in MWEs. Monolingual and bilingual dictionaries can also give an estimation of the semantic ambiguity by counting the number of senses and translations available for a word, though this will be the lower bound, because the number of senses and translations offered in dictionaries does not typically cover the full variety of types of possible uses: depending on a context, a word can be translated in many more ways than is suggested by a dictionary.</Paragraph>
    <Paragraph position="8"> It was relatively straightforward to measure the reduction of morphological ambiguity. We can compare the number of morphological analyses before and after tagging of MWEs. The reduction of semantic ambiguity can be measured only indirectly by comparing the difference between the number of senses detected in a monolingual dictionary and the number of translations in a bilingual dictionary against the same numbers after tagging of MWEs, because we can assume that each MWE has only one sense, given the 'onesense-per-collocation' hypothesis. Even in cases when the hypothesis does not hold, as in the case of the reflexive MWE drug druga, which can be translated in many different ways depending on the main predicate in a clause, the combination of the two words in an MWE saves from the possibility of their separate translation as companion, friend, mate, pal, comrade, colleague, fellow, etc.</Paragraph>
    <Paragraph position="9"> Table 3 shows the level of the ambiguity in the original texts and the estimates for its reduction using the list of MWEs. The morphological analysis was performed using Mystem (Segalovich, 2003), a high-performance analyser which is also used in Yandex, a major Russian search engine.</Paragraph>
    <Paragraph position="10"> The results show that 41% of Russian word forms are ambiguous with respect to their morphological features with an average number of 4.6 analyses per ambiguous word (1.9 on average for all words).</Paragraph>
    <Paragraph position="11"> The estimation of semantic ambiguity is based on electronic copies of the monolingual Ozhegov dictionary (Ozhegov, 1988) and the Oxford Russian bilingual dictionary (ORD, 2000). The former has 37785 entries with 1.6 senses per entry  on average, while the Russian-English part of the latter has 40303 entries with 1.9 translations per entry. The dictionaries were applied to simple tagging of the running text in the corpus, whereby every word listed in the dictionaries was tagged with the respective number of its senses and translations. The experiment also showed that either of the two dictionaries covers about 70% of the running text (noncovered words are typically proper names). Since more frequent words typically exhibit greater polysemy, the polysemy in the running text is larger. A word has about 4.4 senses on average according to (Ozhegov, 1988) and 11.7 translations according to (ORD, 2000). However, these counts are slightly misleading, because about half of the words in the corpus are not ambiguous. But if a word is ambiguous, it exhibits a much greater set of possible senses and translations: for instance, (ORD, 2000) lists the word big as having 35 translations in various contexts, so if the average ambiguity in the corpus is counted for ambiguous words only, it reaches 8.8 for senses and 23.3 for translations.</Paragraph>
    <Paragraph position="12"> The results for morphological and semantic ambiguity are summarised in Table 3. After the application of the list of MWEs (they cover only about 2% of the total corpus size), the level of ambiguity for ambiguous lexical items goes down to 4.1 for morphological analysis, 8.4 for senses and 21.7 for translations. This gives a drop of about 11% for ambiguity in morphological analysis, 4% for ambiguity of senses and 7% for translations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML