XML Viewer - w93-0309

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0309_metho.xml
Size: 11,205 bytes
Last Modified: 2025-10-06 14:13:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0309">
  <Title>Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German</Title>
  <Section position="3" start_page="74" end_page="75" type="metho">
    <SectionTitle>
2 Domain of Investigation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
2.1 What Do We Mean by 'Collocation'?
</SectionTitle>
      <Paragraph position="0"> Collocations in the sense of 'frequently cooccurring words' can quite easily be extracted from corpora by statistic means. From a linguistic point of view, however, a more restricted use of the term is preferable which takes into account the difference between what Sinclair (1966) called casual vs. signiticant collocations. Casual word combinations show a normal, free syntagmatic behaviour. In this paper, collocations shall refer only to word combinations that have a certain affinity to each other in that they follow combinatory restrictions not explainable with syntactic and semantic principles (e.g.</Paragraph>
      <Paragraph position="1"> hammer a nail into sth. rather than &amp;quot;beat a nail into sth.).</Paragraph>
      <Paragraph position="2"> For collocations that are based on a verb and a noun (preferably an object argument, sometimes however the subject of an intransitive verb), three types of V-N combinations are distinguished for German in the literature: verbal phrasemes (idioms) (e.g.</Paragraph>
      <Paragraph position="3"> Brundage et al. 1992), support verb constructions (SVCs) (v.Polenz 1989 or Danlos 1992) and collocations in the narrower sense (Hausmann 1989). As Brundage et al.</Paragraph>
      <Paragraph position="4"> (1992:7) and Barkema (1989:24) point out, the differences between these three types are gradual and &amp;quot;it is hard to find criteria of good selectivity to distinguish collocations from phrasemes&amp;quot;. Although our main interest lies in SVCs we will in the following not distinguish between i) SVCs (e.g. to take into consideration), ii) lexicalized combinations with support verbs where the noun has lost its original meaning and which belong to phrasemes (e.g. to take a \[ancy), and iii) collocational combinations of support verbs with concrete or non-predicative nouns (e.g. to ta/,'e a seat); we will refer to all these cases as V-N collocations.</Paragraph>
    </Section>
    <Section position="2" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
2.2 Why V-N Collocations?
</SectionTitle>
      <Paragraph position="0"> Collocations are well suited for statistical corpora studies. The semantics of a collocation in the narrower sense according to Fleischer (1982:63f) is &amp;quot;given by the individual semantics of its components, its meaning differs however in an unpredictable way from the pure sum of its parts. A substantial cause for this unpredictable difference is the frequency of occurrence and the probability with which the occurrence of one component determines the occurrence of the other&amp;quot; (our translation). The unpredictability of a collocation is thus partly caused by the high cooccurrence frequency of its components compared to the relative frequency of the single words. This holds even more  for SVCs and phrasemes due to their (parlly) non-compositional semantics.</Paragraph>
      <Paragraph position="1"> In German, common nouns, proper names and abbreviations of names starl with an uppercase letter (sentence beginnings are changed to lowercase in the corpus). So the verb-noun pattern was chosen for our sludv instead of possible others, because the uppercase makes it possible to extract V-N collocations even from untagged corpora if the verb is used as the key-word. The results of extracting V-N collocations give good indications how promising the retrieval of collocations would be with POS-tagged corpora. Besides, N-N combinations in German are mainly restricted to proper names, and Adj-N collocations are not as extensive in our corpus due to the small number of frequent and interesting adjectives.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="75" end_page="77" type="metho">
    <SectionTitle>
3 Resources and Methods Used in the Study
</SectionTitle>
    <Paragraph position="0"> Two untagged corpora were used for our study', kindly supplied by the 'Institut f/Jr deutsche Sprache' (IdS), Mannheim: the 2.7 million words 'Mannheimer Korpus I' (MK1) which contains approx. 73% fiction and scientific/philosophical literature and about 27% newspaper texts, and the 'Bonner Zeitungskorpus' (BZK), a 3.7 million words newspaper corpus. Except for the test how results could differ for larger corpora described in section 4.5, where the MK1 was combined with the BZK, the investigation was based on the MK1 on its own, for technical reasons and also because verbs occur more often on average in the MK1 than in the BZK (cf. Breidt 1993).</Paragraph>
    <Section position="1" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
3.1 Statistical Method and Tools
</SectionTitle>
      <Paragraph position="0"> MI is a function well suited for the statistical characterization of collocations because it compares the joint probability p(wl,w2) that two words occur together within a predefined distance with the independent probabilities p(wI) and p(u,~) that the two words occur at all in the corpus (for a more detailed description see Church et al.</Paragraph>
      <Paragraph position="2"> Several methods are possible for the calculation of probabilities (cf. Gale and Church 1990); for our purposes we use the simplest one. where the frequency of occurrence in the corpus is divided by the size N of the corpus, p(z) = f(x)/N. Distance will be defined as a window-size in which bigrams are calculated.</Paragraph>
      <Paragraph position="3"> MI does not give realistic figures for very low frequencies. If a relatively unfrequent word occurs only once in a certain combination, the resulting very high MI value suggests a strong link between the words where it might well be simply by chance.</Paragraph>
      <Paragraph position="4"> So a lower bound of at least 3 occurrences of a word pair is necessary to calculate MI. The t-test used to check whether the difference between the probability for a collocational occurrence and the probability for an independent occurrence of the two words is significant, is a standard significance test in statistics (e.g. Hatch and Farhady 1982). The statistical calculations were done as described in Church et al. (1991), and  were performed together with N\VIC queries and the creat.ioxl of bigrams using tools available at the &amp;quot;Institut f/.ir Maschinelle Sprachverarbeitung', University of Stuttgart ).</Paragraph>
    </Section>
    <Section position="2" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
3.2 The 'Standard' Method
</SectionTitle>
      <Paragraph position="0"> Verbs that can occur in SVCs are in the centre of our study because the5' provide examples for all three types of V-N collocations; besides, the chosen 'potential' support verbs belong to the most frequent verbs in the corpus anyway. V-N collocations were extracted for the following 16 verbs (no translations are given because they differ depending on the N argument): bleiben, bringen, erfahren, finden, geben, gehen, gelangen, geraten, halten, km72men, nehmen, setzen, stehen, stellen, treten, ziehen.</Paragraph>
      <Paragraph position="1"> Bigram tables of all words that occur within a certain distance of these verbs, together with their cooccurrence fi'equencies, form the basis for the calculation of MI.</Paragraph>
      <Paragraph position="2"> Bigram calculations were restricted to words occurring within a 6-word window to the left (cf. next. section), inclusive of the verb, a span which captures 95% of significant collocations in English (Martin et al. 1983). We will refer to these with BI6. For combinations that occur at. least 3 times, MI was calculated together with a t-score.</Paragraph>
      <Paragraph position="3"> From these, candidates for V-N collocations were automatically extracted, sorted by MI. All of these were checked by means of NWIC-listings and classified w.r.t, their collocational status by the author. The classification was in most cases very obvious. If a combination potentially formed a collocation but was not used as such in the corpus it was not counted; a couple of times, where some of the usages were indeed collocations and others not, the decision was made in favour of the predominant case.</Paragraph>
    </Section>
    <Section position="3" start_page="76" end_page="77" type="sub_section">
      <SectionTitle>
3.3 Application for German Corpora: Some Problems
</SectionTitle>
      <Paragraph position="0"> Some properties of the German language make the task of extracting V-N collocations from German text corpora more difficult than for English corpora. A minor difference concerns the strong inflection of German verbs. Whereas in English a verb lexeme appears in 3 or 4 different forms plus one for the present participle, German verbs have 7 to I0 verb forms (without subjunctive forms) for one lexeme and additional 4 for the present participle. This has to be considered for the evaluation of queries based on single inflection forms, because in English more usages are covered with one verb form than in German.</Paragraph>
      <Paragraph position="1"> Another point concerns the variable word order in German (see Uszkoreit 1987) which makes it, more difficult to locate lhe parts of a V-N collocation. In a main clause (verb-second order), a noun preceding a finite verb usually is the subject, but it can also be a topicalized complement; in sentences where the main verb occurs at the end (nonfinite verb or subordinate clause) the preceding noun is mostly a direct object or other complement, or an adjunct.. A noun to the right of a finite verb can be any of subject, object or other argument due to topicalization or scrambling. We restrict our search to V-N combinations where the noun precedes the verb either directly or within two to five words, because this at least definitely captures complements of main verbs IWe greatfully acknowledge thai. the work reported here would not have been possible without the supplied tools and corpora.</Paragraph>
      <Paragraph position="2">  in verb-final position. To find the correct argument to the right of the verb is difficult in an unparsed corpus because of the variable number of intervening constituents.</Paragraph>
      <Paragraph position="3"> As illustrated in the la.,~t paragraph the assumption that a &amp;quot;semantic agent \[...\] is principally used before the verb&amp;quot; and a &amp;quot;'semantic object \[...\] is used after it&amp;quot; as described in Smadja (1991a:180) does not hold for German. Therefore, complicated parsing is necessary to distinguish subject-verb from object-verb combinations. The results of V-N extractions reflect this problem. In many if not in most of the uninteresting combinations extracted, the noun to the left of the verb is the subject rather than a complement of the verb (cf. section 4.6).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML