File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1141_metho.xml

Size: 8,680 bytes

Last Modified: 2025-10-06 14:08:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1141">
  <Title>Collocation Extraction Based on Modifiability Statistics</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Kinds of Collocations
</SectionTitle>
    <Paragraph position="0"> There have been various approaches to define the notion of 'collocation'. This is by no means an easy task, especially when it comes to defining the demarcation line between collocations and free word combinations (modulo general syntactic and semantic semantic constraints). We favor an approach which draws this line on the semantic layer, viz. the compositionality between the components of a linguistic expression.</Paragraph>
    <Paragraph position="1"> For this purpose, we distinguish between three classes of collocations based on varying degrees of semantic compositionality of the basic lexical enti- null ties involved: 1. Idiomatic Phrases. In this case, none of the  lexical components involved contribute to the overall meaning in a semantically transparent way. The meaning of the expression is metaphorical or figurative. For example, the literal meaning of the German PP-verb combination '[jemanden] auf die Schippe nehmen' is 'to take [someone] onto the shovel'. Its figurative meaning is 'to lampoon somebody'.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Support Verb Constructions/Narrow Colloca-
</SectionTitle>
    <Paragraph position="0"> tions. This second class contains expressions in which at least one component contributes to the overall meaning in a semantically transparent way and thus constitutes its semantic core. For example, in the support verb construction 'zur Verf&amp;quot;ugung stellen' (literal: 'to put to availabilty'; actual: 'to make available'), the noun 'Verf&amp;quot;ugung' is the semantic core of the expression, whereas the verb only has a support function with some impact on argument structure, causativity or aktionsart. There are, however, also narrow collocations in which the basic lexical meaning of the verb is the semantic core: For example, in 'aus eigener Tasche bezahlen' ('to pay out of one's own pocket') the verb 'bezahlen' is the semantic core. What unifies these two types is the fact that they function as predicates.</Paragraph>
    <Paragraph position="1"> 3. Fixed Phrases. Here, all basic lexical meanings of the components involved contribute to the overall meaning in a semantically much more transparent way. Still, they are not as completely compositional as to classify them as free word combinations. For example, all the basic lexical meanings of the different lexical components in 'im Koma liegen' (literal: 'to lie in coma'; actual: 'to be comatose') contribute to the overall meaning of the expression. Still, this is different from a completely compositional free word combination, such as 'auf der Strasse gehen' ('to walk on the street').</Paragraph>
    <Paragraph position="2"> Our goal is to consider all three types of collocations as a whole, i.e., we will not distinguish between the three different kinds of collocations. However, in order to focus our experiments, we will concentrate on a particular surface pattern in which they occur, viz. PP-verb collocations.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="5" type="metho">
    <SectionTitle>
3 Methods and Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Construction and Statistics of the Testset
</SectionTitle>
      <Paragraph position="0"> We used a 114-million-word German-language newspaper corpus extracted from the Web to acquire candidate PP-verb collocations. The corpus was first processed by means of the TNT part-of-speech tagger (Brants, 2000). Then we ran a sentence/clause recognizer and an NP/PP chunker, both developed at the Text Knowledge Engineering Lab at Freiburg University, on the POS-tagged corpus. From the XML-marked-up tree output, PP-verb complexes were automatically selected in the following way: Taking a particular PP node as a fixed point, either the preceding or the following sibling V node was taken.2 From such a PP-verb combination, we extracted and counted both its various heads, in terms of Preposition-Noun-Verb (PNV) triples, and all its associated supplements, i.e., here in this case any additional lexical material which also occurs in the nominal group of the PP, such as articles, adjectives, adverbs, cardinals, etc.3 The extraction of the associated supplements is essential to the linguistic measure described in sub-section 3.3 below.</Paragraph>
      <Paragraph position="1"> In order to reduce the amount of candidates for evaluation and to eliminate low-frequency data, we only considered PNV-triples with frequency a0a2a1 a3a5a4 . This was also motivated by the well-known fact that collocations tend to have a higher co-occurrence frequency than free word combina-</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Classification of the Testset
</SectionTitle>
      <Paragraph position="0"> Three human judges manually classified the PP-verb candidate types with a0a13a1 a3a5a4 in regard to whether they were a collocation or not. For this purpose, they used a manual, in which the guidelines included the linguistic properties as described in Section 1 and the three collocation classes identified in Section 2.</Paragraph>
      <Paragraph position="1"> Among the 8,644 PP-verb candidate types, 1,180 (13.7%) were identified as true collocations. The inter-annotator agreement was 94.8% (with a standard deviation of 2.1).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="5" type="sub_section">
      <SectionTitle>
3.3 The Linguistic Measure
</SectionTitle>
      <Paragraph position="0"> The linguistic property around which we built our measure for collocativity is the non- or limited modifiabilty of collocations with additional lexical material (i.e., supplements). The underlying assumption is that a PNV triple is less modifiable (and thus more likely to be a collocation) if it has a lexical supplement which, compared to all others, is particularly characteristic. We express this assumption in the following way: Let a14 be the number of distinct supplements of a particular PNV triple (a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30 ). The probability a32 of a particular supplement a33a35a34a37a36a38a36a40a39 , a41a43a42a45a44 a3a47a46 a14a49a48 , is described by its frequency scaled by the sum of all supplement frequencies: null</Paragraph>
      <Paragraph position="2"> Then the modifiability a76a78a77a80a79 of a PNV triple can be described by its most probable supplement:</Paragraph>
      <Paragraph position="4"> To define a measure of collocativity a95a96a77a98a97a84a97 for a candidate set, some factor regarding frequency has to be taken into account. Thus, besides a76a78a77a80a79 , we take the relative co-occurrence frequency for a spe-</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
3.4 Methods of Evaluation
</SectionTitle>
      <Paragraph position="0"> Standard procedures for evaluating the goodness of collocativity measures usually involve identifying the true positives among the a14 -highest ranked candidates returned by a particular measure. Because this is rather labor-intensive, a14 is usually small, ranging from 50 to several hundred. Evert and Krenn 5Note that the zero supplement of the PNV triple, i.e., the one for which no lexical supplements co-occur is also included in this set.</Paragraph>
      <Paragraph position="1"> (2001), however, point out the inadequacy of such methods claiming they usually lead to very superficial judgements about the measures to be examined. In contrast, they suggest examining various a14 -highest ranked samples, which allows plotting standard precision and recall graphs for the whole candidate set.</Paragraph>
      <Paragraph position="2"> We evaluate the a95a104a77a98a97a105a97 measure against two widely used standard statistical tests (t-test and loglikelihood) and against co-occurrence frequency.</Paragraph>
      <Paragraph position="3"> The comparison to the t-test is especially interesting because it was found to achieve the best overall precision scores in other studies (see Evert and Krenn (2001)). Our baseline is defined by the proportion of true positives (13.7%; see subsection 3.2), which can be described as the likelihood of finding one by blindly picking from the candidate set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML