File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1141_intro.xml

Size: 5,010 bytes

Last Modified: 2025-10-06 14:02:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1141">
  <Title>Collocation Extraction Based on Modifiability Statistics</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Natural language is an open and very flexible communication system. Syntax, of course, imposes constraints, e.g., on word order or the occurrence of particular phrasal types such as PPs or NPs, and lexical semantics imposes, e.g., selectional constraints on conceptually permitted sorts or types within the context of specific verbs or nouns. Nevertheless, natural language speakers usually enjoy an enormous degree of freedom to express the content they want to convey in a great variety of linguistic forms.</Paragraph>
    <Paragraph position="1"> There is, however, a significant subset of expressions which do not share this rather free combinability, so-called collocations. From a linguistic perspective, they can be characterized by at least three recurrent and prominent properties (Manning and Sch&amp;quot;utze, 1999): a0 Non-(or limited) compositionality. The meaning of a collocation is not a straightforward composition of the meanings of its parts. For example, the meaning of 'red tape' is completely different from the meaning of its components. null a0 Non-(or limited) substitutability. The parts of a collocation cannot be substituted by semantically similar words. Thus, 'gut' in 'to spill gut' cannot be substituted by 'intestine' (see also Lin (1999)).</Paragraph>
    <Paragraph position="2"> a0 Non-(or limited) modifiability. Many collocations cannot be supplemented by additional lexical material. For example, the noun in 'to kick the bucket' cannot be modified as 'to kick the a1 holey/plastic/watera2 bucket'.</Paragraph>
    <Paragraph position="3"> Considering these observations, from a natural language processing perspective, collocations should not enter, e.g., the standard syntax-semantics pipeline so as to prevent compositional semantic readings of expressions for which this is absolutely not desired. Hence, collocations need to be identified as such and subsequently be blocked, e.g., from compositional semantic interpretation.</Paragraph>
    <Paragraph position="4"> In computational linguistics, a wide variety of lexical association measures have been employed for the task of (semi-)automatic collocation identification and extraction. Almost all of these measures can be grouped into one of the following three categories: null  The corresponding metrics have been extensively discussed in the literature both in terms of their mathematical properties (Dunning, 1993; Manning and Sch&amp;quot;utze, 1999) and their suitability for the task of collocation extraction (see Evert and Krenn (2001) and Krenn and Evert (2001) for recent evaluations). Typically, they are applied to a set of candidate lexeme pairs which were obtained from pre-processors varying in linguistic sophistication.1 The selected measure then assigns an association score 1On the low end, this may just be a preset numeric window span. In order to reduce the noise among the candidates, however, more elaborate linguistic processing, such as POS tagging, chunking, or even parsing, is increasingly being applied. to each candidate pair, which is computed from its joint and marginal frequencies, thus expressing the strength of the hypothesis stating whether it constitutes a collocation or not.</Paragraph>
    <Paragraph position="5"> While these association measures have their statistical merits in collocation identification, it is interesting to note that they have relatively little to do with the linguistic properties (such as those mentioned at the beginning) which are typically associated with the notion of collocativity. Therefore, it may be interesting to investigate whether there is a way to implement a measure which directly incorporates linguistic criteria in the collocation identification task, and even more important, whether such a linguistically rooted approach would fare better in comparison to some of the standard lexical association measures.</Paragraph>
    <Paragraph position="6"> In the following study, we will introduce such a linguistic measure for identifying PP-verb collocations in German, which is based on the property of non- or limited modifiability. To the best of our knowledge, this is the first work to use this kind of linguistic measure to acquire collocations automatically. By contrasting our method to previous studies which use the standard lexical association measures, we intend to emphasize a more linguistically inspired use of statistics in collocation mining. Section 2 motivates our definition of the notion of collocation and Section 3 describes our methods, in particular the linguistically grounded collocation extraction algorithm, and the experimental setup derived from it. In Section 4 we present and discuss the results of our experiments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML