File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/c04-1141_relat.xml

Size: 3,037 bytes

Last Modified: 2025-10-06 14:15:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1141">
  <Title>Collocation Extraction Based on Modifiability Statistics</Title>
  <Section position="7" start_page="5" end_page="5" type="relat">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> Although there have been many studies on collocation extraction and mining using only statistical approaches (Church and Hanks, 1990; Ikehara et al., 1996), there has been much less work on collocation acquisition which takes into account the linguistic properties typically associated with collocations.</Paragraph>
    <Paragraph position="1"> Smadja (1993), which is the classic work on collocation extraction, uses a two-stage filtering model in which, in the first step, n-gram statistics determine possible collocations and, in the second step, these candidates are submitted to a syntactic valida7Of course, lexical material is always at least partially dependent on the domain in question. In our case, this is the news domain with all its associated subdomains (politics, economics, finance, culture, etc.).</Paragraph>
    <Paragraph position="2"> tion procedure (e.g., determining verb-object collocations) in order to filter out invalid collocations. In a single-judge evaluation of 4,000 collocation candidates, the incorporation of linguistic criteria (via tagging and predicate-argument parsing) boosts precision up to a level of 80% and recall to 94%. These results are, of course, not comparable to ours. First of all, precision and recall are measured at a fixed point for a fixed unranked candidate list. In order to obtain more reliable evaluation results, we plot these values continuously on a ranked candidate list. Secondly, our kind of syntactic preprocessing (which is standard nowadays) allows collocation extraction algorithms to better control the structural types of collocations.</Paragraph>
    <Paragraph position="3"> Lin (1998) acquires a lexical dependency database by assembling dependency relationships from a parsed corpus. An entry in this database is classified as collocation if its log-likelihood value is greater than some threshold. Using an automatically constructed similarity thesaurus, Lin (1999) then separates compositional from non-compositional collocations by taking into account the second linguistic property described in Section 1, viz. their non- or limited substitutability. In particular, he checks the existence and mutual information values of phrases obtained by substituting the words with similar ones, which results in the classification of the phrase as being compositional or noncompositional. Although this study offers some promising results, its applicability rather falls into the category of fine-classifying an already acquired set of collocations, e.g., according to the criteria described in Section 2, and thus is not really comparable to our work. Moreover, the linguistic property in his focus is of course a semantic one, whereas ours is purely syntactic in nature.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML