File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1006_intro.xml

Size: 3,322 bytes

Last Modified: 2025-10-06 14:03:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1006">
  <Title>Automatic Extraction of Idioms using Graph Analysis and Asymmetric Lexicosyntactic Patterns</Title>
  <Section position="3" start_page="48" end_page="48" type="intro">
    <SectionTitle>
2 Previous and Related Work
</SectionTitle>
    <Paragraph position="0"> This section describes previous work in extracting information from text, and inferring semantic or idiomatic properties of words from the information so derived.</Paragraph>
    <Paragraph position="1"> The main technique used in this paper to extract groups of words that are semantically or idiomatically related is a form of lexicosyntactic pattern recognition. Lexicosyntactic patterns were pioneered by Marti Hearst (Hearst, 1992; Hearst and Sch&amp;quot;utze, 1993) in the early 1990's, to enable the addition of new information to lexical resources such as WordNet (Fellbaum, 1998). The main insight of this sort of work is that certain regular patterns in word-usage can reflect underlying semantic relationships. For example, the phrase &amp;quot;France, Germany, Italy, and other European countries&amp;quot; suggests that France, Germany and Italy are part of the class of European countries. Such hierarchical examples are quite sparse, and greater coverage was later attained by Riloff and Shepherd (1997) and Roark and Charniak (1998) in extracting relations not of hierarchy but of similarity, by finding conjunctions or co-ordinations such as &amp;quot;cloves, cinammon, and nutmeg&amp;quot; and &amp;quot;cars and trucks.&amp;quot; This work was extended by Caraballo (1999), who built classes of related words in this fashion and then reasoned that if a hierarchical relationship could be extracted for any member of this class, it could be applied to all members of the class. This technique can often mistakenly reason across an ambiguous middle-term, a situation that was improved upon by Cederberg and Widdows (2003), by combining pattern-based extraction with contextual filtering using latent semantic analysis.</Paragraph>
    <Paragraph position="2"> Prior work in discovering non-compositional phrases has been carried out by Lin (1999) and Baldwin et al. (2003), who also used LSA to distinguish between compositional and non-compositional verb-particle constructions and noun-noun compounds.</Paragraph>
    <Paragraph position="3"> At the same time, work in analyzing idioms and asymmetry within linguistics has become more sophisticated, as discussed by Benor and Levy (2004), and many of the semantic factors underlying our results can be understood from a sophisticated theoretical perspective.</Paragraph>
    <Paragraph position="4"> Other motivating and related themes of work for this paper include collocation extraction and example based machine translation. In the work of Smadja (1993) on extracting collocations, preference was given to constructions whose constituents appear in a fixed order, a similar (and more generally implemented) version of our assumption here that asymmetric constructions are more idiomatic than symmetric ones. Recent advances in example-based machine translation (EBMT) have emphasized the fact that examining patterns of language use can significantly improve idiomatic language generation (Carl and Way, 2003).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML