File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/w05-0603_relat.xml

Size: 2,359 bytes

Last Modified: 2025-10-06 14:15:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0603">
  <Title>Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing</Title>
  <Section position="5" start_page="0" end_page="17" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> The syntax and semantics of NCs is an active area of research; the Journal of Computer Speech and Language has an upcoming special issue on Multiword Expressions.</Paragraph>
    <Paragraph position="1"> The best known early work on automated unsupervised NC bracketing is that of Lauer (1995) who introduces the probabilistic dependency model for the syntactic disambiguation of NCs and argues against the adjacency model, proposed by Marcus (1980), Pustejovsky et al. (1993) and Resnik (1993). Lauer collects n-gram statistics from Grolier's encyclopedia, containing about 8 million words. To  overcome data sparsity problems, he estimates probabilities over conceptual categories in a taxonomy (Roget's thesaurus) rather than for individual words. Lauer evaluated his models on a set of 244 unambiguous NCs derived from the same encyclopedia (inter-annotator agreement 81.50%) and achieved 77.50% for the dependency model above (baseline 66.80%). Adding POS and further tuning allowed him to achieve the state-of-the-art result of 80.70%. More recently, Keller and Lapata (2003) evaluate the utility of using Web search engines for obtaining frequencies for unseen bigrams. They then later propose using Web counts as a baseline unsupervised method for many NLP tasks (Lapata and Keller, 2004). They apply this idea to six NLP tasks, including the syntactic and semantic disambiguation of NCs following Lauer (1995), and show that variations on bigram counts perform nearly as well as more elaborate methods. They do not use taxonomies and work with the word n-grams directly, achieving 78.68% with a much simpler version of the dependency model.</Paragraph>
    <Paragraph position="2"> Girju et al. (2005) propose a supervised model (decision tree) for NC bracketing in context, based on five semantic features (requiring the correct WordNet sense to be given): the top three Word-Net semantic classes for each noun, derivationally related forms and whether the noun is a nominalization. The algorithm achieves accuracy of 83.10%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML