File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1167_intro.xml

Size: 3,569 bytes

Last Modified: 2025-10-06 14:02:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1167">
  <Title>Statistical Language Modeling with Performance Benchmarks using Various Levels of Syntactic-Semantic Information</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Statistical language models consist of estimating the probability distributions of a word given the history of words so far used. The standard n-gram language model considers two histories to be equivalent if they end in the same n [?] 1 words. Due to the tradeoff between predictive power and reliability of estimation, n is typically chosen to be 2 (bi-gram) or 3 (tri-gram). Even tri-gram model suffers from sparse-data estimation problem, but various smoothing techniques (Goodman, 2001) have led to significant improvements in many applications. But still the criticism that n-grams are unable to capture the long distance dependencies that exist in a language, remains largely valid.</Paragraph>
    <Paragraph position="1"> In order to model the linguistic structure that spans a whole sentence or a paragraph or even more, various approaches have been taken recently. These can be categorized into two main types : syntactically motivated and semantically motivated large span consideration. In the first type, probability of a word is decided based on a parse-tree information like grammatical headwords in a sentence (Charniak, 2001) (Chelba and Jelinek, 1998), or based on part-of-speech (POS) tag information (Galescu and Ringger, 1999). Examples of the second type are (Bellegarda, 2000) (Coccaro and Jurafsky, 1998), where latent semantic analysis (LSA) (Landauer et al., 1998) is used to derive large-span semantic dependencies. LSA uses word-document co-occurrence statistics and a matrix factorization technique called singular value decomposition to derive semantic similarity measure between any two text units - words or documents. Each of these approaches, when integrated with n-gram language model, has led to improved performance in terms of perplexity as well as speech recognition accuracy.</Paragraph>
    <Paragraph position="2"> While each of these approaches has been studied independently, it would be interesting to see how they can be integrated in a unified framework which looks at syntactic as well as semantic information in the large span. Towards this direction, we describe in this paper a mathematical framework called syntactically enhanced latent syntactic-semantic analysis (SELSA). The basic hypothesis is that by considering a word alongwith its syntactic descriptor as a unit of knowledge representation in the LSA-like framework, gives us an approach to joint syntactic-semantic analysis of a document. It also provides a finer resolution in each word's semantic description for each of the syntactic contexts it occurs in. Here the syntactic descriptor can come from various levels e.g. part-of-speech tag, phrase type, supertag etc. This syntactic-semantic representation can be used in language modeling to allocate the probability mass to words in accordance with their semantic similarity to the history as well as syntactic fitness to the local context.</Paragraph>
    <Paragraph position="3"> In the next section, we present the mathematical framework. Then we describe its application to statistical language modeling. In section 4 we explain the the use of various levels of syntactic information in SELSA. That is followed by experimental results and conclusion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML