File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/j97-4005_intro.xml
Size: 5,269 bytes
Last Modified: 2025-10-06 14:06:14
<?xml version="1.0" standalone="yes"?> <Paper uid="J97-4005"> <Title>Stochastic Attribute-Value Grammars</Title> <Section position="3" start_page="598" end_page="599" type="intro"> <SectionTitle> Abney Stochastic Attribute-Value Grammars </SectionTitle> <Paragraph position="0"> with the frequency of proof trees in the training corpus. Eisele recognizes that this problem arises only where there are context dependencies.</Paragraph> <Paragraph position="1"> Fortunately, solutions to the context-dependency problem have been described (and indeed are currently enjoying a surge of interest) in statistics, machine learning, and statistical pattern recognition, particularly image processing. The models of interest are known as random fields. Random fields can be seen as a generalization of Markov chains and stochastic branching processes. Markov chains are stochastic processes corresponding to regular grammars and random branching processes are stochastic processes corresponding to context-free grammars. The evolution of a Markov chain describes a line, in which each stochastic choice depends only on the state at the immediately preceding time-point. The evolution of a random branching process describes a tree in which a finite-state process may spawn multiple child processes at the next time-step, but the number of processes and their states depend only on the state of the unique parent process at the preceding time-step. In particular, stochastic choices are independent of other choices at the same time-step: each process evolves independently. If we permit re-entrancies, that is, if we permit processes to re-merge, we generally introduce context-sensitivity. In order to re-merge, processes must be &quot;in synch,&quot; which is to say, they cannot evolve in complete independence of one another. Random fields are a particular class of multidimensional random processes, that is, processes corresponding to probability distributions over an arbitrary graph. The theory of random fields can be traced back to Gibbs (1902); indeed, the probability distributions involved are known as Gibbs distributions.</Paragraph> <Paragraph position="2"> To my knowledge, the first application of random fields to natural language was Mark et al. (1992). The problem of interest was how to combine a stochastic context-free grammar with n-gram language models. In the resulting structures, the probability of choosing a particular word is constrained simultaneously by the syntactic tree in which it appears and the choices of words at the n preceding positions. The context-sensitive constraints introduced by the n-gram model are reflected in re-entrancies in the structure of statistical dependencies, as in Figure 1.</Paragraph> <Paragraph position="3"> Statistical dependencies under the model of Mark et al. (1992).</Paragraph> <Paragraph position="4"> In this diagram, the choice of label on a node z with parent x and preceding word y is dependent on the label of x and y, but conditionally independent of the label on any other node.</Paragraph> <Paragraph position="5"> Della Pietra, Della Pietra, and Lafferty (1995, henceforth, DD&L) also apply random fields to natural language processing. The application they consider is the induction of English orthographic constraints--inducing a grammar of possible English words. DD&L describe an algorithm called Improved Iterative Scaling (IIS) for selecting informative features of words to construct a random field, and for setting the parameters of the field optimally for a given set of features, to model an empirical word distribution.</Paragraph> <Paragraph position="6"> It is not immediately obvious how to use the IIS algorithm to equip attribute-value Computational Linguistics Volume 23, Number 4 grammars with probabilities. In brief, the difficulty is that the IIS algorithm requires the computation of the expectations, under random fields, of certain functions; in general, computing these expectations involves summing over all configurations (all possible character sequences, in the orthography application), which is not possible when the configuration space is large. Instead, DD&L use Gibbs sampling to estimate the needed expectations.</Paragraph> <Paragraph position="7"> Gibbs sampling is possible for the application that DD&L consider. A prerequisite for Gibbs sampling is that the configuration space be closed under relabeling of graph nodes. In the orthography application, the configuration space is the set of possible English words, represented as finite linear graphs labeled with ASCII characters. Every way of changing a label, that is, every substitution of one ASCII character for a different one, yields a possible English word.</Paragraph> <Paragraph position="8"> By contrast, the set of graphs admitted by an attribute-value grammar G is highly constrained. If one changes an arbitrary node label in a dag admitted by G, one does not necessarily obtain a new dag admitted by G. Hence, Gibbs sampling is not applicable. However, I will show that a more general sampling method, the Metropolis-Hastings algorithm, can be used to compute the maximum-likelihood estimate of the parameters of AV grammars.</Paragraph> </Section> class="xml-element"></Paper>