File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0701_metho.xml

Size: 18,298 bytes

Last Modified: 2025-10-06 14:07:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0701">
  <Title>Learning in Natural Language: Theory and Algorithmic Approaches*</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Learning Frameworks
</SectionTitle>
    <Paragraph position="0"> Generative probability models provide a principled way to the study of statistical classification in complex domains such as NL. It is common to assume a generative model for such data, estimate its parameters from training data and then use Bayes rule to obtain a classifier for this model. In the context of NL most classifters are derived from probabilistic language models which estimate the probability of a sentence 8 using Bayes rule, and then decompose this probability into a product of conditional probabilities according to the generative model.</Paragraph>
    <Paragraph position="2"> where hi is the relevant history when predicting wi, and s is any sequence of tokens, words, part-of-speech (pos) tags or other terms.</Paragraph>
    <Paragraph position="3"> This general scheme has been used to derive classifiers for a variety of natural language applications including speech applications (Rab89), pos tagging (Kup92; Sch95), word-sense ambiguation (GCY93) and context-sensitive spelling correction (Go195). While the use of Bayes rule is harmless, most of the work in statistical language modeling and ambiguity resolution is devoted to estimating terms of the form Pr(wlh ). The generative models used to estimate these terms typically make Markov or other independence assumptions. It is evident from studying language data that these assumptions are often patently false and that there are significant global dependencies both within and across sentences. For example, when using (Hidden) Markov Model (HMM) as a generative model for pos tagging, estimating the probability of a sequence of tags involves assuming that the pos tag ti of the word wi is independent of other words in the sentence, given the preceding tag ti-1. It is not surprising therefore that this results in a poor estimate of the probability density function. However, classifiers built based on these false assumptions nevertheless seem to behave quite robustly in many cases.</Paragraph>
    <Paragraph position="4"> A different, distribution free inductive principle that is related to the pac model of learning is the basis for the account developed here.</Paragraph>
    <Paragraph position="5"> In an instance of the agnostic variant of pac learning (Val84; Hau92; KSS94), a learner is given data elements (x, l) that are sampled according to some fixed but arbitrary distribution D on X x {0, 1}. X is the instance space and I E {0, 1} is the label 1. D may simply reflect the distribution of the data as it occurs &amp;quot;in nature&amp;quot; (including contradictions) without assuming that the labels are generated according to some &amp;quot;rule&amp;quot;. Given a sample, the goal of the learning algorithm is to eventually output a hypothesis h from some hypothesis class 7/ that closely approximates the data. The 1The model can be extended to deal with any discrete or continuous range of the labels.</Paragraph>
    <Paragraph position="6"> true error of the hypothesis h is defined to be errorD(h) = Pr(x,O~D\[h(x) 7~ If, and the goal of the (agnostic) pac learner is to compute, for any distribution D, with high probability (&gt; 1 -5), a hypothesis h E 7/ with true error no larger than ~ + inffhenerrorD(h).</Paragraph>
    <Paragraph position="7"> In practice, one cannot compute the true error errorD(h). Instead, the input to the learning algorithm is a sample S = {(x i,l)}i=li m of m labeled examples and the learner tries to find a hypothesis h with a small empirical error errors(h) = I{x e Slh(x) C/ l}l/ISl, and hopes that it behaves well on future examples.</Paragraph>
    <Paragraph position="8"> The hope that a classifier learned from a training set will perform well on previously unseen examples is based on the basic inductive principle underlying learning theory (Val84; Vap95) which, stated informally, guarantees that if the training and the test data are sampled from the same distribution, good performance on large enough training sample guarantees good performance on the test data (i.e., good &amp;quot;true&amp;quot; error). Moreover, the quality of the generalization is inversely proportional to the expressivity of the class 7-/. Equivalently, for a fixed sample size IsI, the quantified version of this principle (e.g. (Hau92)) indicates how much can one count on a hypothesis selected according to its performance on S. Finally, notice the underlying assumption that the training and test data are sampled from the same distribution; this framework addresses this issue. (See (GR99).) In our discussion functions learned over the instance space X are not defined directly over the raw instances but rather over a transformation of it to a feature space. A feature is an indicator function X : X ~ {0, 1} which defines a subset of the instance space - all those elements in X which are mapped to 1 by X- X denotes a class of such functions and can be viewed as a transformation of the instance space; each example (Xl,... xn) E X is mapped to an example (Xi,...Xlxl) in the new space. We sometimes view a feature as an indicator function over the labeled instance space X x {0, 1} and say that X(x, l) = 1 for examples x E x(X) with label l.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="3" type="metho">
    <SectionTitle>
3 Explaining Probabilistic Methods
</SectionTitle>
    <Paragraph position="0"> Using the abovementioned inductive principle we describe a learning theory account that explains the success and robustness of statistics based classifiers (Rot99a). A variety of methods used for learning in NL are shown to make their prediction using Linear Statistical Queries (LSQ) hypotheses. This is a family of linear predictors over a set of features which are directly related to the independence assumptions of the probabilistic model assumed. The success of these classification methods is then shown to be due to the combination of two factors: * Low expressive power of the derived classifier.</Paragraph>
    <Paragraph position="1"> * Robustness properties shared by all linear statistical queries hypotheses.</Paragraph>
    <Paragraph position="2"> Since the hypotheses are computed over a feature space chosen so that they perform well on training data, learning theory implies that they perform well on previously unseen data, irrespective of whether the underlying probabilistic assumptions hold.</Paragraph>
    <Section position="1" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Robust Learning
</SectionTitle>
      <Paragraph position="0"> This section defines a learning algorithm and a class of hypotheses with some generalization properties, that capture many probabilistic learning methods used in NLP. The learning algorithm is a Statistical Queries(SQ) algorithm (Kea93). An SQ algorithm can be viewed as a learning algorithm that interacts with its environment in a restricted way. Rather than viewing examples, the algorithm only requests the values of various statistics on the distribution of the examples to construct its hypothesis. (E.g. &amp;quot;What is the probability that a randomly chosen example (x, l) has xi = 0 and l = 1&amp;quot;?) A statistical query has the form IX, l, 7-\], where X 6 X is a feature, l 6 {0, 1} is a further (optional) restriction imposed on the query and ~&amp;quot; is an error parameter. A call to the SQ oracle returns an estimate ~n of \[x,z,~\]</Paragraph>
      <Paragraph position="2"> which satisfies \]15x - Px\] &lt; T. (We usually omit T and/or l from this notation.) A statistical queries algorithm is a learning algorithm that constructs its hypothesis only using information received from an SQ oracle. An algorithm is said to use a query space X if it only makes queries of the form \[X, l, T\] where X 6 A'. An SQ algorithm is said to be a good learning algorithm if, with high probability, it outputs a hypothesis h with small error, using sample size that is polynomial in the relevant parameters.</Paragraph>
      <Paragraph position="3"> Given a query \[X, l, T\] the SQ oracle is simulated by drawing a large sample S of labeled examples (x, l) according to D and evaluating Prs\[x(x, l)\] = I{(x, l) : X(x, l) = ll}/ISl.</Paragraph>
      <Paragraph position="4"> Chernoff bounds guarantee that the nUmber of examples required to achieve tolerance T with probability at least 1 - 5 is polynomial in 1/T and log 1/5. (See (Zea93; Dec93; AD95)).</Paragraph>
      <Paragraph position="5"> Let X be a class of features and f : {0, 1} a function that depends only on the values ~D for E X. Given x 6 X, a Linear Statis- \[x,~\] X tical Queries (LSQ) hypothesis predicts l argmaxte{o,1} ~xeX ^ D = f\[xj\] ({P\[x,z\] })&amp;quot; X(x). Clearly, the LSQ is a linear discriminator over the feature space A', with coefficients f that are computed given (potentially all) the values ^D P\[x,t\]&amp;quot; The definition generalizes naturally to non-binary classifiers; in this case, the discriminator between predicting l and other values is linear. A learning algorithm that outputs an LSQ hypothesis is called an LSQ algorithm.</Paragraph>
      <Paragraph position="6"> Example 3.1 The naive Bayes predictor (DH73) is derived using the assumption that given the label l E L the features' values are statistically independent. Consequently, the Bayes optimal prediction is given by:</Paragraph>
      <Paragraph position="8"> where Pr(1) denotes the prior probability of l (the fraction of examples labeled l) and Pr(xill) are the conditional feature probabilities (the fraction of the examples labeled l in which the ith feature has value xi). Therefore, we get: Claim: The naive Bayes algorithm is an LSQ algorithm over a set ,.~ which consists of n + 1 features: X0 --- 1, Xi -- xi for i = 1,...,n and where f\[1J\]O = log/5\[~z\],, and f\[x,J\]O = ^D ^D logP\[x,,l\]/P\[1,l\], i = 1,... ,n.</Paragraph>
      <Paragraph position="9"> The observation that the LSQ hypothesis is linear over X' yields the first generalization property of LSQ. VC theory implies that the VC dimension of the class of LSQ hypotheses is bounded above by IXI. Moreover, if the LSQ hypothesis is sparse and does not make use of unobserved features in X (as in Ex. 3.1) it is bounded by the number of features used (Rot00). Together with the basic generalization property described above this implies: Corollary 3.1 For LSQ, the number of training examples required in order to maintain a specific generalization performance guarantee scales linearly with the number o/features used.</Paragraph>
      <Paragraph position="10">  The robustness property of LSQ can be cast for the case in which the hypothesis is learned using a training set sampled according to a distribution D, but tested over a sample from D ~. It still performs well as long as the distributional distance d(D, D') is controlled (Rot99a; Rot00).</Paragraph>
      <Paragraph position="11"> Theorem 3.2 Let .A be an SQ(T, X) learning algorithm for a function class ~ over the distribution D and assume that d(D, D I) &lt; V (for V inversely polynomial in T). Then .A is also an SQ(T, ,~') learning algorithm for ~ over D I.</Paragraph>
      <Paragraph position="12"> Finally, we mention that the robustness of the algorithm to different distributions depends on the sample size and the richness of the feature class 2C/ plays an important role here. Therefore, for a given size sample, the use of simpler features in the LSQ representation provides better robustness. This, in turn, can be traded off with the ability to express the learned function with an LSQ over a simpler set of features.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Additional Examples
</SectionTitle>
      <Paragraph position="0"> In addition to the naive Bayes (NB) classifier described above several other widely used probabilistic classifiers can be cast as LSQ hypotheses. This property is maintained even if the NB predictor is generalized in several ways, by allowing hidden variables (GR00) or by assuming a more involved independence structure around the target variable. When the structure is modeled using a general Bayesian network (since we care only about predicting a value for a single variable having observed the others) the Bayes optimal predictor is an LSQ hypothesis over features that are polynomials X = IIxilxi2 * .. xik of degree that depends on the number of neighbors of the target variable. A specific case of great interest to NLP is that of hidden Markov Models. In this case there are two types of variables, state variables S and observed ones, O (Rab89).</Paragraph>
      <Paragraph position="1"> The task of predicting the value of a state variable given values of the others can be cast as an LSQ, where X C {S, O, 1} x {S, O, 1}, a suitably defined set of singletons and pairs of observables and state variables (Rot99a).</Paragraph>
      <Paragraph position="2"> Finally, Maximum Entropy (ME) models (Jay82; Rat97) are also LSQ models. In this framework, constrains correspond to features; the distribution (and the induced classifier) are defined in terms of the expected value of the features over the training set. The induced classifter is a linear classifier whose weights are derived from these expectations; the weights axe computed iteratively (DR72) since no closed form solution is known for the optimal values.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="4" type="metho">
    <SectionTitle>
4 Learning Linear Classifiers
</SectionTitle>
    <Paragraph position="0"> It was shown in (Rot98) that several other learning approaches widely used in NL work also make their predictions by utilizing a linear representation. The SNoW learning architecture (Rot98; CCRR99; MPRZ99) is explicitly presented this way, but this holds also for methods that are presented in different ways, and some effort is required to cast them this way. These include Brill's transformation based method (Bri95) 2, decision lists (Yax94) and back-off estimation (Kat87; CB95).</Paragraph>
    <Paragraph position="1"> Moreover, even memory-based methods (ZD97; DBZ99) can be cast in a similar way (Rot99b). They can be reformulated as feature-based algorithms, with features of types that are commonly used by other features-based learning algorithms in NLP; the prediction computed by MBL can be computed by a linear function over this set of features.</Paragraph>
    <Paragraph position="2"> Some other methods that have been recently used in language related applications, such as Boosting (CS99) and support vector machines are also making use of the same representation.</Paragraph>
    <Paragraph position="3"> At a conceptual level all learning methods are therefore quite similar. They transform the original input (e.g., sentence, sentence+pos information) to a new, high dimensional, feature space, whose coordinates are typically small conjunctions (n-grams) over the original input.</Paragraph>
    <Paragraph position="4"> In this new space they search for a linear function that best separates the training data, and rely on the inductive principle mentioned to yield good behavior on future data. Viewed this way, methods are easy to compare and analyze for their suitability to NL applications and future extensions, as we sketch below.</Paragraph>
    <Paragraph position="5"> The goal of blowing up the instance space to a high dimensional space is to increase the expressivity of the classifier so that a linear function could represent the target concepts. Within this space, probabilistic methods are the most limited since they do not actually search in the eThis holds only in cases in which the TBL conditions do not depend on the labels, as in Context Sensitive Spelling (MB97) and Prepositional Phrase Attachment (BR94) and not in the general case.</Paragraph>
    <Paragraph position="6">  space of linear functions. Given the feature space they directly compute the classifier. In general, even when a simple linear function generates the training data, these methods are not guaranteed to be consistent with it (Rot99a).</Paragraph>
    <Paragraph position="7"> However, if the feature space is chosen so that they are, the robustness properties shown above become significant. Decision lists and MBL methods have advantages in their ability to represent exceptions and small areas in the feature space. MBL, by using long and very specialized conjunctions (DBZ99) and decision lists, due to their functional form - a linear function with exponentially decreasing weights - at the cost of predicting with a single feature, rather than a combination (Go195). Learning methods that attempt to find the best linear function (relative to some loss function) are typically more flexible. Of these, we highlight here the SNoW architecture, which has some specific advantages that favor NLP-like domains.</Paragraph>
    <Paragraph position="8"> SNoW determines the features' weights using an on-line algorithm that attempts to minimize the number of mistakes on the training data using a multiplicative weight update rule (Lit88). The weight update rule is driven by the maximum entropy principle (KW95). The main implication is that SNoW has significant advantages in sparse spaces, those in which a few of the features are actually relevant to the target concept, as is typical in NLP. In domains with these characteristics, for a given number of training examples, SNoW generalizes better than additive update methods like perceptron and its close relative SVMs (Ros58; FS98) (and in general,it has better learning curves).</Paragraph>
    <Paragraph position="9"> Furthermore, although in SNoW the transformation to a large dimensional space needs to be done explicitly (rather than via kernel functions as is possible in perceptron and SVMs) its use of variable size examples nevertheless gives it computational advantages, due to the sparse feature space in NLP applications. It is also significant for extensions to relational domain mentioned later. Finally, SNoW is a multi-class classifier.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML