XML Viewer - a00-2017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2017_metho.xml
Size: 27,716 bytes
Last Modified: 2025-10-06 14:07:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2017">
  <Title>A Classification Approach to Word Prediction*</Title>
  <Section position="3" start_page="124" end_page="126" type="metho">
    <SectionTitle>
2 Information Sources and Features
</SectionTitle>
    <Paragraph position="0"> Our goal is to learn a representation for each word in terms of features which characterize the syntactic and semantic context in which the word tends to appear. Our features are defined as simple relations over a collection of predicates that capture (some of) the information available in a sentence.</Paragraph>
    <Section position="1" start_page="124" end_page="124" type="sub_section">
      <SectionTitle>
2.1 Information Sources
</SectionTitle>
      <Paragraph position="0"> Definition 1 Let s =&lt; wl,w2,...,wn &gt; be a sentence in which wi is the i-th word. Let :PS be a collection of predicates over a sentence s. IS(s)) 1, the Information source(s) available for the sentence s is a representation ors as a list of predicates I E :r, XS(S) = {II(Wll , ...Wl,),-.., /~g(W~l , ..-Wk,)}.</Paragraph>
      <Paragraph position="1"> Ji is the arity of the predicate Ij.</Paragraph>
      <Paragraph position="2"> Example 2 Let s be the sentence &lt; John, X, at, the, clock, to, see, what, time, it, is &gt; Let ~={word, pos, subj-verb}, with the interpretation that word is a unary predicate that returns the value of the word in its domain; pos is a unary predicate that returns the value of the pos of the word in its domain, in the context of the sentence; subj- verb is a binary predicate that returns the value of the two words in its domain if the second is a verb in the sentence and the first is its subject; it returns C/ otherwise. Then,</Paragraph>
      <Paragraph position="4"> The IS representation of s consists only of the predicates with non-empty values. E.g., pos(w6) = modal is not part of the IS for the sentence above.</Paragraph>
      <Paragraph position="5"> subj - verb might not exist at all in the IS even if the predicate is available, e.g., in The ball was given to Mary.</Paragraph>
      <Paragraph position="6"> Clearly the IS representation of s does not contain all the information available to a human reading s; it captures, however, all the input that is available to the computational process discussed in the rest of this paper. The predicates could be generated by any external mechanism, even a learned one. This issue is orthogonal to the current discussion.</Paragraph>
    </Section>
    <Section position="2" start_page="124" end_page="125" type="sub_section">
      <SectionTitle>
2.2 Generating Features
</SectionTitle>
      <Paragraph position="0"> Our goal is to learn a representation for each word of interest. Most efficient learning methods known today and, in particular, those used in NLP, make use of a linear decision surface over their feature space (Roth, 1998; Roth, 1999). Therefore, in order to learn expressive representations one needs to compose complex features as a function of the information sources available. A linear function expressed directly in terms of those will not be expressive enough. We now define a language that allows 1We denote IS(s) as IS wherever it is obvious what the referred sentence we is, or whenever we want to indicate Information Source in general.</Paragraph>
      <Paragraph position="1">  one to define &amp;quot;types&amp;quot; of features 2 in terms of the information sources available to it.</Paragraph>
      <Paragraph position="2"> Definition 3 (Basic Features) Let I E Z be a k-ary predicate with range R. Denote w k = (Wjl,... , wjk). We define two basic binary relations as follows. For a e R we define:</Paragraph>
      <Paragraph position="4"> Features, which are defined as binary relations, can be composed to yield more complex relations in terms of the original predicates available in IS.</Paragraph>
      <Paragraph position="5"> Definition 4 (Composing features) Let fl, f2 be feature definitions. Then fand(fl, f2) for(f1, f2) fnot(fl) are defined and given the usual semantic:</Paragraph>
      <Paragraph position="7"> In order to learn with features generated using these definitions as input, it is important that features generated when applying the definitions on different ISs are given the same identification. In this presentation we assume that the composition operator along with the appropriate IS element (e.g., Ex. 2, Ex. 9) are written explicitly as the identification of the features. Some of the subtleties in defining the output representation are addressed in (Cumby and Roth, 2000).</Paragraph>
    </Section>
    <Section position="3" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
2.3 Structured Features
</SectionTitle>
      <Paragraph position="0"> So far we have presented features as relations over IS(s) and allowed for Boolean composition operators. In most cases more information than just a list of active predicates is available. We abstract this using the notion of a structural information source (SIS(s)) defined below. This allows richer class of feature types to be defined.</Paragraph>
      <Paragraph position="1"> 2We note that we do not define the features will be used in the learning process. These are going to be defined in a data driven way given the definitions discussed here and the input ISs. The importance of formally defining the &amp;quot;types&amp;quot; is due to the fact that some of these are quantified. Evaluating them on a given sentence might be computationally intractable and a formal definition would help to flesh out the difficulties and aid in designing the language (Cumby and Roth, 2000).</Paragraph>
    </Section>
    <Section position="4" start_page="125" end_page="126" type="sub_section">
      <SectionTitle>
2.4 Structured Instances
</SectionTitle>
      <Paragraph position="0"> Definition 5 (Structural Information Source) Let s =&lt; wl,w2, ...,Wn &gt;. SIS(s)), the Structural Information source(s) available for the sentence s, is a tuple (s, E1,... ,Ek) of directed acyclic graphs with s as the set of vertices and Ei 's, a set of edges in s.</Paragraph>
      <Paragraph position="1"> Example 6 (Linear Structure) The simplest SIS is the one corresponding to the linear structure of the sentence. That is, SIS(s) = (s,E) where (wi, wj) E E iff the word wi occurs immediately before wj in the sentence (Figure 1 bottom left part).</Paragraph>
      <Paragraph position="2"> In a linear structure (s =&lt; Wl,W2,...,Wn &gt;,E), where E = {(wi,wi+l);i = 1, ...n- 1}, we define the chain c(wj, \[l, r\]) = {w,_,,..., wj,.., n s.</Paragraph>
      <Paragraph position="3"> We can now define a new set of features that makes use of the structural information. Structural features are defined using the SIS. When defining a feature, the naming of nodes in s is done relative to a distinguished node, denoted wp, which we call the focus word of the feature. Regardless of the arity of the features we sometimes denote the feature f defined with respect to wp as f(wp).</Paragraph>
      <Paragraph position="4"> Definition 7 (Proximity) Let SIS(s) = (s, E) be the linear structure and let I E Z be a k-ary predicate with range R. Let Wp be a focus word and C = C(wp, \[l, r\]) the chain around it. Then, the proximity features for I with respect to the chain C are defined as:</Paragraph>
      <Paragraph position="6"> The second type of feature composition defined using the structure is a collocation operator.</Paragraph>
      <Paragraph position="7"> Definition 8 (Collocation) Let fl,...fk be feature definitions, col locc ( f l , f 2, . . . f k ) is a restricted conjunctive operator that is evaluated on a chain C of length k in a graph. Specifically, let C = {wj,, wj=, .. . , wjk } be a chain of length k in SIS(s). Then, the collocation feature for fl,.., fk with respect to the chain C is defined as collocc(fl, . . . , fk) = { 1 ifVi = 1,...k, fi(wj,) = 1 0 otherwise (4) The following example defines features that are used in the experiments described in Sec. 4.</Paragraph>
      <Paragraph position="8">  Example 9 Let s be the sentence in Example 2. We define some of the features with respect to the linear structure of the sentence. The word X is used as the focus word and a chain \[-10, 10\] is defined with respect to it. The proximity features are defined with respect to the predicate word. We get, for example: fc(word) ---- John; fc(word) = at; fc(word) = clock. Collocation features are defined with respect to a chain \[-2, 2\] centered at the focus word X. They are defined with respect to two basic features fl, f2 each of which can be either f(word, a) or f(pos, a). The resulting features include, for example: collocc(word, word)= {John- X}; collocc(word, word) = {X - at}; collocc(word, pos) = {at- DET}.</Paragraph>
    </Section>
    <Section position="5" start_page="126" end_page="126" type="sub_section">
      <SectionTitle>
2.5 Non-Linear Structure
</SectionTitle>
      <Paragraph position="0"> So far we have described feature definitions which make use of the linear structure of the sentence and yield features which are not too different from standard features used in the literature e.g., n-grams with respect to pos or word can be defined as colloc for the appropriate chain. Consider now that we are given a general directed acyclic graph G = (s, E) on the the sentence s as its nodes. Given a distinguished focus word wp 6 s we can define a chain in the graph as we did above for the linear structure of the sentence. Since the definitions given above, Def. 7 and Def. 8, were given for chains they would apply for any chain in any graph. This generalization becomes interesting if we are given a graph that represents a more involved structure of the sentence.</Paragraph>
      <Paragraph position="1"> Consider, for example the graph DG(s) in Figure 1. DG(s) described the dependency graph of the sentence s. An edge (wi,wj) in DG(s) represent a dependency between the two words. In our feature generation language we separate the information provided by the dependency grammar 3 to two parts. The structural information, provided in the left side of Figure 1, is used to generate SIS(s).</Paragraph>
      <Paragraph position="2"> The labels on the edges are used as predicates and are part of IS(s). Notice that some authors (Yuret, 1998; Berger and Printz, 1998) have used the structural information, but have not used the information given by the labels on the edges as we do.</Paragraph>
      <Paragraph position="3"> The following example defines features that are used in the experiments described in Sec. 4.</Paragraph>
      <Paragraph position="4"> Example 10 Let s be the sentence in Figure 1 along with its IS that is defined using the predicates word, pos, sub j, obj, aux_vrb. A sub j-verb 3This information can be produced by a functional dependency grammar (FDG), which assigns each word a spe- cific function, and then structures the sentence hierarchically based on it, as we do here (Tapanainen and Jrvinen, 1997), but can also be generated by an external rule-based parser or a learned one.</Paragraph>
      <Paragraph position="5"> feature, fsubj-verb, can be defined as a collocation over chains constructed with respect to the focus word join. Moreover, we can define fsubj-verb to be active also when there is an aux_vrb between the subj and verb, by defining it as a disjunction of two collocation features, the sub j-verb and the subj-aux_vrb-verb. Other features that we use are conjunctions of words that occur before the focus verb (here: join) along all the chains it occurs in (here: will, board, as) and collocations of obj and verb.</Paragraph>
      <Paragraph position="6"> As a final comment on feature generation, we note that the language presented is used to define &amp;quot;types&amp;quot; of features. These are instantiated in a data driven way given input sentences. A large number of features is created in this way, most of which might not be relevant to the decision at hand; thus, this process needs to be followed by a learning process that can learn in the presence of these many features.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="126" end_page="127" type="metho">
    <SectionTitle>
3 The Learning Approach
</SectionTitle>
    <Paragraph position="0"> Our experimental investigation is done using the SNo W learning system (Roth, 1998). Earlier versions of SNoW (Roth, 1998; Golding and Roth, 1999; Roth and Zelenko, 1998; Munoz et al., 1999) have been applied successfully to several natural language related tasks. Here we use SNo W for the task of word prediction; a representation is learned for each word of interest, and these compete at evaluation time to determine the prediction.</Paragraph>
    <Section position="1" start_page="126" end_page="127" type="sub_section">
      <SectionTitle>
3.1 The SNOW Architecture
</SectionTitle>
      <Paragraph position="0"> The SNo W architecture is a sparse network of linear units over a common pre-defined or incrementally learned feature space. It is specifically tailored for learning in domains in which the potential number of features might be very large but only a small subset of them is actually relevant to the decision made.</Paragraph>
      <Paragraph position="1"> Nodes in the input layer of the network represent simple relations on the input sentence and are being used as the input features. Target nodes represent words that are of interest; in the case studied here, each of the word candidates for prediction is represented as a target node. An input sentence, along with a designated word of interest in it, is mapped into a set of features which are active in it; this representation is presented to the input layer of SNoW and propagates to the target nodes. Target nodes are linked via weighted edges to (some of) the input features. Let At = {Q,... , i,~} be the set of features that are active in an example and are linked to the target node t. Then the linear unit corresponding to</Paragraph>
      <Paragraph position="3"> where w~ is the weight on the edge connecting the ith feature to the target node t, and Ot is the threshold  for the target node t. In this way, SNo W provides a collection of word representations rather than just discriminators.</Paragraph>
      <Paragraph position="4"> A given example is treated autonomously by each target subnetwork; an example labeled t may be treated as a positive example by the subnetwork for t and as a negative example by the rest of the target nodes. The learning policy is on-line and mistake-driven; several update rules can be used within SNOW. The most successful update rule is a variant of Littlestone's Winnow update rule (Littlestone, 1988), a multiplicative update rule that is tailored to the situation in which the set of input features is not known a priori, as in the infinite attribute model (Blum, 1992). This mechanism is implemented via the sparse architecture of SNOW.</Paragraph>
      <Paragraph position="5"> That is, (1) input features are allocated in a data driven way - an input node for the feature i is allocated only if the feature i was active in any input sentence and (2) a link (i.e., a non-zero weight) exists between a target node t and a feature i if and only if i was active in an example labeled t.</Paragraph>
      <Paragraph position="6"> One of the important properties of the sparse architecture is that the complexity of processing an example depends only on the number of features active in it, na, and is independent of the total number of features, nt, observed over the life time of the system. This is important in domains in which the total number of features is very large, but only a small number of them is active in each example.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="127" end_page="129" type="metho">
    <SectionTitle>
4 Experimental Study
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="127" end_page="128" type="sub_section">
      <SectionTitle>
4.1 Task definition
</SectionTitle>
      <Paragraph position="0"> The experiments were conducted with four goals in mind:  1. To compare mistake driven algorithms with naive Bayes, trigram with backoff and a simple maximum likelihood estimation (MLE) baseline. null 2. To create a set of experiments which is comparable with similar experiments that were previously conducted by other researchers. 3. To build a baseline for two types of extensions of the simple use of linear features: (i) Non-Linear features (ii) Automatic focus of attention. 4. To evaluate word prediction as a simple language model.</Paragraph>
      <Paragraph position="1">  We chose the verb prediction task which is similar to other word prediction tasks (e.g.,(Golding and Roth, 1999)) and, in particular, follows the paradigm in (Lee and Pereira, 1999; Dagan et al., 1999; Lee, 1999). There, a list of the confusion sets is constructed first, each consists of two different verbs. The verb vl is coupled with v2 provided that they occur equally likely in the corpus. In the test set, every occurrence of vl or v2 was replaced by a set {vl, v2} and the classification task was to predict the correct verb. For example, if a confusion set is created for the verbs &amp;quot;make&amp;quot; and &amp;quot;sell&amp;quot;, then the data is altered as follows: Once target subnetworks have been learned and the network is being evaluated, a decision support mechanism is employed, which selects the dominant active target node in the SNoW unit via a winner-take-all mechanism to produce a final prediction. SNoW is available publicly at http ://L2R. cs. uiuc. edu/- cogcomp, html.</Paragraph>
      <Paragraph position="2"> make the paper --+ {make,sell} the paper sell sensitive data --~ {make,sell} sensitive data The evaluated predictor chooses which of the two verbs is more likely to occur in the current sentence. In choosing the prediction task in this way, we make sure the task in difficult by choosing between  competing words that have the same prior probabilities and have the same part of speech. A further advantage of this paradigm is that in future experiments we may choose the candidate verbs so that they have the same sub-categorization, phonetic transcription, etc. in order to imitate the first phase of language modeling used in creating candidates for the prediction task. Moreover, the pretransformed data provides the correct answer so that (i) it is easy to generate training data; no supervision is required, and (ii) it is easy to evaluate the results assuming that the most appropriate word is provided in the original text.</Paragraph>
      <Paragraph position="3"> Results are evaluated using word-error rate (WER). Namely, every time we predict the wrong word it is counted as a mistake.</Paragraph>
    </Section>
    <Section position="2" start_page="128" end_page="128" type="sub_section">
      <SectionTitle>
4.2 Data
</SectionTitle>
      <Paragraph position="0"> We used the Wall Street Journal (WSJ) of the years 88-89. The size of our corpus is about 1,000,000 words. The corpus was divided into 80% training and 20% test. The training and the test data were processed by the FDG parser (Tapanainen and Jrvinen, 1997). Only verbs that occur at least 50 times in the corpus were chosen. This resulted in 278 verbs that we split into 139 confusion sets as above. After filtering the examples of verbs which were not in any of the sets we use 73, 184 training examples and 19,852 test examples.</Paragraph>
    </Section>
    <Section position="3" start_page="128" end_page="128" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> In order to test the advantages of different feature sets we conducted experiments using the following  features sets: 1. Linear features: proximity of window size 4-10 words, conjunction of size 2 using window size 4-2. The conjunction combines words and parts of speech.</Paragraph>
      <Paragraph position="1"> 2. Linear + Non linear features: using the lin null ear features defined in (1) along with non linear features that use the predicates sub j, obj, word, pos, the collocations subj-verb, verb-obj linked to the focus verb via the graph structure and conjunction of 2 linked words. The over all number of features we have generated for all 278 target verbs was around 400,000. In all tables below the NB columns represent results of the naive Bayes algorithm as implemented within SNoW and the SNoW column represents the results of the sparse Winnow algorithm within SNOW.</Paragraph>
      <Paragraph position="2"> Table 1 summarizes the results of the experiments with the features sets (1), (2) above. The baseline experiment uses MLE, the majority predictor. In addition, we conducted the same experiment using trigram with backoff and the WER is 29.3%. From  and non-linear features these results we conclude that using more expressive features helps significantly in reducing the WER. However, one can use those types of features only if the learning method handles large number of possible features. This emphasizes the importance of the new learning method.</Paragraph>
      <Paragraph position="3">  achieved using similarity methods (Dagan et al., 1999) and using the methods presented in this paper. Results are shown in percentage of improvement in accuracy over the baseline. Table 2 compares our method to methods that use similarity measures (Dagan et al., 1999; Lee, 1999). Since we could not use the same corpus as in those experiments, we compare the ratio of improvement and not the WER. The baseline in this studies is different, but other than that the experiments are identical. We show an improvement over the best similarity method. Furthermore, we train using only 73,184 examples while (Dagan et al., 1999) train using 587, 833 examples. Given our experience with our approach on other data sets we conjecture that we could have improved the results further had we used that many training examples.</Paragraph>
    </Section>
    <Section position="4" start_page="128" end_page="129" type="sub_section">
      <SectionTitle>
4.4 Focus of attention
</SectionTitle>
      <Paragraph position="0"> SNoW is used in our experiments as a multi-class predictor - a representation is learned for each word in a given set and, at evaluation time, one of these is selected as the prediction. The set of candidate words is called the confusion set (Golding and Roth, 1999). Let C be the set of all target words. In previous experiments we generated artificially subsets of size 2 of C in order to evaluate the performance of our methods. In general, however, the question of determining a good set of candidates is interesting in it own right. In the absence, of a good method, one might end up choosing a verb from among a larger set of candidates. We would like to study the effects this issue has on the performance of our method.</Paragraph>
      <Paragraph position="1"> In principle, instead of working with a single large confusion set C, it might be possible to,split C into subsets of smaller size. This process, which we call the focus of attention (FOA) would be beneficial only if we can guarantee that, with high probability,  given a prediction task, we know which confusion set to use, so that the true target belongs to it. In fact, the FOA problem can be discussed separately for the training and test stages.</Paragraph>
      <Paragraph position="2">  1. Training: Given our training policy (Sec. 3) every positive example serves as a negative example to all other targets in its confusion set. For a large set C training might become computationally infeasible.</Paragraph>
      <Paragraph position="3"> 2. Testing: considering only a small set of words  as candidates at evaluation time increases the baseline and might be significant from the point of view of accuracy and efficiency.</Paragraph>
      <Paragraph position="4"> To evaluate the advantage of reducing the size of the confusion set in the training and test phases, we conducted the following experiments using the same features set (linear features as in Table 1).</Paragraph>
      <Paragraph position="5">  Error Rate for Training and testing using all the words together against using pairs of words.</Paragraph>
      <Paragraph position="6"> &amp;quot;Train All&amp;quot; means training on all 278 targets together. &amp;quot;Test all&amp;quot; means that the confusion set is of size 278 and includes all the targets. The results shown in Table 3 suggest that, in terms of accuracy, the significant factor is the confusion set size in the test stage. The effect of the confusion set size on training is minimal (although it does affect training time). We note that for the naive Bayes algorithm the notion of negative examples does not exist, and therefore regardless of the size of confusion set in training, it learns exactly the same representations. Thus, in the NB column, the confusion set size in training makes no difference.</Paragraph>
      <Paragraph position="7"> The application in which a word predictor is used might give a partial solution to the FOA problem.</Paragraph>
      <Paragraph position="8"> For example, given a prediction task in the context of speech recognition the phonemes that constitute the word might be known and thus suggest a way to generate a small confusion set to be used when evaluating the predictors.</Paragraph>
      <Paragraph position="9"> Tables 4,5 present the results of using artificially simulated speech recognizer using a method of general phonetic classes. That is, instead of transcribing a word by the phoneme, the word is transcribed by the phoneme classes(Jurafsky and Martin, 200).</Paragraph>
      <Paragraph position="10"> Specifically, these experiments deviate from the task definition given above. The confusion sets used are of different sizes and they consist of verbs with different prior probabilities in the corpus. Two sets of experiments were conducted that use the phonetic transcription of the words to generate confusion sets.  Error Rate for Training and testing with confusion sets determined based on phonetic classes (PC) from a simulated speech recognizer. null In the first experiment (Table 4), the transcription of each word is given by the broad phonetic groups to which the phonemes belong i.e., nasals, fricative, etc. 4. For example, the word &amp;quot;b_u_y&amp;quot; is transcribed using phonemes as &amp;quot;b_Y&amp;quot; and here we transcribe it as &amp;quot;P_VI&amp;quot; which stands for &amp;quot;Plosive_Vowell&amp;quot;. This partition results in a partition of the set of verbs into several confusions sets. A few of these confusion sets consist of a single word and therefore have 100% baseline, which explains the high baseline.</Paragraph>
      <Paragraph position="11">  Error Rate for Training and testing with confusion sets determined based on phonetic classes (PC) from a simulated speech recognizer. In this case only confusion sets that have less than 98% baseline are used, which explains the overall lower baseline.</Paragraph>
      <Paragraph position="12"> Table 5 presents the results of a similar experiment in which only confusion sets with multiple words were used, resulting in a lower baseline.</Paragraph>
      <Paragraph position="13"> As before, Train All means that training is done with all 278 targets together while Train PC means that the PC confusion sets were used also in training. We note that for the case of SNOW, used here with the sparse Winnow algorithm, that size of the confusion set in training has some, although small, effect. The reason is that when the training is done with all the target words, each target word representation with all the examples in which it does not occur are used as negative examples. When a smaller confusion set is used the negative examples are more likely to be &amp;quot;true&amp;quot; negative.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML