File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3604_metho.xml

Size: 23,629 bytes

Last Modified: 2025-10-06 14:10:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3604">
  <Title>All-word prediction as the ultimate confusable disambiguation</Title>
  <Section position="4" start_page="25" end_page="27" type="metho">
    <SectionTitle>
2 Data preparation and experimental
</SectionTitle>
    <Paragraph position="0"> setup First, we identify the textual corpora used. We then describe the general experimental setup of learning curve experiments, and the IGTREE decision-tree induction algorithm used throughout all experiments. null</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
2.1 Data
</SectionTitle>
      <Paragraph position="0"> To generate our word prediction examples, we used the &amp;quot;Reuters Corpus Volume 1 (English Language, 1996-08-20 to 1997-08-19)&amp;quot;1. We tokenized this corpus with a rule-based tokenizer, and used all 130,396,703 word and punctuation tokens for experimentation. In the remainder of the article we make no difference between words and punctuation markers; both are regarded as tokens. We separated the final 100,000 tokens as a held-out test set, henceforth referred to as REUTERS, and kept the rest as training set, henceforth TRAIN-REUTERS.</Paragraph>
      <Paragraph position="1"> Additionally, we selected two test sets taken from different corpora. First, we used the Project Gutenberg2 version of the novel Alice's Adventures in Wonderland by Lewis Carroll (Carroll, 1865), henceforth ALICE. As the third test set we selected all tokens of the Brown corpus part of the Penn Tree-bank (Marcus et al., 1993), a selected portion of the original one-million word Brown corpus (KuVcera and Francis, 1967), a collection of samples of American English in many different genres, from sources printed in 1961; we refer to this test set as BROWN.</Paragraph>
      <Paragraph position="2"> In sum, we have three test sets, covering texts from the same genre and source as the training data, a fictional novel, and a mix of genres wider than the training set.</Paragraph>
      <Paragraph position="3"> Table 1 summarizes the key training and test set statistics. As the table shows, the cross-domain coverages for unigrams and bigrams are rather low; not only are these numbers the best-case performance ceilings, they also imply that a lot of contextual information used by the machine learning method used in this paper will be partly unknown to the learner, especially in texts from other domains than the training set.</Paragraph>
      <Paragraph position="4">  in terms of numbers of tokens, and unigram and bi-gram coverage (%) of the training set on the test sets.</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
2.2 Experimental setup
</SectionTitle>
      <Paragraph position="0"> All experiments described in this article take the form of learning curve experiments (Banko and Brill, 2001), in which a sequence of training sets is generated with increasing size, where each size training set is used to train a model for word prediction, which is subsequently tested on a held-out test set - which is fixed throughout the whole learning curve experiment. Training set sizes are exponentially grown, as earlier studies have shown that at a linear scale, performance effects tend to decrease in size, but that when measured with exponentially growing training sets, near-constant (i.e. log-linear) improvements are observed (Banko and Brill, 2001).</Paragraph>
      <Paragraph position="1"> We create incrementally-sized training sets for the word prediction task on the basis of the TRAIN-REUTERS set. Each training subset is created backward from the point at which the final 100,000-word REUTERS set starts. The increments are exponential with base number 10, and for every power of 10 we cut off training sets at n times that power, where n = 1,2,3,...,8,9 (for example, 10,20,...,80,90).</Paragraph>
      <Paragraph position="2"> The actual examples to learn from are created by windowing over all sequences of tokens. We encode examples by taking a left context window spanning seven tokens, and a right context also spanning seven tokens. Thus, the task is represented by a growing number of examples, each characterized by 14 positional features carrying tokens as values, and one class label representing the word to be predicted.</Paragraph>
      <Paragraph position="3"> The choice for 14 is intended to cover at least the superficially most important positional features. We assume that a word more distant than seven positions left or right of a focus word will almost never be more informative for the task than any of the words within this scope.</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
2.3 IGTree
</SectionTitle>
      <Paragraph position="0"> IGTree (Daelemans et al., 1997) is an algorithm for the top-down induction of decision trees. It compresses a database of labeled examples into a lossless-compression decision-tree structure that preserves the labeling information of all examples, and technically should be named a trie according to (Knuth, 1973). A labeled example is a feature-value vector, where features in our study represent a sequence of tokens representing context, associated with a symbolic class label representing the word to be predicted. An IGTREE is composed of nodes that each represent a partition of the original example database, and are labeled by the most frequent class of that partition. The root node of the trie thus represents the entire example database and carries the most frequent value as class label, while end nodes (leafs) represent a homogeneous partition of the database in which all examples have the same class label. A node is either a leaf, or is a non-ending node that branches out to nodes at a deeper level of the trie. Each branch represents a test on a feature value; branches fanning out of one node test on values of the same feature.</Paragraph>
      <Paragraph position="1"> To attain high compression levels, IGTREE adopts the same heuristic that most other decision-tree induction algorithms adopt, such as C4.5 (Quinlan, 1993), which is to always branch out testing on the most informative, or most class-discriminative features first. Like C4.5, IGTREE uses information gain (IG) to estimate the most informative features.</Paragraph>
      <Paragraph position="2"> The IG of feature i is measured by computing the difference in uncertainty (i.e. entropy) between the situations without and with knowledge of the value of that feature with respect to predicting the class label: IGi = H(C)[?]summationtextv[?]Vi P(v)xH(C|v), where C is the set of class labels, Vi is the set of values for feature i, and H(C) = [?]summationtextc[?]C P(c)log2 P(c) is the entropy of the class labels. In contrast with C4.5, IGTREE computes the IG of all features once on the full database of training examples, makes a feature ordering once on these computed IG values, and uses this ordering throughout the whole trie.</Paragraph>
      <Paragraph position="3"> Another difference with C4.5 is that IGTREE does not prune its produced trie, so that it performs a lossless compression of the labeling information of the original example database. As long as the  database does not contain fully ambiguous examples (with the same features, but different class labels), the trie produced by IGTREE is able to reproduce the classifications of all examples in the original example database perfectly.</Paragraph>
      <Paragraph position="4"> Due to the fact that IGTREE computes the IG of all features once, it is functionally equivalent to IB1-IG (Daelemans et al., 1999), a k-nearest neighbor classifier for symbolic features, with k = 1 and using a particular feature weighting in the similarity function in which the weight of each feature is larger than the sum of all weights of features with a lower weight (e.g. as in the exponential sequence 1,2,4,8,... where 2 &gt; 1, 4 &gt; (1 + 2), 8 &gt; (1 + 2 + 4), etc.). Both algorithms will base their classification on the example that matches on most features, ordered by their IG, and guess a majority class of the set of examples represented at the level of mismatching. IGTREE, therefore, can be seen as an approximation of IB1-IG with k = 1 that has favorable asymptotic complexities as compared to IB1-IG.</Paragraph>
      <Paragraph position="5"> IGTREE's computational bottleneck is the trie construction process, which has an asymptotic complexity of O(nlg(v) f) of CPU, where n is the number of training examples, v is the average branching factor of IGTREE (how many branches fan out of a node, on average), and f is the number of features. Storing the trie, on the other hand, costs O(n) in memory, which is less than the O(n f) of IB1-IG. Classification in IGTREE takes an efficient O(f lg(v)) of CPU, versus the cumbersome worst-case O(n f) of IB1-IG, that is, in the typical case that n is much higher than f or v.</Paragraph>
      <Paragraph position="6"> Interestingly, IGTREE is functionally equivalent to back-off smoothing (Zavrel and Daelemans, 1997), with the IG of the features determining the order in which to back off, which in the case of word prediction tends to be from the outer context to the inner context of the immediately neighboring words.</Paragraph>
      <Paragraph position="7"> Like with probabilistic n-gram based models with a back-off smoothing scheme, IGTREE will prefer matches that are as exact as possible (e.g. matching on all 14 features), but will back-off by disregarding lesser important features first, down to a simple bigram model drawing on the most important feature, the immediately preceding left word. In sum, IGTREE shares its scaling abilities with n-</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="27" end_page="28" type="metho">
    <SectionTitle>
3 All-words prediction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
3.1 Learning curve experiments
</SectionTitle>
      <Paragraph position="0"> The word prediction accuracy learning curves computed on the three test sets, and trained on increasing portions of TRAIN-REUTERS, are displayed in Figure 1. The best accuracy observed is 42.2% with 30 million training examples, on REUTERS. Apparently, training and testing on the same type of data yields markedly higher prediction accuracies than testing on a different-type corpus. Accuracies on BROWN are slightly higher than on ALICE, but the difference is small; at 30 million training examples, the accuracy on ALICE is 12.6%, and on BROWN 15.8%.</Paragraph>
      <Paragraph position="1"> A second observation is that all three learning curves are progressing upward with more training examples, and roughly at a constant log-linear rate.</Paragraph>
      <Paragraph position="2"> When estimating the rates after about 50,000 examples (before which the curves appear to be more volatile), with every tenfold increase of the number of training examples the prediction accuracy on REUTERS increases by a constant rate of about 8%, while the increases on ALICE and BROWN are both about 2% at every tenfold.</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
3.2 Memory requirements and classification
</SectionTitle>
      <Paragraph position="0"> speed The numbers of nodes exhibit an interesting sublinear relation with respect to the number of training examples, which is in line with the asymptotic complexity order O(n), where n is the number of training instances. An increasingly sublinear amount of nodes is necessary; while at 10,000 training instances the number of nodes is 7,759 (0.77 nodes per instance), at 1 million instances the number of nodes is 652,252 (0.65 nodes per instance), and at 30 million instances the number of nodes is 15,956,878 (0.53 nodes per instance).</Paragraph>
      <Paragraph position="1"> A factor in classification speed is the average amount of branching. Conceivably, the word prediction task can lead to a large branching factor, especially in the higher levels of the tree. However, not every word can be the neighbor of every other word in finite amounts of text. To estimate the average branching factor of a tree we compute the fth root of the total number of nodes (f being the number of features, i.e. 14). The largest decision tree currently constructed is the one on the basis of a training set of 30 million examples, having 15,956,878 nodes. This tree has an average branching factor of 14[?]15,956,878 [?] 3.27; all other trees have smaller branching factors. Together with the fact that we have but 14 features, and the asymptotic complexity order of classification is O(f lg(v)), where v is the average branching factor, classification can be expected to be fast. Indeed, depending on the machine's CPU on which the experiment is run, we observe quite favorable classification speeds. Figure 2 displays the various speeds (in terms of the number of test tokens predicted per second) attained on the three test sets3. The best prediction accuracies are still attained at classification speeds of over a hundred predicted tokens per second. Two other relevant observations are that first, the classification speed hardly differs between the three test sets (BROWN is classified only slightly slower than the other two test sets), indicating that the classifier is spending a roughly comparable amount of searching through the decision trees regardless of genre differences. Second, the decrease in speed settles 3Measurements were made on a GNU/Linux x86-based machine with 2.0 Ghz AMD Opteron processors.</Paragraph>
      <Paragraph position="2">  number of classified test examples per second, measured on the three test sets, with increasing training examples. Both axes have a logarithmic scale.</Paragraph>
      <Paragraph position="3"> on a low log-linear rate after about one million examples. Thus, while trees grow linearly, and accuracy increases log-linearly, the speed of classification slowly diminishes at decreasing rates.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="28" end_page="30" type="metho">
    <SectionTitle>
4 Confusables
</SectionTitle>
    <Paragraph position="0"> Word prediction from context can be considered a very hard task, due to the many choices open to the predictor at many points in the sequence. Predicting content words, for example, is often only possible through subtle contextual clues or by having the appropriate domain or world knowledge, or intimate knowledge of the writer's social context and intentions. In contrast, certain function words tend to be predictable due to the positions they take in syntactic phrase structure; their high frequency tends to ensure that plenty of examples of them in context are available.</Paragraph>
    <Paragraph position="1"> Due to the important role of function words in syntactic structure, it can be quite disruptive for a parser and for human readers alike to encounter a mistyped function word that in its intended form is another function word. In fact, confusable errors between frequent forms occur relatively frequently. Examples of these so-called confusables in English are there versus their and the contraction they're; or the duo than and then. Confusables can arise from having the same pronunciation (homophones), or having very similar pronunciation (country or county) or spelling (dessert, desert), hav- null ing very close lexical semantics (as between among and between), or being inflections or case variants of the same stem (I versus me, or walk versus walks), and may stem from a lack of concentration or experience by the writer.</Paragraph>
    <Paragraph position="2"> Distinguishing between confusables is essentially the same task as word prediction, except that the number of alternative outcomes is small, e.g. two or three, rather than thousands or more. The typical application setting is also more specific: given that a writer has produced a text (e.g. a sentence in a word processor), it is possible to check the correctness of each occurrence of a word known to be part of a pair or triple of confusables.</Paragraph>
    <Paragraph position="3"> We performed a series of experiments on disambiguating nine frequent confusables in English adopted from (Golding and Roth, 1999). We employed an experimental setting in which we use the same experimental data as before, in which only examples of the confusable words are drawn - note that we ignore possible confusable errors in both training and test set. This data set generation procedure reduces the amount of examples considerably. Despite having over 130 million words in TRAIN-REUTERS, frequent words such as there and than occur just over 100,000 times. To be able to run learning curves with more than this relatively small amount of examples, we expanded our training material with the New York Times of 1994 to 2002 (henceforth TRAIN-NYT), part of the English Gigaword collection published by the Linguistic Data Consortium, offering 1,096,950,281 tokens.</Paragraph>
    <Paragraph position="4"> As a first illustration of the experimental outcomes, we focus on the three-way confusable there - their - they're for which we trained one classifier, which we henceforth refer to as a confusable expert. The learning curve results of this confusable expert are displayed in Figure 3 as the top three graphs. The logarithmic x-axis displays the full number of instances from TRAIN-REUTERS up to 130.3 million examples, and from TRAIN-NYT after this point. Counter to the learning curves in the all-words prediction experiments, and to the observation by (Banko and Brill, 2001), the learning curves of this confusable triple in the three different data sets flatten, and converge, remarkably, to a roughly similar score of about 98%. The convergence only occurs after examples from TRAIN-NYT are added.</Paragraph>
    <Paragraph position="5">  tion accuracy on deciding between the confusable pair there, their, and they're, by IGTREE trained on TRAIN-REUTERS, and tested on REUTERS, AL-ICE, and BROWN. The top graphs are accuracies attained by the confusable expert; the bottom graphs are attained by the all-words predictor trained on TRAIN-REUTERS until 130 million examples, and on TRAIN-NYT beyond (marked by the vertical bar).</Paragraph>
    <Paragraph position="6"> In the bottom of the same Figure 3 we have also plotted the word prediction accuracies on the three words there, their, and they're attained by the all-words predictor described in the previous section on the three test sets. The accuracies, or rather recall figures (i.e. the percentage of occurrences of the three words in the test sets which are correctly predicted as such), are considerably lower than those on the confusable disambiguation task.</Paragraph>
    <Paragraph position="7"> Table 2 presents the experimental results obtained on nine confusable sets when training and testing on Reuters material. The third column lists the accuracy (or recall) scores of the all-words word prediction system at the maximal training set size of 30 million labeled examples. The fourth columns lists the accuracies attained by the confusable expert for the particular confusable pair or triple, measured at 30 million training examples, from which each particular confusable expert's examples are extracted.</Paragraph>
    <Paragraph position="8"> The amount of examples varies for the selected confusable sets, as can be seen in the second column.</Paragraph>
    <Paragraph position="9"> Scores attained by the all-words predictor on these words vary from below 10% for relatively low-frequent words to around 60% for the more frequent confusables; the latter numbers are higher than the  set, attained by the all-words prediction classifier trained on 30 million examples of TRAIN-REUTERS, and by confusable experts on the same training set.</Paragraph>
    <Paragraph position="10"> The second column displays the number of examples of each confusable set in the 30-million word training set; the list is ordered on this column.</Paragraph>
    <Paragraph position="11"> overall accuracy of this system on REUTERS. Nevertheless they are considerably lower than the scores attained by the confusable disambiguation classifiers, while being trained on many more examples (i.e., all 30 million available). Most of the confusable disambiguation classifiers attain accuracies of well above 90%.</Paragraph>
    <Paragraph position="12"> When the learning curves are continued beyond TRAIN-REUTERS into TRAIN-NYT, about a thousand times as many training examples can be gathered as training data for the confusable experts. Table 3 displays the nine confusable expert's scores after being trained on examples extracted from a total of one billion words of text, measured on all three test sets. Apart from a few outliers, most scores are above 90%, and more importantly, the scores on ALICE and BROWN do not seriously lag behind those on REUTERS; some are even better.</Paragraph>
  </Section>
  <Section position="7" start_page="30" end_page="31" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> As remarked in the cases reported in the literature directly related to the current article, word prediction is a core task to natural language processing, and one of the few that takes no annotation layer to provide data for supervised machine learning and probabilistic modeling (Golding and Roth, 1999; Even-Zohar  Accuracy on test set (%) Confusable set REUTERS ALICE BROWN cite - site - sight 100.0 100.0 69.0 accept - except 84.6 100.0 97.0 affect - effect 92.3 100.0 89.5 fewer - less 90.5 100.0 97.2 among - between 94.4 77.8 74.4 I - me 99.0 98.3 98.3 than - then 97.2 92.9 95.8 there - their - they're 98.1 97.8 97.3 to - too - two 94.3 93.4 92.9  set, attained by confusable experts trained on examples extracted from 1 billion words of text from TRAIN-REUTERS plus TRAIN-NYT, on the three test sets.</Paragraph>
    <Paragraph position="1"> and Roth, 2000; Banko and Brill, 2001). Our discrete, classificatio-nased approach has the same goal as probabilistic methods for language modeling for automatic speech recognition (Jelinek, 1998), and is also functionally equivalent to n-gram models with back-off smoothing (Zavrel and Daelemans, 1997). The papers by Golding and Roth, and Banko and Brill on confusable correction focus on the more common type of than/then confusion that occurs a lot in the process of text production. Both pairs of authors use the confusable correction task to illustrate scaling issues, as we have. Golding and Roth illustrate that multiplicative weight-updating algorithms such as Winnow can deal with immense input feature spaces, where for each single classification only a small number of features is actually relevant (Golding and Roth, 1999). With IGTREE we have an arguably competitive efficient, but one-shot learning algorithm; IGTREE does not need an iterative procedure to set weights, and can also handle a large feature space. Instead of viewing all positional features as containers of thousands of atomic word features, it treats the positional features as the basic tests, branching on the word values in the tree. More generally, as a precursor to the above-mentioned work, confusable disambiguation has been investigated in a string of papers discussing the application of various machine learning algorithms to the task (Yarowsky, 1994; Golding, 1995; Mangu  and Brill, 1997; Huang and Powers, 2001).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML