File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1069_metho.xml

Size: 17,375 bytes

Last Modified: 2025-10-06 14:12:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1069">
  <Title>Towards Understanding Text with a Very Large Vocabulary</Title>
  <Section position="2" start_page="354" end_page="354" type="metho">
    <SectionTitle>
2. Probabilistic Part of Speech
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="354" end_page="354" type="sub_section">
      <SectionTitle>
Models
</SectionTitle>
      <Paragraph position="0"> One straightforward way to use probabilities is in assigning parts of speech to words. Models predicting part of speech can serve to cut down the search space a parser must consider in processing known words and can be used as one input to more complex strategies for inferring lexical and semantic information about unknown words. We have explored the use of such models in both contexts.</Paragraph>
      <Paragraph position="1"> Simple but powerful models to predict part of speech can be derived using a corpus that has been tagged (or labelled) as to part of speech [Church 1988; de Marken 19901. Using a tagged corpus to train the model is called &amp;quot;supervised training&amp;quot;, since a human has prepared the correct training data. This is in contrast to &amp;quot;unsupervised training&amp;quot; where the process is fully automated. For example, in unsupervised part of speech tagging, one can use a corpus without annotation for training, a dictionary that lists parts of speech for the most frequently occurring words, and an initial probability assignment, e.g., a uniform probability distribution or probability estimates from a previous, related domain. An iterative procedure then revises the probability estimates so as to maximize the probability over the whole corpus.</Paragraph>
      <Paragraph position="2"> Our supervised training experiments used a tri-tag model based on a corpus from the University of Pennsylvania consisting of Wall Street Journal articles in which each word or punctuation mark has been tagged with one of 47 parts of speech, as shown in the following example: A tri-tag model predicts the relative likelihood of a particular tag given the two preceding tags, e.g. how likely is the tag RB on the third word in the above example, given that the two previous words were tagged NNS and VBD.</Paragraph>
      <Paragraph position="3"> Using the UPenn corpus, we counted for each possible pair of tags, the number of times that the pair was followed by each possible third tag, and then derived from those counts a probabilistic tri-tag model. We also estimated from the training data the conditional probability of each particular word given a known tag (e.g., how likely is the work to be &amp;quot;terms&amp;quot; if the tag is NNS); this is called the &amp;quot;word emit&amp;quot; probability. Both of these probability estimates usedpadding to an arbitrary estimate to avoid setting the probability for unseen tri-tags or unseen word senses to zero.</Paragraph>
      <Paragraph position="4"> Given these probabilities, one can then predict the maximum-likelihood tag sequence for a given word sequence. Using the tri-tag probabilities, we computed the probabilities of all possible paths in the tag space through the sentence, selected the path whose overall probability was highest, and then took the tag predictions from that path.</Paragraph>
      <Paragraph position="5"> We replicated the result [Church 19881 that this process is able to predict the parts of speech with only a 3-5% error rate when the possible parts of speech of the words are known. We believe that this error rate could be reduced still further and extend the success to unknown words.</Paragraph>
      <Paragraph position="6"> Using the UPenn set of parts of speech, unknown words can be in any of the 22 open-class parts of speech. The tri-tag model can be used to estimate the most probable one.</Paragraph>
      <Paragraph position="7"> While random choice among the 22 open classes would be expected to show an error rate for new words of 91.5%, our initial results using the model showed an error rate of only 51.6%. The best previously reported error rate was 75% [Kuhn &amp; de Mori 19901. Note that the error rate should be reduced even further by using more knowledge, such as capitalization knowledge and morphology.</Paragraph>
      <Paragraph position="8"> While supervised training is shown here to be very effective, it requires a correctly tagged corpus. We have done some experiments to quantify how much tagged data is really necessary, and to suggest ways to handle new words when using such models.</Paragraph>
      <Paragraph position="9"> In these experiments, we demonstrated that the training set can, in fact, be much smaller than might have been expected. One rule of thumb suggests that the training set needs to be large enough to contain 10 instances of each type of tag sequence in order for their probabilities to be estimated with reasonable accuracy. This would imply that a tri-tag model using 47 possible parts of speech would need a bit more than a million words of training. However, we found that much less training data was necessary, as illustrated in Figure 1.</Paragraph>
      <Paragraph position="11"/>
    </Section>
  </Section>
  <Section position="3" start_page="354" end_page="354" type="metho">
    <SectionTitle>
SIZE OF TRAINING SET
</SectionTitle>
    <Paragraph position="0"> In our experiments, the error rate for a supervised tri-tag model increased only from 3.30% to 3.87% when the size of the training set was reduced from 1 million words to 64,000 words. This is probably because most of the possible tri-tag sequences never actually appear. All that is really necessary, recalling the rule of thumb, is enough training to allow for 10 of each of the tag sequences that do occur. There were 16,170 unique mples in our training set, so the rule of thumb would suggest that 160,000 words would be sufficient training. This would explain why the degradation in performance was slight when the size of the corpus was reduced. The benefits of probabilistic modeling therefore seem applicable to new tag sets, subdomains, or languages without needing prohibitively large corpora.</Paragraph>
  </Section>
  <Section position="4" start_page="354" end_page="356" type="metho">
    <SectionTitle>
3. Probabilistic Language Model
</SectionTitle>
    <Paragraph position="0"> Probabilities can also quantify the likelihoods of alternative complete interpretations of a sentence. In these experiments, we used the grammar of the Delphi component from BBN's HARC system \[Stallard 1989\], which combines syntax and semantics in a unification formalism. We developed a context-free model, which estimates the probability of each rule in the grammar independently (in contrast to a context-sensitive model, such as the tri-tag model described above, which bases the probability of a tag on what other tags are in the adjacent context).</Paragraph>
    <Paragraph position="1"> In our context-free model, we associate a probability with each rule of the grammar. For each distinct major category (left-hand side) of the grammar, there is a set of context-free</Paragraph>
    <Paragraph position="3"> For each rule, we estimate the probability of the right-hand side given the left-hand side.</Paragraph>
    <Paragraph position="4"> The probability of a syntactic structure S, given the input string W, is then modelled by the product of the probabilities of the rules used in S. (\[Chitrao &amp; Grishman 1990\] used a similar context-free model.) Using this model, we explored the following issues:  What method of training the rule probabilities should be employed? How much (little) training data is required for reliable estimates? * How is system performance impacted? * Do the results suggest refinements in the probability model?  Our intention is to use the Treebank corpus being developed at the University of Pennsylvania as a source of correct structures for training. However, until that material becomes available, we have run initial experiments using small training sets taken from an existing question-answering corpus of sentences about a personnel database. To our surprise, we found that as little as 100 sentences of supervised training (in which a person, using graphical tools, identifies the correct parse) is sufficient to improve the ranking of the interpretations found. In our tests, the NLP system produces all interpretations satisfying all syntactic and semantic constraints. From that set, the intended interpretation must be chosen. The context-free probability model reduced the error rate on an independent test set by a factor of two to four, compared to random selection from the interpretations satisfying all knowledge-based constraints.</Paragraph>
    <Paragraph position="5"> We tested the predictive power of rule probabilities using this model both in unsupervised and in supervised mode. In the former case, the input is all parse trees (whether correct or not) for the sentences in the training set. In the latter case, the training data included a specification of the correct parse as hand picked by the grammar's author from among the parse trees produced by the system.</Paragraph>
    <Paragraph position="6"> The detailed results from using a training set of 81 sentences appear in the histogram in Figure 2.</Paragraph>
    <Paragraph position="7">  The &amp;quot;best possible&amp;quot; error rates for each test indicates the percentage of cases for which none of the interpretations produced by the system was judged correct, so that no selection scheme could achieve a lower error rate than that. The &amp;quot;chance&amp;quot; score gives the error rate that would be expected with random selection from all interpretations produced. The &amp;quot;test&amp;quot; column shows the error rate with the supervised or unsupervised probability model in question. The first supervised test had an 81.4% improvement, and the second a 50.8% improvement, and the third a 56% improvement. These results state how much better than chance the given model did as a percentage of the maximum possible improvement.</Paragraph>
    <Paragraph position="8"> We expect to improve the model's performance by recording probabilities for other features in addition to just the set of rules involved in producing them. For example, in the grammar used for this test, two different attachments for a prepositional phrase produced trees with the same set of rules, but differing in shape. Thus the simple, context-free model based on the product of rule probabilities could not capture preferences concerning such attachment. By adding to the model probabilities for such additional features, we expect that the power of the probabilisfic model to  automatically select the correct parse can be substantially increased.</Paragraph>
  </Section>
  <Section position="5" start_page="356" end_page="356" type="metho">
    <SectionTitle>
4. Learning Lexical Syntax
</SectionTitle>
    <Paragraph position="0"> One purpose for probabilistic models is to contribute to handling new words or partially understood sentences. We have done preliminary experiments that show that there is promise in learning lexical syntactic and semantic features from context when probabilistic tools are used to help control the ambiguity.</Paragraph>
    <Paragraph position="1"> In our experiments, we used a corpus of sentences each with one word that the system did not know. To create the corpus, we began with a corpus of sentences known to parse from the personnel question-answering domain (our goal, again, is to use the Treebank data from the University of Pennsylvania for such training when it becomes available).</Paragraph>
    <Paragraph position="2"> We then replaced one word in each sentence with an undefmed word.</Paragraph>
    <Paragraph position="3"> For example, in the following sentence, the word &amp;quot;contact&amp;quot; is undefined in the system: Who in Division Four is the contact for MIT? That word has both a noun and a verb part of speech; however, the pattern of parts of speech of the words surrounding &amp;quot;contact&amp;quot; causes the tri-tag model to return a high probability that the word is a noun. Using unification variables for all possible features of a noun, the parser produces multiple parses. Applying the context-free rule probabilities to select the most probable of the resulting parses allows the system to conclude both syntactic and semantic facts about &amp;quot;contact&amp;quot;. Syntactically, the system discovers that it is a count noun, with third person singular agreement. Semantically, the system learns (from the use of who) that contact is in the semantic class PERSONS.</Paragraph>
    <Paragraph position="4"> Furthermore, the partially-specified semantic representation for the sentence as a whole also shows the semantic relation to SCHOOLS, which is expressed here by the for phrase. Thus, even a single use of an unknown word in context can supply useful data about its syntactic and semantic features.</Paragraph>
    <Paragraph position="5"> Probalistic modelling plays a key role in this process.</Paragraph>
    <Paragraph position="6"> While context sensitive techniques for inferring lexical features can contribute a great deal, they can still leave substantial ambiguity. As a simple example, suppose the word &amp;quot;list&amp;quot; is undefined in the sentence &amp;quot;List the employees.&amp;quot; The tri-tag model predicts both a noun and a verb part of speech in that position. Using an underspecified noun sense combined with the usual definitions for the rest of the words yields no parses. However, an underspecified verb sense yields three parses, differing in the subcategorization frame of the verb &amp;quot;list&amp;quot;. For more complex sentences, even with this very limited protocol, the number of parses for the appropriate word sense can reach into the hundreds.</Paragraph>
    <Paragraph position="7"> Using the rule probabilities acquired through supervised training (described in the previous section), the likelihood of the ambiguous interpretations resulting from a sentence with an unknown word was computed. Then we tested whether the tree ranked most highly matched the tree previously selected by a person as the correct one. This tree equivalence test was based on the trees' smcture and on the rule applied at each node; while an underspecified tree might have some less-specified feature values than the chosen fully-specified tree, it would still be equivalent in the sense above.</Paragraph>
    <Paragraph position="8"> Of 160 inputs with an unknown word, in 130 cases the most likely tree matched the correct one, for an error rate of 18.75%, while picking at random would have resulted in an error rate of 63.14%, for an improvement by better than a factor of 3. This suggests that probabilistic modeling can be a powerful tool for controlling the high degree of ambiguity in efforts to automatically acquire lexical data.</Paragraph>
    <Paragraph position="9"> We have also begun to explore heuristics for combining lexical data for a single word acquired from a number of partial parses. There are some cases in which the best approach is to unify the two learned sets of lexical features, so that the derived sense becomes the sum of the information learned from the two examples. For instance, the verb subcategorization information learned from one example could be thus combined with agreement information learned from another. On the other hand, there are many cases, including alternative subcategorization frames, where each of the encountered options needs to be included as separate alternatives.</Paragraph>
  </Section>
  <Section position="6" start_page="356" end_page="356" type="metho">
    <SectionTitle>
5. Conclusions
</SectionTitle>
    <Paragraph position="0"> In trying to address the problems inherent in understanding text using very large vocabularies, we found that the use of probabilistic models was crucial in obtaining useful results.</Paragraph>
    <Paragraph position="1"> The three main problems addressed by this paper were (1) reducing ambiguity resulting from multiple parts of speech, (2) reducing parse ambiguity, and (3) learning lexical information of new words encountered in the text.</Paragraph>
    <Paragraph position="2"> Using supervised training for tri-tag probabilities, we achieved a 3-5% error rate on a test set in picking the correct part of speech. Our experiments showed that a smaller training set than previously expected (64,000 words rather than 1 million) was needed in order to achieve a good level of performance.</Paragraph>
    <Paragraph position="3"> For reducing interpretation ambiguity, our context-free probability model, trained in supervised mode on only 81 sentences, was able to reduce the error rate for selecting the correct parse on independent test sets by a factor of 2-4.</Paragraph>
    <Paragraph position="4"> For the problem of processing new words in the text, the tri-tag model reduced the error rate for picking the correct part of speech for such words from 91.5% to 51.6%. And once the possible parts of speech for a word are known (or hypothesized using the tri-tag model), the probabilistic language model proved useful in indicating which parses (obtained using the unknown word) should be looked at for learning more complex lexical information about the word.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML