File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1065_metho.xml

Size: 21,218 bytes

Last Modified: 2025-10-06 14:12:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1065">
  <Title>STUDIES IN PART OF SPEECH LABELLING</Title>
  <Section position="4" start_page="0" end_page="332" type="metho">
    <SectionTitle>
TAG PART OF SPEECH
</SectionTitle>
    <Paragraph position="0"> Predicting the part of speech of a word is one straightforward way to use probabilities. Many words are several ways ambiguous, such as the following: a round table: adjective a round of cheese: noun to round out your interests: verb to work the year round: adverb Even in context, part of speech can be ambiguous, as in the famous example: &amp;quot;Time flies.&amp;quot; where both words are two ways ambiguous, resulting in two grammatical interpretations as sentences.</Paragraph>
    <Paragraph position="1"> Models predicting part of speech can serve to cut down the search space a parser must consider in processing known words and make the selection among alternatives more accurate. Furthermore, they can be used as one input to more complex strategies for inferring lexical and semantic information about unknown words.</Paragraph>
    <Section position="1" start_page="0" end_page="331" type="sub_section">
      <SectionTitle>
2.1 The n-gram model
</SectionTitle>
      <Paragraph position="0"> If we want to determine the most likely syntactic part of speech or tag for each word in a sentence, we can formulate a probabilisfic tagging model. Let us assume that we want to know the most likely tag sequence, T, given a particular word sequence, W. Using Bayes' rule we can write the a posteriori probability of tag sequence T given word sequence as rcrl</Paragraph>
      <Paragraph position="2"> where P(T) is the a priori probability of tag sequence T, P(WIT) is the conditional probability of word sequence W occurring given that a sequence of tags T occurred, and P(W) is the unconditioned probability of word sequence W. Then, in principle, we can consider all possible tag sequences, evaluate the a posteriori probability of each, and choose the one that is highest. Since W is the same for all hypothesized tag sequences, we can disregard P(W).</Paragraph>
      <Paragraph position="3"> We can rewrite the probability of each sequence as a product of the conditional probabilities of each word or tag given all of the previous tags.</Paragraph>
      <Paragraph position="5"> Now, we can make the approximation that each tag depends only the immediately preceding tags (say the two preceding tags for a tri-tag model), and that the word depends only on the tag.</Paragraph>
      <Paragraph position="7"> That is, once we know the tag that will be used, we gain no further information about the likely word from knowing the previous tags or words. This model is called a Markov model, and the assumption is frequently called the Markov independence assumption.</Paragraph>
      <Paragraph position="8"> If we have sufficient training data then we can estimate the tag n-gram sequence probabilities and the probability of each word given a tag (lexical probabilities). We use robust estimation techniques that take care of the cases of unobserved events (i.e. sequences of tags that have not occurred in the training data). However, in real-world problems, we also are likely to have words that were never observed at all in the training data. The model given above can still be used, simply by defining a generic new word called &amp;quot;unknown-word&amp;quot;. The system can then guess at the tag of the unknown word primarily using the tag sequence probabilities. We return to the problem of unknown words in Section 3.</Paragraph>
      <Paragraph position="9"> Using a tagged corpus to train the model is called &amp;quot;supervised training&amp;quot;, since a human has prepared the correct training data. We conducted supervised training to derive both a bi-tag and a tri-tag model based on a corpus from the University of Pennsylvania. The UPenn corpus, which was created as part of the TREEBANK project (Santorini 1990) consists of Wall Street Journal (WSJ) articles. Each word or punctuation mark has been tagged with one of 47 parts of speech 5, as shown in the following example: Terms/NNS were/VBD not/RB disclosed/VBN. /.6 A bi-tag model predicts the relative likelihood of a particular tag given the preceding tag, e.g. how likely is the tag VBD on the second word in the above example, given that the previous word was tagged NNS. A tri-tag model predicts the relative likelihood of a particular tag given the two preceding tags, e.g. how likely is the tag RB on the third word in the above example, given that the two previous words were tagged NNS and VBD. While the bi-tag model is faster at processing time, the tri-tag model has a lower error rate.</Paragraph>
      <Paragraph position="10"> 5 Of the 47 parts of speech, 36 are word tags and 11 punctuation tags. Of the word tags, 22 are tags for open class words and 14 for closed class words.</Paragraph>
      <Paragraph position="11"> 6 NNS is plural noun; VBD is past tense verb; RB is adverbial; VBN is past participle verb.</Paragraph>
      <Paragraph position="12"> The algorithm for supervised training is straightforward.</Paragraph>
      <Paragraph position="13"> One counts for each possible pair of tags, the number of times that the pair was followed by each possible third tag, and then derived from those counts a probabilistic tri-tag model. One also estimates from the training data the conditional probability of each particular word given a known tag (e.g., how likely is the word &amp;quot;terms&amp;quot; if the tag is NNS); this is called the &amp;quot;word emit&amp;quot; probability. The probabilities were padded to avoid setting the probability for unseen M-tags or unseen word senses to zero.</Paragraph>
      <Paragraph position="14"> Given these probabilities, one can then find the most likely tag sequence for a given word sequence. Using the Viterbi algorithm, we selected the path whose overall probability was highest, and then took the tag predictions from that path. We replicated the result (Church 1988) that this process is able to predict the parts of speech with only a 3-4% error rate when the possible parts of speech of each the words in the corpus are known. This is in fact about the rate of discrepancies among human taggers on the TREEBANK project (Marcus, Santorini &amp; Magerman 1990).</Paragraph>
    </Section>
    <Section position="2" start_page="331" end_page="332" type="sub_section">
      <SectionTitle>
2.2 Quantity of training data
</SectionTitle>
      <Paragraph position="0"> While supervised training is shown here to be very effective, it requires a correctly ta~ed corpus. We have done some experiments to quantify how much tagged data is really necessary.</Paragraph>
      <Paragraph position="1"> In these experiments, we demonstrated that the training set can, in fact, be much smaller than might have been expected. One rule of thumb suggests that the training set needs to be large enough to contain on average 10 instances of each type of tag sequence in order for their probabilities to be estimated with reasonable accuracy. This would imply that a M-tag model using 47 possible parts of speech would need a bit more than a million words of training. However, we found that much less training data was necessary.</Paragraph>
      <Paragraph position="2"> It can be shown that a good estimate of the probability of a new event is the sum of the probability of all the events that occurred just once. However, if the average number of tokens of each event that as been observed is 10, then the lower bound on the probability of new events is 1/10. Thus the likelihood of a new tri-gram is fairly low. In a M-gram model of part of speech, an event is a particular sequence of tags. While theoretically the set of possible events is all permutations of the tags, in practice only a relatively small number of tag sequences actually occur. We found only 6,170 unique triples in our training set, out of a possible 97,000. This would suggest that only 60,000 words would be sufficient for training.</Paragraph>
      <Paragraph position="3"> In our experiments, the error rate for a supervised tri-tag model increased only from 3.30% to 3.87% when the size of the training set was reduced from 1 million words to 64,000 words. This is probably because most of the possible tri-tag sequences never actually appear. All that is  really necessary, recalling the rule of thumb, is enough training to allow for 10 of each of the tag sequences that do OCCUr.</Paragraph>
      <Paragraph position="4"> This result is applicable to new tag sets, subdomains, or languages. By beginning with a measure of the number of events that actually occur in the data, we can more precisely determine the amount of data needed to train the probabilistic models. In applications such as tagging, where a significant number of the theoretically possible events do not occur in practice, we can use supervised training of probabilistic models without needing prohibitively large corpora.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="332" end_page="333" type="metho">
    <SectionTitle>
3. UNKNOWN WORDS
</SectionTitle>
    <Paragraph position="0"> Sources of open-ended text, such as a newswire, present natural language processing technology with a major challenge: what to do with words the system has never seen before. Current technology depends on handcrafted linguistic and domain knowledge. For instance, the system that performed most successfully in the evaluation of software to extract data from text at the 2nd Message Understanding Conference held at the Naval Ocean Systems Center, June, 1989, would simply halt processing a sentence when a new word was encountered.</Paragraph>
    <Paragraph position="1"> Determining the part of speech of an unknown word can help the system to know how the word functions in the sentence, for instance, that it is a verb stating an action or state of affairs, that it is a common noun stating a class of persons, places, or things, that it is a proper noun naming a particular person, place, or thing, etc. If it can do that well, then more precise classification and understanding is feasible.</Paragraph>
    <Paragraph position="2"> Using the UPenn set of parts of speech, unknown words can be in any of the 22 open-class parts of speech. The tri-tag model can be used to estimate the most probable one. Random choice among the 22 open classes would be expected to show an error rate for new words of 95%. The best previously reported error rate was 75% (Kuhn &amp; de Moil 1990).</Paragraph>
    <Paragraph position="3"> In our first tests using the tri-tag model we showed an error rate of only 51.6%. However, this model only took into account the context of the word, and no information about the word itself. In many languages, including English, the word endings give strong indicators of the part of speech. Furthermore, capitalization information, when available, can help to indicate whether a word is a proper noun.</Paragraph>
    <Paragraph position="4"> We developed a probabilistic model that takes into account features of the word in determining the likelihood of the word given a part of speech. This was used instead of the &amp;quot;word emit&amp;quot; probabilities for known words that the system obtained from training. To develop the model, we first determined the features we thought would distinguish parts of speech. There are four independent 7 categories of features: inflectional endings, denvational endings, hyphenation, and  capitalization. Our initial test had three inflectional endings (-ed, -s, -ing), and 32 denvational endings, (including -ion, al, -ive, -ly). Capitalization has four values, in our system (+ initial + capitalized, - initial + capitalized, etc.) in order  to take into account the first word of a sentence. We can incorporate these features of the word into the probability that this particular word will occur given a particular tag using p(wj I t i = p(unknown-word I ti) * p(Capital - feature I t i) * p(endings/hyph I t i ) We estimate the probability of each ending for each tag based on the training data. While these probabilities are not strictly independent, the approximation is good enough to make a marked difference in classification of unknown words. As the results in Figure 1 shows, the use of the orthographic endings of the words reduces the error rate on the unknown words by a factor of 3.</Paragraph>
    <Paragraph position="5"> We tested capitalization separately, since some data, such as that in the Third Message Understanding Conference is upper case only. Titles and bibliographies will cause similar distortions in a system trained on mixed case and using capitalization as a feature. Interestingly, the capitalization feature contributed very little to the reduction in error rates, whereas using the word features contributed a great deal.</Paragraph>
    <Paragraph position="6">  them as such for our tests.</Paragraph>
    <Paragraph position="7">  In sum, adding a probability model of typical endings of words to the tri-tag model has yielded an accuracy of 82% for unknown words. Adding a model of capitalization to the other two models further increased the accuracy to 85%. The total effect of BBN's model has been a reduction of a factor of five in the error rate of the best previously reported performance.</Paragraph>
  </Section>
  <Section position="6" start_page="333" end_page="333" type="metho">
    <SectionTitle>
4. K-BEST TAG SETS
</SectionTitle>
    <Paragraph position="0"> An alternative mode of running POST is to return the set of most likely tags for each word, rather than a single tag for each.</Paragraph>
    <Paragraph position="1"> In our first test, the system returned the sequence of most likely tags for the sentence. This has the advantage of eliminating ambiguity; however, even with a rather low error rate of 3.7%, there are cases in which the system returns the wrong tag, which can be fatal for a parsing system trying to deal with sentences averaging more than 20 words in length.</Paragraph>
    <Paragraph position="2"> We addressed this problem by adding the ability of the tagger to return for each word an ordered list of tags, marked by their probability using the Forward Backward algorithm. The Forward Backward algorithm is normally used in unsupervised training to estimate the model that finds the maximum likelihood of the parameters of that model. We use it in determining the k-best tags for each word by computing for each tag the probability of the tag occurring at that position and dividing by the probability of the word sequence given this model.</Paragraph>
    <Paragraph position="3"> The following example shows k-best tagging output, with the correct tag for each word marked in bold. Note that the probabilities are in natural log base e. Thus for each difference of 1, there is a factor of 2.718 in the probability. Bailey Controls, based in Wickliffe Ohio, makes computerized industrial controls systems.</Paragraph>
    <Paragraph position="4"> Bailey (NP. -1.17) (RB. -1.35) (FW. -2.32) (NN. -2.93) (NPS. -2.95) (JJS. -3.06) (JJ. -3.31) (LS.-3.41) (JJR.</Paragraph>
    <Paragraph position="5"> -3.70) (NNS.-3.73) (VBG.-3.91)...</Paragraph>
    <Paragraph position="6"> Controls (VBZ.-0.19) (NNS. -1.93) (NPS. -3.75) (NP. - null In two of the words (&amp;quot;Controls&amp;quot; and &amp;quot;computerized&amp;quot;) the first tag is not the correct one. However, in all instances the correct tag is included in the set. Note the first word, &amp;quot;Bailey&amp;quot;, is unknown to the system, therefore, all of the open class tags are possible.</Paragraph>
    <Paragraph position="7"> In order to reduce the ambiguity further, we tested various ways to limit how many tags were returned based on their probabilities. Often one tag is very likely and the others, while possible, are given a low probability, as in the word &amp;quot;in&amp;quot; above. Therefore, we tried removing all tags whose probability is more than e 2 less likely than the most likely tag. So only tags within the threshold 2.0 of the most likely would be included (i.e. if the most likely tag had a log probability of -0.19, only tags with a log probability greater than -2.19 would be included). This reduced the ambiguity for known words from 1.93 tags per word to 1.23, and for unknown words, from 15.2 to 2.0.</Paragraph>
    <Paragraph position="8"> However, the negative side of using cut offs is that the correct tag may be excluded. Note that a cut off of 2.0 would exclude the correct tag for the word &amp;quot;Controls&amp;quot; above. By changing the cut off to 4.0, we are sure to include all the correct tags in this example, but the ambiguity for known words raises from 1.23 to 1.24 and for unknown words from 2.0 to 3.7, for an ambiguity rating of 1.57 overall.</Paragraph>
    <Paragraph position="9"> We are continuing experiments to determine the most effective way of limiting the number of tags returned, and hence decreasing ambiguity, while ensuring that the correct tag is likely to be in the set.</Paragraph>
  </Section>
  <Section position="7" start_page="333" end_page="334" type="metho">
    <SectionTitle>
5. MOVING TO A NEW DOMAIN
</SectionTitle>
    <Paragraph position="0"> In all of the tests discussed so far, we both trained and tested on sets of articles in the same domain, the Wall Street Journal texts used in the Penn Treebank Project. However, an important measure of the usefulness of the system is how well it performs in other domains. While we would not expect high performance in radically different kinds of text, such as transcriptions of conversations or technical manuals, we would hope for similar performance on newspaper articles from different sources and on other topics.</Paragraph>
    <Paragraph position="1"> We tested this hypothesis using data from the Third Message Understanding Conference (MUC-3). The goal of MUC-3 is to extract data from texts on terrorism in Latin American countries. The texts are mainly newspaper articles, although there are some transcriptions of interviews and speeches. The University of Pennsylvania TREEBANK project tagged four hundred MUC messages (approximately 100,000 words), which we divided into 90% training and 10% testing.</Paragraph>
    <Paragraph position="2">  For our first test, we used the original probability tables trained on the Wall Street Journal articles. We then retrained the probabilities on the MUC messages and ran a second test, with an average improvement of three percentage points in both bi- and tri- tags. The full results are shown below: BITAGS: TEST 1 TEST 2 Overall error rate: 8.5 5.6  While the results using the new tables are an improvement in these first-best tests, we saw the best results using K-best mode, which obtained a .7% error rate. We ran several tests using our K-best algorithm with various thresholds. As described in Section 4, the threshold limits how many tags are returned based on their probabilities. While this reduces the ambiguity compared to considering all possibilities, it also increases the error rate. Figure 4 shows this tradeoff from effectively no threshold, thresholds for K- Best on the right hand side of the g~aph (shown in the figure as a threshold of 12), which has a .7% error rate and an ambiguity of 3, through a cut off of 2, which has a error rate of 2.9, but an ambiguity of nearly zero--i.e, one tag pre word. (Note the far left of the graph is the error rate for a cut off of 0, that is, only consideering the first of the k-best tags, which is approximately the same as the bi-tag error rate shown'in Figure 3.)</Paragraph>
  </Section>
  <Section position="8" start_page="334" end_page="335" type="metho">
    <SectionTitle>
6. USING DICTIONARIES
</SectionTitle>
    <Paragraph position="0"> In all of the results reported here, we are using word/part of speech tables derived from training, rather than on-line dictionaries to determine the possible tags for a given word.</Paragraph>
    <Paragraph position="1"> The advantage of the tables is that the training provides the probability of a word given a tag, whereas the dictionary makes no distinctions between common and uncommon uses of a word. The disadvantage of this is that uses of a word that did not occur in the ~aining set will be unknown to the system. For example, in the training portion of the WSJ corpus, the word &amp;quot;put&amp;quot; only occurred as verb.</Paragraph>
    <Paragraph position="2"> However, in our test set, it occurred as a noun in the compound &amp;quot;put option&amp;quot;. Since for efficiency reasons, we only consider those tags known to be possible for a word, this will cause an error.</Paragraph>
    <Paragraph position="3"> We are currently integrating on-line dictionaries into the system, so that alternative word senses will be considered,  while still not opening the set of tags considered for a known word to all open class tags. This will not completely eliminate the problem, since words are often used in novel ways, as in this example from a public radio plea for funds: &amp;quot;You can Mastercard your pledge.&amp;quot;. We will be rerunning the experiments reported here to evaluate the effect of using on-line dictionaries.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML