XML Viewer - w99-0608

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0608_metho.xml
Size: 18,457 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0608">
  <Title>Improving POS Tagging Using Machine-Learning Techniques</Title>
  <Section position="5" start_page="53" end_page="53" type="metho">
    <SectionTitle>
2 Tree-based Taggers
</SectionTitle>
    <Paragraph position="0"> Decision trees have been successfully applied to a number of different NLP problems and, in particular, in POS tagging they have proven to be an efficient and compact way of capturing the relevant information for disambiguating. See (MSxquez, 1999) for a broad survey on this issue. null In this approach to tagging, the ambiguous words in the training corpus are divided into classes corresponding to the sets of tags they can take (i.e, 'noun-adjective', 'noun-adjectiveverb', etc.). These sets are called ambiguity classes and a decision tree is acquired for each of them. Afterwards, the tree-base is applied in a particular disambiguation algorithm.</Paragraph>
    <Paragraph position="1"> Regarding the learning algorithm, we use a particular implementation of a top-down induction of decision trees (TDIDT) algorithm, belonging to the supervised learning family. This algorithm is quite similar to the well-known CART (Breiman et al., 1984), and C4.5 (Quinlan, 1993), but it incorporates some particularities in order to better fit the domain at hand. Training examples are collected from annotated corpora and they consist of the target word to be disambiguated and some information of its local context in the sentence.</Paragraph>
    <Paragraph position="2"> All words not present in the training corpus are considered unknown. In principle, we have to assume that they can take any tag corresponding to open categories (i.e., noun, proper noun, verb, adjective, adverb, cardinal, etc.), which sum up to 20 in the Penn Treebank tagset. In this approach, an additional ambiguity class for unknown words is considered, and so, they are treated exactly in the same way as the other ambiguous words, except by the type of information used for acquiring the trees, which is enriched with a number of morphological features.</Paragraph>
    <Paragraph position="3"> Once the tree-model has been acquired, it can be used in many ways to disambiguate a real text. In the following sections, 2.1 and 2.2, we present two alternatives.</Paragraph>
    <Section position="1" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
2.1 RTT: A Reductionistic Tree-based
Tagger
</SectionTitle>
      <Paragraph position="0"> RTT is a reductionistic tagger in the sense of Constraint Grammars (Karlsson et al., 1995).</Paragraph>
      <Paragraph position="1"> In a first step a word-form frequency dictionary provides each input word with all possible tags with their associated lexical probability. After that, an iterative process reduces the ambiguity (discarding low probable tags) at each step until a certain stopping criterion is satisfied.</Paragraph>
      <Paragraph position="2"> More particularly, at each step and for each ambiguous word (at a sentence level) the work performed in parallel is: 1) The target word is &amp;quot;passed&amp;quot; through its corresponding decision tree; 2) The resulting probability distribution is used to multiplicatively update the probability distribution of the word; and 3) The tags with very low probabilities are filtered out.</Paragraph>
      <Paragraph position="3"> For more details, we refer the reader to (Mgrquez and Rodrfguez, 1997).</Paragraph>
    </Section>
    <Section position="2" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
2.2 STT: A Statistical Tree-based
Tagger
</SectionTitle>
      <Paragraph position="0"> The aim of statistical or probabilistic tagging (Church, 1988; Cutting et al., 1992) is to assign the most likely sequence of tags given the observed sequence of words. For doing so, two kinds of information are used: the lexical probabilities, i.e, the probability of a particular tag conditional on the particular word, and the contextual probabilities, which describe the probability of a particular tag conditional on the surrounding tags.</Paragraph>
      <Paragraph position="1"> Contextual (or transition) probabilities are usually reduced to the conditioning of the preceding tag (bigrams), or pair of tags (trigrams), however, the general formulation allows a broader definition of context. In this way, the set of acquired statistical decision trees can be seen as a compact representation of a rich contextual model, which can be straightforwardly incorporated inside a statistical tagger. The point here is that the context is not restricted to the n-1 preceding tags as in the n-gram formulation. Instead, it is extended to all the contex- null tual information used for learning the decision trees.</Paragraph>
      <Paragraph position="2"> The Viterbi algorithm (described for instance in (Deaose, 1988)),. in which n-gram probabilities are substituted by the application of the corresponding decision trees, allows the calculation of the most-likely sequence of tags with a linear cost on the sequence length. However, one problem appears when applying conditionings on the right context of the target word, since the disambiguation proceeds from left to right and, so, the right hand side words may be ambiguous. Although dynamic programming can be used to Calculate the most likely sequence of tags to the right (in a forward-backward approach), we use a simpler approach which consists of calculating the contextual probabilities by a weighted average of all possible tags for the right context.</Paragraph>
      <Paragraph position="3"> Additionally, the already presented tagger allows a straightforward incorporation of n-gram probabilities, by linear interpolation, in a back-off approach including, from most general to most specific, unigrams, bigrams, trigrams and decision trees. From now on, we will refer to STT as STT + when using n-gram information.</Paragraph>
      <Paragraph position="4"> Due to the high ambiguity of unknown words, their direct inclusion in the statistical tagger would result in a severe decreasing of performance. To avoid this situation, we apply the tree for unknown words in a pre-process for filtering low probable tags. In this way, when entering to the tagger the average number of tags per unknown word is reduced from 20 to 3.1.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="53" end_page="55" type="metho">
    <SectionTitle>
3 Evaluation of the Taggers
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
3.1 Domain of Application
</SectionTitle>
      <Paragraph position="0"> We have used a portion of about 1,17 Mw of the Wall Street Journal (WSJ) corpus, tagged according to the Penn Treebank tag set (45 different tags). The corpus has been randomly partitioned into two subsets to train (85%) and test (15%) the system. See table 1 for some details about the used corpus.</Paragraph>
      <Paragraph position="1"> The training corpus has been used to create a word form lexicon --of 45,469 entries-- with the associated lexical probabilities for each word.</Paragraph>
      <Paragraph position="2"> The training corpus contains 239 different ambiguity classes, with a number of examples ranging from few dozens to several thousands (with a maximum of 34,489 examples for the preposition-adverb-particle ambiguity). It is noticeable that only the 36 most frequent ambiguity classes concentrate up to 90% of the ambiguous occurrences of the training corpus.</Paragraph>
      <Paragraph position="3"> Table 2 contains more information about the number of ambiguity classes necessary to cover a concrete percentage of the training corpus.</Paragraph>
      <Paragraph position="4"> Training examples for the unknown-word ambiguity class were collected from the training corpus in the following way: First, the training corpus is randomly divided into twenty parts of equal size. Then, the first part is used to extract the examples which do not occur in the remaining nineteen parts, that is, taking the 95% of the corpus as known and the remaining 5% to extract the examples. This procedure is repeated with each of the twenty parts, obtaining approximately 22,500 examples from the whole corpus. The choice of dividing by twenty is not arbitrary. 95%-5% is the proportion that results in a percentage of unknown words very similar to the test set (i.e., 2.25%) 2 .</Paragraph>
      <Paragraph position="5"> Finally, the test set has been used as completely fresh material to test the taggers. All results on tagging accuracy reported in this paper have been obtained against this test set.</Paragraph>
    </Section>
    <Section position="2" start_page="53" end_page="55" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> In this experiment we used six basic discrete-valued features to disambiguate know n ambiguous words, which are: the part-of-speech tags of the three preceding and two following words, and the orthography of the word to be disambiguated. null For tagging unknown words, we used 20 attributes that can be classified into three groups: * Contextual information: part-of-speech tags of the two preceding and following words.</Paragraph>
      <Paragraph position="1">  * Orthographic and Morphological information (about the target word): prefixes (first two symbols) and suffixes (last three symbols); Length; Multi-word?; Capitalized?; Other capital letters?; Numerical characters?; Contain dots? * Dictionary-related information: Does the  target word contains any known word as a prefix (or a suffix)?; Is the target word :See (MPSrquez, 1999) for a discussion on the appropriateness of this procedure.</Paragraph>
      <Paragraph position="2">  of words; W/S: average number of words per sentence; AW: number and percentage of ambiguous words; T/W: average number of tags per word; T/AW: average number of tags per ambiguous t:nown word; T/DW: average number of tags per ambiguous word (including unknown words); and U: number and percentage of Unknown words Classes I 8 11 14 18 36 57 111 239 J  the prefix (or the suffix) of any word in the lexicon? The last group of features are inspired in those applied by Brill (1995) when addressing unknown words.</Paragraph>
      <Paragraph position="3"> The learning algorithm 3 acquired, in about thirty minutes, a base of 191 trees (the other ambiguity classes had not enough examples) which required about 0,68 Mb of storage.</Paragraph>
      <Paragraph position="4"> The results of the taggers working with this tree-base is presented in table 3. MFT stands for a baseline most-frequent-tag tagger. RTT, STT, and STT + stand for the basic versions of the taggers presented in section 2. The over-all accuracy is reported in the first column. Columns 2, 3, and 4 contain the tagging accuracy on some specific groups of words: unknown words, ambiguous words (excluding unknown words) and known words which is the complementary of the set of unknown words.</Paragraph>
      <Paragraph position="5"> Column 5 shows the speed of each tagger 4 and, finally, the 'Memory' column reflects the size of the used language model (the lexicon is not considered). null Three main conclusions can be extracted: * RTT and STT approaches obtain almost the same results in accuracy, however RTT is faster.</Paragraph>
      <Paragraph position="6"> ZThe programs were implemented using PERL-5.0 and they were run on a SUN UltraSparc2 machine with  porates bigrams and trigrams, with a slight time-space penalty.</Paragraph>
      <Paragraph position="7"> The accuracy of all taggers is comparable to the best state-of-the art taggers under the open vocabulary assumption (see section 5.2).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="55" end_page="57" type="metho">
    <SectionTitle>
4 Machine-Learning-based
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
Improvements
</SectionTitle>
      <Paragraph position="0"> Our purpose is to improve the performance on two types of ambiguity classes, namely: Most frequent ambiguity classes. We focused on the 26 most representative classes, which concentrate the 86% of the ambiguous occurrences. From these, eight (24.1%) were already resolved at almost 100% of accuracy, while the remaining eighteen (61.9%) left some room for improvement. Section 4.1 explain which methods have been applied to construct ensembles for these eighteen classes plus the unknown-word ambiguity class.</Paragraph>
      <Paragraph position="1"> Ambiguity classes with few examples. We considered the set of 82 ambiguity classes with a number of examples between 50 and 3,000 and an accuracy rate lower than 95%.</Paragraph>
      <Paragraph position="2"> They agglutinate 48,322 examples (14.24% of the total ambiguous occurrences). Section 4.2 explains the applied method to increase the number of examples of these classes.</Paragraph>
    </Section>
    <Section position="2" start_page="56" end_page="57" type="sub_section">
      <SectionTitle>
4.1 Ensembles of Decision Trees
</SectionTitle>
      <Paragraph position="0"> The general methods for constructing ensembles of classifiers are based on four techniques: 1) Resampling the training data, e.g.</Paragraph>
      <Paragraph position="1"> Boosting (Freund and Schapire, 1995), Bagging (Breiman, 1996), and Cross-validated Committees (Parmanto et al., 1996); 2) Combining different input features (Cherkauer, 1996; Tumer and Ghosh, 1996); 3) Changing output representation, e.g. ECOC (Dietterich and Bakiri, 1995) and PWC-CC (Moreira and Mayoraz, 1998); and 4) Injecting randomness (Dietterich, 1998).</Paragraph>
      <Paragraph position="2"> We tested several of the preceding methods on our domain. Below, we briefly describe those that reported major benefits.</Paragraph>
      <Paragraph position="3"> 4.i.1 Bagging (BAG) From a training set of n examples, severaI samples of the same size are extracted by randomly drawing, with replacement, n times. Such new training sets are called bootstrap replicates. In each replicate, some examples appear multiple times, while others do not appear. A classifier is induced from each bootstrap replicate and then they are combined in a voting approach. The technique is called bootstrap aggregation, from which the acronym bagging is derived. In our case, the bagging approach was performed following the description of Breiman (1996), constructing 10 replicates for each data set 5.</Paragraph>
      <Paragraph position="4">  Criteria, (FSC) In this case, the idea is to obtain different classifiers by applying several different functions for feature selection inside the tree induction algorithm. In particular, we have selected a set of seven functions that achieve a similar accuracy, namely: Gini Impurity Index, Information Gain and Gain Ratio, Chi-square statistic (X2), Symmetrical Tau criterion, RLM (a distance-based method), and a version of RELIEF-F which uses the Information Gain function to assign weights to the features. The first five are described, for instance, in (Sestito and Dillon, 1994), RLM is due to LSpez de MPSntaras (1991), and, finally, RELIEF-F is described in (Kononenko et al., 1995). Since the applied feature selection functions are based on different principles, we expect to obtain biased classifiers with complementary information.</Paragraph>
      <Paragraph position="5">  We have extended the basic set of six features with lexical information about words appearing in the local context of the target word, and with the ambiguity classes of the same words.</Paragraph>
      <Paragraph position="6"> In this way, we consider information about the surrounding words at three different levels of specificity: word form, POS tag, and ambiguity class.</Paragraph>
      <Paragraph position="7"> Very similar to Brill's lexical patterns (Brill, 1995), we also have included features to capture collocational information. Such features are obtained by composition of the already described single attributes and they are sequences of contiguous words and/or POS tags (up to three items).</Paragraph>
      <Paragraph position="8"> The resulting features were grouped according to their specificity to generate ensembles of eight trees 6. The idea here is that specific information (lexical attributes and collocational patterns) would produce classifiers that cover concrete cases (hopefully, with a high precision), while more general information (POS tags) would produce more general (but probably less precise) trees. The combination of both type of trees should perform better because of the complementarity of the information.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="57" end_page="57" type="metho">
    <SectionTitle>
4.2 Generating Pseudo-Examples
(CPD)
</SectionTitle>
    <Paragraph position="0"> Breiman (1998), describes a simple and effective method for generating new pseudo-examples fl'om existing data and incorporating them into a tree-based learning algorithm to increase prediction accuracy in domains with few training exalnples. We call this method CPD (standing for generation of Convex Pseudo-Data).</Paragraph>
    <Paragraph position="1"> The method for obtaining new data from the old is similar to the process of gene combination to create new generations in genetic algorithms.</Paragraph>
    <Paragraph position="2"> First, two examples of the same class are selected at random from the training set. Then, a new example is generated from them by selecting attributes from one or another parent according to a certain probability. This probability depends on a single generation parameter (a real number between 0 and 1), which regulates the amount of change allowed in the combination step.</Paragraph>
    <Paragraph position="3"> In the original paper, Breiman does not propose any optimization of the generation parameter, instead, he performs a limited amount of trials with different values and simply reports the best result. In our domain, we observed a big variance on the results depending on the concrete values of the generation parameter. Instead of trying to tune it, we generate several training sets using different values of the generation parameter and we construct an ensemble of decision trees. In this way, we make the global classifier independent of the particular choice, and we generally obtain a combined result which is more accurate than any of the individuals.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML