File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1031_metho.xml

Size: 19,868 bytes

Last Modified: 2025-10-06 14:14:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1031">
  <Title>A Flexible POS Tagger Using an Automatically Acquired Language Model*</Title>
  <Section position="4" start_page="0" end_page="238" type="metho">
    <SectionTitle>
2 Language Model
</SectionTitle>
    <Paragraph position="0"> We will use a hybrid language model consisting of an automatically acquired part and a linguist-written part.</Paragraph>
    <Paragraph position="1">  The automatically acquired part is divided in two kinds of information: on the one hand, we have bi-grams and trigrams collected from the annotated training corpus (see section 5 for details). On the other hand, we have context constraints learned from the same training corpus using statistical decision trees, as described in section 3.</Paragraph>
    <Paragraph position="2"> The linguistic part is very small --since there were no available resources to develop it further-- and covers only very few cases, but it is included to illustrate the flexibility of the algorithm.</Paragraph>
    <Paragraph position="3"> A sample rule of the linguistic part:</Paragraph>
    <Paragraph position="5"> This rule states that a tag past participle (VBN) is very compatible (10.0) with a left context consisting of a %vauxiliar% (previously defined macro which includes all forms of &amp;quot;have&amp;quot; and &amp;quot;be&amp;quot;) provided that all the words in between don't have any of the tags in the set \[VBN IN , : JJ JJS J JR\]. That is, this rule raises the support for the tag past participle when there is an auxiliary verb to the left but only if there is not another candidate to be a past participle or an adjective inbetween. The tags \[IN , :\] prevent the rule from being applied when the auxiliary verb and the participle are in two different phrases (a comma, a colon or a preposition are considered to mark the beginning of another phrase).</Paragraph>
    <Paragraph position="6"> The constraint language is able to express the same kind of patterns than the Constraint Grammar formalism (Karlsson et al., 1995), although in a different formalism. In addition, each constraint has a compatibility value that indicates its strength. In the middle run, the system will be adapted to accept CGs.</Paragraph>
  </Section>
  <Section position="5" start_page="238" end_page="240" type="metho">
    <SectionTitle>
3 Constraint Acquisition
</SectionTitle>
    <Paragraph position="0"> Choosing, from a set of possible tags, the proper syntactic tag for a word in a particular context can be seen as a problem of classification. Decision trees, recently used in NLP basic tasks such as tagging and parsing (McCarthy and Lehnert, 1995: Daelemans et al., 1996; Magerman, 1996), are suitable for performing this task.</Paragraph>
    <Paragraph position="1"> A decision tree is a n-ary branching tree that represents a classification rule for classifying the objects of a certain domain into a set of mutually exclusive classes. The domain objects are described as a set of attribute-value pairs, where each attribute measures a relevant feature of an object taking a (ideally small) set of discrete, mutually incompatible values.</Paragraph>
    <Paragraph position="2"> Each non-terminal node of a decision tree represents a question on (usually) one attribute. For each possible value of this attribute there is a branch to follow. Leaf nodes represent concrete classes.</Paragraph>
    <Paragraph position="3"> Classify a new object with a decision tree is simply following the convenient path through the tree until a leaf is reached.</Paragraph>
    <Paragraph position="4"> Statistical decision trees only differs from common decision trees in that leaf nodes define a conditional probability distribution on the set of classes.</Paragraph>
    <Paragraph position="5"> It is important to note that decision trees can be directly translated to rules considering, for each path from the root to a leaf, the conjunction of all questions involved in this path as a condition and the class assigned to the leaf as the consequence. Statistical decision trees would generate rules in the same manner but assigning a certain degree of probability to each answer.</Paragraph>
    <Paragraph position="6"> So the learning process of contextual constraints is performed by means of learning one statistical decision tree for each class of POS ambiguity -~ and converting them to constraints (rules) expressing compatibility/incompatibility of concrete tags in certain contexts.</Paragraph>
    <Section position="1" start_page="238" end_page="238" type="sub_section">
      <SectionTitle>
Learning Algorithm
</SectionTitle>
      <Paragraph position="0"> The algorithm we used for constructing the statistical decision trees is a non-incremental supervised learning-from-examples algorithm of the TDIDT (Top Down Induction of Decision Trees) family. It constructs the trees in a top-down way, guided by the distributional information of the examples, but not on the examples order (Quinlan, 1986). Briefly.</Paragraph>
      <Paragraph position="1"> the algorithm works as a recursive process that departs from considering the whole set of examples at the root level and constructs the tree ina top-down way branching at any non-terminal node according to a certain selected attribute. The different values of this attribute induce a partition of the set of examples in the corresponding subsets, in which the process is applied recursively in order to generate the different subtrees. The recursion ends, in a certain node, either when all (or almost all) the remaining examples belong to the same class, or when the number of examples is too small. These nodes are the leafs of the tree and contain the conditional probability distribution, of its associated subset, of examples, on the possible classes.</Paragraph>
      <Paragraph position="2"> The heuristic function for selecting the most useful attribute at each step is of a crucial importance in order to obtain simple trees, since no backtracking is performed. There exist two main families of attribute-selecting functions: information-based (Quinlan, 1986: Ldpez, 1991) and statistically--based (Breiman et al., 1984; Mingers, 1989).</Paragraph>
    </Section>
    <Section position="2" start_page="238" end_page="239" type="sub_section">
      <SectionTitle>
Training Set
</SectionTitle>
      <Paragraph position="0"> For each class of POS ambiguity the initial example set is built by selecting from the training corpus Classes of ambiguity are determined by the groups of possible tags for the words in the corpus, i.e, nounadjective, noun-adjective-verb, preposition-adverb, etc.  all the occurrences of the words belonging to this ambiguity class. More particularly, the set of attributes that describe each example consists of the part-of-speech tags of the neighbour words, and the information about the word itself (orthography and the proper tag in its context). The window considered in the experiments reported in section 6 is 3 words to the left and 2 to the right. The following are two real examples from the training set for the words that can be preposition and adverb at the same time (IN-RB conflict).</Paragraph>
      <Paragraph position="1"> VB DT NN &lt;&amp;quot;as&amp;quot; ,IN&gt; DT JJ NN IN NN &lt;&amp;quot;once&amp;quot;,RB&gt; VBN TO Approximately 90% of this set of examples is used for the construction of the tree. The remaining 10% is used as fresh test corpus for the pruning process.</Paragraph>
    </Section>
    <Section position="3" start_page="239" end_page="239" type="sub_section">
      <SectionTitle>
Attribute Selection Function
</SectionTitle>
      <Paragraph position="0"> For the experiments reported in section 6 we used a attribute selection function due to L6pez de Mintaras (L6pez. 1991), which belongs to the information-based family. Roughly speaking, it defines a distance measure between partitions and selects for branching the attribute that generates the closest partition to the correc* partaion, namely the one that joins together all the examples of the same class.</Paragraph>
      <Paragraph position="1"> Let X be aset of examples, C the set of classes and Pc(X) the partition of X according to the values of C. The selected attribute will be the one that generates the closest partition of X to Pc(X). For that we need to define a distance measure between partitions. Let PA(X) be the partition of X induced by the values of attribute A. The average information of such partition is defined as follows:</Paragraph>
      <Paragraph position="3"> where p(X. a) is the probability for an element of X belonging to the set a which is the subset of X whose examples have a certain value for the attribute .4, and it is estimated bv the ratio ~ This average * IXl ' information measure reflects the randomness of distribution of the elements of X between the classes of the partition induced by .4.. If we consider now the intersection between two different partitions induced by attributes .4 and B we obtain</Paragraph>
      <Paragraph position="5"> It is easy to show that the measure</Paragraph>
      <Paragraph position="7"> with values in \[0,1\].</Paragraph>
      <Paragraph position="8"> So the selected attribute will be that one that minimizes the measure: d.v(Pc(X), PA(X)).</Paragraph>
    </Section>
    <Section position="4" start_page="239" end_page="240" type="sub_section">
      <SectionTitle>
Branching Strategy
</SectionTitle>
      <Paragraph position="0"> Usual TDIDT algorithms consider a branch for each value of the selected attribute. This strategy is not feasible when the number of values is big (or even infinite). In our case the greatest number of values for an attribute is 45 --the tag set size-- which is considerably big (this means that the branching factor could be 45 at every level of the tree 3). Some s.vsterns perform a previous recasting of the attributes in order to have only binary-valued attributes and to deal with binary trees (Magerman, 1996). This can always be done but the resulting features lose their intuition and direct interpretation, and explode in number. We have chosen a mixed approach which consist of splitting for all values and afterwards joining the resulting subsets into groups for which we have not enough statistical evidence of being different distributions. This statistical evidence is tested with a X ~&amp;quot; test at a 5% level of significance. In order to avoid zero probabilities the following smoothing is performed. In a certain set of examples, the probability of a tag ti is estimated by I~,l+-~ ri(4) = ,+~ where m is the number of possible tags and n the number of examples.</Paragraph>
      <Paragraph position="1"> Additionally. all the subsets that don't imply a reduction in the classification error are joined together in order to have a bigger set of examples to be treated in the following step of the tree construction. The classification error of a certain node is simply: I - maxt&lt;i&lt;m (t)(ti)).</Paragraph>
      <Paragraph position="2"> Experiments reported in (.\I&amp;rquez and Rodriguez. 1995) show that in this way more compact and predictive trees are obtained.</Paragraph>
      <Paragraph position="3"> Pruning the Tree Decision trees that correctly classify all examples of the training set are not always the most predictive ones. This is due to the phenomenon known as o,'erfitting. It occurs when the training set has a certain amount of misclassified examples, which is obviously the case of our training corpus (see section 5). If we  force the learning algorithm to completely classify the examples then the resulting trees would fit also the noisy examples.</Paragraph>
      <Paragraph position="4"> The usual solutions to this problem are: l) Prune the tree. either during the construction process (Quinlan. 1993) or afterwards (Mingers, 1989); 2) Smooth the conditional probability distributions using fresh corpus a (Magerman, 1996).</Paragraph>
      <Paragraph position="5"> Since another important, requirement of our problem is to have small trees we have implemented a post-pruning technique. In a first step the tree is completely expanded and afterwards it is pruned following a minimal cost-complexity criterion (Breiman et al.. 1984). Roughly speaking this is a process that iteratively cut those subtrees producing only marginal benefits in accuracy, obtaining smaller trees at each step. The trees of this sequence are tested using a, comparatively small, fresh part of the training set in order to decide which is the one with the highest degree of accuracy on new examples. Experimental tests (M&amp;rquez and Rodriguez, 1995) have shown that the pruning process reduces tree sizes at about 50% and improves their accuracy in a 2-5%.</Paragraph>
      <Paragraph position="6"> An Ezample Finally, we present a real example of the simple acquired contextual constraints for the conflict IN-RB</Paragraph>
      <Paragraph position="8"> The tree branch in figure 2 is translated into the following constraints: -5.81 &lt;\[&amp;quot;as .... As&amp;quot;\],IN&gt; (\[RB'I) (\[IN\]); 2.366 &lt;\[&amp;quot;as .... As&amp;quot;\],RS&gt; (\[RB\]) (\[IN\]); which express the compatibility (either positive or negative) of the word-tag pair in angle brackets with the given context. The compatibility value for each constraint is the mutual information between the tag and the context (Cover and Thomas, 1991). It is directly&amp;quot; computed from the probabilities in the tree. ~Of course, this can be done only in the case of statistical decision trees.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="240" end_page="241" type="metho">
    <SectionTitle>
4 Tagging Algorithm
</SectionTitle>
    <Paragraph position="0"> Usual tagging algorithms are either n-gram oriented -such as Viterbi algorithm (Viterbi. 1967)- or ad-hoc for every case when they must deal with more complex information.</Paragraph>
    <Paragraph position="1"> We use relaxation labelling as a tagging algorithm. Relaxation labelling is a generic name for a family of iterative algorithms which perform function optimization, based on local information. See (Torras. 1989) for a summary. Its most remarkable feature is that it can deal with any kind of constraints, thus the model can be improved by adding any constraints available and it makes the tagging algorithm independent of the complexity of the model.</Paragraph>
    <Paragraph position="2"> The algorithm has been applied to part-of-speech tagging (Padr6, 1996), and to shallow parsing (Voutilainen and Padro. 1997).</Paragraph>
    <Paragraph position="3"> The algorithm is described as follows: Let. V = {Vl.t'2 ..... v,} be a set of variables (words).</Paragraph>
    <Paragraph position="4"> Let ti = {t\].t~ ..... t~,} be the set of possible labels (POS tags) for variable vi.</Paragraph>
    <Paragraph position="5"> Let CS be a set of constraints between the labels of the variables. Each constraint C E CS states a &amp;quot;compatibility value&amp;quot; C, for a combination of pairs variable-label. Any number of variables may be involved in a constraint.</Paragraph>
    <Paragraph position="6"> The aim of the algorithm is to find a weighted labelling 5 such that &amp;quot;global consistency&amp;quot; is maximized. Maximizing &amp;quot;global consistency&amp;quot; is defined i is as maximizing for all vi, ~i P} x Sii, where pj the weight for label j in variable vi and Sij the support received by the same combination. The support for the pair variable-label expresses how compatible that pair is with the labels of neighbouring variables. according to the constraint set. It is a vector optimization and doesn't maximize only the sum of the supports of all variables. It finds a weighted labelling such that any other choice wouldn't increase the support for any variable.</Paragraph>
    <Paragraph position="7"> The support is defined as the sum of the influence of every constraint on a label.</Paragraph>
    <Paragraph position="9"> where: l~ij is the set of constraints on label j for variable i, i.e. the constraints formed by any combination of variable-label pairs that includes the pair (ci. t i ).</Paragraph>
    <Paragraph position="11"> uct of the current weights ~ for the labels appearing 5A weighted labelling is a weight assignment for each label of each variable such that the weights for the labels of the same variable add up to one.</Paragraph>
    <Paragraph position="12"> Gp~(rn) is the weight assigned to label k for variable r at time m.</Paragraph>
    <Paragraph position="13">  in the constraint except (vi,t}) (representing how applicable the constraint is in the current context) multiplied by Cr which is the constraint compatibility value (stating how compatible the pair is with the context).</Paragraph>
    <Paragraph position="14"> Briefly, what the algorithm does is:  i. Start with a random weight assignment r. 2. Compute the support value for each label of each variable.</Paragraph>
    <Paragraph position="15"> 3. Increase the weights of the labels more compat null ible with the context (support greater than 0) and decrease those of the less compatible labels (support less than 0) s, using the updating function: null</Paragraph>
    <Paragraph position="17"> 4. If a stopping/convergence criterion 9 is satisfied,  stop, otherwise go to step 2.</Paragraph>
    <Paragraph position="18"> The cost of the algorithm is proportional to the product of the number of words by the number of constraints.</Paragraph>
  </Section>
  <Section position="7" start_page="241" end_page="241" type="metho">
    <SectionTitle>
5 Description of the corpus
</SectionTitle>
    <Paragraph position="0"> We used the Wall Street Journal corpus to train and test the system. We divided it in three parts: 1,100 Kw were used as a training set, 20 Kw as a model-tuning set, and 50 Kw as a test set.</Paragraph>
    <Paragraph position="1"> The tag set size is 45 tags. 36.4% of the words in the corpus are ambiguous, and the ambiguity ratio is 2.44 tags/word over the ambiguous words, 1.52 overall.</Paragraph>
    <Paragraph position="2"> We used a lexicon derived from training corpora, that contains all possible tags for a word, as well as their lexical probabilities. For the words in test corpora not appearing in the train set, we stored all possible tags, but no lexical probability (i.e. we assume uniform distribution) ldeg.</Paragraph>
    <Paragraph position="3"> The noise in the lexicon was filtered by manually checking the lexicon entries for the most frequent 200 words in the corpus 11 to eliminate the tags due to errors in the training set. For instance the original ZWe use lexical probabilities as a starting point.</Paragraph>
    <Paragraph position="4"> SNegative values for support indicate incompatibility. 9We use the criterion of stopping when there are no more changes, although more sophisticated heuristic procedures are also used to stop relaxation processes (Eklundh and Rosenfeld, 1978; Richards et hi. , 1981). 1degThat is, we assumed a morphological analyzer that provides all possible tags for unknown words.</Paragraph>
    <Paragraph position="5"> l~The 200 most frequent words in the corpus cover over half of it.</Paragraph>
    <Paragraph position="6"> lexicon entry (numbers indicate frequencies in the training corpus) for the very common word the was ~he CD i DT 47715 JJ 7 NN I NNP 6 VBP 1 since it appears in the corpus with the six different tags: CD (cardinal), DT (determiner), JJ (adjective), NN (noun). NNP (proper noun) and VBP (verb-personal form). It is obvious that the only correct reading for the is determiner.</Paragraph>
    <Paragraph position="7"> The training set was used to estimate bi/trigram statistics and to perform the constraint learning.</Paragraph>
    <Paragraph position="8"> The model-tuning set was used to tune the algorithm parameterizations, and to write the linguistic part of the model.</Paragraph>
    <Paragraph position="9"> The resulting models were tested in the fresh test set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML