XML Viewer - c96-2148

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2148_metho.xml
Size: 25,069 bytes
Last Modified: 2025-10-06 14:14:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2148">
  <Title>POS Tagging Using Relaxation Labelling</Title>
  <Section position="4" start_page="0" end_page="877" type="metho">
    <SectionTitle>
2 Relaxation Labelling Algorithm
</SectionTitle>
    <Paragraph position="0"> Relaxation labelling is a generic name for a family of iterative algorittuns which perform function optimization, based (m local infi~rmation. See (Torras 89) for a clear exposition.</Paragraph>
    <Paragraph position="1"> Let V = {vl, v2,..., v,,~} be a set of variables Let t = , ,,,~ } be the set of possilfle labels for variable vi.</Paragraph>
    <Paragraph position="2"> Let Cb' be a set: of constraints between the labels of the variables. Each constraint C C CS states a &amp;quot;compatibility value&amp;quot; C,. ibr a colnbinalion of pairs variable-label. Constraints can be of any order (that is, any number of variables may be involved in a constraint).</Paragraph>
    <Paragraph position="3"> The aim of the algorithm is to find a weighted labelling such that &amp;quot;global consistency&amp;quot; is maximized. A weighted labelling is a weight assignation for each possibh', label of each variable. Maxinfizing &amp;quot;Global consistency&amp;quot; is defined as maxi)i )i is the weight mizing ~j t j x Sij , Vvi. Where I j for label j in wtriable vi and Sij the support received by the same combination. The support for a pair w~riable-label expresses how compatible is that pair with the labels of neighbouring variables, according to the constraint set.</Paragraph>
    <Paragraph position="4"> The relaxation algorithm consists of: * start in a randoln weighted labelling.</Paragraph>
    <Paragraph position="5"> * fbr each variable, compute the &amp;quot;support&amp;quot; that each label receives froln the current .weights for the labels of the other variabh;s.</Paragraph>
    <Paragraph position="6"> * Update the weight of each variable label ac(:ording to the support obtained.</Paragraph>
    <Paragraph position="7"> * iterate the process until a convergence criterion is met.</Paragraph>
    <Paragraph position="8"> The support computing and label weight changing must be perfornmd in parallel, to avoid that changing the a variable weights would affect t;he support colnputation of the others.</Paragraph>
    <Paragraph position="9"> The algorithm requires a way to compute which is the support for a wn'iable label given the others  and the constraints. This is called the &amp;quot;support function&amp;quot;.</Paragraph>
    <Paragraph position="10"> Several support, functions are used in tire literature to define the support received by label j of variable i (Sij).</Paragraph>
    <Paragraph position="11"> Being: 1&amp;quot;1 ?'d R~j = {,&amp;quot; I r -- \[(v,,, tk~),..., (~, *}),..., (v,.,, t.k,,)\] tile set of constraints on label j for variable i, i.e. the constraints formed by any coinbination of pairs variable-label that includes the pair (vi, t}). rl l)k, (m) the weight assigned to label t~.~ for variable v,,~ at time m.</Paragraph>
    <Paragraph position="12"> TO(V) the set of all possible subsets of variables in V.</Paragraph>
    <Paragraph position="13"> R~ (for G E Tdeg(V)) the set of constraints on tag i ieor word j in which the involved variables are exactly those of G.</Paragraph>
    <Paragraph position="14"> Usual support flnmtions are based on coinputing, for each constraint r involving (vi,t}), tile &amp;quot;constraint influence&amp;quot;, Inf(r) = C,. x p~'(m) x ... x p~Z., (m), which is the product of tile current weights for the labels appearing the constraint except (vi,t}) (representing how applicable is tile constraint in the current context) multiplied by C.,. which is the constraint compatibility value (stating how compatible is the pair with the context). The first formula combines influences just adding them:</Paragraph>
    <Paragraph position="16"> The next fornmla adds the constraint influences grouped according to the variables they involve, then multiplies the results of each group to get the final value: (1.2) &amp;-- 11 The last formula is tile same than the previous one, but instead of adding the constraint influences in the same group, just picks tile maximum. (1.3) Sij = II max {Inf(r)} The algorithm also needs art &amp;quot;updating function&amp;quot; to compute at each iteration which is tile new weight for a variable label, arrd this computation must be done in such a way that it can be proven to meet a certain convergence criterkm, at least under appropriate conditions 1 Several formulas have been proposed and some of them have been proven to be approximations of a gradient step algorithin.</Paragraph>
    <Paragraph position="17"> Usual updating flmctions are the following.</Paragraph>
    <Paragraph position="18"> ~Convergence has been proven under certain conditions, but in a complex application such as POS gagging we will lind cases where it is not necessarily achieved. Alternative stopping criterions will require further attention.</Paragraph>
    <Paragraph position="19"> Tile first formula increases weights for labels with support greater than 1, and decreases those with support smaller than 1. The denonfinator expression is a normalization factor.</Paragraph>
    <Paragraph position="21"> The second formula increases weight for labels with support greater than 0 and decreases weight, for those with support smaller than 0.</Paragraph>
    <Paragraph position="23"> Advantages of the algorithm are:  * Its irighly local character (only the state at, previous time step is needed to compute each new weight). This makes the algorithm highly parallelizable.</Paragraph>
    <Paragraph position="24"> * Its expressivity, since we state the problem in terms of constraints between labels.</Paragraph>
    <Paragraph position="25"> * Its flexibility, we don't have to check absolute coherence of constraints.</Paragraph>
    <Paragraph position="26"> * Its robustness, sin(:(,' it can give an answer to problenls without an exact solution (incompatible constraints, insufficient data...) * Its ability to find local-optima solutions to  NP problems in a non-exponential time.</Paragraph>
    <Paragraph position="27"> (Only if we have an upper bound for the nun&gt; ber of iterations, i.e. convergence is fast, or the algorithm is stopped after a fixed number of iterations. See section 4 for further details) Drawbacks of tire algorithm are: * Its cost. Being n the number of variables, v the average number of possible labels per variable, c the average number of constraints per label, and I tire average number of iterations until convergence, tile average cost is n x v x c x i, an expression in which the inulgi~ plying terms ,night; be much bigger than n if we deal with probh',ms with many values and constraints, or if convergence is not quickly achieved.</Paragraph>
    <Paragraph position="28"> * Since it acts as an approximation of gradient step algorithms, it has similar weakness: Found optima are local, and convergence is not always guaranteed.</Paragraph>
    <Paragraph position="29"> * In ge, ne, ral, constraints must be written mannally, since they at(', the modelling of the problem. This is good for easily modelable or reduced constraint-set problems, but in the</Paragraph>
  </Section>
  <Section position="5" start_page="877" end_page="879" type="metho">
    <SectionTitle>
3 Application to POS Tagging
</SectionTitle>
    <Paragraph position="0"> In this section we expose our application of relaxation labelling to assign 1);u't of speech tags to the words in a sentenc, e.</Paragraph>
    <Paragraph position="1"> Addressing tagging problems through ot)timization methods has been done in (Schmid 94) (POS tagging using neural networks) and in (Cowie et al. 92) (WSD using sinmlated annealing). (Pelillo &amp; I{efice 94) use a toy POS tagging l)i'oblenl to ext)eriment their methods to improve the quality of eoInt)atibility coeflh:ients for the constraints used by a relaxation labelling algorithm.</Paragraph>
    <Paragraph position="2"> The model used is lie tblh)wing: each word ill the text is a variable and may take several hfl)els, which are its POS tags.</Paragraph>
    <Paragraph position="3"> Since the number of variabh~s lind word position will vary from one senten(:e to another, constraints are expressed in relative terms (e.g.</Paragraph>
    <Paragraph position="4"> \[(vi, Determiner)(v.i ,,, Adjective)(vi ,2, Nou'r0\]). The Conshnint Set l{elaxation labelling is a.bh~ to deal wil;h constraints 1)etween any subset of wn'ial)les.</Paragraph>
    <Paragraph position="5"> Any rehttionship between any subset of words and tags may 1)e expressed as constraint and used l;o feed th(: algorithm. So, linguisl;s are fre(, to express ;my kind of constraint an(l are not restricted I:o previously decided patl;erns like in (Brill 92). Constraints for subsets of two and three variables are automati(:ally acquired, and any other subsets are left, to the linguists' criterion. That is, we are establishing two classes of constraints: the autoinatically acquired, and the mmmally written. This means that we ha.ve a great model flexibility: we can choose among a completely hand written model, where, a linguist has written all l;he constraint;s, a comph~tely mm)mat, ically lierived model, or ally interinediate (:olnl)ination of (',onstrailfl;s fl'om ea, ch (;ype.</Paragraph>
    <Paragraph position="6"> We can use the same information than HMM taggers to ot)tain automatic (:onstraints: the 1)robability 2. of transition fl'om one tag to another (bigram -or binary constraint- probability) will give us an idea of how eomt)atible they are in the positions i and i + 1, ;rod the same for l;rigrain -or ternary cbnstraint- probabilities. Extending ~Esl;imated fi'om occurrences in tagged (:ort)or~t.</Paragraph>
    <Paragraph position="7"> W(: prefer tll(: use of supervis(:d training (sin(:e large enough corpora arc available) because of the difficulty of using an unsut)ervised method (such as Bmm&gt; Welch re-estimation) when dealing, as in our case, with heterogeneous constraints.</Paragraph>
    <Paragraph position="8"> this to higher order constraints is possil)le, but; would result in prohibitive comtmt;ational costs.</Paragraph>
    <Paragraph position="9"> l)ealing with han(l-written constraints will not be so easy, since it; is not obvious \]low to compute &amp;quot;transition probabilities&amp;quot; for a comph:x constraint null Although accurate-but costly- methods to estimate comt)al;ibility values have been proposed in (Pelillo &amp; Hetice 94), we will choose a simpler an(t much (:heaptw (:Olntmtationally solution: (JOHll)llting the compatibility degree fl)r the manually written constraints using the number of occurr('nees of the consl;raint pattern in the training (:orIms to comtmte the prol)ability of the restricted word-tag pair given the contexl; defined by the constraint a II.elaxation doesn't need -as HMMs (h)- the prior prot)at)ility of a certain tag for a word, since it is not a constraint, but il; Call \])e llSCd to SOt; the initial st;at(; to a 11ot templet;ely rall(lol\[I OllC. hfitially we will assign to each word il;s most I)ro/)able tag, so we start optimization in a biassed point.</Paragraph>
    <Paragraph position="10">  The sut)port functions described in section 2 are traditionally used in relaxation algorithnts, it seems better for our purt)ose to choose an additive one, since the multiplicative flm(:tions might yiehl zero or tiny values when -as in Ollr cose- for ,q (:crtain val'iable or tag no constraints are available for a given subsel; of vm'ial)les.</Paragraph>
    <Paragraph position="11"> Since that fllnt:tions are general, we may try to lind ;~ suI)I)ort flmctkm more speciiic tbr our t)rol)h:m. Sin(:e IIMMs lind the maxinmm sequ(:n(:e probat)ility and relaxation is a maximizing algorii;hm, we (:an make relaxation maximize th(,' se(lllenc(? t)robability an(l we should gel; tile same results. To a(:hieve this we define a new Sul)port flmc, l;ion, which is the sequence i)robability: Being: t k tile tag for varial)h: 'vk with highest weight value a~ the current tilne step.</Paragraph>
    <Paragraph position="12"> 7r(Vt, t 1) \[;he probal)ility for t~he sequence to sl;art in tag t I.</Paragraph>
    <Paragraph position="13"> P(v,t) the lexical probability for the word represe\]tted by v to have t;ag t.</Paragraph>
    <Paragraph position="14"> T(tl, I2) the probability of tag t2 given that I;he previous one is tl.</Paragraph>
    <Paragraph position="15"> ~itj the set of all ternm'y constrainl;s on tag j for</Paragraph>
    <Paragraph position="17"> aThis is an issue that will require fitrtl,er ati:enlion, since as constraints can be expressed in several degrees of g(merality, l;he estimated probabilities may vary greatly del)ending on how t;he constraint was expressed.</Paragraph>
    <Paragraph position="18">  probabilities may be good for n-gram models, but it is dubious whether it can be generalized to higher degree constraints. In addition we can question the appropriateness of using probability values to express compatibilities, and try to find another set of values that fits better our needs. We tried several candidates to represent compatibility: Mutual Information, Association Ratio and Relative Entropy.</Paragraph>
    <Paragraph position="19"> This new compatibility measures are not limited to \[0, 1\] as probabilities. Since relaxation updating functions (2.2) and (2.1) need support values to be normalized, we must choose some function to normalize compatibility values.</Paragraph>
    <Paragraph position="20"> Although the most intuitive and direct scaling would be the linear function, we will test as well some sigmoid-shaped hmctions widely used in neural networks and in signal theory to scale free-ranging values in a finite interval.</Paragraph>
    <Paragraph position="21"> All this possibilities together with all the possibilities of the relaxation algorithm, give a large amount of combinations and each one of them is a possible tagging algorithm.</Paragraph>
  </Section>
  <Section position="6" start_page="879" end_page="880" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> To this extent, we have presented the relaxation labelling algorithm family, and stated soine considerations to apply them to POS tagging.</Paragraph>
    <Paragraph position="1"> In this section we will describe the experiments performed on applying this technique to our partieular problem.</Paragraph>
    <Paragraph position="2"> Our experiments will consist of tagging a corpus with all logical combinations of the following parameters: Support function, Updating function, Compatibility values, Normalization function and Constraints degree, which can be binary, ternary, or hand-written constraints, we will experiment with any combination of them, as well as with a particular combination consisting of a back-off technique described below.</Paragraph>
    <Paragraph position="3"> In order to have a comparison reference we will evaluate the pertbrmance of two tuggers: A blind most-likely-tag tagger and a HMM tagger (Elworthy 93) performing Viterbi algorithm. The training and test corpora will be the same for all taggerm null All results are given as precision percentages over ambiguous words.</Paragraph>
    <Section position="1" start_page="879" end_page="880" type="sub_section">
      <SectionTitle>
4.1 Results
</SectionTitle>
      <Paragraph position="0"> We performed the same experiments on three different corpora: Corpus SN (Spanish Novel) train: 15Kw, test: 2Kw, tag set size: 70. This corpus was chosen to test the algorithm in a language distinct than English, and because previous work (Moreno-Torres 94) on it provides us with a good test bench and with linguist written constraints.</Paragraph>
      <Paragraph position="1"> Corpus Sus (Susanne) train: 141Kw, test: 6Kw, tag set, size: 150. The interest of this corpus is to test the algorithm with a large tag set. Corpus WSJ (Wall Street Journal) train: 1055Kw, test: 6Kw, tag set size: 45 The interest of this corpus is obviously its size, which gives a good statistical evidence for automatic constraints acquisition.</Paragraph>
      <Paragraph position="2"> Baseline results.</Paragraph>
      <Paragraph position="3"> Results obtained by the baseline tuggers are found in table 1.</Paragraph>
      <Paragraph position="4">  First; row of table 2 shows the best results obtained by relaxation when using only binary constraints (B). That is, in the same conditions than HMM taggers. In this conditions, relaxation only performs better than HMM for the small corpus SN, and tile bigger the corpus is, tile worse results relaxation obtains.</Paragraph>
      <Paragraph position="5"> Adding hand-written constraints (C).</Paragraph>
      <Paragraph position="6"> Relaxation can deal with more constraints, so we added between 30 and 70 hand-written constraints depending on the corpus. The constraints were derived ~malyzing the most frequent errors committed by tile HMM tagger, except for SN where we adapted the context constraints proposed by (Moreno-Torres 94).</Paragraph>
      <Paragraph position="7"> The constraints do not intend to be a general language model, they cover only some common error cases. So, experiments with only hand-written constraints are not performed.</Paragraph>
      <Paragraph position="8"> The compatibility value for these constraints is coinputed from their occurrences in the corpus, and may be positive (compatible) or negative (incompatible). null Second row of table 2 shows the results obtained when using binary plus hand-written constraints. In all corpora results improve when adding hand-written constraints, except in WSJ. This is because the constraints used in this case are few (about 30) and only cover a few specific error cases (mainly tile distinction past/participle following verbs to have or to be).</Paragraph>
      <Paragraph position="9"> Using trigram information (T).</Paragraph>
      <Paragraph position="10"> We have also available ternary constraints, extracted from trigram occurrences. Results ob- null tion of constraint kinds.</Paragraph>
      <Paragraph position="11"> tained using ternary constraints in combination with other kinds of information are shown in rows T, BT, TC and BTC in table 2.</Paragraph>
      <Paragraph position="12"> There seem to be two tendencies in this table: First, using trigrmns is only helpflfl in WSJ. This is becmme the training cortms for WSJ is much bigger than in the other cases, and so the trigrmn model obtained is good, while, for the ()tiler c&lt;)rpora, the training set; seems to t)e too small to provide a good trigram iniormation.</Paragraph>
      <Paragraph position="13"> Secondly, we can observe that there is a general tendency to &amp;quot;the more information, the better resuits&amp;quot;, that ix, when using BTC we get l)etter resuits that with B~, which is in turn better than T alone.</Paragraph>
      <Paragraph position="14"> Stopping before eonve~yenee.</Paragraph>
      <Paragraph position="15"> All above results at'(; obtaine.d stopt)ing the relaxation ;algorithm whim it reaches convergence (no significant cbmges are l)rodu(:ed fl'om one iteration to the next), but relaxation algorithms not necessarily give their l)est results at convergence 4, or not always need to achieve convergence to know what the result will be (Zucker et al. 81). So they are often stoplmd after a few iterations. Actually, what we arc (loing is changing our convergen('e criterion to one more sophisticated than &amp;quot;sto 1) when dlere are no Inore changes&amp;quot;.</Paragraph>
      <Paragraph position="16"> The results l)resented in table 3 are tit(; best overall results dmt we wouM obtain if we had a criterion which stopped tit(; iteration f)rocess when the result obtained was an optimum. The number in parenthesis is the iteration at, which the algorithm should be stopped. Finding such a criterion is ~ point that will require fllrther research.  of tit(*, supI)ort function doesn't correspond ea;actly to the best solution for the problem, that is, the chosen flmction is only a,n approximation of the desired one. And (2) performing too much iterations can produce a more probable solution, which will not necessarily be the correct one.</Paragraph>
      <Paragraph position="17"> These results are clearly better than those obtained at; relaxation convergence, and they also outperform HMM taggers.</Paragraph>
      <Paragraph position="18"> Searching a more specific support flLnction.</Paragraph>
      <Paragraph position="19"> We have t)een using support fimctions that are traditionally used in relaxation, but we might try to st)ecialize relaxation labelling to POS tagging. Results obtained with this specific sut)t)ort fun(:tion (3.1) are sumntarize.d in table 4</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="880" end_page="880" type="metho">
    <SectionTitle>
SN Sus
</SectionTitle>
    <Paragraph position="0"> tkm.</Paragraph>
    <Paragraph position="1"> Using this new supt)ort fun(:tion we obtain resuits slightly below those of the IIMM tagger, Our sut)i)ort fun(:tion is tim sequence 1)robal)ility, which is what Viterbi maxinfizes, 1)ut we get worse, results. Tlmrc are two main reasons for that. The first one is that relaxation does not maximize the sui)t)ort; flln('tion but the weigh, ted support for each variable, so we' are not doing exactly the same than a HMM tagger. Second reason is that relaxation is not an algorithm that finds global optima an(1 can be trapl)ed in local maxima.</Paragraph>
    <Paragraph position="2"> Combining information in a llack-off h, ierarchy. Wh can confl)ine bigram and ti'igranl infi'omation in a. back-off mechanism: Use trigrams if available and bigrmns when not.</Paragraph>
    <Paragraph position="3">  The results he, re point to the same conclusions than the use of trigrams: il! we have a good trigrmn model (as in WSJ) then the back-off&amp;quot; technique is usefifl, and we get here the best overall result for tiffs corlms. If the trigram model ix not so good, results are not better than the obtained with l)igrams ahme.</Paragraph>
  </Section>
  <Section position="8" start_page="880" end_page="881" type="metho">
    <SectionTitle>
5 Application to Word Sense
</SectionTitle>
    <Paragraph position="0"> Disambiguation We can apply the same algorithm to the task of disambiguating tile sense of a word in a certain context. All we need is to state tile &lt;',onslxaints between senses of neighbour words. We can coinbine this task with POS tagging, since t, here~ are also constraints between the POS tag of a word attd its sense, or the sense of a neighbour word.  Preliminary experiments have been performed on SemCor (Miller et al. 93). The problem consists in assigning to each word its correct POS tag and the WordNet file code for its right sense. A most-likely algorithm got 62% (over nouns apperaring in WN). We obtained 78% correct, only adding a constraint stating that the sense chosen for a word must be compatible with its POS tag.</Paragraph>
    <Paragraph position="1"> Next steps should be adding more constraints (either hand written or automatically derived) on word senses to improve performance and tagging each word with its sense in WordNet instead of its file code.</Paragraph>
  </Section>
  <Section position="9" start_page="881" end_page="881" type="metho">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> We have applied relaxation labelling algorithm to the task of POS tagging. Results obtained show that the algorithm not only can equal markovian taggers, but also outperform them when given enough constraints or a good enough model.</Paragraph>
    <Paragraph position="1"> The main advantages of relaxation over Markovian taggers are the following: First of all, relaxation can deal with more information (constraints of any degree), secondly, we can decide whether we want to use only automatically acquired constraints, only linguist-written constraints, or any combination of both, and third, we can tune the model (,~dding or changing constraints or compatibility coefficients).</Paragraph>
    <Paragraph position="2"> We can state that in all experiments, the refinement of the model with hand written constraints led to an improvement in performance.</Paragraph>
    <Paragraph position="3"> We improved performance adding few constraints which were not linguistically motiwtted. Probably adding more &amp;quot;linguistic&amp;quot; constraints would yield more significant improvements.</Paragraph>
    <Paragraph position="4"> Several parametrizations for relaxation have been tested, and results seem to indicate that: * support function (1.2) produces clearly worse results than the others. Support flmction (1.1) is slightly ahead (1.3).</Paragraph>
    <Paragraph position="5"> * using mutual information as compatibility values gives better results.</Paragraph>
    <Paragraph position="6"> * waiting for convergence is not a good policy, and so alternative stopping criterions must be studied.</Paragraph>
    <Paragraph position="7"> * the back-off technique, as well as the trigram model, requires a really big training corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML