File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1080_metho.xml

Size: 16,666 bytes

Last Modified: 2025-10-06 14:14:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1080">
  <Title>Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset</Title>
  <Section position="4" start_page="484" end_page="485" type="metho">
    <SectionTitle>
2 The Training Data
</SectionTitle>
    <Paragraph position="0"> Our training data consists of about 130,000 tokens of newspaper and magazine text, manually doubletagged and then corrected by a single judge.</Paragraph>
    <Paragraph position="1"> Our training data consists of about 130,000 tokens of newspaper and magazine text, manually tagged using a special-purpose tool which allows for easy disambiguation of morphological output. The data has been tagged twice, with manual resolution of discrepancies (the discrepancy rate being about 5%, most of them being simple tagging errors rather than opinion differences).</Paragraph>
    <Paragraph position="2"> One data item contains several fields: the input word form (token), the disambiguated tag, the set of all possible tags for the input word form, the disambiguated lemma, and the set of all possible lemmas with links to their possible tags. Out of these, we are currently interested in the form, its possible tags and the disambiguated tag. The lemmas are ignored for tagging purposes. 4 The tag from the &amp;quot;disambiguated tag&amp;quot; field as well as the tags from the &amp;quot;possible tags&amp;quot; field are further divided into so called subtags (by morphological category). In the set &amp;quot;possible tags field&amp;quot;, 4In fact, tagging helps in most cases to disambiguate the lemmas. Lemma disambiguation is a separate process following tagging. The lemma disambiguation is a much simpler problem - the average number of different lemmas per token (as output by the morphological analyzer) is only 1.15. We do not cover the lemma disambiguation procedure here.</Paragraph>
    <Paragraph position="3">  ~--s ........ IRIRI-I-1461-1-1-1-1-1-I-I-IIoa AAIS6 .... tA N I AIAIIMNISlSI-I-I-I-I t/A/-/-/Ipoetta,&amp;quot;ov&amp;~ milS6 ..... A--lNINII/S12361-/-I-I-I-IAl-I-/Imodelu z: ........... \[Zl :l-l-l-l-l-l-l-l-l-l-l-l\] , P4YS1 ........ \[P/4/IY=/S/14/-/-/-/-/-/-/-/-/\]kZ,r~ VpYS---IR-A A-lV/p/Y/S/-/-/-II/P,I-/A/-/-/lsi~uloval ~IS4 ..... A--\[N/N/I/S/14/-/-/-/-/-/A/-/-/\[v~rvoj AANS2 .... IA--\[A/A/IMN/S/24/-/-/-/-/i/A/-/-/Isv~zov4ho h~NS2 ..... A-- \[N/N/N/S/236/-/-/-/-/-/A/-/-/\]kllma~u \]~--8 ........ I~IRI-1-1461-I-I-I-I-I-I-I-311 v AAIm8 .... IA--IAIAIFI~IP1281-1-1-1-111Al-l-llP~i~tlch IaWIP6 ..... A--INININIPlSl-l-l-l-l-lAl-l-lldea,tiletlch  model, which was-simulating development of-world climate in next decades the ambiguity on the level of full (combined) tags is mapped onto so called &amp;quot;ambiguity classes&amp;quot; (AC-s) of subtags. This mapping is generally not reversible, which means that the links across categories might not be preserved. For example, the word form jen for which the morphology generates three possible tags, namely, TT ........... (particle &amp;quot;only&amp;quot;), and NNISI ..... A-- and NNIS4 ..... A-- (noun, masc.</Paragraph>
    <Paragraph position="4"> inanimate, singular, nominative (1) or accusative (4) case; &amp;quot;yen&amp;quot; (the Japanese currency)), will be assigned six ambiguous ambiguity classes (NT, NT, -I, -S, -14, -h, for POS, subpart of speech, gender, number, case, and negation) and 7 unambiguous ambiguity classes (all -). An example of the training data is presented in Fig. 1. It contains three columns, separated by the vertical bar 0): 1. the &amp;quot;truth&amp;quot; (the correct tag, i.e. a sequence of 13 subtags, each represented by a single character, which is the true value for each individual category in the order defined in Fig. 1 (lst column: POS, 2nd: SUBPOS, etc.)  2. the 13-tuple of ambiguity classes, separated by a slash (/), in the same order; each ambiguity class is named using the single character subtags used for all the possible values of that category; 3. the original word form.</Paragraph>
    <Paragraph position="5">  Please note that it is customary to number the seven grammatical cases in Czech: (instead of naming them): &amp;quot;nominative&amp;quot; gets 1, &amp;quot;genitive&amp;quot; 2, etc. There are four genders, as the Czech masculine gender is divided into masculine animate (M) and inanimate (I).</Paragraph>
    <Paragraph position="6"> Fig. 1 is a typical example of the ambiguities encountered in a running text: little POS ambiguity, but a lot of gender, number and case ambiguity (columns 3 to 5).</Paragraph>
  </Section>
  <Section position="5" start_page="485" end_page="486" type="metho">
    <SectionTitle>
3 The Model
</SectionTitle>
    <Paragraph position="0"> Instead of employing the source-channel paradigm for tagging (more or less explicitly present e.g. in (Merialdo, 1992), (Church, 1988), (Hajji, Hladk~, 1997)) used in the past (notwithstanding some exceptions, such as Maximum Entropy and rule-based taggers), we are using here a &amp;quot;direct&amp;quot; approach to modeling, for which we have chosen an exponential probabilistic model. Such model (when predicting an event 5 y E Y in a context x) has the general form PAC,e (YIX) = exp(~-~in----1 Aifi (y, x)) Z(x) (3) where fi (Y, x) is the set (of size n) of binary-valued (yes/no) features of the event value being predicted and its context, hi is a &amp;quot;weigth&amp;quot; (in the exponential sense) of the feature fi, and the normalization factor</Paragraph>
    <Paragraph position="2"> ~,Ve use a separate model for each ambiguity class AC (which actually appeared in the training data) of each of the 13 morphological categories 6. The final PAC (Yix) distribution is further smoothed using unigram distributions on subtags (again, separately</Paragraph>
    <Paragraph position="4"> Such smoothing takes care of any unseen context; for ambiguity classes not seen in the training data, for which there is no model, we use unigram probabilities of subtags, one distribution per category.</Paragraph>
    <Paragraph position="5"> In the general case, features can operate on any imaginable context (such as the speed of the wind over Mt. Washington, the last word of yesterday TV news, or the absence of a noun in the next 1000 words, etc.). In practice, we view the context as a set of attribute-value pairs with a discrete range of values (from now on, we will use the word &amp;quot;context&amp;quot; for such a set). Every feature can thus be represented by a set of contexts, in which it is positive. There is, of course, also a distinguished attribute for the value of the variable being predicted (y); the rest of the attributes is denoted by x as expected. Values of attributes will be denoted by an overstrike (~, 5).</Paragraph>
    <Paragraph position="6"> The pool of contexts of prospective features is for the purpose of morphological tagging defined as a Sa subtag, i.e. (in our case) the unique value of a morphological category.</Paragraph>
    <Paragraph position="7"> 6Every category is, of course, treated separately. It means that e.g. the ambiguity class 23 for category CASE (meaning that there is an ambiguity between genitive and dative  cases) is different from ambiguity class 23 for category GRADE or PEI~0N.</Paragraph>
    <Paragraph position="8"> full cross-product of the category being predicted (y) and of the x specified as a combination of: 1. an ambiguity class of a single category, which may be different from the category being predicted, or 2. a word form and 1. the current position, or 2. immediately preceding (following) position in text, or 3. closest preceding (following) position (up to  four positions away) having a certain ambiguity class in the POS category  where Cat E Categories and CatAC is the ambiguity class AC (such as AN, for adjective/noun ambiguity of the part of speech category) of a morphological category Cat (such as POS). For example, the function fPOSaN,A,-~ is well-defined (A E {A,N}), whereas the function fCASE145,6,-PS is not (6 C/~ {1, 4, 5}). We will introduce the notation of the context part in the examples of feature value computation below. The indexes may be omitted if it is clear what category, ambiguity class, the value of the category being predicted and/or the context the feature belongs to.</Paragraph>
    <Paragraph position="9"> The value of a well-defined feature 7 function fca~Ac,y,~(Y, x) is determined by fCa~ac.y,~(Y, x) = 1 ~=~ ~ = y A * C x. (7) This definition excludes features which are positive for more than one y in any context x. This property will be used later in the feature selection algorithm. As an example of a feature, let's assume we are predicting the category CASE from the ambiguity class 145, i.e. the morphology gives us the possibility to assign nominative (1), accusative (4) or vocative  (5) case. A feature then is e.g.</Paragraph>
    <Paragraph position="10">  The resulting case is nominative (1) and the following word form is pracuje (lit.</Paragraph>
    <Paragraph position="11">  is negative (lit. (at the) Prague castle).</Paragraph>
    <Paragraph position="12"> denoted as fCASE145,1,(FORM+1=pracuje), or The resulting case is accusative (4) and the closest preceding preposition's case has the ambiguity class 46 denoted as fCASEa4s,4,(CASE-pos=R=46).</Paragraph>
    <Paragraph position="13"> The feature fPOSNv,N,(POS_l=A,CASE_l=145) will be positive in the context of Fig. 2, but not in the context of Fig. 3.</Paragraph>
    <Paragraph position="14"> The full cross-product of all the possibilities outlined above is again restricted to those features which have actually appeared in the training data more than a certain number of times.</Paragraph>
    <Paragraph position="15"> Using ambiguity classes instead of unique values of morphological categories for evaluating the (context part of the) features has the advantage of giving us the possibility to avoid Viterbi search during tagging. This then allows to easily add lookahead (right) context. 8 There is no &amp;quot;forced relationship&amp;quot; among categories of the same tag. Instead, the model is allowed to learn also from the same-position &amp;quot;context&amp;quot; of the subtag being predicted. However, when using the model for tagging one can choose between two modes of operation: separate, which is the same mode used when training as described herein, and VTC (Valid Tag Combinations) method, which does not allow for impossible combinations of categories. See Sect. 5 for more details and for the impact on the tagging accuracy.</Paragraph>
  </Section>
  <Section position="6" start_page="486" end_page="487" type="metho">
    <SectionTitle>
4 Training
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="486" end_page="486" type="sub_section">
      <SectionTitle>
4.1 Feature Weights
</SectionTitle>
      <Paragraph position="0"> The usual method for computing the feature weights (the Ai parameters) is Maximum Entropy (Berger 8It remains to be seen whether using the unique values at least for the left context - and employing Viterbi would help. The results obtained so far suggest that probably not much, and if yes, then it would restrict the number of features selected rather than increase tagging accuracy.</Paragraph>
      <Paragraph position="1"> &amp; al., 1996). This method is generally slow, as it requires lot of computing power.</Paragraph>
      <Paragraph position="2"> Based on our experience with tagging as well as with other projects involving statistical modeling, we assume that actually the weights are much less important than the features themselves.</Paragraph>
      <Paragraph position="3"> We therefore employ very simple weight estimation. It is based on the ratio of conditional probability of y in the context defined by the feature fy,~ and the uniform distribution for the ambiguity class AC.</Paragraph>
    </Section>
    <Section position="2" start_page="486" end_page="487" type="sub_section">
      <SectionTitle>
4.2 Feature Selection
</SectionTitle>
      <Paragraph position="0"> The usual guiding principle for selecting features of exponential models is the Maximum Likelihood principle, i.e. the probability of the training data is being maximized. (or the cross-entropy of the model and the training data is being minimized, which is the same thing). Even though we are eventually interested in the final error rate of the resulting model, this might be the only solution in the usual source-channel setting where two independent models (a language model and a &amp;quot;translation&amp;quot; model of some sort - acoustic, real translation etc.) are being used.</Paragraph>
      <Paragraph position="1"> The improvement of one model influences the error rate of the combined model only indirectly.</Paragraph>
      <Paragraph position="2"> This is not the case of tagging. Tagging can be seen as a &amp;quot;final application&amp;quot; problem for which we assume to have enough data at hand to train and use just one model, abandoning the source-channel paradigm. We have therefore used the error rate directly as the objective function which we try to minimize when selecting the model's features. This idea is not new, but as far as we know it has been implemented in rule-based taggers and parsers, such as (Brill, 1993a), (Brill, 1993b), (Brill, 1993c) and (Ribarov, 1996), but not in models based on probability distributions.</Paragraph>
      <Paragraph position="3"> Let's define the set of contexts of a set of features:</Paragraph>
      <Paragraph position="5"> where F is some set of features of interest.</Paragraph>
      <Paragraph position="6"> The features can therefore be grouped together based on the context they operate on. In the current implementation, we actually add features in &amp;quot;batches&amp;quot;. A &amp;quot;batch&amp;quot; of features is defined as a set of features which share the same context Z (see the definition below). Computationaly, adding features in batches is relatively cheap both time- and spacewise. null For example, the features</Paragraph>
      <Paragraph position="8"> share the context (POS_I = A, CASE_, = 145).</Paragraph>
      <Paragraph position="9"> Let further * FAC be the pool of features available for selection. null * SAC be the set of features selected so far for a model for ambiguity class AC, * PSac (Yl d) the probability, using model (3-5) with features SAC, of subtag y in a context defined by position d in the training data, and * FAC,~ be the set (&amp;quot;batch&amp;quot;) of features sharing the same context ~, i.e.</Paragraph>
      <Paragraph position="10"> FAc, = {f FAc: : S = (9) Note that the size of AC is equal to the size of any batch of features (\[AC\[ = \[FAc,~\[ for any z).</Paragraph>
      <Paragraph position="11"> The selection process then proceeds as follows:  1. For all contexts ~ E X(FAc) do the following: 2. For all features f = fy,~ E FAc,5 compute their associated weights AI using the formula:</Paragraph>
      <Paragraph position="13"> 3. Compute the error rate of the training data by going through it and at each position d selecting the best subtag by maximizing PSacUFAc.~(Yid) over all y E AC.</Paragraph>
      <Paragraph position="14"> 4. Select such a feature set FAC,~ which results in the maximal improvement in the error rate of the training data and add all f e FAC,~ permanently to SAC; with SAC now extended, start from the beginning (unless the termination condition is met), 5. Termination condition: improvement in error  rate smaller than a preset minimum.</Paragraph>
      <Paragraph position="15"> The probability defined by the formula (11) can easily be computed despite its ugly general form, as the denominator is in fact the number of (positive) occurrences of all the features from the batch defined by the context ~ in the training data. It also helps if the underlying ambiguity class AC is found only in a fraction of the training data, which is typically the case. Also, the size of the batch (equal to \[AC\[) is usually very small.</Paragraph>
      <Paragraph position="16"> On top of rather roughly estimating the Af parameters, we use another implementation shortcut here: we do not necessarily compute the best batch of features in each iteration, but rather add all (batches of) features which improve the error rate by more than a threshold 6. This threshold is set to half the number of data items which contain the ambiguity class AC at the beginning of the loop, and then is cut in half at every iteration. The positive consequence of this shortcut (which certainly adds some unnecessary features) is that the number of iterations is much smaller than if the maximum is regularly computed at each iteration.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML