File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1065_metho.xml

Size: 16,712 bytes

Last Modified: 2025-10-06 14:07:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1065">
  <Title>Memory-Based Learning of Morphology with Stochastic Transducers</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Fisher Kernels and Information
Geometry
</SectionTitle>
    <Paragraph position="0"> The method used is a simple application of the information geometry approach introduced by (Jaakkola and Haussler, 1998) in the field of bio-informatics.</Paragraph>
    <Paragraph position="1"> The central idea is to use a generative model to extract finite-dimensional features from a symbol sequence. Given a generative model for a string, one can use the sufficient statistics of those generative models as features. The vector of sufficient statistics can be thought of as a finite-dimensional representation of the sequence in terms of the model.</Paragraph>
    <Paragraph position="2"> This transformation from an unbounded sequence of atomic symbols to a finite-dimensional real vector is very powerful and allows the use of Support Vector Machine techniques for classification. (Jaakkola and Haussler, 1998) recommend that instead of using the sufficient statistics, that the Fisher scores are used, together with an inner product derived from the Fisher information matrix of the model. The Fisher scores are defined for a data point a10 and a particular model as</Paragraph>
    <Paragraph position="4"> The partial derivative of the log likelihood is easy to calculate as a byproduct of the E-step of the EM algorithm, and has the value for HMMs (Jaakkola et al., 2000) of</Paragraph>
    <Paragraph position="6"> where a27 a45 is the indicator variable for the parameter a41 , and a8a7a47 is the indicator value for the state a43 where</Paragraph>
    <Paragraph position="8"> that the sum of the parameters must be one.</Paragraph>
    <Paragraph position="9"> The kernel function is defined as</Paragraph>
    <Paragraph position="11"> where a33 a36 is the Fisher information matrix.</Paragraph>
    <Paragraph position="12"> This kernel function thus defines a distance be- null This distance in the feature space then defines a pseudo-distance in the example space.</Paragraph>
    <Paragraph position="13"> The name information geometry which is sometimes used to describe this approach derives from a geometrical interpretation of this kernel. For a parametric model with a0 free parameters, the set of all these models will form a smooth a0 -dimensional manifold in the space of all distributions. The curvature of this manifold can be described by a Riemannian tensor - this tensor is just the expected Fisher information for that model. It is a tensor because it transforms properly when the parametrization is changed.</Paragraph>
    <Paragraph position="14"> In spite of this compelling geometric explanation, there are difficulties with using this approach directly. First, the Fisher information matrix cannot be calculated directly, and secondly in natural language applications, unlike in bio-informatic applications we have the perennial problem of data sparsity, which means that unlikely events occur frequently.</Paragraph>
    <Paragraph position="15"> This means that the scaling in the Fisher scores gives extremely high weights to these rare events, which can skew the results. Accordingly this work uses the unscaled sufficient statistics. This is demonstrated below.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Details
</SectionTitle>
    <Paragraph position="0"> Given a transducer that models the transduction from uninflected to inflected words, we can extract the sufficient statistics from the model in two ways. We can consider the statistics of the joint model a59 a17 a30 a31 a36 a20a2a1 a22 or the statistics of the conditional model a59 a17 a36 a20a30 a31 a1 a22 . Here we have used the conditional model, since we are interested primarily in the change of the stem, and not the parts of the stem that remain unchanged. It is thus possible to use either the features of the joint model or of the conditional model, and it is also possible to either scale the features or not, by dividing by the parameter value as in Equation 2. The second term in Equation 2 corresponding to the normalization can be neglected.</Paragraph>
    <Paragraph position="1"> We thus have four possible features that are compared on one of the data sets in Table 4. Based on the performance here we have chosen the unscaled conditional sufficient statistics for the rest of the experiments presented here, which are calculated thus:</Paragraph>
    <Paragraph position="3"> tense of apply (6pl3). This example shows that the most likely transduction is the suffix Id, which is incorrect, but the MBL approach gives the correct result in line 2.</Paragraph>
    <Paragraph position="4"> Given an input string a30 we want to find the string a36 such that the pair a30  a36 is very close to some element of the training data. We can do this in a number of different ways. Clearly if a30 is already in the training set then the distance will be minimized by choosing a36 to be one of the outputs that is stored for input a36 ; the distance in this case will be zero. Otherwise we sample repeatedly (here we have taken 100 samples) from the conditional distribution of each of the submodels. This in practice seems to give good results, though there are more principled criteria that could be applied.</Paragraph>
    <Paragraph position="5"> We give a concrete example using the LING English past tense data set described below. Given an unseen verb in its base form, for example apply, in phonetic transcription 6pl3, we generate 100 samples from the conditional distribution. The five most likely of these are shown in Table 1, together with the conditional probability, the distance to the closest example and the closest example.</Paragraph>
    <Paragraph position="6"> We are using a a0 -nearest-neighbor rule with a0 a26 a28 , since there are irregular words that have completely idionsyncratic inflected forms. It would be possible to use a larger value of a0 , which might help with robustness, particularly if the token frequency was also used, since irregular words tend to be more common.</Paragraph>
    <Paragraph position="7"> In summary the algorithm proceeds as follows: a9 We train a small Stochastic Transducer on the pairs of strings using the EM algorithm.</Paragraph>
    <Paragraph position="8"> a9 We derive from this model a distance function between two pairs of strings that is sensitive to the properties of this transduction.</Paragraph>
    <Paragraph position="9"> a9 We store all of the observed pairs of strings.</Paragraph>
    <Paragraph position="10"> a9 Given a new word, we sample repeatedly from the conditional distribution to get a set of possible outputs.</Paragraph>
    <Paragraph position="11"> a9 We select the output such that the input/output pair is closest to one of the oberved pairs.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Data Sets
</SectionTitle>
      <Paragraph position="0"> The data sets used in the experiments are summarized in Table 2. A few additional comments follow.</Paragraph>
      <Paragraph position="1"> LING These are in UNIBET phonetic transcription.</Paragraph>
      <Paragraph position="2"> EPT In SAMPA transcription. The training data consists of all of the verbs with a non-zero lemma spoken frequency in the 1.3 million word CO-BUILD corpus. The test data consists of all the remaining verbs. This is intended to more accurately reflect the situation of an infant learner.</Paragraph>
      <Paragraph position="3"> GP This is a data set of pairs of German nouns in singular and plural form prepared from the CELEX lexical database.</Paragraph>
      <Paragraph position="4"> NAKISA This is a data set prepared for (Plunkett and Nakisa, 1997). Its consists of pairs of singular and plural nouns, in Modern Standard Arabic, randomly selected from the standard Wehr dictionary in a fully vocalized ASCII transcription. It has a mixture of broken and sound plurals, and has been simplified in the sense that rare forms of the broken plural have been removed.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> Table 4 shows a comparison of the four possible feature sets on the Ling data. We used 10-fold cross validation on all of these data sets apart from the EPT data set, and the SLOVENE data set; in these cases we averaged over 10 runs with different random seeds. We compared the performance of the models evaluated using them directly to model the transduction using the conditional likelihood (CL) and using the MBL approach with the unscaled conditional features. Based on these results, we used  LING data set with 10 fold cross validation, 1 10state model trained with 10 iterations. Mean in % with standard deviation in brackets.</Paragraph>
      <Paragraph position="1"> the unscaled conditional features; subsequent experiments confirmed that these performed best.</Paragraph>
      <Paragraph position="2"> The results are summarized in Table 3. Run-times for these experiments were from about 1 hour to 1 week on a current workstation. There are a few results to which these can be directly compared; on the LING data set, (Mooney and Califf, 1995) report figures of approximately 90% using a logic program that learns decision lists for suffixes. For the Arabic data sets, (Plunkett and Nakisa, 1997) do not present results on modelling the transduction on words not in the training set; however they report scores of 63.8% (0.64%) using a neural network classifier. The data is classified according to the type of the plural, and is mapped onto a syllabic skeleton, with each phoneme represented as a bundle of phonological features. for the data set SLOVENE, (Manandhar et al., 1998) report scores of 97.4% for FOIDL and 96.2% for CLOG. This uses a logic programming methodology that specifically codes for suffixation and prefixation alone. On the very large and complex German data set, we score 70.6%; note however that there is substantial disagreement between native speakers about the correct plural of nonce words (K&amp;quot;opcke, 1988). We observe that the MBL approach significantly outperforms the conditional likelihood method over a wide range of experiments; the performance on the training data is a further difference, the MBL approach scoring close to 100%, whereas the CL approach scores only a little better than it does on the test data. It is certainly possible to make the conditional likelihood method work rather better than it does in this paper by paying careful attention to convergence criteria of the models to avoid overfitting, and by smoothing the models carefully. In addition some sort of model size selection must be used. A major advantage of the MBL approach is that it works well without re- null are in the mixture, CL gives the percentage correct using the conditional likelihood evaluation and MBLSS, using the Memory-based learning with sufficient statistics, with the standard deviation in brackets. quiring extensive tuning of the parameters.</Paragraph>
      <Paragraph position="3"> In terms of the absolute quality of the results, this depends to a great extent on how phonologically predictable the process is. When it is completely predictable, as in SLOVENE the performance approaches 100%; similarly a large majority of the less frequent words in English are completely regular, and accordingly the performance on EPT is very good. However in other cases, where the morphology is very irregular the performance will be poor. In particular with the Arabic data sets, the NAKISA data set is very small compared to the complexity of the process being learned, and the MCCARTHY data set is rather noisy, with a large number of erroneous transcriptions. With the German data set, though it is quite irregular, and the data set is not frequency-weighted, so the frequent irregular words are not more likely to be in the training data, there is a lot of data, so the algorithm performs quite well.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Cognitive Modelling
</SectionTitle>
      <Paragraph position="0"> In addition to these formal evaluations we examined the extent to which this approach can account for some psycho-linguistic data, in particular the data collected by (Prasada and Pinker, 1993) on the mild productivity of irregular forms in the English past tense. Space does not permit more than a rather crude summary. They prepared six data sets of 10 pairs of nonce words together with regular and irregular plurals of them: a sequence of three data sets that were similar to, but progressively further away from sets of irregular verbs (prototypicalintermediate- and distant- pseudoirregular - PPI IPI and DPI), and another set that were similar to sets of regular verbs (prototypical-, intermediate- and distant- pseudoregular PPR, IPR and DPR). Thus the first data sets contained words like spling which would have a vowel change form of splung and a regular suffixed form of splinged, and the second data sets contained words like smeeb with regular smeebed and irregular smeb. They asked subjects for their opinions on the acceptabilities of the stems, and of the regular (suffixed) and irregular (vowel change) forms. A surprising result of this was that subtracting the rating of the past tense form from the rating of the stem form (in order to control for the varying acceptability of the stem) gave different results for the two data sets. With the pseudoirregular forms the irregular form got less acceptable as the stems became less like the most similar irregular stems, but with the pseudo-regulars the regular form got more acceptable. This was taken as evidence for the presence of two qualitatively distinct modules in human morphological processing.</Paragraph>
      <Paragraph position="1"> In an attempt to see whether the models presented here could account for these effects, we transcribed the data into UNIBET transcription and tested it with the models prepared for the LING data set. We calculated the average negative log probability for each of the six data sets in 3 ways: first we calculated the probability of the stem alone to model the acceptability of the stem; secondly we calculated the conditional probability of the regular (suffixed form), and thirdly we calculated the conditional probability of the irregular (vowel change) form of the word. Then we calculated the difference between the figures for the appropriate past tense form from the stem form. This is unjustifiable in terms of probabilities but seems the most natural way of modelling the effects reported in (Prasada and Pinker, 1993). These results are presented in Table 5. Interestingly we observed the same effect: a decrease in &amp;quot;acceptability&amp;quot; for irregulars, as they became more distant, and the opposite effect for regulars. In our case though it is clear why this happens - the probability of the stem decreases rapidly, and this overwhelms the mild decrease in the conditional probability.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> The productivity of the regular forms is an emergent property of the system. This is an advantage over previous work using the EM algorithm with SFST, which directly specified the productivity as a parameter. null</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6.1 Related work
</SectionTitle>
    <Paragraph position="0"> Using the EM algorithm to learn stochastic transducers has been known for a while in the biocomputing field as a generalization of edit distance (Allison et al., 1992). The Fisher kernel method has not been used in NLP to our knowledge before though we have noted two recent papers that have some points of similarity. First, (Kazama et al., 2001) derive a Maximum Entropy tagger, by training a HMM and using the most likely state sequence of the HMM as features for the Maximum Entropy tagging model.</Paragraph>
    <Paragraph position="1"> Secondly, (van den Bosch, 2000) presents an approach that is again similar since it uses rules, induced using a symbolic learning approach as features in a nearest-neighbour approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML