File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-0602_concl.xml

Size: 6,432 bytes

Last Modified: 2025-10-06 13:53:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0602">
  <Title>Unsupervised Learning of Morphology Using a Novel Directed Search Algorithm: Taking the First Step</Title>
  <Section position="7" start_page="0" end_page="0" type="concl">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The superior fscore of our Directed Search system over the Linguistica system has several possible factors which we are currently investigating. It must be noted that Linguistica is designed to leverage off of word frequency in a corpus, and its performance may be enhanced if given a corpus of words, rather than just a lexicon. Similar distributions are used both in the Linguistica model and our Directed Search Model. Rissanen's universal prior for integers is frequently used in Linguistica whereas the inverse squared distribution is used in our model.</Paragraph>
    <Paragraph position="1"> Experiments substituting the inverse squared distribution with the universal prior have shown no significant empirical difference in performance. We are currently working on a more detailed comparison of the two systems.</Paragraph>
    <Paragraph position="2"> The results obtained from Directed Search algorithm can be significantly improved by incorporating the hill climbing search detailed in Snover and Brent (2001). The hill climbing search attempts to move stems from one paradigm to similar paradigms to increase the probability of the hypothesis. Experiments where the hypothesis outputted by the Directed Search system is used as the starting hypothesis for the hill climbing search, using the probability model detailed in this paper, show an increase in performance, most notably in recall and fscore, over using the Directed Search in isolation.</Paragraph>
    <Paragraph position="3"> Many of the stem relations predicted by the Directed Search algorithm, result from postulating stem and suffix breaks in words that are actually morphologically simple. This occurs when the endings of these words resemble other, correct, suffixes. In an attempt to deal with this problem we have investigated incorporating semantic information into the probability model since morphologically related words also tend to be semantically related. A successful implementation of such information should eliminate errors such as &amp;quot;capable&amp;quot; breaking down as &amp;quot;cap&amp;quot;+&amp;quot;able&amp;quot; since &amp;quot;capable&amp;quot; is not semantically related to &amp;quot;cape&amp;quot; or &amp;quot;cap&amp;quot;.</Paragraph>
    <Paragraph position="4"> Using latent semantic analysis, Schone and Jurafsky (2000) have previously demonstrated the success of using semantic information in morphological analysis. Preliminary results on our datasets using a similar technique, co-occurrence data, which represents each word as a vector of frequencies of co-occurrence with other words, indicates that much semantic, as well as morphological, information can be extracted.</Paragraph>
    <Paragraph position="5"> When the cosine measure of distance is used in comparing pairs of words in the corpus, the highest scoring pairs are for the most part morphologically or semantically related. We are currently working on correctly incorporating this information into the probability model.</Paragraph>
    <Paragraph position="6"> The Directed Search algorithm does not currently handle multiple suffixation or any prefixation; however, some ideas for future work involve extending the model to capture these processes. While such an extension would be a significant one, it would not change the fundamental nature of the algorithm. Furthermore, the output of the present system is potentially useful in discovering spelling change rules, which could then be bootstrapped to aid in discovering further morphological structure.</Paragraph>
    <Paragraph position="7"> Yarowsky and Wicentowski (2000) have developed a system that learns such rules given a preliminary morphological hypothesis and part of speech tags.</Paragraph>
    <Paragraph position="8"> While the experiments reported here are based on an input lexicon of orthographic representations, there is no reason why the Directed Search algorithm could not be applied to phonetically transcribed data. In fact, especially in the case of the English language, where the orthography is particularly inconsistent with the phonology, our algorithm might be expected to perform better at discovering the internal structure of phonologically transcribed words. Furthermore, phonetically transcribed data would eliminate the problems introduced by the lack of one-to-one correspondence of letters to phonemes.</Paragraph>
    <Paragraph position="9"> Namely, the algorithm would not mistakenly treat sibilants, such as the /ch/ sound in &amp;quot;chat&amp;quot; as two separate units, although these phonemes are often represented orthographically by a two letter sequence.</Paragraph>
    <Paragraph position="10"> A model of morphology incorporating phonological information such as phonological features could capture morphological phenomena that bridge the morphology-phonology boundary, such as allomorphy, or the existence of multiple variants of morphemes. Simply running the algorithm on phonetic data might not improve performance though, as same structures which were more straight forward in the orthographic data might be more complex in the phonetic representation. Finally, for those interested in the question of whether the language learning environment provides children with enough information to discover morphology with no prior knowledge, an analysis of phonological not orthographic data would be necessary.</Paragraph>
    <Paragraph position="11"> The goal of the Directed Search model was to produce a preliminary description, with very low false positives, of the final suffixation, both inflectional and derivational, in a language independent manner.</Paragraph>
    <Paragraph position="12"> The Directed Search algorithm performed better for the most part with respect to Fscore than Linguistica, but more importantly, the precision of Linguistica does not approach the precision of our algorithm, particularly on the larger corpus sizes. On the other hand, we feel the Directed Search algorithm has attained the goal of producing an initial estimate of suffixation that could aid other models in discovering higher level structure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML