File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0604_metho.xml

Size: 22,722 bytes

Last Modified: 2025-10-06 14:08:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0604">
  <Title>References</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Morphological learning
</SectionTitle>
    <Paragraph position="0"> Since its inception in the mid 1950s, the field of computational morphology has been characterized by a paucity of procedures for generation. Notwithstanding the impressive body of literature on the shortcomings of traditional Paninian morphology, most computational research projects also rely on a traditional notion of the morpheme and ignore all non-compositional aspects of morphology. These observations are obviously not unrelated and are in part inherited from the field of computational syntax where applications traditionally were designed to assign a syntactic structure to a given string of words, though this is less true today.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Segmentation and morpheme identification
</SectionTitle>
      <Paragraph position="0"> Word formation and the population of the lexicon, while central to morphological theory, are noticeably absent from the field of computational morphology. Most computational work in the field of morphology has focused on the identification of morphemes or morphological parsing while paying little or no attention to generation. While these applications find a common goal in the automatic acquisition of morphology, it is helpful to distinguish between two types of analysis in light of the often very different results sought by various morphological learners.</Paragraph>
      <Paragraph position="1"> On the one hand, some applications focus exclusively on the segmentation of words or longer strings into smaller units. In other words, their July 2002, pp. 31-40. Association for Computational Linguistics.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia,
</SectionTitle>
    <Paragraph position="0"> Morphological and Phonological Learning: Proceedings of the 6th Workshop of the function is to identify morpheme boundaries within words and, as such, they only indirectly identify morphemes as linguistic units. Zellig Harris's (Harris, 1955; Harris, 1967) pioneering work suggests that morpheme boundaries can be determined by counting the number of letters that follow a given substring within a corpus (v. (Hafer and Weiss, 1974) for a further development of Harris's ideas).</Paragraph>
    <Paragraph position="1"> Janssen (1992) and Flenner (1994; 1995) also work towards segmenting words but use training corpora in which morpheme boundaries have been manually inserted. Recent work by Kazakov and Manandhar (1998) combines unsupervised and supervised learning techniques to generate a set of segmentation rules that can further be applied to previously unseen words.</Paragraph>
    <Paragraph position="2"> On the other hand, some computational morphological applications are designed solely to identify morphemes based on a training corpus and not to provide a morphological analysis for each word of that corpus. Brent (1993), for example, aims at finding the right set of suffixes from a corpus, but the algorithm cannot double as a morphological parser.</Paragraph>
    <Paragraph position="3"> More recently, efforts have been developing which identify morphemes and perform some sort of analysis. Schone and Jurafsky (2001) employ a great many sophisticated post-hoc adjustments to obtain the right conflation sets for words by pure corpus analysis without annotations. Their procedure uses a morpheme-based model, provides an analysis of the words, and does in a sense discover morphological relations. Goldsmith (2001b; 2001a), inspired by de Marcken's (1995) thesis on minimum description length, attempts to provide both a list of morphemes and an analysis of each word in a corpus. Also, Baroni (2000) aims at finding a set of prefixes from a corpus, together with an affix-stem parse of each of the words.</Paragraph>
    <Paragraph position="4"> While they might differ in their methods or objectives, all of the above morphological applications share a common characteristic in that they are learners designed exclusively for the acquisition of morphological facts from corpora and do not generate new words based on the information they acquire.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Parsing and generation
</SectionTitle>
      <Paragraph position="0"> Only a handful of programs can both parse and generate words. Once again, these programs fall into two very distinct categories. In view of the disparity between these programs, it is useful to distinguish between genuine morphological learners able to generate from acquired knowledge and generators/parsers that implement a man-made analysis.</Paragraph>
      <Paragraph position="1"> The latter group is perhaps the most well known, so let us begin with them.</Paragraph>
      <Paragraph position="2"> Kimmo-type applications of two-level morphology (Koskenniemi, 1983; Antworth, 1990; Karttunen et al., 1992; Karttunen, 1993; Karttunen, 1994) can provide a morphological analysis of the words in a corpus and generate new words based on a set of rules; but these programs must first be provided with that set of rules and a lexicon containing morphemes by the user. Similar work in oneand two-level morphology has been done using the Attribute-Logic Engine (Carpenter, 1992). Some of these systems (e.g. (Karttunen et al., 1987)) have a front-end that compiles more traditional linearly ordered morphological rules into the finite-state automata of two-level morphology. Once again, these applications require a set of man-made lexical rules to function. While the practical uses of such applications as PC-Kimmo are incontestable, it is clear that they are part of a different endeavour, and should not be confused with genuine morphological learners.</Paragraph>
      <Paragraph position="3"> The other relevant group of computational applications can, as mentioned, both acquire morphological knowledge from corpora and generate new words based on that knowledge. Albright and Hayes (2001a; 2001b) tackle the wider task of acquiring morphology and (morpho)phonology based on a small paradigm list and their learner is able to generate particular inflected forms given a related word. DVzeroski and Erjavec (1997) work towards learning morphological rules for forming particular inflectional forms given a lemma (a set of related words). Their learner produces a set of rules relating all the members of a paradigm to a base form. The program can then produce a member of that paradigm on command given the base form. While the methods used by Albright and Hayes and DVzeroski and Erjavec radically differ, both use a form of supervised learning which significantly reduces the amount of information their learner has to acquire. Albright and Hayes train their program using a paradigm list in which each entry contains, for example, both the present and past tense forms of an English verb.</Paragraph>
      <Paragraph position="4"> Similarly, the training data used by DVzeroski and Erjavec similarly has a base form, or lexeme, associated to each and every word so that all the words of a given paradigm share a common label. The distinctions between the two methods are immaterial, what matters is that both learners are being told which words are related to which and are left with the task of describing that relation in the form a rule.</Paragraph>
      <Paragraph position="5"> In other words, the algorithms they use cannot discover that words are morphologically related.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.3 What's morphology?
</SectionTitle>
      <Paragraph position="0"> In the above algorithms, the task of determining whether one word is related to another in a morphological sense is most frequently left to the linguist, as this information has to be encoded in the training data for these algorithms. (Some of the most recent work such as (Schone and Jurafsky, 2001) and (Goldsmith, 2001b) are notable exceptions to this paradigm.) This is perhaps not surprising, since no serious attempt at defining a morphological relation has been made in the last few decades. American structuralists of the forties and fifties proposed what have been referred to as discovery procedures (v. (Nida, 1949), for example) for the identification of morphemes but since the mid fifties (Chomsky, 1955), it has been customary for morphological theory to ignore this aspect of morphology and relegate it to studies on language acquisition. But, since a morphological learner like that presented here is designed to model the acquisition of morphology, it seems that it should above all be able to determine for itself whether two words are morphologically related or not, whether there is anything morphological to acquire at all.</Paragraph>
      <Paragraph position="1"> Another important thing to note about the vast majority of computational morphology learners is their reliance on a traditional notion of the morpheme as a lexical unit and their exclusive focus on concatenative morphology. There is a panoply of recent publications devoted to the empirical shortcomings of traditional so-called &amp;quot;Itemand-Arrangement&amp;quot; morphology (Hockett, 1954; Bochner, 1993; Ford and Singh, 1991; Anderson, 1992; Ford et al., 1997), and the list of phenomena that fall out of reach of a compositional approach is rather impressive: zero-morphs, ablaut-like processes, templatic morphology, class markers, partial suppletion, etc. Still, seemingly every documented morphological learner relies on a Bloomfieldian notion of the morpheme and produces an Item-and-Arrangement analysis; this description applies to all of the computational papers cited above.</Paragraph>
      <Paragraph position="2">  2 An alternative theory Whole Word Morphologizer (henceforth WWM) is the first implementation of the theory of Whole Word Morphology. The theory, developed by Alan Ford and Rajendra Singh at Universit'e de Montr'eal,  seeks to account for morphological relations in a minimalist fashion. Ford and Singh published a series of papers dealing with various aspects of the theory between 1983 and 1990. Drawing on these papers, they published a full outline of it in 1991 (Ford and Singh, 1991) and an even fuller defense of it in 1997 (Ford et al., 1997). Since then, aspects of it have been taken up in a series of publications by Agnihotri, Dasgupta, Ford, Neuvel, Singh, and various combinations of these authors. The central mechanism of the theory, the Word Formation Strategy (WFS), is a sort of non-decomposable morphological transformation that relates full words with full words (or helps one fashion a full word from another full word) and parses any complex word into a variable and a non-variable component. Neuvel and Singh (In press) offer a strict definition of morphological relatedness and, based on this definition, suggest guidelines for the acquisition of Word Formation Strategies.</Paragraph>
      <Paragraph position="3"> In Whole-Word Morphology, any morphological relation can be represented by a rule of the following form:  (1) |X|a-|Xprime|b in which the following conditions and notations are employed: 1. |X|a and|Xprime|b are statements that words of the form X and Xprime are possible in the language, and X and Xprime are abbreviations of the forms of classes of words belonging to categories a and b (with which specific words belonging to the right category can be unified in form); 2. prime represents all the form-related differences between X and Xprime; 3. a and b are categories that may be represented as feature-bundles; 4. -represents a bi-directional implication; 5. Xprime and X are semantically related.</Paragraph>
      <Paragraph position="4">  There are several ramifications of (1). First, there is only one morphology; no distinction, other than a functional one, is made between inflection and derivation. Second, morphology is relational and not compositional. The program thus makes no reference to theoretical constructs such as 'root', 'stem', and 'morpheme', or devices such as 'levels' and 'strata' and relies exclusively on the notion of morphological relatedness. And since its objective is not to assign a probability to a given word or string, it must rely on a strict formal definition of a morphological relation. Ultimately, the theory takes the Saussurean view that words are defined by the differences amongst them and argues that some of these differences, namely those that are found between two or more pairs of words, constitute the domain of morphology. In other words, two words of a lexicon are morphologically related if and only if all the differences between them are found in at least one other pair of words of the same lexicon.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Overview of the method
</SectionTitle>
    <Paragraph position="0"> Under the assumption that the morphology of a language resides exclusively in differences that are exploited in more than one pair of words within its lexicon, WWM (Algorithm 1 in the next section) compares every word of a small lexicon and determines the segmental differences found between them. The input to the current version of the program is a small text file that contains anywhere from 1000 to 5000 words. Each word appears in orthographic form and is followed by its syntactic and morphological categories, as in the example below: (2) cat, Ns (Noun, singular) catch, V catches, V3s (Verb, (pres.) 3rd pers.</Paragraph>
    <Paragraph position="1"> sing.) decided, Vp (Verb, past) The algorithm simply compares each letter from word A to the corresponding one from word B to produce a comparison record, which can be viewed as a data structure. Currently, it works on orthographic representations. This means it would as easily work on phonemic transcriptions, but it will require empirical evaluation to see whether the results from these can improve upon those obtained using spellings, and we have not yet gone through such an exercise. It starts on either the left or right edge of the words if the two words share their first (few) segments or their last (few) segments, respectively (the forward version is presented in Algorithm 2 in the next section). This is just a simple-minded way of aligning the similar parts of the words for the comparison; a more sophisticated implementation in the future could use a more general sequence alignment procedure. The segments are placed in one of two lists in the comparison structure (differences or similarities) based on whether or not they are identical. Each comparison structure also contains the categories of both words, and is kept in a large list of all comparison structures found from analyzing the entire corpus. The example below shows the information in the comparison structure produced from the English words receive and reception. It includes the differences and similarities between the two words, from the perspective of each word in turn, as well as the lexical categories of the words.</Paragraph>
    <Paragraph position="2">  Matching character sequences in the difference section are replaced with a variable. The result is then set against comparisons generated by other pairs of words and duplicate differences are recognized. In the example below, the comparisons produced by the pairs receive/reception, conceive/conception and deceive/deception are shown.</Paragraph>
    <Paragraph position="3">  The three comparisons in (4) share the same formal and grammatical differences, and so the theory indicates they should be merged into one morphological strategy. Since the differences are the same, it is only the similarities that are actually merged. Each new morphological strategy is also restricted to apply in as narrow an environment as possible.</Paragraph>
    <Paragraph position="4"> Neuvel and Singh (Neuvel and Singh, In press) suggest that any morphological strategy must be maximally restricted at all times; this is accomplished by specifying as constant all the similarities found, not between words, but between the similarities found between words. In (4), all three sets of similarities end with the sequence of letters &amp;quot;ce.&amp;quot; These similarities between similarities are specified as constant in each strategy and the length of each word is also factored in. The merge routine called in Algorithm 2 carries out this procedure; we don't show it because it is tedious but not especially interesting. The restricted morphological strategy relating the words in  (4) is as follows: (5) Differences  For the sake of clarity, we can represent the information contained in (5) in a more familiar fashion using the formalism described in (1). The vertical brackets '|*|' are used for orthographic forms so as not to confuse them with phonemic representations.</Paragraph>
    <Paragraph position="5"> (6) |[?]##ceive|V-|[?]##ception|Ns The '#' signs in the above representations stand for letters that must be instantiated but are not specified; the '[?]' symbol stands for a letter that is not specified and that may or may not be instantiated.</Paragraph>
    <Paragraph position="6"> Strategy (6) can therefore be interpreted as follows: (6prime) If there is a verb that ends with the sequence &amp;quot;ceive&amp;quot; preceded by no less than two and no more than three characters, there should also be a singular noun that ends with the sequence &amp;quot;ception&amp;quot; preceded by the same two or three characters.</Paragraph>
    <Paragraph position="7"> After performing the comparisons and merging, WWM extracts a list of morphological strategies, which are those comparison structures whose count is more than some fixed threshold. Table 1 contains a few strategies found from the first few chapters of Moby Dick. These strategies result from merging comparison structures which have the same differences--merging the similarities of several unifiable word pairs, and so many have no specified letters at all.</Paragraph>
    <Paragraph position="8"> WWM then goes through the lexicon word by word and attempts to unify each word in form and category with the left or right side of this strategy. If it succeeds, WWM replaces all the segments fully specified on the side of the strategy the word is unified with, with the segments fully specified on the other side. For example, given the noun perception in the corpus and strategy (6), WWM will map the word onto the right hand side of (6), take out the sequence &amp;quot;ception&amp;quot; from the end and replace it with the sequence &amp;quot;ceive&amp;quot; to produce the new word perceive. The category of the word will also be changed from singular noun to verb. New words can thus be generated in a rather obvious fashion by taking each word in the original lexicon and applying any strategies that can be applied, i.e. whose orthographic form and part of speech can be unified with the word at hand. Algorithm 3 shows the basic generation procedure; once again the routines called unify and create which implement the nitty-gritty details of the above description are not given because they are more tedious than interesting, and will certainly need to be changed in more general future versions of WWM. Table 2 gives some of the new words WWM creates using text from Le petit prince as its base lexicon.</Paragraph>
    <Paragraph position="9">  The output from the algorithm is a list of words,1 much as in Table 2, which are generated from the input corpus using the morphological relations (strategies) discovered. The method described above will clearly force WWM to create words that were already part of its original lexicon; in fact, each and every word involved in licensing the discovery of a morphological strategy will be duplicated by the program. Generated words that were not part of WWM's original lexicon are then added to a sepa1By word we mean an orthographic form together with the part of speech. Further work in this vein would add meanings as well.</Paragraph>
    <Paragraph position="10"> rate word list containing only new words. If desired, this new word list can be merged with the original lexicon for another round of discovery to formulate new strategies based on a larger dataset. Additionally, each of the new words can simply be put through another cycle of word creation by applying the same strategies as before a second time.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"> This section contains some pseudocode showing several basic components of the Whole Word Morphologizer. Algorithm 1 shows the main procedure, which takes a POS-tagged lexicon as input and outputs a list of all words that are possible given the morphological relations present in the lexicon.</Paragraph>
    <Paragraph position="1"> The two procedures compforward and compbackward are symmetrical, so Algorithm 2 shows just the first of these. This algorithm provides the data structure which includes the differences and similarities between each pair of words in the lexicon, in similar fashion to the examples in the preceding section. In practice, only those pairs of words which are by some heuristic sufficiently similar in the first place are compared. Additionally, the two similarities sequences for each word pair are actually represented as one sequence which encodes the information found in the two sequences of the examples in the preceding; this is just for convenience of Algorithm 1 WWM(lexicon) Require: lexicon to be a list of POS-tagged words.</Paragraph>
    <Paragraph position="2"> Ensure: a list newwords is generated for all tagged words wi do for all tagged words w j do if wi and w j share a beginning sequence</Paragraph>
    <Paragraph position="4"> for all comparison structures in the list do if count(comparison) &gt; Threshold then append comparison to the list strategies null</Paragraph>
    <Paragraph position="6"> storage and computation.</Paragraph>
    <Paragraph position="7"> Algorithm 3 shows the outline of the final stage, which generates an output list of words from the input lexicon and the morphological strategies. The strategy list is simply a list of all comparison structures that occurred more frequently than some arbitrary threshold number.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML