File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-0604_concl.xml
Size: 11,240 bytes
Last Modified: 2025-10-06 13:53:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0604"> <Title>References</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Accomplishments and prospects </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Initial results </SectionTitle> <Paragraph position="0"> Whole Word Morphologizer has been tested on a limited basis using English and French lexicons of approximately 3000 entries, garnered from the POS-tagged versions of Le petit prince and Moby Dick.</Paragraph> <Paragraph position="1"> The program initially, without any post-hoc corrections, achieved between 70% and 82% accuracy in generation; these figures measure the percentage of the new words beyond the original lexicon that are possible words of the language. The figures thus measure a kind of precision value, in terms of the precision/recall tradeoff, and are fair values in that they do not include the generated words that are already in the lexicon.</Paragraph> <Paragraph position="2"> Algorithm 2 compforward(w1,w2) Require: w1 and w2 to be (word, category) pairs.</Paragraph> <Paragraph position="3"> Ensure: a data structure comparison documenting the different and similar letters between w1 and w2 is merged into the global list of comparisons. comparison is a structure of 5 lists w1dif, w1cat, w2dif, w2cat, sim.</Paragraph> <Paragraph position="4"> for x = 1 to length(w2) do if characters w1(x)= w2(x) then append w1(x) to list sim if list w1dif does not end with 'X' then append 'X' to both lists w1dif and w2dif else append w1(x) to w1dif, append w2(x) to w2dif, append '#' to sim</Paragraph> <Paragraph position="6"> for x = length(w2)+1 to length(w1) do append w1(x) to w1dif end for if dif lists and categories match a comparison already in the list comps then merge comparisons and increment count(comparison) else append comparison to comps</Paragraph> <Paragraph position="8"> A satisfactory recall metric seems impossible to think of in its usual sense here. First of all, there are generally an indefinite number of possible words in a language. One therefore cannot give a precise set of words that we wish the system could generate from a specific lexicon, so there seems to be no way to measure the percentage of &quot;desired words&quot; that are in fact generated. Even if we were to make such a list by hand from the current small corpora to use as a gold standard (which has been suggested by a referee), it must also be remembered that WWM discovers strategies (morphological relations) for creating new words from given ones. It cannot be expected to discover strategies that are not evident in a corpus. Indeed, WWM will never discover that, for example, 'am' and 'be' are related, because according to the theory of morphology being applied these Algorithm 3 generate(lexicon, strategies) Ensure: a list newwords is generated using lexicon and strategies for all words in lexicon do for all strategies do if unify(lexicon[x], strategies[x]) says the word and strategy match with either left or right alignment then newword - create(lexicon[x], strategies[x]) if newword is not in the lexicon or the list newwords then append newword to newwords list</Paragraph> <Paragraph position="10"> words are only related by convention, not by morphology. &quot;Nonproductive morphology&quot; is not really morphology.</Paragraph> <Paragraph position="11"> The real point is that we do not want to hold WWM's performance up against our own ideas about morphological relations among words, since it would be practically impossible to determine not merely a large set of possible words that linguists think are related to those in the corpus, but rather a set of possible words that WWM ought to generate according to its theory. This would amount to trying to beat WWM at its own game in pursuit of a gold standard, which could only be obtained using a better implementation of WWM's theory. A perfect implementation of Whole Word Morphology would have perfect recall, in view of our eventual goal of using this theory to inform us about the morphology of a language--about what ought to be recalled. We are not trying to learn something that we feel is already known.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 What's learning? </SectionTitle> <Paragraph position="0"> It is worth considering the endeavor of learning morphology in terms of formal learning theory, as presented in Osherson et al. (1986) or Kanazawa (1998) for example. In the classical framework, the problem of learning a language from positive example data is approached by considering the successive guesses at the target language that a purported learner makes when presented with some sequentially increasing learning sample drawn from that language. Considering just morphology, it seems that the target language is the set of all possible words of the natural language at hand, a possibly infinite (or at least indefinite) set. WWM's output is a list of generated words subsuming the corpus, which are supposed to be all the words creatable by applying its idea of morphology to that corpus. It can thus be viewed as making a guess about the target language, given a certain learning sample. If the learning sample is increased, its guess increases in size also. The errors in precision of course mean that at the current corpus sizes its guesses are for the moment not even subsets of the target language.</Paragraph> <Paragraph position="1"> According to one classic paradigm, a system would be held to be a successful learner if it could be proven to home in on the target language as the learning sample increased in size indefinitely. This is Gold's (1967) criterion of identification in the limit. In this framework, an empirical analysis cannot be used to decide the adequacy of a learner, and we would like to deemphasize the importance of the empirical results for this purpose. That said, the empirical results are for now all we have to show, but eventually we hope to produce a mathematical proof of just what WWM can learn, and just what kinds of lexicons are learnable in Gold's sense.</Paragraph> <Paragraph position="2"> To our knowledge, it has never been proven whether the total lexicon of a natural language is identifiable in the limit from the sort of data we provide (i.e. POS-tagged words), using in particular the theory of Whole Word Morphology in a perfect fashion. Still, it is interesting that nothing about this language learning paradigm says anything about morphological analysis. The current crop of true morphological learners, e.g. (Goldsmith, 2001b), endeavor to learn to analyze the morphology of the language at hand in the manner of a linguist. Goldsmith has even called his Linguistica system a &quot;linguist in a box.&quot; This is perhaps an interesting and worthwhile endeavor, but it is not one that is undertaken here. WWM is instead attempting to learn the target language in a more direct way from the data, without first constructing the intermediary of a traditional morphological analysis. We are thus not learning the linguist's notion of morphology but rather the result of morphology, i.e. the word forms of the language together with the other information that goes into a word.2</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Post-hoc fixes and future developments </SectionTitle> <Paragraph position="0"> A significant proportion of errors in generation result from the application of competing ambiguous morphological strategies. For example, when using the (French) text of Le petit prince as its base lexicon, WWM produces two strategies relating 2nd person verb forms to their infinitives. Given the verb conjugues 'conjugate,' pres. 2nd sing., one strategy produces the correct -er class infinitive conjuguer while the other creates the non-word *conjuguere, based on the relation among -re verb forms like fais/faire 'do' and vends/vendre 'sell.' This is because of an inherent ambiguity among various word pairs which do not fully indicate the paradigms of which they are a part. WWM then adds to its lexicon, not only the correct form, but all the outputs warranted by its grammar.</Paragraph> <Paragraph position="1"> To try to correct this problem, a form of lexical blocking has been implemented in the current version of the program. WWM creates every possible word, including different strategies giving the same one, and lets lexical lookup take precedence over productive morphology. The knowledge WWM possesses about its lexicon increases considerably during the creation of morphological strategies. The program learns not only which strategies are licensed by a given lexicon, but also which words of its lexicon are related to one another. WWM can assign a number to every lexical entry and give the same &quot;paradigm&quot; number to related words. Before adding a newly created word to its lexicon, the program looks for an existing word with the same paradigm number and category. For example, if WWM maps the word decoction, which was assigned to, say, paradigm 489 onto a strategy creating plural nouns, it will look for a plural noun belonging to paradigm 489 in its lexicon before it adds decoctions to the list of new words.</Paragraph> <Paragraph position="2"> Preliminary results are encouraging, with WWM reaching up to 92% accuracy in generation after 2In this theory, a word's form cannot be usefully divorced from the other information that allows its proper use, and in our implementation the POS tags (poor substitutes for what should be a richer database of information) are crucial to the discovery of the strategies.</Paragraph> <Paragraph position="3"> the blocking modification. Obviously the program needs to be systematically tested on multiple lexica from different languages, but these results strongly suggest that it is possible to model the acquisition of morphology as a component of learning to generate language directly, rather than to treat computational learning as the acquisition of linguistic theory as several current approaches do, e.g. (Goldsmith, 2001b).</Paragraph> <Paragraph position="4"> Although the principles of whole word morphology allow one to contemplate versions of WWM that would work on templatic morphologies, polysynthetic languages, and a host of other recalcitrant phenomena, the current instantiation of the program is not so ambitious. The comparison algorithm detailed in the previous section compares words letter by letter, either from left to right or from right to left. No other possible alignments between words are considered and WWM is in its current state only capable of grasping prefixal and suffixal morphology. We are currently developing a more sophisticated sequence alignment routine which will allow the program to handle infixing, circumfixing, and templatic morphologies of the Semitic type, as well as word-internal changes typified by Germanic strong verb ablaut.</Paragraph> </Section> </Section> class="xml-element"></Paper>