File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/01/j01-1003_relat.xml

Size: 4,805 bytes

Last Modified: 2025-10-06 14:15:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-1003">
  <Title>Machine Learning</Title>
  <Section position="3" start_page="0" end_page="60" type="relat">
    <SectionTitle>
2. Related Work
</SectionTitle>
    <Paragraph position="0"> Machine learning techniques are widely employed in many aspects of language processing. The availability of large, annotated corpora has fueled a significant amount of work in the application of machine learning techniques to language processing problems, such as part-of-speech tagging, grammar induction, and sense disambiguation, as witnessed by recent workshops and journal issues dedicated to this topic. 1 The current work attempts to contribute to this literature by describing a human-supervised machine learning approach to the induction of morphological analyzers--a problem that, surprisingly, has received little attention.</Paragraph>
    <Paragraph position="1"> There have been a number of studies on inducing morphographemic rules from a list of inflected words and a root word list. Johnson (1984) presents a scheme for inducing phonological rules from surface data, mainly in the context of studying certain aspects of language acquisition. The premise is that languages have a finite number of alternations to be handled by morphographemic rules and a fixed number of contexts in which they appear; so if there is enough data, phonological rewrite rules can be generated to account for the data. Rules are ordered by some notion of &amp;quot;surfaciness&amp;quot;, and at each stage the most surfacy rule--the rule with the most transparent context-is selected. Golding and Thompson (1985) describe an approach for inducing rules of English word formation from a corpus of root forms and the corresponding inflected forms. The procedure described there generates a sequence of transformation rules, 2 each specifying how to perform a particular inflection.</Paragraph>
    <Paragraph position="2"> More recently, Theron and Cloete (1997) have presented a scheme for obtaining two-level morphology rules from a set of aligned segmented and surface pairs. They use the notion of string edit sequences, assuming that only insertions and deletions are applied to a root form to get the inflected form. They determine the root form associated with an inflected form (and consequently the suffixes and prefixes) by exhaustively matching the inflected form against all root words. The motivation is that &amp;quot;real&amp;quot; suffixes will appear frequently in the corpus of inflected forms. Once common suffixes and prefixes are identified, the segmentation for an inflected word can be determined by choosing the segmentation with the most frequently occurring affix segments; the remainder is then considered the root. While this procedure seems to  Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers be reasonable for a small root word list, the potential for &amp;quot;noisy&amp;quot; or incorrect alignments is quite high when the corpus of inflected forms is large and the procedure is not given any prior knowledge of possible segmentations. As a result, automatically selecting the &amp;quot;correct&amp;quot; segmentation becomes nontrivial. An additional complication is that allomorphs show up as distinct affixes and their counts in segmentations are not accumulated, which might lead to actual segmentations being missed due to fragmentation. The rules are not induced via a learning scheme: aligned pairs are compressed into a special data structure and traversals over this data structure generate morphographemic rules. Theron and Cloete have experimented with pluralization in Afrikaans, and the resulting system has shown about 94% accuracy on unseen words.</Paragraph>
    <Paragraph position="3"> Goldsmith (1998) has used an unsupervised learning method based on the minimum description length principle to learn the &amp;quot;morphology&amp;quot; of a number of languages. What is learned is a set of root words and affixes, and common inflectional-pattern classes. The system requires just a corpus of words in a language. In the absence of any root word list to use as a scaffolding, the shortest forms that appear frequently are assumed to be roots, and observed surface forms are then either generated by the concatenative affixation of suffixes or by rewrite rules. 3 Since the system has no notion of what the roots and their part-of-speech values really are, and what morphological information is encoded by the affixes, this information needs to be retrofitted manually by a human, who has to weed through a large number of noisy rules. We feel that this approach, while quite novel, can be used to build real-world morphological analyzers only after substantial modifications are made.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML