File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3246_intro.xml
Size: 3,717 bytes
Last Modified: 2025-10-06 14:02:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3246"> <Title>Learning Hebrew Roots: Machine Learning with Linguistic Constraints</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The standard account of word-formation processes in Semitic languages describes words as combinations of two morphemes: a root and a pattern.1 The root consists of consonants only, by default three (although longer roots are known), called radicals.</Paragraph> <Paragraph position="1"> The pattern is a combination of vowels and, possibly, consonants too, with 'slots' into which the root consonants can be inserted. Words are created by interdigitating roots into patterns: the first radical is inserted into the first consonantal slot of the pattern, the second radical fills the second slot and the third fills the last slot. See Shimron (2003) for a survey.</Paragraph> <Paragraph position="2"> Identifying the root of a given word is an important task. Although existing morphological analyzers for Hebrew only provide a lexeme (which is a combination of a root and a pattern), for other Semitic languages, notably Arabic, the root is an essential part of any morphological analysis sim1An additional morpheme, vocalization, is used to abstract the pattern further; for the present purposes, this distinction is irrelevant.</Paragraph> <Paragraph position="3"> ply because traditional dictionaries are organized by root, rather than by lexeme. Furthermore, roots are known to carry some meaning, albeit vague. We believe that this information can be useful for computational applications and are currently experimenting with the benefits of using root and pattern information for automating the construction of a Word-Net for Hebrew.</Paragraph> <Paragraph position="4"> We present a machine learning approach, augmented by limited linguistic knowledge, to the problem of identifying the roots of Hebrew words. To the best of our knowledge, this is the first application of machine learning to this problem. While there exist programs which can extract the root of words in Arabic (Beesley, 1998a; Beesley, 1998b) and Hebrew (Choueka, 1990), they are all dependent on labor intensive construction of large-scale lexicons which are components of full-scale morphological analyzers. Note that Tim Bockwalter's Arabic morphological analyzer2 only uses &quot;word stems - rather than root and pattern morphemes - to identify lexical items. (The information on root and pattern morphemes could be added to each stem entry if this were desired.)&quot; The challenge of our work is to automate this process, avoiding the bottleneck of having to laboriously list the root and pattern of each lexeme in the language, and thereby gain insights that can be used for more detailed morphological analysis of Semitic languages.</Paragraph> <Paragraph position="5"> As we show in section 2, identifying roots is a non-trivial problem even for humans, due to the complex nature of Hebrew derivational and inflectional morphology and the peculiarities of the Hebrew orthography. From a machine learning perspective, this is an interesting test case of interactions among different yet interdependent classifiers. After presenting the data in section 3, we discuss a simple, baseline, learning approach (section 4) and then propose two methods for combining the results of interdependent classifiers (section 5), one which is purely statistical and one which incorporates lin- null guistic constraints, demonstrating the improvement of the hybrid approach. We conclude with suggestions for future research.</Paragraph> </Section> class="xml-element"></Paper>