File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/c00-1001_abstr.xml
Size: 6,068 bytes
Last Modified: 2025-10-06 13:41:33
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1001"> <Title>A word-grammar based morl)hoh)gieal analyzer for agglutinative languages</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1 Morl)hographenfics (also called morpho- </SectionTitle> <Paragraph position="0"> phonology). This ternl covers orthographic variations that occur when linking IllOfphellleS.</Paragraph> <Paragraph position="1"> 2) morpholactics. Specil'ication of which nlorphenles can or cannot combine with each other lo form wflid words.</Paragraph> <Paragraph position="2"> 3) Feature-combination. Specification of how these lnorphemes can be grouped and how their nlorphosyntactic features can be comlfined.</Paragraph> <Paragraph position="3"> The system here presented adopts, oil the one hand, tile lwo-level fornlalisnl to deal with morphogralfilemics and sequential morl)holactics (Alegria el al., 96) and, on the other hand, a unification-based woM-grammar 2 to combine the grammatical information defined in nlorphemes and to tackle complex nlorphotactics. This design allowed us to develop a full coverage analyzer that processes efl'iciently unrestricted texts in Basque. The remainder of tills paper is organized sis follows. After a brief' description of Basque nlorphology, section 2 describes tile architecture for morphological processing, where the morphosynlactic component is included. Section 3 specifies tile plaenomena covered by the analyzer, explains its desigi~ criteria, alld presents implementation and ewthialion details. Section d compares file I This has also been called mo*7)hOSh,ntactic parsitlg. When we use lhc \[(fill #11017~\]lOSyltl~/X WC will always refer to il~c lficrarchical structure at woM level, conlbining morphology and synlax. 2 '\]'\]lt3 \[IDl'll\] WOl'd-gF(lllllllUl&quot; should not be confused with the synlaclic lilcory presented in (Hudson, 84). system with previous works. Finally, the paper ends with some concluding renmrks.</Paragraph> <Paragraph position="4"> 1 Brief description of Basque morphology These are the most important features of Basque morphology (Alegria et al., 96): * As prepositional functions are realized by case suffixes inside word-fornls, Basque presents a relatively high power to generate inflected word-forms. For instance, froth a single noun a minimum of 135 inflected forms can be generated. Therefore, the number of simple word-forms covered by the current 70,000 dictionary entries woukl not be less than 10 million.</Paragraph> <Paragraph position="5"> * 77 of the inflected forms are simple combinations of number, determination, and case marks, not capable of further inflection, but the other 58 word-forms ending in one of the two possible genitives (possessive and locative) can be further inflected with the 135 morphemes. This kind of recursive construction reveals a noun ellipsis inside a noun phrase and could be theoretically exteuded ad infinitum; however, in practice it is not usual to fiud more than two levels of this kind of recursion in a word-form. Taking into account a single level of noun ellipsis, the number of word-forum coukl be estimated over half a billion.</Paragraph> <Paragraph position="6"> * Verbs offer a lot of grammatical information. A verb tbrln conveys information about the subject, the two objects, as well as the tense and aspect. For example: diotsut (Eng.: 1 am telling you something).</Paragraph> <Paragraph position="7"> o Word-formation is very productive in Basque. It is very usual to create new compounds as well as derivatives.</Paragraph> <Paragraph position="8"> As a result of this wealth of infornmtion contained within word-forms, complex structures have to be built to represent complete morphological information at word level.</Paragraph> <Paragraph position="9"> 2 An architecture for the full morphological analyzer The framework we propose for the morphological treatment is shown in Figure 1. The morphological analyzer is the fiont-end to all present applications for the processing of Basque texts. It is composed of two modules: the segmentation module and the morphosyntactic analyzer.</Paragraph> <Paragraph position="10"> The segmentation ,nodule was previously implemented in (Alegria et al., 96). This system applies two-level morphology (Koskenniemi, 83) for the morphological description and obtains, for each word, its possible segmentations (one or many) into component morphemes. The two-level system has the following components: (r) A set of 24 morphograf~hemic rules, compiled into transducers (Karttunen, 94).</Paragraph> <Paragraph position="11"> * A lexicon made up of around 70,000 items, grouped into 120 sublexicons and stored in a general lexical database (Aduriz et al., 98a).</Paragraph> <Paragraph position="12"> This module has full coverage of free-running texts in Basque, giving an average number of 2.63 different analyses per word. The result is the set of possible morphological segmentations of a word, where each morpheme is associated with its corresponding features in the lexicon: part of speech (POS), subcategory, declension case, number, definiteness, as well as syntactic function and some semantic features. Therefore, the output of the segmeutation phase is very rich, as shown in Figure 2 with the word amarengan (Eng.: on the mother).</Paragraph> <Paragraph position="13"> iq:e, ure 2. Morphosynlactic analysis e of (unureugun (l{ng.: (m The architecture is a modular envhoument that allows different types of output depending on the desired level of analysis. The foundation of the architecture lies in the fact lhat TEIconfommnt SGML has been adopted for the comnmnication allloIlg modules (Ide and VCFOIIiS, 95). l~'eature shucluleS coded accoMing TIU are used to represent linguistic information, illcluding tile input mM outl)ut of the morplaological analyzer. This reprcscntation rambles the use of SGML-aware parsers and tools, and Call he easily filtered into different formats (Artola et ill., 00).</Paragraph> </Section> class="xml-element"></Paper>