File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-2015_metho.xml
Size: 5,412 bytes
Last Modified: 2025-10-06 14:14:33
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-2015"> <Title>CATMORF: Multi two-level steps for Catalan morphology</Title> <Section position="2" start_page="0" end_page="25" type="metho"> <SectionTitle> 2 Internal structure of CATMORF </SectionTitle> <Paragraph position="0"> CATMORF's internal structure (figure 2) conforms to the two level paradigm. In the two-level framework, as it is well known, morphographemics is modelled in two-level rules (TLR) and morphotactics either in continuation classes or in unification word grammars (WG). Our system models morphotactics in a (DCG-like) WG and morphographemics in The main characteristics of the formalism is that it allows the linguist to express both the morphographemic and morphotacticai contexts thus constraining the application of TLRs.</Paragraph> <Paragraph position="1"> Thus a rule in CATMORF may make use of the following data structures: the Surface Left and Right morphographemic contexts; the Lexical Left and Right morphographemic contexts; the Morphological Left and Right contexts; and the Application context (i.e, a feature structure which keeps trace of the application of rules and which must unify with the application-FS associated to every morph found in the lexicon).</Paragraph> <Paragraph position="2"> As is customary the surface and lexical descriptions in rules are related by four types of operators. Note that some of the facilities in SEGMORF were not available in the Alep formalism: the specification of the morphotactical context, the possibility of mapping single characters onto multiple ones, and the ability to cross morpheme boundaries.</Paragraph> <Section position="1" start_page="0" end_page="25" type="sub_section"> <SectionTitle> 2.2 The Word Grammar </SectionTitle> <Paragraph position="0"> Due to the expressivity of the TLRs the WG can be very simple: it is a DCG-style grammar, which builds a word out of the morphemes into which the surface string has been divided and provides the morphosyntactic information at the word level.</Paragraph> </Section> <Section position="2" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.3 CATMORF's lexicon </SectionTitle> <Paragraph position="0"> The items in our lexicon contain information on the word form and lemma; the inflection paradigm of verbs, nouns and adjectives (needed for both the WG and the TLR components); and the blocking of rules by several classes of stems.</Paragraph> <Paragraph position="1"> All this information, including the one concerning the inflection paradigms and the blocking of rules has been obtained semi-automatically from a MRD (a conventional &quot;human-reader-in-mind&quot; dictionary available in electronic form): the IEC dictionary (IEC, 1996), which is a recent normative dictionary for Catalan.</Paragraph> </Section> </Section> <Section position="3" start_page="25" end_page="26" type="metho"> <SectionTitle> 3 Technical details </SectionTitle> <Paragraph position="0"> The main technical characteristics of our analyzer: * The system has been written in Sicstus Prolog.</Paragraph> <Paragraph position="1"> * The system covers nominal and verbal inflection fully. A few nominal derivation processes are also covered. 114 rules cover nominal inflection; 10 rules cover verbal inflection.</Paragraph> <Paragraph position="2"> * The WG has 1 rule for verbal inflection and 15 rules for nominal processes.</Paragraph> <Paragraph position="3"> * The original MRD contains 67567 entries. Our lexicon contains 70543 entries; 11092 verbs (around 9000 stems and 2000 lexicalized verb forms), 386 verbal suffixes, 56275 nouns and adjectives, 3 nominal suffixes and 2555 adverbs. The rest of the entries are prepositions, conjunctions, etc.</Paragraph> <Paragraph position="4"> * Only around 800 nouns and around 2000 verb forms have been added to the system by hand.</Paragraph> <Paragraph position="5"> The rest of the entries (around 60000) have been added automatically.</Paragraph> <Paragraph position="6"> * The system is currently being used in the analysis of Catalan newspapers.</Paragraph> </Section> <Section position="4" start_page="26" end_page="26" type="metho"> <SectionTitle> 4 The Multi two-level steps </SectionTitle> <Paragraph position="0"> framework In Catalan, TL-rules depend on word formation processes. 114 rules cover nominal inflection and derivation processes, whereas only 10 rules cover verbal inflection; thus, few rules can be considered as applicable to both inflections.</Paragraph> <Paragraph position="1"> This shows that Catalan morphology can be more efficiently accounted for in a multi two-level steps framework, in which different TLR and WG rule sets are available, depending on the type of word formation process to cover (as depicted in figure 2). Morphemes (prefixes, noun stems, verbal stems, etc.) do not direct to continuation classes (or sublexicons); instead, word formation processes (according to the WG) select their appropiate sublexicons.</Paragraph> <Paragraph position="2"> Note that this framework does not avoid the specification of morpbotactical contexts for those morphographemic changes which involve interaction between TLRs and the WG. It simply specifies that for some word formation processes only a subset of TLRs should be considered. See (Badia & Tuells, 1997) for further considerations.</Paragraph> </Section> class="xml-element"></Paper>