File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3049_metho.xml
Size: 14,050 bytes
Last Modified: 2025-10-06 14:12:30
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3049"> <Title>A i INII L&quot;S .\[A,I E MORPHOLOGICAL PROCESSOR FOI SPANISH</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The arc-list compiler </SectionTitle> <Paragraph position="0"> The arc-list compiler starts with a list of lexical items with their morphological classes, applying morphophonological transformations to generate the arc list. For instance, each verb headword in the Collins dictionary is given an index that specifies one of 62 conjugation classes. Based on this information, the arc-list compiler calculates the set of stem allomorphs necessary for that verb's inflection, along with the set of endings that each stem allomorph selects. Spanish verbs have from one to five orthographic stem allomorphs. When the verb is regular there is only one stem, like &quot;cambi-&quot; in &quot;cambiar&quot; (to change). An irregular verb may have up to five stems, like &quot;ten-&quot;, &quot;teng-&quot;, &quot;tien-&quot;, &quot;tend&quot;, &quot;tnv-&quot; for the verb &quot;tener&quot; (to have). This is common in Romance languages (see Tzoukermann 1986 for French). These different stems are the result of morphophonological changes occuring during the verbal flexion, usually related to the stress implications of the verbal ending or to the features of its initial vowel.</Paragraph> <Paragraph position="1"> Depending on the conjugation class, the character string corresponding to the verb lemma is subjected to one or more rewriting rules. These rewriting rules are of different types: * they can be the consequence of a stress change during the verbal flexion: (a) e -- ie when the last syllable is not stressed like in quoter / qulero.</Paragraph> <Paragraph position="2"> * they can be a morphographic change that is gen- null eral to Spanish orthography: (b) c - qu before &quot;e&quot; and &quot;i&quot; like in sacar / saque.</Paragraph> <Paragraph position="3"> or the reverse rule (c) qu - c before &quot;a&quot;, &quot;o&quot;, &quot;u&quot; like in delinquir / delinco.</Paragraph> <Paragraph position="4"> Some verbs are subject to one type of rewriting rule such as in (a) - (c) above, and consequently produce one additional stein allomorph. The verb &quot;sacar&quot; (to take / pull out) will generate &quot;sac-&quot; and &quot;saqu-&quot;, as well as &quot;delinqnir&quot; (to offend) with &quot;delinqu-&quot; and &quot;delinc-&quot;.</Paragraph> <Paragraph position="5"> Some other verbs, less frequent in number' but more frequent in actual use, are subject to two rewriting rules and need a more complex treatment.</Paragraph> <Paragraph position="6"> In &quot;forzar&quot; (to force), tile morphophonological rule combines with the othographic one and produces a distribution of four steins, such as &quot;forz-&quot;, &quot;fore&quot;, &quot;fuerc-', &quot;fuerz-&quot;. The same phenomenon occurs for &quot;rogar&quot; (to beg) with the stems &quot;rog-&quot;, &quot;rogu-&quot;, &quot;rues-&quot;, &quot;ruegu-&quot;. For some verbs of the second group in &quot;-er&quot;, the stem production is less predictable; for instance &quot;tenet&quot; presents five sterns &quot;ten-&quot;, &quot;teng-&quot;, &quot;tien-&quot;, &quot;tend-&quot;, &quot;tuv-'. Notice that some of them such as &quot;tens-&quot; do not follow the type of morphophonological rules mentioned above.</Paragraph> <Paragraph position="7"> Because of Spanish orthographic conventions connected with the notation of stress, some nouns and adjectives also acquire more than one stein allomorph in a rule-governed way. In addition, of course, there must be a list of cases where the allomorphy is simply unique to the word in question.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The arc list </SectionTitle> <Paragraph position="0"> Using a state labeled 1 by convention as the start state, and a state labeled 0 by convention as the (unique) final state, we express all of the information needed to define our automaton .4 by enumerating the arcs in H, which now can be represented as lists of 4-tuples (qi,qj,u,v), where qi and qj are arbitrary identifiers for states, u is a substring of an inflected form, and v is a substring of the corresponding lemma + morphosyntactic category.</Paragraph> <Paragraph position="1"> 278 2 {Jsed either tbr analysis or for generation, our pro-. gram interprets this same arc list. The arc list can be conceptually divided in two parts: one contains the stems of the verbs, nouns and adjectives; the other contains a number of sub-lexicons that provide the endings for these lexical categories as well as the cliticsdeg Our Spanish system is defined by a set of about 58,000 such 4-tuples, (most of which are) gener~ ated by rule from head words and category information extracted from the typographer's tape for the Collin:~ Spanish Dictionary. Affixes, assorted nullstring transitions and tittles account for about 1000 elements of this set; the remainder are stems or stem allomorphs. Since we have about 55,000 laminas, the overhead for compiling out predictable aspects of allomorphy is at worst the approximately 2,500 stem allolnorphs and affix arcs, i.e. less than 5%. There are about 225 states in total.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1. :Verbal stems </SectionTitle> <Paragraph position="0"> The verbal stem lexicon was obtained by extracting tile verb headwords (about 6,800 Spanish verbs) from the Collins dictionary.</Paragraph> <Paragraph position="1"> Once the grammar provides the stems, a state pair is associated to them. The first state is always the initial state &quot;1&quot;, the second depemts on the type of stern and its ending throughout the conjugation (digits or character strings can be used indifferently for labelling the states), l~br example, for the first verb conjugation, whose infinitives end in &quot;-at,&quot; the second states are spread out among 10 different states.</Paragraph> <Paragraph position="3"> Two verb stems x and y will share the same second state number if and only if: , x has the same number of sterns as y, (r) x has the same ending distribution as y.</Paragraph> <Paragraph position="4"> This permits a compression of the database since the set of sterns are gathered under a common second state number. Other arguments in favor of this choice of representation are given in section 4.1. For the 62 conjugation classes, grouped in three verb conjugations, the number of stems combined with the various ending distributions creates a number of verb-stem-final states close to 150.</Paragraph> <Paragraph position="5"> Defective verbs, due to their idiosyncrasies, are listed separatelydeg</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 The adjective stems </SectionTitle> <Paragraph position="0"> The adjective base tbrms (about 10,500) were derived fl'om the masculine singular Ibrms listed in the dictionary. The lexical representation of a regular adjective has an entry in the lexicon as follows: i 300 buon bueno where &quot;buen-&quot; is the stem and &quot;bueno&quot; (good) the dictionary base form. Special attention needed to be paid to stressed adjectives like &quot;musulmSn&quot; (Muslim) or &quot;mand&l&quot; (bossy) where the inflected form does not keep the accent. Therefore, both forms (stressed and unstressed) needed to be stored.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 The noun stems </SectionTitle> <Paragraph position="0"> About 30,700 nouns were extracted from the dictionary. These nouns are not inflected for gender, but are simply listed as masculine or feminine. Thus the arc label for a noun contains the complete form of the singular. Some examples of arcs for nouns are: In the above examples, (a) can either generate a singular lbrm or it can acquire the plural tbrm in a fimher step, whereas (b), which occurs only in the plural, can have no Nrther inflection added.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The affixes </SectionTitle> <Paragraph position="0"> Besides the stems, various sublexicons containing &quot;intermediary states&quot; and affixes of different types constitute the other part of the Spanish arc list.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Intermediary nodes or continua- </SectionTitle> <Paragraph position="0"> tion classes The regrouping of the verbal arc list by stem and person allows reduction of the number of states and therefore, of arcs. For instance, an intermediary state was added for the tenses only. The arc marked &quot;#&quot; shows a transition on an empty string.</Paragraph> <Paragraph position="2"> This arc takes any verb stem of which tile final state is 2 and links it to tile indicative present node - labeled here 150- of the &quot;-at&quot; verbs. Consequently, there are as many nodes of that kind as tenses for each group and verb category.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Endings </SectionTitle> <Paragraph position="0"> A series of sublexicons lists the inflections for the verbs, nouns and adjectives. Verbal inflections are of the form: 150 500 o Ist singular present indicative 150 500 as 2nd singular present indicative In the same way, the regular endings for the adjectives are of the form: Each transition corresponds to the gender or number feature of the adjective.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Clitics </SectionTitle> <Paragraph position="0"> The eleven Spanish clities can occur either alone or in combination (\[1\]). Over sixty-five combinations can be formed such as &quot;seles&quot;, &quot;noslas&quot;, etc. The infinitive, gerund and imperative are the only forms in which they can occur, for instance, &quot;hacerlo&quot; (to do it) or &quot;dici6ndooslo&quot; (saying it to you). Nevertheless, they are sometimes subject to orthographic rules of the type: deletion of &quot;s&quot; for first person plural imperative verbs in front of the enclitic &quot;nos&quot;, such as in &quot;anlanlonos '~ .</Paragraph> <Paragraph position="1"> Consequently, about 300 arcs were listed to handle the general cases as well as the idiosyncrasies.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Reflexive verbs </SectionTitle> <Paragraph position="0"> In the case of reflexive verbs such as &quot;aflliarse&quot; (to afiliate, to join) or &quot;abstenerse&quot; (to abstain, to refrain), a special treatment is motivated. Such verbs have a paradigm like: (a) me afilio, (I afiliate) te afilias, (you afiliate) me afiliaba, (I .as afiliating) te a~liabas, (you were afiliating) (b) afiliandome (afiliating myself) afiliatet (afilla~e!) The reflexive pronouns generally precede the verb form, separated from it by white space ms shown in (a), except for the infinitive, imperative and present participle (example (b) above) a. For the preceding reflexive pronouns, there is a dependency between the person-and-number of the pronoun and the person-and-number of the verbal ending, spanning the intervening verb stem. To capture such dependencies in a single automaton of the kind that 3Note that some verbs (e.g. &quot;afillaxse&quot;) occur only reflexively, while other (e.g. &quot;lavar&quot; (to wash, to clean)) may be used reflexively or non reflexively. Note also that object pronouns in general are cliticized, note only the reflexive ones. we are using, we would have to use a separate path for each person-number combination, duplicating the verb stem (and its allomorphs, if any) six times. This seems like a bad idea. A better alternative, in such cases, is to set up the automaton to permit all reflexive pronouns to co-occur with all endings, and to filter the resulting set of tuples to remove the ones that do not match. This can be done, for example, by passing the output through a second automaton that does nothing but check person and number agreement in reflexive verbs.</Paragraph> <Paragraph position="1"> We find it interesting that precisely those aspects of Spanish morphology that require such a treatment are those whose formatives are written as separate words.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Prefixes and suffixes </SectionTitle> <Paragraph position="0"> About 60 suffixes and 90 prefixes were added to the arc list for handling derivational morphology. Only tile very productive ones were selected. The prefixes are of the form &quot;nero-&quot;, &quot;ante-&quot;, &quot;auto-&quot;, &quot;bio-&quot; occurring with or without the dash; the suffixes are of the form &quot;-ejo&quot;, &quot;-eta&quot;, &quot;-zuela&quot;, &quot;-uelo&quot;, etc. The resulting arc list, in addition to supporting an efficient computation of relations between surface and lexical forms, provides a good overview of the morphological structure of the Spanish verbal, system, permitting easy access to the sets of verbs that behave in a similar way.</Paragraph> </Section> </Section> class="xml-element"></Paper>