File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/c90-3049_abstr.xml

Size: 6,581 bytes

Last Modified: 2025-10-06 13:46:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-3049">
  <Title>A i INII L&amp;quot;S .\[A,I E MORPHOLOGICAL PROCESSOR FOI SPANISH</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> A finite transducer that processes Spanish inflectional amt derivational morphology is presented.</Paragraph>
    <Paragraph position="1"> The system handles both generation astd analysis of tens of millions inflected ibrms. Lexical and surface (orthographic) representations of the words are linked by a program that interprets a finite directed graph whose arcs are labelled by n-tuples of strings. Each of about 55,000 base forms requires at le~t one arc in the graph. Representing the inflectional and derivational possibilities for these forms imposed an overhead of only about 3000 additional arcs, of which about 2500 represent (phonologicallypredictable) stem allomorphy, so that we pay a storage price of about 5% for compiling these form~ offline. A simple interpreter for the resulting automaton processes several hundred words per second on a Sun4.</Paragraph>
    <Paragraph position="2"> it Introduction One useful way to look at computational morphology and phonology is in terms of transductlons, that is, n-sty word relations definable by the element-wise concatenation of n-tuple labels along paths in a fiidte directed labeled graph. For instance, we can take one member of such a relation to be the spelling of an inflected form, another member to be the corret~ponding lemma, another to be a string representing its morphosyntactic features, another to represent its pronunciation, and so forth.</Paragraph>
    <Paragraph position="3"> Inspired by the (unpublished) work of Kaplan and Kay (ongoing since the late 1970's), and that of Koskenniemi in \[12\], many researchers have used binary word relations to represent &amp;quot;underlying&amp;quot; and &amp;quot;surface&amp;quot; forms in the morphophonology of words. Much of the interest of this work has been focused on methods to combine multiple two-tape automata, which may be composed or run in parallel in order to compute the desired binary relation.</Paragraph>
    <Paragraph position="4"> In this paper, we take a somewhat different approach to defining and computing word relations, and discuss its application in a morphological processot for Spanish orthographic words that covers more than forty millions forms generable from the approximately 55,000 basic words in the Collins Spanish Dictionary (\[3\]) 1. The main advantage of this approach is the extreme simplicity both of its data structures and of their interpretation. As a result, an interpreter is easy to implement; time and/or space op-= timization issues in the implementation are straight..</Paragraph>
    <Paragraph position="5"> forward to define; at the same time, it is extremely easy to compile traditional morphological information into the required form, at least for languages like Spanish that can be fairly well modeled in terms of the concatenation of steins and affixes. As is usl~.lly the case in automata-based approaches, the system treats analysis and generation symmetrically, and tile same description can be run with equal facility in either direction.</Paragraph>
    <Paragraph position="6"> Define 2 an n-ary nondeterminisiic finite automaton as a 5-tuple A = (Q,qt,F,E,H) where Q is a finite non-empty set of states, ql is a designated start state, F is a set of designated final states, E is a finite non-empty alphabet, and H is a finite subset of {Q x (E*)&amp;quot; x Q}, where (:E*) n is the set of n-tuples of (possibly empty) words over E..,4 can be thought of as a labeled directed graph, whose nodes are elements of Q, and whose edges are elements of H, each such edge being labeled with the appropriate n-tuple of words. The component-wise concatenation of labels along every path that begins in ql and ends in an element of F defines a set of n-tuples, R C (E*) n, which is the relation accepted by A.</Paragraph>
    <Paragraph position="7"> As a practical matter, we generally want to run a program that (explicitly or implicitly) searches this graph in order to find all the n-tuples in R with some interesting property, say those corresponding 1 The present set can be increased almost exponentially by adding new derivatlonal affixes. ~The name and the basic idea of these automata come from \[5\]. For simplicity of exposition we gloss over various authors' attempts to distinguish variously among machines, automata and transducer#, as well as the profusion of precursors mad descendants in (\[15\], \[16\], \[2\], \[7\], etc.). Our notation is edeetlc.</Paragraph>
    <Paragraph position="9"> to forms whose surface spelling is the string w, or those corresponding to the first person plural imperfect subjunctive of such-and-such a verb. Depending on the structure of H and the property selected, the search will be harder or easier. For the stem-andaffix kind of morphology exemplified by Spanish, the natural structure for H is quite easy to search. We do not have space to discuss search methods here, but will simply observe that a non-optimal method devised for convenience in another experiment (\[6\]) processes several hundred Spanish words per second on a Sun4.</Paragraph>
    <Paragraph position="10"> For the application discussed in this paper, we want to relate inflected forms, lemmas, and morphosyntactic features, so that the elements of R should be 3-tuples like: ( eambiaran, eambiar, 3rd plural per fect subjunctive). Since most Spanish words consist of a stem, which mainly specifies the lemma, and a set of affixes that mainly specify the morphosyntactic features, it is appropriate to use 2-tuples made by concatenating the second and third elements.</Paragraph>
    <Paragraph position="11"> The basis of our run-time system is the arc list H.</Paragraph>
    <Paragraph position="12"> For a large lexicon, it is inconvenient to write this list by hand, and so we compile it from a lexical table that reflects more directly the way that morphological information is represented in a standard dictionary, such as the Collins dictionary we began with. The program interprets recursively all the possible arcs of the lists. Therefore, more than one analyzed or generated form is given. For instance, the analysis for the input word &amp;quot;retirada&amp;quot; is of the form: retiral&amp;quot; pas~ participle feminine singular retirado adjective feminine singular retirada noun feminine singular</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML