File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1025_metho.xml
Size: 24,666 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1025"> <Title>Compounding and derivational morphology in a finite-state setting</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 The locus of word formation rules in </SectionTitle> <Paragraph position="0"> grammars for NLP In English orthography, compounds following productive word formation patterns are spelled with spaces or hyphens separating the components (e.g., classic car repair workshop). This is convenient from an NLP perspective, since most aspects of word formation can be ignored from the point of view of the conceptually simpler token-internal processes of inflectional morphology, for which standard finite-state techniques can be applied. (Let us assume that to a first approximation, spaces and punctuation are used to identify token boundaries.) It makes it also very easy to access one or more of the components of a compound (like classic car in the example), which is required in many NLP techniques (e.g., in a vector space model).</Paragraph> <Paragraph position="1"> If an NLP task for English requires detailed information about the structure of compounds (as complex multi-token units), it is natural to use the formalisms of computational syntax for English, i.e., context-free grammars, or possibly unification-based grammars. This makes it possible to deal with the bracketing structure of compounding, which would be impossible to cover in full generality in the finite-state setting.</Paragraph> <Paragraph position="2"> In languages like German, spelling conventions for compounds do not support such a convenient split between sub-token processing based on finite-state technology and multi-token processing based on context-free grammars or beyond--in German, even very complex compounds are written without spaces or hyphens: words like Verkehrswegeplanungsbeschleunigungsgesetz ('law for speeding up the planning of traffic routes') appear in corpora. So, for a fully adequate and general account, the token-level analysis in German has to be done at least with a context-free grammar:1 For checking the selection features of derivational affixes, in the general case a tree or bracketing structure is required. For instance, the prefix Fehl- combines with nouns (compare (1)); however, it can appear linearly adjacent with a verb, including its own prefix, and only then do we get the suffix -ung, which turns the verb into a noun.</Paragraph> <Paragraph position="3"> English, the token-level analysis has to go beyond finite-state means too: the prefix non- in nonrealizability combines with the complex derived adjective realizable, not with the verbal stem realize (and non- could combine with a more complex form). However, since in English there is much less token-level interaction between derivation and compounding, a finite-state approximation of the relevant facts at token-level is more straight-forward than in German.</Paragraph> <Paragraph position="4"> Furthermore, context-free power is required to parse the internal bracketing structure of complex words like (2), which occur frequently and productively.</Paragraph> <Paragraph position="5"> Gesund heits ver trag lich keits pruf ung healthy bear examine 'check for health compatibility' As the results of the DeKo project on derivational and compositional morphology of German show (Schmid et al. 2001), an adequate account of the word formation principles has to rely on a number of dimensions (or features/attributes) of the morphological units. An affix's selection of the element it combines with is based on these dimensions. Besides part-of-speech category, the dimensions include origin of the morpheme (Germanic vs. classical, i.e., Latinate or Greek2), complexity of the unit (simplex/derived), and stem type (for many lemmata, different base stems, derivation stems and compounding stems are stored; e.g., trag in (2) is a derivational stem for the lemma trag(en) ('bear'); heits is the compositional stem for the affix heit).</Paragraph> <Paragraph position="6"> Given these dimensions in the affix feature selection, we need a unification-based (attribute) grammar to capture the word formation principles explicitly in a formal account. A slightly simplified such grammar is given in (3), presented in a PATR-IIstyle notation:3 (3) a. X0 a0 X1 X2</Paragraph> <Paragraph position="8"/> <Paragraph position="10"> Applying the suffixation rule, we can derive intellektual.isier- (the stem of 'intellectualize') from the two sample lexicon entries in (4). Note how the selection feature (SELECTION) of prefixes and affixes are unified with the selected category's features (triggered by the last feature equation in the prefixation and suffixation rules (3a,b)).</Paragraph> <Paragraph position="11"> Context-freeness Since the range of all atomic-valued features is finite and we can exclude lexicon entries specifying the SELECTION feature embedded in their own SELECTION value, the three attribute grammar rewrite rules can be compiled out into an equivalent context-free grammar.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Arguments for a finite-state word </SectionTitle> <Paragraph position="0"> formation component While there is linguistic justification for a context-free (or unification-based) model of word formation, there are a number of considerations that speak in favor of a finite-state account. (A basic assumption made here is that a morphological analyzer is typically used in a variety of different system contexts, so broad usability, consistency, simplicity and generality of the architecture are important criteria.) First, there are a number of NLP applications for which a token-based finite-state analysis is standardly used as the only linguistic analysis. It would be impractical to move to a context-free technology in these areas; at the same time it is desirable to include an account of word formation in these tasks. In particular, it is important to be able to break down complex compounds into the individual components, in order to reach an effect similar to the way compounds are treated in English orthography.</Paragraph> <Paragraph position="1"> Second, inflectional morphology has mostly been treated in the finite-state two-level paradigm. Since any account of word formation has to be combined with inflectional morphology, using the same technology for both parts guarantees consistency and reusability.4 null Third, when a morphological analyzer is used in a linguistically sophisticated application context, there will typically be other linguistic components, most notably a syntactic grammar. In these components, more linguistic information will be available to address derivation/compounding. Since the necessary generative capacity is available in the syntactic grammar anyway, it seems reasonable to leave more sophisticated aspects of morphological analysis to this component (very much like the syntax-based account of English compounds we discussed initially). Given the first two arguments, we will however nevertheless aim for maximal exactness of the finite-state word formation component.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Previous strategies of addressing </SectionTitle> <Paragraph position="0"> compounding and derivation Naturally, existing morphological analyzers of languages like German include a treatment of compositional morphology (e.g., Schiller 1995). An over-generation strategy has been applied to ensure coverage of corpus data. Exactness was aspired to for the inflected head of a word (which is always right-peripheral in German), but not for the non-head part of a complex word. The non-head may essentially be a flat concatenation of lexical elements or even an arbitrary sequence of symbols. Clearly, an account making use of morphological principles would be desirable. While the internal structure of a word is not relevant for the identification of the part-of-speech category and morphosyntactic agreement information, it is certainly important for information extraction, information retrieval, and higher-level tasks like machine translation.</Paragraph> <Paragraph position="1"> 4An alternative is to construct an interface component between a finite-state inflectional morphology and a context-free word formation component. While this can be conceivably done, it restricts the applicability of the resulting overall system, since many higher-level applications presuppose a finite-state analyzer; this is for instance the case for the Xerox Linguistic Environment (http://www.parc.com/istl/groups/nltt/xle/), a development platform for syntactic Lexical-Functional Grammars (Butt et al. 1999).</Paragraph> <Paragraph position="2"> An alternative strategy--putting emphasis on a linguistically satisfactory account of word formation--is to compile out a higher-level word formation grammar into a finite-state automaton (FSA), assuming a bound to the depth of recursive selfembedding. This strategy was used in a finite-state implementation of the rules in the DeKo project (Schmid et al. 2001), based on the AT&T Lextools toolkit by Richard Sproat.5 The toolkit provides a compilation routine which transforms a certain class of regular-grammar-equivalent rewrite grammars into finite-state transducers. Full context-free recursion has to be replaced by an explicit cascading of special category symbols (e.g., N1, N2, N3, etc.).</Paragraph> <Paragraph position="3"> Unfortunately, the depth of embedding occurring in real examples is at least four, even if we assume that derivations like ver.trag.lich ('compatible'; in (2)) are stored in the lexicon as complex units: in the initially mentioned compound Verkehrs.wege.planungs.beschleunigungs.gesetz ('law for speeding up the planning of traffic routes'), we might assume that Verkehrs.wege ('traffic routes') is stored as a unit, but the remainder of the analysis is rule-based. With this depth of recursion (and a realistic morphological grammar), we get an unmanagable explosion of the number of states in the compiled (intermediate) FSA.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Proposed strategy </SectionTitle> <Paragraph position="0"> We propose a refinement of finite-state approximation techniques for context-free grammars, as they have been developed for syntax (Pereira and Wright 1997, Grimley-Evans 1997, Johnson 1998, Nederhof 2000). Our strategy assumes that we want to express and develop the morphological grammar at the linguistically satisfactory level of a (contextfree-equivalent) unification grammar. In processing, a finite-state approximation of this grammar is used. Exploiting specific facts about morphology, the number of states for the constructed FSA can be kept relatively low, while still being in a position to cover realistic corpus example in an exact way.</Paragraph> <Paragraph position="1"> The construction is based on the following observation: Intuitively, context-free expressiveness is not needed to constrain grammaticality for most of the 5Lextools: a toolkit for finite-state linguistic analysis, AT&T Labs Research; http://www.research.att.com/sw/tools/lextools/ word formation combinations. This is because in most cases, either (i) morphological feature selection is performed between string-adjacent terminal symbols, or (ii) there are no categorial restrictions on possible combinations. (i) is always the case for suffixation, since German morphology is exclusively right-headed.6 So the head of the unit selected by the suffix is always adjacent to it, no matter how complex the unit is: (5) X Y . . . Y Xa1a3a2 (i) is also the case for prefixes combining with a sim null ple unit. (ii) is the case for compounding: while affix-derivation is sensitive to the mentioned dimensions like category and origin, no such grammatical restrictions apply in compounding.7 So the fact that in compounding, the heads of the two combined units may not be adjacent (since the right unit may be complex) does not imply that context-freeness is required to exclude impossible combinations: The only configuration requiring context-freeness to exclude ungrammatical examples is the combination of a prefix with a complex morphological unit:</Paragraph> <Paragraph position="3"> As (1) showed, such examples do occur; so they should be given an exact treatment. However, the depth of recursive embeddings of this particular type (possibly with other embeddings intervening) in realistic text is limited. So a finite-state approximation 6This may appear to be falsified by examples like ver- (Va1a3a2 ) + Urteil (N, 'judgement') = verurteilen (V, 'convict'); however, in this case, a noun-to-verb conversion precedes the prefix derivation. Note that the inflectional marking is always rightperipheral. null 7Of course, when speakers disambiguate the possible bracketings of a complex compound, they can exclude many combinations as implausible. But this is a defeasible world knowledge-based effect, which should not be modeled as strict selection in a morphological grammar.</Paragraph> <Paragraph position="4"> keeping track of prefix embeddings in particular, but leaving the other operations unrestricted seems well justified. We will show in sec. 6 how such a technique can be devised, building on the algorithm reviewed in sec. 5.</Paragraph> <Paragraph position="5"> 5 RTN-based approximation techniques A comprehensive overview and experimental comparison of finite-state approximation techniques for context-free grammars is given in (Nederhof 2000).</Paragraph> <Paragraph position="6"> In Nederhof's approximation experiments based on an HPSG grammar, the so-called RTN method provided the best trade-off between exactness and the resources required in automaton construction.</Paragraph> <Paragraph position="7"> (Techniques that involve a heavy explosion of the number of states are impractical for non-trivial grammars.) More specifically, a parameterized version of the RTN method, in which the FSA keeps track of possible derivational histories, was considered most adequate.</Paragraph> <Paragraph position="8"> The RTN method of finite-state approximation is inspired by recursive transition networks (RTNs).</Paragraph> <Paragraph position="9"> RTNs are collections of sub-automata. For each rule a4a6a5a8a7a10a9a12a11a13a11a13a11a14a7a16a15 in a context-free grammar, a sub-automaton with a17a19a18a21a20 states is constructed:</Paragraph> <Paragraph position="11"> As a symbol is processed in the a4 automaton (say, a7a10a9 ), the RTN control jumps to the respective subautomaton's initial state (so, from a22 a25 in (8) to a state</Paragraph> <Paragraph position="13"> in the sub-automaton for a7a31a9 ), keeping the return address on a stack representation. When the sub-automaton is in its final state (a22a32a27a30 ), control jumps back to the next state in the a4 automaton: a22 a9 . In the RTN-based finite-state approximation of a context-free grammar (which does not have an unlimited stack representation available), the jumps to sub-automata are hard-wired, i.e., transitions for non-terminal symbols like the a7a31a9 transition from a22 a25 to a22 a9 are replaced by direct a29 -transitions to the initial state and from the end state of the respective sub-automata: (9). (Of course, the resulting non-deterministic FSA is then determinized and mini-</Paragraph> <Paragraph position="15"> The technique is approximative, since on jumping back, the automaton &quot;forgets&quot; where it had come from, so if there are several rules with a right-hand side occurrence of, say a7 a15 , the automaton may non-deterministically jump back to the wrong rule. For instance, if our grammar consists of a recursive production B a5 a B c for category B, and a production</Paragraph> <Paragraph position="17"> The approximation loses the original balancing of a's and c's, so &quot;abcc&quot; is incorrectly accepted.</Paragraph> <Paragraph position="18"> In the parameterized version of the RTN method that Nederhof (2000) proposes, the state space is enlarged: different copies of each state are created to keep track of what the derivational history was at the point of entering the present subautomaton. For representing the derivational history, Nederhof uses a list of &quot;dotted&quot; productions, as known from Earley parsing. So, for state a22a13a9 in (10), we would get copies a22 a9a15a14a17a16a19a18 , a22 a9a15a14a17a16a21a20 a9a23a22a25a24a27a26a28a9a30a29a2a31a32a18 , etc., likewise for the states a22 a25a34a33 a22 a9 a33 a11a13a11a13a11 The a29 -transitions for jumping to and from embedded categories observe the laws for legal context-free derivations, as far as recorded by the dotted rules.8 Of course, the window for looking back in history is bounded; there is a parameter (which Nederhof calls a35 ) for the size of the history list in the automaton construction. Beyond the recorded history, the automaton's approximation will again get inexact.</Paragraph> <Paragraph position="19"> (11) shows the parameterized variant of (10), with parameter a35a37a36a39a38 , i.e., a maximal length of one element for the history (a40 is used as a short-hand for item a41 a42 a5a44a43a46a45 a42a48a47a50a49 ). (11) will not accept &quot;abcc&quot; (but it will accept &quot;aabccc&quot;).</Paragraph> <Paragraph position="21"> The number of possible histories (and thus the number of states in the non-deterministic FSA) grows exponentially with the depth parameter, but only polynomially with the size of the grammar.</Paragraph> <Paragraph position="22"> Hence, with parameter a35a56a36a57a38 (&quot;RTN2&quot;), the technique is usable for non-trivial syntactic grammars.</Paragraph> <Paragraph position="23"> Nederhof (2000) discusses an important additional step for avoiding an explosion of the size of the intermediate, non-deterministic FSA: before the described approximation is performed, the context-free grammar is split up into subgrammars of mutually recursive categories (i.e., categories which can participate in a recursive cycle); in each subgrammar, all other categories are treated as non-terminal symbols. For each subgrammar, the RTN construction and FSA minimization is performed separately, so in the end, the relatively small minimized FSAs can be reassembled.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 A selective history-based RTN-method </SectionTitle> <Paragraph position="0"> In word formation, the split of the original grammar into subgrammars of mutually recursive (MR) categories has no great complexity-reducing effect (if any), contrary to the situation in syntax. Essentially, all recursive categories are part of a single large equivalence class of MR categories. Hence, the size of the grammar that has to be effectively approximated is fairly large (recall that we are dealing with a compiled-out unification grammar). For a realistic grammar, the parameterized RTN technique is unusable with parameter a35a58a36 a20 or higher. Moreover, a history of just two previous embeddings (as we get it with a35a59a36 a20 ) is too limited in a heavily recursive setting like word formation: recursive embeddings of depth four occur in realistic text.</Paragraph> <Paragraph position="1"> However, we can exploit more effectively the &quot;mildly context-free&quot; characteristics of morphological grammars (at least of German) discussed in sec. 4. We propose a refined version of the parameterized RTN-method, with a selective recording of derivational history. We stipulate a distinction of two types of rules: &quot;historically important&quot; h-rules (written a4 a0a5 a7a10a9a12a11a13a11a13a11a14a7a16a15 ) and non-h-rules (written a4a2a1a3a0a5 a7a10a9a12a11a13a11a13a11a14a7a16a15 ). The h-rules are treated as in the parameterized RTN-method. The non-h-rules are not recorded in the construction of history lists; they are however taken into account in the determination of legal histories. For instance, a41</Paragraph> <Paragraph position="3"> will appear as a legal history for the sub-automaton for some category D only if there is a derivation B a1a6a0a7a9a8a11a10 D a12 (i.e., a sequence of rule rewrites making use of non-h-rules). By classifying certain rules as non-h-rules, we can concentrate record-keeping resources on a particular subset of rules.</Paragraph> <Paragraph position="4"> In sec. 4, we saw that for most rules in the compiled-out context-free grammar for German morphology (all rules compiled from (3b) and (3c)), the inexactness of the RTN-approximation does not have any negative effect (either due to headadjacency, which is preserved by the non-parametric version of RTN, or due to lack of category-specific constraints, which means that no context-free balancing is checked). Hence, it is safe to classify these rules as non-h-rules. The only rules in which the inexactness may lead to overgeneration are the ones compiled from the prefix rule (3a). Marking these rules as h-rules and doing selective history-based RTN construction gives us exactly the desired effect: we will get an FSA that will accept a free alternation of all three word-formation types (as far as compatible with the lexical affixes' selection), but stacking of prefixes is kept track of. Suffix derivations and compounding steps do not increase the length of our history list, so even with a a35a37a36a39a38 or a35 a36 a20 , we can get very far in exact coverage.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Additional optimizations </SectionTitle> <Paragraph position="0"> Besides the selective history list construction, two further optimizations were applied to Nederhof's (2000) parameterized RTN-method: First, Earley items with the same remainder to the right of the dot were collapsed (a41 a4 a0a5 a10 a9 a45 a12 a49 and a41 a4 a0a5 a10 a11 a45 a12 a49 ). Since they are indistinguishable in terms of future behavior, making a distinction results in an unnecessary increase of the state space. (Effectively, only the material to the right of the dot was used to build the history items.) Second, for immediate right-peripheral recursion, the history list was collapsed; i.e., if the current history has the form</Paragraph> <Paragraph position="2"> a11a13a11a13a11a16a15 , and the next item to be added would be again a41 a4a17a0a5 a11a13a11a13a11 a45 a4 a49 , the present list is left unchanged. This is correct because completion of</Paragraph> <Paragraph position="4"> a49 will automatically result in the completion of all immediately stacked such items.</Paragraph> <Paragraph position="5"> Together, the two optimizations help to keep the number of different histories small, without losing relevant distinctions. Especially the second optimization is very effective in a selective history setting, since the &quot;immediate&quot; recursion need not be literally immediate, but an arbitrary number of non-h-rules may intervene. So if we find a noun prefix [N a0a5 N a24a19a18 a45 N], i.e., we are looking for a noun, we need not pay attention (in terms of coveragerelevant history distinctions) whether we are running into compounds or suffixations: we know, when we find another noun prefix (with the same selection features, i.e., origin etc.), one analysis will always be to close off both prefixations with the same noun: Note however that in terms of grammatically legal continuations, this configuration is &quot;subsumed&quot; by (13b), which is compatible with (12) (the top '?' category will be accessible using a29 -transitions back from a completed N--recall that suffixation and compounding is not controlled by any history items).</Paragraph> <Paragraph position="6"> So we can note that the only examples for which the approximating FSA is inexact are those where the stacking depth of distinct prefixes (i.e., selecting # diff. pairs of interm. non-deterministic fsa minimized fsa categ./hist. list # states # a0 -trans. # a29 -trans. # states # trans. for a different set of features) is greater than our parameter a35 . Thanks to the second optimization, the relatively frequent case of stacking of two verbal prefixes as in vor.ver.arbeiten 'preprocess' counts as a single prefix for book-keeping purposes.</Paragraph> </Section> class="xml-element"></Paper>