File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1033_metho.xml
Size: 22,469 bytes
Last Modified: 2025-10-06 14:15:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1033"> <Title>Dependency Parsing with an Extended Finite State Approach</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Dependency Syntax </SectionTitle> <Paragraph position="0"> Dependency approaches to syntactic representation use the notion of syntactic relation to associate surface lexical items. The book by Mel~uk (1988) presents a comprehensive exposition of dependency syntax. Computational approaches to dependency syntax have recently become quite popular (e.g., a workshop dedicated to computational approaches to dependency grammars has been held at COL-ING/ACL'98 Conference). J~irvinen and Tapananinen have demonstrated an efficient wide-coverage dependency parser for English (Tapanainen and J~irvinen, 1997; JPSrvinen and Tapanainen, 1998).</Paragraph> <Paragraph position="1"> The work of Sleator and Temperley(1991) on link grammar, an essentially lexicalized variant of dependency grammar, has also proved to be interesting in a number of aspects. Dependency-based statistical language modeling and analysis have also become quite popular in statistical natural language processing (Lafferty et al., 1992; Eisner, 1996; Chelba and et al., 1997).</Paragraph> <Paragraph position="2"> Robinson(1970) gives four axioms for well-formed dependency structures, which have been assumed in almost all computational approaches. In a dependency structure of a sentence (i) one and only one word is independent, i.e., not linked to some other word, (ii) all others depend directly on some word, (iii) no word depends on more than one other, and, (iv) if a word A depends directly on B, and some word C intervenes between them (in linear order), then C depends directly on A or on B, or on some other intervening word. This last condition of projectivity (or various extensions of it; see e.g., Lau and Huang (1994)) is usually assumed by most computational approaches to dependency grammars as a constraint for filtering configurations, and has also been used as a simplifying condition in statistical approaches for inducing dependencies from corpora (e.g., Yiiret(1998).)</Paragraph> </Section> <Section position="4" start_page="0" end_page="254" type="metho"> <SectionTitle> 3 Turkish </SectionTitle> <Paragraph position="0"> Turkish is an agglutinative language where a sequence of inflectional and derivational morphemes get affixed to a root (Oflazer, 1993). Derivations are very productive, and the syntactic relations that a word is involved in as a dependent or head element, are determined by the inflectional properties of the</Paragraph> <Paragraph position="2"> Figure h Links and Inflectional Groups one or more (intermediate) derived forms. In this work, we assume that a Turkish word is represented as a sequence of inflectional groups (IGs hereafter), separated by &quot;DBs denoting derivation boundaries, in the following general form: root+Infl1&quot;DB+Infl2&quot;DB+. * .'DB+Infl.</Paragraph> <Paragraph position="3"> where Infli denote relevant inflectional features including the part-of-speech for the root, or any of the derived forms. For instance, the derived determiner saglamlaSStlrdzgzmzzdaki I would be represented as:2 s aglam+hdj &quot;DB+Verb+Be come &quot;DB+Verb+Caus+Po s A sentence would then be represented as a sequence of the IGs making up the words.</Paragraph> <Paragraph position="4"> An interesting observation that we can make about Turkish is that, when a word is considered as a sequence of IGs, syntactic relation links only emanate from the last IG of a (dependent) word, and land on one of the IG's of the (head) word on the right (with minor exceptions), as exemplified in exceptions, the dependency links between the IGs, when drawn above the IG sequence, do not cross. Figure 2 shows a dependency tree for a sentence laid on top of the words segmented along IG boundaries.</Paragraph> </Section> <Section position="5" start_page="254" end_page="257" type="metho"> <SectionTitle> 4 Finite State Dependency Parsing </SectionTitle> <Paragraph position="0"> The approach relies on augmenting the input with &quot;channels&quot; that (logically) reside above the IG sequence and &quot;laying&quot; links representing dependency relations in these channels, as depicted Figure 3 a).</Paragraph> <Paragraph position="1"> The parser operates in a number of iterations: At each iteration of the parser, an new empty channel 1Literally, &quot;(the thing existing) at the time we caused (something) to become strong&quot;. Obviously this is not a word that one would use everyday. Turkish words found in typical text average about 3-4 morphemes including the stem.</Paragraph> <Paragraph position="2"> 2 The morphological features other than the obvious POSe are: +Become: become verb, +Caus: causative verb, PastPart: Derived past participle, Ptsg: leg possessive agreement, A3sg: 3sg number-person agreement,+Zero: Zero derivation with no overt morpheme, +Pnon: No possessive agreement, +Loc:Locative case, +Poe: Positive Polarity.</Paragraph> <Paragraph position="3"> a) Input sequence of IGs am augmented with symbols to represent Channels. (IGl) (IG2) (IG3)... (IGi)... (IGn_{) (IG,) b) Links are embedded in channels.</Paragraph> <Paragraph position="4"> ,..-...,.,,% ,,,:...,...r, ............................ .~,,.....~ ...... (IGl) (IG2) (IG3)... (IGi)... (IG._l) (IG.) c) New channels are &quot;stacked on top of each other&quot;. * u...... ~...T .',..,L. ............................. ~.......~ ..... .n.......r..,.:,......~ ............................ .~....,..~ ...... (IGI) (IG2) (IG3)... (IGi)... (IG..I) (IG.) d) So that links that can not be accommodated in lower channels can be established. ...................................... * .l ...................... ;. ..... (IGl) (IG2) (IG3)... (IGi)... (IG,.l) (1G,) * .~.--.-- ~- &quot;A'&quot;&quot; ~ ............. ~ ............ ~&quot;&quot;&quot;1~ ..... (IG,) (IG,) (IG0... (IG~)... (IGdeg.,) 0G,) new links can be added. An abstract view of this is presented in parts b) through e) of Figure 3.</Paragraph> <Section position="1" start_page="254" end_page="255" type="sub_section"> <SectionTitle> 4.1 Representing Channels and Syntactic Relations </SectionTitle> <Paragraph position="0"> The sequence (or the chart) of IGs is produced by a a morphological analyzer FST, with each IG being augmented by two pairs of delimiter symbols, as <(IG)>. Word final IGs, IGs that links will emanate from, are further augmented with a special marker (c).</Paragraph> <Paragraph position="1"> Channels are represented by pairs of matching symbols that surround the <... ( and the )...> pairs.</Paragraph> <Paragraph position="2"> Symbols for new channels (upper channels in Figure 3) are stacked so that the symbols for the topmost channels are those closest to the (...).a The channel symbol 0 indicates that the channel segment is not used while 1 indicates that the channel is used by a link that starts at some IG on the left and ends at some IG on the right, that is, the link is just crossing over the IG. If a link starts from an IG (ends on an IG), then a start (stop) symbol denoting the syntactic relation is used on the right (left) side of the IG. The syntactic relations (along with symbols used) that we currently encode in our parser are the following: 4 S (Subject), 0 (Object), M (Modifier, adv/adj), P (Possessor), C (Classifier), D (Determiner), T (Dative Adjunct), L ( Locative Adjunct), A: (Ablative Adjunct) and I (Instrumental Adjunct). For instance, with three channels, the two IGs of bahgedeki in Figure 2, would be represented as <MD0(bah~e+Noun+h3sg+Pnon+Loc)000> the first IG indicate the incoming modifier and determiner links, and the d on the right of the second IG indicates the outgoing determiner link.</Paragraph> </Section> <Section position="2" start_page="255" end_page="255" type="sub_section"> <SectionTitle> 4.2 Components of a Parser Stage </SectionTitle> <Paragraph position="0"> The basic strategy of a parser stage is to recognize by a rule (encoded as a regular expression) a dependent IG and a head IG, and link them by modifying the &quot;topmost&quot; channel between those two. To achieve this: 1. we put temporary brackets to the left of the dependent IG and to the right of the head IG, making sure that (i) the last channel in that segment is free, and (ii) the dependent is not already linked (at one of the lower channels), 2. we mark the channels of the start, intermediate and ending IGs with the appropriate symbols encoding the relation thus established by the brackets, 3. we remove the temporary brackets.</Paragraph> <Paragraph position="1"> A typical linking rule looks like the following: 5 \[LL IGI LR\] \[ML IG2 MR\]* \[RL IG3 RR\] (->) &quot;{s&quot; ... &quot;s}&quot; This rule says: (optionally) bracket (with {S and S}), any occurrence of morphological pattern IG1 (dependent), skipping over any number of occurrences of pattern IG2, finally ending with a pattern IG3 (governor). The symbols L(eft)L(eft), LR, ML, MR, RL and RR are regular expressions that encode constraints on the bounding channel symbols. For instance, LI~ is the pattern &quot;(c) .... ) .... 0&quot; \[&quot;0&quot; I 1\]* &quot;>&quot; which checks that (i) this is a word-final IG (has a &quot;(c)&quot;), (ii) the right side &quot;topmost&quot; channel is empty (channel symbol nearest to &quot;)&quot;is &quot;0&quot;), and (iii) the IG is not linked to any other in any of the lower channels (the only symbols on the right side are 0s and ls.) For instance the example rule \[LL NominativeNominalA3pl LR\] \[ML AnyIG MR\]* \[RL \[FiniteVerbA3sg I FiniteVerbl3pl\] RR \] (->) &quot;{s .... s}&quot; SWe use the XRCE Regular Expression Language Syntax; see http ://www. xrce. xerox, com/resea.vch/ taltt/fst/fssyntax.htral for details.</Paragraph> <Paragraph position="2"> is used to bracket a segment starting with a plural nominative nominal, as subject of a finite verb on the right with either +A3sg or +A3pl number-person agreement (allowed in Turkish.) The regular expression NominativeNominalA3pl matches any nominal IG with nominative case and A3pl agreement, while the regular expression \[FiniteVerbA3sg J FiniteVerbA3pl\] matches any finite verb IG with either A3sg or A3pl agreement. The regular expression AnyIG matches any IG.</Paragraph> <Paragraph position="3"> All the rules are grouped together into a parallel bracketing rule defined as follows:</Paragraph> <Paragraph position="5"> which will produce all possible bracketing of the input IG sequence. 6</Paragraph> </Section> <Section position="3" start_page="255" end_page="256" type="sub_section"> <SectionTitle> 4.3 Filtering Crossing Link Configurations </SectionTitle> <Paragraph position="0"> The bracketings produced by Bracket contain configurations that may have crossing links. This happens when the left side channel symbols of the IG immediately right of a open bracket contains the symbol 1 for one of the lower channels, indicating a link entering the region, or when the right side channel symbols of the IG immediately to the left of a close bracket contains the symbol 1 for one of the lower channels, indicating a link exiting the segment, i.e., either or both of the following patterns appear in the bracketed segment: (i) {S < ... 1 ... 0 ( ... ) ...</Paragraph> <Paragraph position="1"> (ii) ... ( ... ) 0 ... 1 ... > S} Configurations generated by bracketing are filtered by FSTs implementing suitable regular expressions that reject inputs having crossing links.</Paragraph> <Paragraph position="2"> A second configuration that may appear is the following: A rule may attempt to put a link in the topmost channel even though the corresponding segment is not utilized in a previous channel, e.g., the corresponding segment one of the previous channels may be all Os. This constraint filters such cases to 6{Reli and Roli} are pairs of brackets; there is a distinct pair for each syntactic relation to be identified by these rules. prevent redundant configurations from proliferating for later iterations of the parser. 7 For these two configuration constraints we define Filteraonfigs as s FilterConfigs = \[ FilterCrossingLinks .o.</Paragraph> <Paragraph position="3"> Filt erEmptySegment s\] ; We can now define one phase (of one iteration) of the parser as:</Paragraph> <Paragraph position="5"> The transducer MarkChannels modifies the channel symbols in the bracketed segments to either the syntactic relation start or end symbol, or a 1, depending on the IG. Finally, the transducer RemoveTempBrackets, removes the brackets. 9 The formulation up to now does not allow us to bracket an IG on two consecutive non-overlapping links in the same channel. We would need a bracketing configuration like ... {S < ... > {H < ... > S} ... < ... > M} ...</Paragraph> <Paragraph position="6"> but this would not be possible within Bracket, as patterns check that no other brackets are within their segment of interest. Simply composing the Phase transducer with itself without introducing a new channel solves this problem, giving us a onestage parser, i.e.,</Paragraph> <Paragraph position="8"/> </Section> <Section position="4" start_page="256" end_page="257" type="sub_section"> <SectionTitle> 4.4 Enforcing Syntactic Constraints </SectionTitle> <Paragraph position="0"> The rules linking the IGs are overgenerating in that they may generate configurations that may violate some general or language specific constraints.</Paragraph> <Paragraph position="1"> For instance, more than one subject or one object may attach to a verb, or more that one determiner or possessor may attach to a nominal, an object may attach to a passive verb (conjunctions are handled in the manner described in JPSrvinen and Tapanainen(1998)), or a nominative pronoun may be linked as a direct object (which is not possible in Turkish), etc. Constraints preventing these may can be encoded in the bracketing patterns, but doing so results in complex and unreadable rules. Instead, each can be implemented as a finite state filter which operate on the outputs of Parse by checking the symbols denoting the relations. For instance we can define the following regular expression for filtering out configurations where two determiners are attached to the same IG: ldeg</Paragraph> <Paragraph position="3"> The FST for this regular expression makes sure that all configurations that are produced have at most one D symbol among the left channel symbols, n Many other syntactic constraints (e.g., only one object to a verb) can be formulated similar to above.</Paragraph> <Paragraph position="4"> All such constraints Consl, Cons2 ...ConsN, can then be composed to give one FST that enforces all of these: be a transducer which detects if any configuration has at least one link established in the last channel added (i.e., not all of the &quot;topmost&quot; channel symbols are O's.) Let MorphologicalDisambiguator be a reductionistic finite state disambiguator which performs accurate but very conservative local disambiguation and multi-word construct coalescing, to reduce morphological ambiguity without making any errors.</Paragraph> <Paragraph position="5"> The iterative applications of the parser can now be given (in pseudo-code) as: # Map sentence to a transducer representing a chart of IGs</Paragraph> <Paragraph position="7"> This procedure iterates until the most recently added channel of every configuration generated is unused (i.e., the (lower regular) language recognized by M .o. LastChannelNotEmpty is empty.) The step after the loop, M = M .o.</Paragraph> <Paragraph position="8"> 0nly0neUnlinked, enforces the constraint that 11 The crucial portion at the beginning says &quot;For any IG it is not the case that there is more than one substring containing D among the left channel symbols of that IG.&quot; in a correct dependency parse all except one of the word final IGs have to link as a dependent to some head. This transduction filters all those configurations (and usually there are many of them due to the optionality in the bracketing step.) Then, Parses defined as the (lower) language of the resulting FST has all the strings that encode the IGs and the links.</Paragraph> </Section> <Section position="5" start_page="257" end_page="257" type="sub_section"> <SectionTitle> 4.6 Robust Parsing </SectionTitle> <Paragraph position="0"> It is possible that either because of grammar coverage, or ungrammatical input, a parse with only one unlinked word final IG may not be found. In such cases Parses above would be empty. One may however opt to accept parses with k > 1 unlinked word final IGs when there are no parses with < k unlinked word final IGs (for some small k.) This can be achieved by using the lenient composition operator (Karttunen, 1998). Lenient composition, notated as . 0., is used with a generator-filter combination.</Paragraph> <Paragraph position="1"> When a generator transducer G is leniently composed with a filter transducer, F, the resulting transducer, G . 0. F, has the following behavior when an input is applied: If any of the outputs of G in response to the input string satisfies the filter F, then G .0. F produces just these as output. Otherwise, G .0. F outputs what G outputs.</Paragraph> <Paragraph position="2"> Let Unlinked_i denote a regular expression which accepts parse configurations with less than or equal i unlinked word final IGs. For instance, for i = 2, this would be defined as follows: -\[\[$\[ &quot;<&quot; LeftChannelSymbols* &quot;(&quot; AnyIG &quot;@ .... )&quot; E&quot;0&quot; I 13. &quot;>&quot;\]3&quot; > 2 \]; which rejects configurations having more than 2 word final IGs whose right channel symbols contain only 0s and is, i.e., they do not link to some other IG as a dependent.</Paragraph> <Paragraph position="3"> Replacing line M = H .o. Only0neUnlinked, with, for instance, M = M .0. Unlinked_l .0.</Paragraph> <Paragraph position="4"> Unlinked_2 .0. Unlinked_3; will have the parser produce outputs with up to 3 unlinked word final IGs, when there are no outputs with a smaller number of unlinked word final IGs. Thus it is possible to recover some of the partial dependency structures when a full dependency structure is not available for some reason. The caveat would be however that since Unlinked_l is a very strong constraint, any relaxation would increase the number of outputs substantially.</Paragraph> </Section> </Section> <Section position="6" start_page="257" end_page="258" type="metho"> <SectionTitle> 5 Experiments with dependency </SectionTitle> <Paragraph position="0"> parsing of Turkish Our work to date has mainly consisted of developing and implementing the representation and finite state techniques involved here, along with a non-trivial grammar component. We have tested the resulting system and grammar on a corpus of 50 Turkish sentences, 20 of which were also used for developing and testing the grammar. These sentences had 4 to 24 words with an average 10 about 12 words.</Paragraph> <Paragraph position="1"> The grammar has two major components. The morphological analyzer is a full coverage analyzer built using XRCE tools, slightly modified to generate outputs as a sequence of IGs for a sequence of words. When an input sentence (again represented as a transducer denoting a sequence of words) is composed with the morphological analyzer (see pseudo-code above), a transducer for the chart representing all IGs for all morphological ambiguities (remaining after morphological disambiguation) is generated. The dependency relations are described by a set of about 30 patterns much like the ones exemplified above. The rules are almost all non-lexical establishing links of the types listed earlier. Conjunctions are handled by linking the left conjunct to the conjunction, and linking the conjunction to the right conjunct (possibly at a different channel). There are an additional set of about 25 finite state constraints that impose various syntactic and configurational constraints. The resulting Parser transducer has 2707 states 27,713 transitions while the SyntacticConstraints transducer has 28,894 states and 302,354 transitions. The combined transducer for morphological analysis and (very limited) disambiguation has 87,475 states and 218,082 arcs.</Paragraph> <Paragraph position="2"> Table 1 presents our results for parsing this set of 50 sentences. The number of iterations also count the last iteration where no new links are added. Inspired by Lin's notion of structural complexity (Lin, 1996), measured by the total length of the links in a dependency parse, we ordered the parses of a sentence using this measure. In 32 out of 50 sentences (64%), the correct parse was either the top ranked parse or among the top ranked parses with the same measure. In 13 out of 50 parses (26%) the correct parse was not among the top ranked parses, but was ranked lower. Since smaller structural complexity requires, for example, verbal adjuncts, etc. to attach to the nearest verb wherever possible, topicalization of such items which brings them to the beginning of the sentence, will generate a long(er) link to the verb (at the end) increasing complexity. In 5 out of 50 sentences (5%), the correct parse was not available among the parses generated, mainly due to grammar coverage. The parses generated in these cases used other (morphological) ambiguities of certain lexical items to arrive at some parse within the confines of the grammar.</Paragraph> <Paragraph position="3"> The finite state transducers compile in about 2 minutes on Apple Macintosh 250 Mhz Powerbook. Parsing is about a second per iteration including lookup in the morphological analyzer. With completely (and manually) morphologically disambiguated input, parsing is instantaneous. Figure 4 presents the input and the output of the parser for a sample Turkish sentence. Figure 5 shows the output of the parser processed with a Perl script to provide a more human-consumable presentation:</Paragraph> </Section> class="xml-element"></Paper>