File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/00/j00-1006_ackno.xml
Size: 9,149 bytes
Last Modified: 2025-10-06 13:50:02
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-1006"> <Title>Multitiered Nonlinear Morphology Using Multitape Finite Automata: A Case Study on Syriac and Arabic</Title> <Section position="11" start_page="13121" end_page="13121" type="ackno"> <SectionTitle> Acknowledgments </SectionTitle> <Paragraph position="0"> This research was supported by a St. John's Benefactors' scholarship and was carried out under the supervision of Dr. Stephen Pulman (University of Cambridge). Thanks are due to the Master and Fellows of St. John's College and the Computer Laboratory, Cambridge, for various grants. Much revision and enhancement took place at Bell Laboratories. Comments by the anonymous reviewers helped in reshaping the presentation of this paper. Ken Beesley kindly made a forthcoming paper available to me and answered many questions. Martin Jansche performed the test detailed in Appendix B.2 and provided comments. Christine Nakatani provided many useful editorial comments.</Paragraph> <Paragraph position="1"> The algorithms in Section 4 were implemented by the author in SICStus Prolog using a finite-state library provided by E. Grimley-Evans. The library allows the creation, manipulation, and destruction of multitape finite-state machines with an easy algebraic interface to n-way regular expressions. The Prolog term regexp to fsa(+RegExp,?Automaton) constructs the machine Automaton for the extended regular expression KegExp, e.g., expression 3 (Section 4.2.1) is turned into a four-tape machine with the expression regexp_to_fsa(t(c,c,c,c)^(\[t(k,c,k,O) .... )\]^t(c,c,c,c))*, Centers) The predicate t (+terminal) denotes a terminal tuple, infix ^ denotes concatenation, postfix * denotes Kleene star, and a list denotes union over its elements. Primi- null Computational Linguistics Volume 26, Number 1 tive Prolog procedures (e.g., union/3, kleene/2, etc.) are also provided for (Kiraz and Grimley-Evans 1998).</Paragraph> <Paragraph position="2"> The algorithms given in Section 4 are closely followed. First, the lexical compiler is invoked creating a multitape machine for each sublexicon, and then putting them together with the cross product operator. The rule compiler then compiles all the rules into another machine. The entire language description is then created with the operation:</Paragraph> <Paragraph position="4"> where 7rs denotes the set of all surface symbols. The first component maps the lexicon to all surface symbols. The intersection leaves Language with the valid expressions only. (One can also use composition instead of intersection depending on the nature of the rules.) There is some room to enhance the implementation, especially in time and memory overhead. While the finite state library is capable of handling automata for small-scale grammars, larger grammars would stretch its capabilities. The library was successfully used, however, to implement the algorithms in the current work as well as those presented by Grimley-Evans (1997) and Kiraz (1997a). It must be stressed that the finite-state calculus library, as well as the rule and lexicon compilers, are prototype implementations written for research purposes. An interpreter version of the work presented here was described earlier (Kiraz 1996). The interpreter works on rules directly and can handle larger grammars.</Paragraph> <Paragraph position="5"> Appendix B: Beesley's Bracketing Algorithm vs. Kiraz's Multitape Algorithm B.1 Using Bell Labs&quot; Lextools This test was performed using the Lextools lexical compiler. Each line in the input file is an extended regular expression. The compiler turns each line into an FSA, then takes the union of all FSAs. The file may contain a header, surrounded between two XXs, for alias definitions. In Beesley's case, the following source file was used (only two roots andtwo vocalisms are shown):</Paragraph> <Paragraph position="7"> Each line in the header consists of $ (name), followed by =, followed by an extended regular expression. The compiler builds an FSA for the expression and stores it under name. In regular expressions, curly brackets are used to denote multicharacter symbols (e.g., {Sigma}) or special symbols (e.g., < and > which otherwise are used for weights in weighted FSAs). The operators are: - for subtraction, I for union, ~ for insert or ignore, a for intersection, and $ (name) for reference to a machine already defined in the header.</Paragraph> <Paragraph position="8"> The implementation of the Lextools insert operator was modified to provide for a fair representation of Beesley's method. The initial runs took quite some time (many Kiraz Multitiered Nonlinear Morphology hours for 100 roots!) since the insert operator was initially implemented by the expression Range(A o (G U c:B)*) for inserting B into A (Kaplan and Kay 1994). The new algorithm inserts B directly into A by iterating over states and adding new states and arcs. Additionally, since Beesley's algorithm makes heavy use of alias definitions, access to them was enhanced by applying a binary search mechanism rather than Lextools's original linear search.</Paragraph> <Paragraph position="9"> For Kiraz's method, two source files were used. The root file is simply a list of all roots, e.g.,</Paragraph> <Paragraph position="11"> and the pattern file is a The two FSAs generated from the two files were fed into a regular expression compiler to compute expression (2) (Section 4.1).</Paragraph> <Paragraph position="12"> Table 3 (Section 6.2) shows the times spent on compiling up to 3,000 roots with the 24 patterns described in (Beesley, forthcoming). The test was performed on a MIPS R5000 180 MHz based Unix system with a memory size of 64 MB. Caveat is required here: To the disadvantage of Kiraz, the two FSAs resulting from the root and vocalism files are saved on disk, then loaded again by the regular expression compiler to compute their cross product adding unnecessary expensive I/O operations. Otherwise, the results would be even more in favor of the multitiered model.</Paragraph> <Paragraph position="13"> B.2 Using van Noord's FSA Utilities (by Martin Jansche) An independent comparison between the approach described here and the one in (Beesley, forthcoming) was carried out using van Noord's FSA Utilities (van Noord 1997) (we use its notation for regular expressions throughout this section). We compared the time spent on compiling lexica of various sizes for both frameworks. Given a set of roots and a set of patterns, each compiled into a finite automaton as outlined in Section 4.1, the key difference between the work of Kiraz and that of Beesley is that the former uses the cross product operation to compile the full lexicon, while the latter uses intersection. Since these two operations are very similar, one would not expect much of a difference if the elements in each sublexicon were the same. However, the different formal renderings of roots and patterns in the two approaches results in a significant difference.</Paragraph> <Paragraph position="14"> Beesley Kiraz lexicon by intersection cross product roote.g, ignore(\[<,k,>,<,t,>,<,b,>\], ?-{<,>}) \[k,t,b\] pattern e.g. \[<,?,>,u,<,?,>,i,<,?,>\] \['C' ,u, 'C' ,+-, 'C'\] Using Beesley's approach directly made the task of compiling the root lexicon intractable for more than 10 roots. After modifications to Beesley's version--such as intersecting the disjunction of roots with the disjunction of patterns once, 5 delimiting 5 We realize that since roots do not apply to all patterns, but to lexically defined subsets, applying intersection once does not work in practice. It was applied here to enhance the performance of Beesley's algorithm.</Paragraph> <Paragraph position="15"> Computational Linguistics Volume 26, Number 1 root consonants with one special symbol only, and inserting other symbols only between root consonants rather than at arbitrary positions so that a typical root now has the shape \[sym*, <, k, sym*, <, t, sym*, <, b, sym*\], where syms expands to (? - <)--it became tractable, but was still much slower than the alternative discussed here. Both approaches were implemented as Prolog code and macros on top of the FSA Utilities. The two implementations share common code that contains, among other things, a database of roots and patterns in the form pattern(\['C',a,'C',a,'C'\]).</Paragraph> <Paragraph position="16"> pattern(\['C',u,'C',i,'C'\]).</Paragraph> <Paragraph position="17"> %% etc.</Paragraph> <Paragraph position="18"> root(k,t,b).</Paragraph> <Paragraph position="19"> root(p,n,q).</Paragraph> <Paragraph position="20"> root(\[sym*,<,X,sym*,<,Y,sym*,<,Z,sym*\]) :- root(X,Y,Z).</Paragraph> <Paragraph position="21"> macro(sym, ? - <). ~% Beesley's &quot;nonRoot&quot; There is a common macro sublexicon(Pred,N) that calls a metapredicate like findall/3 to find the first N solutions to a call to Pred(W) and collects all the Ws thus obtained into a big disjunction. The compilation of the lexicon is then very simple: null</Paragraph> </Section> class="xml-element"></Paper>