Morphological Analyzer as Syntactic Parser 
G~ibor Pr6sz6ky 
Morphol,ogic 
NdmctvOlgyi 0t 25, lhtdapcst, 11-1126 11ungary 
h6109pro(a\]clhl.hu 
Abstract. We describe how a simple parser can be built on 
tile basis of nmrphology and a morphological analyzer. Our 
initial conditions have been tile tcclmiques and principles 
ol: Humor, a reversible, shing-bascd tmification tool 
(Prdszdky 1994). Parsing is perlorlngd by the Sillllc engine 
as morphological analysis. It is usefld when therc is not 
enough space to add a new engine to an existing morpl\]of 
ogy-based application (e.g. a spell-checker), but you would 
like to handle sentence-level information, its well (e.g. a 
gramnlar checker). The morpimlogical analyzer breaks up 
words into several parts, all of which stored it\] tile main 
lexicon, l:,ach part has a feature structure and the validily of 
tile input word is checked by unifying them. Thc mor- 
phological analyzer returns various information about a 
word including its categorization. In a sentence, the cate- 
gory of each word (or morphcme) is considered a recta- 
letter, and the sentencc itself can be transformed into a 
recta-word that essentially behaves like a real one. Thus tim 
set of sentences recognized by tile parser called Hum0rESl( 
can form a lexicon of recta-words that are processed much 
rite same way as lexicons of real words (morphology). This 
means that algorithmic parsing step are substituted by lexi- 
con look-up, which, by definition, is pcrforn~cd following 
tile stlrJ'ace order of string elements. Both the finitizer that 
transfimns fornml grammars into finite lexicons and tim 
tun-tinm parser of the proposed model have running im- 
plementations.1 
1 INTROI)UCTION 
\[,exical entries in a morphology-lmsed system are words. 
Because of tile similarity, syntactic constructions occurring 
as entries in a mctaqcxicon can be called recta-words. 
Mcta-letters, that is, letters o1" a recta-word arc morpho- 
syntactic categories having an internal structure that de- 
scribes syntaelic behavior of the entry in higher level con° 
structions. The system called Hum0rE,~K (Humor l';nhanced 
with Syntactic Knowledge, where Humor stands lbr I ligh- 
speed Unification Morphology) to be shown here consists 
el: nulnerous recta-lexicons. Each o1: them has a name: lhe 
syntactic category it describes. Categories like S', S, NP, 
VI< etc. are described in separate lexicons. Meta-lexicons 
l'ornl a hierarchy, that is, letters in a lnetadexicon can refer 
to other (but only lower level) lexicons. Parsing on each 
level, therefore, can be realized as lexical h)ok-up. Neither 
backtracking, look-ahcad, tier other tilnc-consuming pars- 
ing steps arc needed in order to get the analysis of a sen- 
tence. The only on-line Ol)eratitm is a unit\]ability check for 
each possible lexical entry that matches lhe sentence in 
question. 
I This work was partially supported by the Ihmgm tan National 
Scientific Fund (()TKA). 
Gramnmrs are compiled into a nmtti-lcvel pattern 
slrttchtrc. ()n a lower level, parsing a word results in a 
recta-letter, that is, part of a recta-word on a higher level. 
Such structures, lbr example, NI' and VP, are recta-letters 
coming from lower levels and form a recta-word that can 
be parsed as a sentence, because of the existence of a rule S 
-~ NP VP in the original gratltlllar. A COtlIpIcx setliellce 
gratllt/lar can be broken up into non-rectlrsivc, gralnttlars 
describing smaller grammatical units on different levels. 
These granlmars are, of course, nmch simpler than the 
original one. Recursive transition networks (P, TN) can also 
be made according to similar principles, but their recursivc 
nattu'e cannot be Ionnd in our method. In other words: the 
output symbol of any level does not occur in the actual or 
lowcr level dictionaries. 
Tile whole lexicon cascade can be generated front arbi- 
trary grmnmars writ/en in any usual (for the time being, 
CI:, but in tile near furore any fcahne-based) tbrnmlism. 
We call this step grammar learninL< The sotl\wue tool we 
have developed for this reason lakes tile grallllllar ~ts inpttt, 
creates the largest regular subset of tile language it de- 
scribes regarding the string-completion limit of Kornai 
(1985), then lbrms it finite pattern structure by depth limit 
and length limit fronl the above I'egtliar description. 
2 PARSING WITH PATTERNS 
l'arsers are (conlputational) tools that read and analyze a 
sentence, and return a wide range of information ~tl)(,lttt it, 
that is, they recognize 
(1) if the input is a valid sentence (according to the yules of 
the object hmguage), 
(2) segment the input sentence as tnany ways its possible, 
tllld 
(3) provide stone custonl infornmtion. 
The latter custom information can be a simple '()I<' sign 
indicating that tile sentence is well-formed (grammar 
checker), but it can also be the same sentence in another 
language (translation tool), or, in case of a (granunatically) 
incorrect sentence, it cat\] bc a list el suggestions how it 
nlW be corrected (~,,rammar correcter). In the present im- 
plementation we use morpho-syntactic categories as output 
it\]lormation on every level (parser). 
I:or the input sentence 
The dog sinRs. 
tile l';nglish module of Humor returns the following mor- 
phological categorization: 
The\] 1)1:/\['\] dog\[N\] S/hA, IV\] t \[3S(; I .IENI)I 
l,ct us now strip off tile actual words fronl the nmrphologi- 
cal information (from now on we call them morphological 
codes or mo,7)h-codes ). Wriling only the morph-codes, we 
gel 
I)ETN V 3S(; I';NI). 
The problcnl is now how wc recognize this as a sentence. 
This sequence must smnehow bc stored in another lexicon 
1123 
describing phrases and phrase structures. It is quite clear 
that in the above string, DET, N, and V are simple symbols 
that can easily be encoded as single letters like d, n, v, x 
and e. Transforming the sequence of morph-codes we get 
the word dnvxe. Earlier we said thai the I'tum0r engine is 
lexicon-independent, so if we have another lexicon, we can 
easily switch to it and instruct Humor to analyze the actual 
word. Humor returns something like dnvxe\[S\] where'S' is 
now the category of the input word indicating that it is a 
sentence. 
The meta-lcvel, of course, can be split tip to further lev- 
els. Let us use, for the sake of simplicity, a simple toy 
grammar of two levels for the nominal phrase and the sen- 
tence: 
(Level 2) S -~ NP S, S -~, S NP, S -+ V (3SG) 
(t,evel l) NP --~ DET NG, NG -~ ADJ NG, NG -+ N 
Now we feel a need for a tool that generates a set of finite 
patterns out of this grammar description. We, therefore, 
developed a tool that finds the largest regular subset of a 
context free language (regarding a special parameter set) 
and then uses a recursive generator to produce the finite 
patterns. \]For the above toy grammar a possible lexicon can 
be the following: 
(Level 2): V END, V 3SG END, NP V END. NP V 3SG END, NP 
V NP END, NP V 3SG NP END, V NP END, V 3SG NP END .... 
(Level 1): DET N, DET ADJ N, DET ADJ ADJ N .... 
If we use letters v, m, x, n, a and d for V, NP, 3SG, N, 
ADJ and DET, respectively, we get the following lexicons: 
(1 ,evel S) re, vxe, mve. mvxe, mvme, m~xme, vine, vxme .... 
(Level NP) dn, dan, daan .... 
If the appropriate lexicons are built from the pattern lists 
for grammars of both levels, the parser is ready to run. The 
parsing algorithm can be outlined as follows. \]'he parser 
runs a morphological analysis on each word in the input 
sentence and encodes the morph-codes into meta-letters. 
Using our example, The dog sings (DET N V 3SG 
END) the parser will find that the string 'DET N' forms a 
noun phrase, because dn can be found in the NP lexicon. 
The meta-morphological analysis (a search in the lexicon 
of the patterns of Level 1) returns dn\[m\], that is, DET N 
\[NP\]. For level 2, the parser exchanges the substring 'DET 
N' with the meta-letter 'NP'. So the new recta-word is mve, 
that is, 'NP V END' which is accepted by the Level 2 
grammar (sentences). In fact, we have another meta-word 
here, namely, a single n (='N') that can also be categorized 
as a noun phrase (m); and this yields dmve, that is, 'DET 
NP V END' which is not accepted by the Level 2 grammar. 
Giving these two as input to the Level 2 meta- 
morphological analysis, the system will reject dmve 'I)ET 
NP V END' but will accept mve 'NP V END' by returning 
mve\[S\], that is, NP V END \[S\]. 
It is clear that no backtracking is possible in our run- 
time system, that is, a meta-word cannot be categorized by 
a symbol that is a recta-letter of meta-words on the same or 
lower level. It is an important restriction: category symbols 
must be recta-letters used only on higher levels. This con- 
straint providcs us with another advantage: any set of cate- 
gory symbols (higher level meta-letters or meta-morph- 
codes) is disjoint from the set of lower level meta-letters 
(or recta-letters used on the level of morphology), there- 
fore, parsing lexicons can be unified: meta-words 
(morphological or any set of phrase structure patterns) for 
all levels can be stored in a single lexicon. 
In the explanation of the parsing techniques we have ex- 
cluded one aspect until this point, and this is unification. 
Without feature structures and unification, however, nu- 
merous incorrectly formed sentences are accepted by the 
parser. If a meta-word is not ~bund, it is rejected and the 
process goes on to the next meta-word. If the meta-word is 
found, then it may still be incorrect. This is checked 
through the unifiability-checking of the feature structures 
of its ineta-letters. For instance, in a noun phrase 'DET N', 
the unifiability of the feature structures assigned to I)ET 
and N is checked. If they are not unifiable, the recta-word 
is rejected and the process goes on to the next recta-word. 
If they are unifiable, the output is passed on to the next 
level. The last level is responsible for providing the user 
with the proper analysis, that is, all the information cob 
lected so far. 
3 FROM GRAMMARS TO LEXICAL 
PATTERNS 
All infinite structures generated by recursion can be re- 
stricted by limiting the recursion depth. This means a con- 
straint of the depth of the derivation tree of a sentence in a 
language. We can also restrict the direction of branching in 
the derivation tree. '\['his means that we could generate 
(finite) patterns directly fi'om the original (context-free) 
language imposing various limits on embedding; but these 
methods can be too weak or too strong and, most of all, ir- 
relevant to the ol~iect language. There is, however, a 
slighter constraint that helps transfbrming context-free 
grammars. According to Kornai's hypothesis (Kornai 
1985), any string that can be the beginning of a grammati- 
cal string can be completed with k or less terminal symbols, 
where k is a small integer. This k is called the string com- 
pletion limit (SCL). A grammar transformation device can 
be instructcd to discard sentence beginnings that have a 
minimal SCL larger than specified (by the user). SCL lim- 
its center-embedding but allows arbitrary deep right- 
branching structures (easily defined by right regular gram- 
mars), l,eft branching is also limited, but this limitatibn is 
less pronounced than that of center-embedding. 
Our special tool, GRAM2LEX, takes a CF grammar as 
input. As a first step, it reads the grammar and creates the 
appropriate RTNs from it. Goldberg and Kfilmfin (1992) 
describe an algorithm unifying recursive transition net- 
works. We have improved their algorithm. Its implementa- 
tion is incorporated into the GRAM21,EX tool as a second 
processing phase. The algorithm creates the largest regular 
subset of a context-free language that respects the SCI,. In 
terms of finite state autonmta, SCL is the number of 
branches in the longest path fi'om a non-accepting state to 
an accepting one (regarding all such pmhs). The process re- 
stilts a finite state automaton. In order to get a finite dc- 
scription, from the FSA we introduced two independent pa- 
rameters. The length of the output string (in terms of ter- 
minal symbols) If the current string reaches the maximum 
length, the recursion is cut and the process immediately 
tracks back a level. The maximum number of passing the 
same branch during the generation of an output string can 
also be specified. In the current implementation, this 
1124 
maximum is global to a whole output string. There is, how- 
ever, another approach: this number can be related to the 
current recursion level, so if a certain iteration occurs at 
more than one position in a sentence, the maxinmm length 
of the iteration is the same at both positions and the actttal 
lengths are independent. 
The GRAM2I J.iX tool takes all thc three parameters (the 
S('I,, the maximum string length and the maximum itera- 
tion length) as user-defined ones. The set of finite patterns 
can be compiled into a compressed lexicon with Morphol,- 
ogic's lexicon compiler. The GP, AM21J!X tool produces a 
file in the input format required by this compiler. 
l,evels of the parser are individual processes that com- 
municate with each other. The most important medium is 
the internal pmwing table that represents the parsing graph 
described below. Based on lhat graph, the process of a par- 
ticular level is able to execute its main Rmctional lnodnles, 
namely 
to create the appropriate input to call the morphology 
engine, 
• switch tn the phrase pattern lexicon of the current 
level, 
run the morphology engine and process the output of 
the morphology engine, and 
• if possible, insert new branches into the parsing graph 
lbr the next level. 
Each level is an independent process communicating 
with the others (including level 0, the morphological analy- 
sis). The medium of commtulication is the parsing graph of 
which there is only one copy and is generally accessed by 
all levels. The parsing process on each level can be decom- 
posed into three layers. All levels have the same function- 
ality; it is nnly the internal operation of the first layer that 
diflcrs in the case or' the lowest level (morphology) alld tile 
highest one (sentences): 
• pre-process that based on the current structure of the 
parsing graph (if it exists), produces tile set of the pos- 
sine phrasc slructurcs, 
+ search that checks all the elements of the set generated 
by l,ayer 1 if they are acceptable by the eurrcnt level 
using the ttumor engine equipped with the current levcl's 
parsing lexicon, 
• post-process that based on the patterns accepted by 
l,ayer 2, inserts new nodes and branches into the pars- 
ing graph. 
The different levels are cnnnected to each other like tile 
layers of a single level. The structure of our present 
(demonstrational) 0-1-2-level parser for l lungarian is the 
tbllowing: 
• Morphology (Preprocess Words, Search Morphology 
l+cxicon, Create/Modify Parsing Graph), 
• Noun Phrases (Create Patterns, Search l,evel 1 l'attern 
l,exicon, Modit~¢ Parsing Graph), 
• Sentences (Create I'atterns, Search I,evel 2 l'attcrn 
l,exicon, Modify Parsing Graph). 
4 IMPLEMENTING THE RUN-TIME 
PARSER 
In the current implementation, the parsing levels are exe- 
cuted sequentially, but they can be made concttrrent: dur- 
ing one session, level (/reads a word from the input sen- 
tence, analyzes it and inserts the appropriate nodes and 
branches into tile parsing graph. Further on, tile system has 
a self-driving structure: tile level that made changes to the 
parsing graph sends all indication to the next level which 
then starts the same processing phase, The changes in the 
parsing graph are thus spread upwards in the level struc- 
ture. When the last level (usually the highest) finished up- 
dating the graph, it sends a 'ready for next' signal to level 0 
which starts the next session. 
Termination is controlled by level 0: if it finished am> 
lyzing the htst word (morpheme) of the sentence, it sends a 
'ternainate' signal to the next level. Receiving this signal, 
intermediate levels pass it to the next level after finishing 
the processing the changes that were made to the parsing 
graph. The last level (usually the highest) thcn terminates 
all levels and passes the parsing graph to tile output gen- 
erator. 
I,ct us see an example: 
Patterns: s: NI' VI' END 
Nit N I N N I D|:,T N I DET AI)J N I 
l)l{'I' ADJ AI)J N 
VP: V I V 3SG \] V NP I 131,2 VING \[ FH:, VING 
ADV I V NP 
END: . I ! 
lnpnt: Pro/bss'or Smith is coming home. 
Output: S -, \[NI' VP t';ND I 
NP-> IN NI 
N--+ ProfessorlNI 
N - ~, Smith\[PROP\] 
VP -> \[lie VING AI)VI 
BI,2 > isllW\] 
VING -+> eeme\[V\] t-ing\[lNGI 
ADV +. home\[AI)V\] 
END ->. 
This is the inherent tagging of the sentence built from the 
information stored directly in the phrase structure patterns. 
We have begun, however, the development of another type 
of tagging where phrases correspond to the source gram- 
mars' non-terminal symbols, like this: 
(S 
(NP 
(N professor) 
(N Smith)) 
(VP 
(BE is) 
(V(J 
(V\[N(; 
(V eome) 
(ING ing)) 
(ADV home)))) 
The cur,ent average speed of this multi-level system 
(even for dictionaries with 100.000 entries) is arotnld 50 
input/see for each module on a Pentium/75 machine, where 
input can mean either sentence or phrase or word to bc 
analyzed. 
5 USER INTERFACE 
The current implementation of the Humotl:$K parser allows 
the run-li,ne expansion of the user-defined lexicon file. 
This was achieved by developing a small user interface that 
performs the following functions: 
1125 
• Works in both batch and interactive mode. 
• Users can review all the different taggings of a sentence. 
• Users can view the internal parsing table from which the 
parser output was generated. This means the review of 
the analysis of each morpheme and the recta-words gen- 
erated from them. 
•Uscrs can view both the morpho-lexical and the syntacti- 
cal part of the user-defined lexicons. 
• The user can acid new entries to the user-defined lexicon 
file on any level. The changes take effect suddenly, that 
is, when processing the next sentence or re-parsing the 
last one. 
6 CONCLUSION 
We have developed a parser called {-lum0rl:Sl( that is quite 
powerful (even in its present format, without feature struc- 
tures) and has several important features: 
1. unified processing method of every linguistic level 
2. possible parallel processing of the levels (morphology, 
phrase levels, sentence level, etc.) 
3. morphological, phrasal and syntactic lexicons can be en- 
hanced, even in run-time 
4. easy handling of unknown elements (with re-analysis) 
5. easy correction of gramnmtical Errors 
6. reversible (generation with the 'synthesis by analysis' ) 
7. the same system can be used both for corpus tagging and 
finE-grained parsing 
l"eature I seems important if thcrc is not enough space 
to add a new engine to an existing lnorphology-ascd 
application (e.g. a spell-checker), but you must handle 
sentence-level information, as well (e.g. a grammar 
checker). Real parallelism indicated in 2 has not yet been 
implemented. Usefulness of attributes 3-6 are going to bc 
proven in practice, because we have just finished the first 
version of the first I lungarian gramma," checker called Hely- 
esebb. It uses the spelling corrector and morphological ana- 
lyzcr/gcnerator modules relying on the Humor morphologi- 
cal system -. the basis of Hurn0rE,~l( that are widely used by 
tens of thousands of both professional and non- 
professional End-users (Prdszdky 1994, Prdszdky et al. 
1994). We have results in proving the first part o1' \['caturc 7, 
namely corpus tagging. Fine-grained parsing would need 
the extended use of features. This system .- as wc men- 
tioned earlier - is under development. 
7 REFERENCES 
\[1\] Goldbcrg, J. and I,. K/dmfin, °The First BUG Report', 
Proceedings of COLING-92, Nantes (1992). 
\[21 Kis, B. 'Parsing Based on Morphology', Unpublished 
Master's Thesis, Budapest Technical 1Jniversity, 1995. 
131 Kornai, A. 'Natural l,anguages and the Cholnsky Ilie,- 
archy', Proceedings (?f the 2nd Conf. of the I:ACL, Ge- 
neva, 1-7. (1985). 
1411 Prdsz6ky, G. 'Industrial Applications of Unification 
Morphology', Proceedinjzs ()/ ANLP-94, Stuttgart 
(1994). 
\[5\] Prdszdky, G., M. Pal and I,. Tihanyi, 'Hum0r-bascd Ap- 
plications', Proceedings o/C'OLING-94, Kyoto (1994). 
\[12\] A Duna ut&n a Tisza a legnagyobh foly6nk. 
\[i/\] \] 0:00:02.9\] HumorESK 2.0 REV\[F,W 
SS -> \[DP Cas DP DP End\] 
DP -> \[Det N\] 
Det -> a\[Artic\]e\]-A 
N -> Duna \[ProperNoun\] 
Cas -> ut6n\[PostPosition\] 
13P -> \[Det N\] 
Det -> a\[Article\] 
N -> Tisza\[Noun\] 
D\] ) -> \[Det Adj\ Adj \Adj N PSfx\] 
Det -> a\[Article\] 
Adj\ -> leg \[Superlative\] 
Adj -> +nagy\[Adjective\] 
\Adj -> +obb\[Comparative\] 
N -> fo\]y6\[Noun\] 
PSfx -> +nk\[PersSuffPlurFirst\] 
End -> 
First \[^HOME\] Last \[^END\] \[G\]o to Syntax exceptions\[A\]t+S\] 
\[P\]arse again Accept\[AENTER\] JR\]eject Word exceptions\[Alt~W\] 
Exit\[ESC\] Internal parsing table\[FlO\] 
Figulc 1. HumorESK-analysis of thc sentence 
"A Duna utdn a Tisza a legnagyobb folydnk. " 
(After the Danube, our biggest rivet" is Tisza.) 
1126 
