FRAGMENTATION AND PART OF SPEECH DISAMBIGUATION l 
Jean-Louis Binot 
BTM. 
Kwikstraat, 4 
B3078 Everberg, Belgium 
ABSTRACT 
That at least some syntax is necessary to support semantic process- 
ing is fairly obvious. To know exactly how much syntax is needed, 
however, and how and when to apply it, is still an open and crucial, 
albeit old, question. This paper discusses the solutions used in a 
semantic analyser of French called SABA, developed at the Uni- 
versity of Liege, Belgium. Specifically, we shall argue in favor of the 
usefulness of two syntactic processes: fragmentation, which can he 
interleaved with semantic processing, and part-of-speech 
disambiguation, which can be performed as a preprocesslng step. 
1. Introduction 
The role of syntax is one of these issues in natural language proc- 
essing which, albeit old and often (hotly) debated, have yet to re- 
ceive a definitive answer. (Lytinen 86) distinguishes two approaches 
to NI, processing. Followers of the "modular" approach believe 
usually in the autonomy of syntax and in the usefulness and cost- 
effectiveness of a purely syntactic stage of processing. Results of this 
approach include the development of new grammatical formalisms 
(Weir et al. 86) (Ristad 86), and of large syntactic grammars (Jensen 
et al. 86). 
Followers of the "integrated" approach, on the contrary, believe 
that semantics should be used as soon as possible in the parsing 
process. An "integrated" system would have no discemable stages 
of parsing, and would build directly a meaning representation with- 
out building an intermediate syntactic structure, tlow much syntax 
is needed to support this semantic processing, however, and how 
should the integration between syntax and semantics be done are 
still open and crucial questions. Some integrated systems, such as 
IPP (Schank et al. 80) and Wilks" Preference Semantics system 
(Wilks 75), were trying to reduce the role of syntax as much as 
possible. Lytinen proposes a more moderate option in which sepa- 
rate syntactic and semantic rules are dynamically combined at 
parsing time. Another kind of integration is used in (Boguraev 79), 
where an ATN is combined with Wilks' style semantic procedures. 
And, lastly, one might consider that unification-based grammars 
(Shieber 86) offer yet another approach where syntactic and se- 
mantic constraints can be specified simultaneously in functional 
structures and satisfied in parallel. 
The research presented in this paper was entirely performed while the 
author was working at the Computer Sciences department of University of Liege, Belgium. 
In this paper, we wish to present our arguments in favors of in- 
tegration, and then to discuss two specific technical proposals. Our 
general position can be stated as follows: 
1. That at least some form of syntax is necessary for natural lan- 
guage processing should be by now fairly obvious and should 
need no further argumentation. 
2. Syntax, however, is not a goal per se. The basic goal of NLP, 
at least from the point of view of AI, is to give a computer a 
way to "understand" natural language input, and this dearly 
requires a semantic component. The utility or necessity of 
syntax should only be evaluated in the light of the help it can 
provide to this semantic component. 
3. Grammaticality is not an essential issue, except in language 
generation and in specific applications like CRITIQUE (Jensen 
et al. 86), where the purpose is to correct the syntax and the 
style of a writer. For the general task of understanding, 
achieving comprehension, even in the face of incorrect or unu- 
sual input, is more important than enforcing some grammatical 
standards. And we believe that robustness is more easily 
achieved in the context of a semantic system than in the pre- 
dictive paradigm of the grammatical approach. 
If we want to avoid the use of a full male grammar, the syntactic 
processes necessary to support the semantic module must be im- 
plemented by special dedicated procedures. This paper describe the 
.solutions used in a semantic analyser of French called SABA, de- 
veloped at the Computer Sciences department of University of 
Liege, Belgium. Specifically, we shall argue in favor of two syntactic 
processes: fragmentation, which can be interleaved with semantic 
processing, and part of speech disambiguation, which is usefully 
performed in a preprocessing step. We shall start by a brief de- 
scription of the SABA system. 
2. Overview of the SABA system. 
SABA ("Semantic Analyser, Backward Approach", (Binot, 1985), 
(Blnot et al., 1986)) is a robust and portable semantic parser of 
written French sentences. A prototype of this parser is running in 
MACLISP and in ZETAI.1SP; it has been tested successfully on a 
corpus of about 125 French sentences. This system is not based 
on a French grammar, but on semantic procedures which, with 
some syntactic support, build directly a semantic dependency graph 
from the natural language input. The following example is typical 
of the level of complexity that can be handled by the system: 
(l) Le pont que le eonvoi a passe quand il a quitte New York ¢e 
matin etait fort long. 
(The bridge that the convoy crossed when it left New York 
this morning was very long.) 
284 
To allow for portability, the SABA parser translates its natural 
language input into an ~mtermediate" semantic network formalism 
called SF (for "Sentence Formalism'), presented in details in (Binot, 
1984, 1985). Before generating the SF output, SABA builds a 
simplified semantic graph expressing all the semantic dependencies 
established between the meaningful terms of the sentence. The 
graph established for sentence (1) is shown in figure (2). 
(2) 
pont w 
LR 
que *. 
11 
BENEFICIARY VALUE INTENSITY 
QUAL long fort 
OBJECT AGENT 
0~ * convol 
MOMENT I passer 
AGENT OBJECT 
I~0~ * 
quitter New York 
MOMENT 
* matin 
These kinds of dependencies are established by using the ~dual 
frames" method described in (Binot and Ribbens 86). Dual frames 
is a general method for establishing binary semantic dependencies 
between all possible types of meaningfull terms. This method sup- 
ports also a hierarchy of semantic classes and an inheritance mech- 
anism allowing the designer to specify generic semantic frames at a 
general level. However, we are not cone*reed here by the specifics 
of a particular semantic method, but by the kind of syntactic sup- 
port necessary to establish such dependencies (or, to put it another 
way, by the kind of syntactic support needed to identify accurately 
the arguments fdling the role slots of various meaningfull tenns). 
3. Fragmentation 
3.1 General discussion 
Consider again sentence (1) and suppose that a purely semantic 
system were to understand it by establishing semantic dependencies 
between words. There would be no reason for such a system to re- 
frain from attempting to connect "was long" to "convoy', for ex- 
ample, And, if the attempt is made, no amount of semantic or 
pragmatic knowledge will be able to prevent the connection, which 
is perfectly valid as such. Note also that a simple proximity principle 
would not work in this case. 
Thus, any natural language processing system must take into 
account, in some way, the structure of a sentence, ttowever, we 
don't necessarily need to build an intermediate syntactic structure, 
such as a parse tree, showing the detailed "phrase structure" of the 
input. The most crucial structural information needed for an accu- 
rate semantic processing concerns "boundaries" across which se- 
mantic processing should not be allowed to relate words. These 
boundaries can be identified by a fragmentation process which will 
cut a sentence into useful fragments by looking for specific types of 
words. 
Except maybe in Wilks" system fragmentation has not received 
the attention it deserves as a faster alternative to full syntactic pars- 
ing. Wilks" fragmentation process, however, was by his own ad- 
mission too simple. In his system, fragmentation was performed 
only once as a preprocessing step, and was designed around the size 
of his notion of "template'. Both of these characteristics, we think, 
give rise to problems. 
Performing fragmentation as a single preprocessing step is obvi- 
ously insufficient for garden path sentences and for all the structural 
ambiguities that cannot be solved without the help of the semantic 
module. Although Wilks said something about involving some se- 
mantic processing at the fragmentation stage, notably for handling 
the ambiguity about "that% he never presented, to our knowledge, 
a systematic procedure to integrate fragmentation and semantics. 
On the other hand, we believe that template sized fragments are 
more troublesome and less us,full than clause sized fragments. Even 
in straightforward active declarative sentences, two distinct mech- 
anisms must be provided to establish semantic dependencies in 
Wilks system: template matching, which identifies ~agent-action- 
object" triples, and paraplates, which are used to tie these templates 
together. A prepositional phrase constitutes a separate template. 
One problem with that approach is that in sentences such as "The 
old man / in the comer / left', fragmented by Wilks as shown by the 
./, the agent ends up in a different fragment than the action and 
an additionnal step will be required to relate the two. The same 
problem seems to arise in passive structures ('John is loved / by 
Mary'). To avoid these kinds of problems, we decided to use clause 
sized fragments and to establish semantic dependencies directly at 
the clause level. 
A third difference between the two approaches is that, while 
Wilks never provided a systematic method to solve part of speech 
ambiguities, SABA makes use of a part of speech disambiguation 
preprocessor, which will be described in the second part of this pa- 
per. This module being applied before fragmentation, we shall as- 
sume in the following discussion of the fragmentation mechanism 
that each word has a single part of speech. 
3.2 The fragmentation mechanism. 
We have implemented in the SABA system a fragmentation mech- 
anism which uses the clause as the fundamental fragmentation unit 
and which is repetitively applied and interleaved with the semantic 
processing. We start by presenting the basic algorithm, then, in the 
next sections, we shall discuss some more difficult problems and 
show how the introduction of two additionnal mechanisms, ejection 
and backtracking, can solve them. 
Fragmentation algorithm: 
Repeat the following until success or dead end 
I. Fragment the sentence into clauses; 
2. Select the innermost clause; 
3. Process the selected clause, which includes: 
a. The fragmentation of the clause into groups; 
b. The establishement of semantic dcpendancies inside each 
group; 
c. The establishement of semantic dcpendancies at the clause 
level; 
4. If the processing is suecessfull, erase the text of the clau~ from 
the input and replace it by a special non terminal symbol. 
This algorithm follows a bottom-up strategy in which the in- 
nermost clause (the most dependent one) is always processed first. 
Ties are resolved by a left to right prefercnce rule. The special sym- 
bols used in step 4 are PP CProposition Principal,") for a main 
'clause, PR for a relative clausc, PC for a conjunctive subordinate 
clause and PINF for an infinitive clause. Participe clauses are proc- 
essed as special kinds of relatives, as we explain in section 4.2. 
Success in the above algorithm means that the input has been 
reduced to the PP symbol or to a string of such symbols and con- 
junctions. A dead end is reached if fragmentation can find no new 
clause or if the selected clause cannot be processed. What happens 
then will be discussed in the next sections. 
285 
As can be seen in the above algorithm, fragmentation in SABA 
is in fact a two level process: sentences are fragmented into clauses 
and clauses into groups. Fragmentation into groups, wich gives far 
less problems than fragmentation into clauses, will not be discussed 
at all in this paper. 
Fragmentation of a sentence into clauses proceeds by extending 
to the left and to the right of each verb 2 and checking each en- 
countered word looking for clause delimiters. The checks are per- 
formed by heuristic rules based on the part of speech of each word. 
Other rules will then look at the delimiters to fred the innermost 
clause. 
The rules checking if a given word is a delimiter are given below. 
The term "explicit clause boundaries" used in the rules denotes the 
following kinds of words: relative or interrogative pronouns, rela- 
tive or interrogative adjectives, subordinate conjunctions and coor- 
dinate conjunctions. Coordinate conjunctions, which raise special 
problems, will not be discussed before section 3.5. 
Clause fragmentation rules. 
1. Explicit clause boundaries other than coordinate conjunctions 
are always clause delimiters; they are included in the clause on 
the left and excluded on the right) 
2. The special symbols PR, PC, PINF are never clause delimiters. 
3. Sentence boundaries are always clause delimiters. 
4. Another verb and the symbol PP are always clause delimiters, 
and are always excluded from the clause. 
5. Negation particles ('ne', "n") are considered as (excluded) 
clause delimiters when expanding to the right of the verb of the 
clause. 
Rules 1 to 4 are rather immediate. Rule 5 takes into account the fact 
that negation particles in French are always placed before the ne- 
gated verb. 
The basic clause selection rules (for choosing the innermost 
clause) are equally simple. A clause is subordinate if its left bound 
is a relative or interrogative pronoun (or adjective), or a subordinale 
conjunction, or if its verb is an infinitive. A clause is said to be free 
(meaning that it is not qualified by other subordinate clauses which 
should be processed first) if its right bound is not one of these terms. 
The leftmost free and subordinate clause, or, if none, the leftmost 
free clause will be chosen. 
Let us illustrate the effect of the above rules on example (1). The 
figure (3) below shows the successive states of the input text. In 
each state, the last fragmentation remit is indicated by underlining 
the identified clauses. The semantic l~'ocessing of the innermost 
clause selected at each step leads to the building of the correspond- 
ing part of the graph of figure (2). 
(3) Le pont,que le convoi a passeoquand il a quitte INeW- York ce 
matin~ etait fort long~ 
Le pont,que le convoi a passelPCjetait fort long; 
iLe pont PR etait fort long i 
PP 
As can be seen, a single fragmentation pass will often yield 
imperfect results. There will be holes (sentence fragments which are 
not included in any clause, like ~Le pont" in the first two steps) and 
overlappings (fragments which could be included in two clauses, like 
"New-York ce matin" in the first step). This is where the repetitive 
nature of the fragmentation process comes into play. Successive 
2 
Except auxiliaries that are part of a compound verbal form. 
a 
If the lea clause bound is a relative pronoun preceeded by a preposi- 
tion. the preposition will also be included in the clause. 
erasing of the innermost clauses from the input text, once they have 
been processed by the semantic module, will gradually cause the 
holes to disappear, and thus reveal the content of the main clause(s). 
Terms in overlapping areas will be automatically tried ftrst in the 
innermost clause to which they could belong, in effect implementing 
a kind of deepest attachment preference. What happens when that 
ftrst try is semantically inacceptable is discussed in the next section. 
Another interesting feature of the bottom-up algorithm is that the 
special symbol representing a processed subordinate clause will be 
naturally included, in later fragmentation steps, in the clause quali- 
fied by this subordinate, thus permitting to process correctly inter 
clause dependencies. 
3.3 The ejection mechanism. 
A ftrst class of problems for which the above fragmentation algo- 
rithm is not sufficient concerns cases when the deepest attachment 
preference fails. This problem occurs typically when a clause has 
no explicit clause boundary on one side, as in the examples (4) and 
(5~, below: 
(4) rl'aime I'homme,~lue /e presente a mon pere.~ 
(I love the man whom I introduce to my father) 
(5) rle presente rhomrne, flue /'aline a mon pere. I 
(I introduce the man wh'om 1 love to my father) 
In both eases the relative clause has no explicit fight boundary, and 
the attachment problem concerns the group "a mon pete". The 
fragmentation result (shown by underlines) will in both cases in- 
elude this group in the relative clause, which is wrong for (5). In 
such cases, the fragmentation will be automatically corrected, after 
the semantic processing of the relative clause, by a "right-ejection" 
mechanism: 
Right ejection mechanism 
If a group G on the right of the verb remains unconnected 
after the semantic processing of a clause, and if there is no 
other term on the right of G which has been connected to a 
term on its left, then G and all terms on its right will be ex- 
cluded from the current clause. 
In the case of example (5), assuming reasonnably that no semantic 
dependency can be established between "aime" and "a mon pere", 
this last group will be ejected froin the relative clause, giving the 
situation shown in (6): 
(6) iJe presente t homme que j'airae l a mon pere. 
Since fragmentation is interleaved with the semantic processing, the 
next fragmentation step will automatically pick up the discarded 
term after the processing of the relative clause, and insert it at the 
correct level: 
(7) Je presente rhomrae PR a rnon pere, 
The same mechanism applies to overlapping cases, such as in ex- 
ample (8): 
., , I (8) L'homrne ique I at rencontre sur la place rnla off err un care. I 
(The man that I met in the square bought me a coffee) 
Here, two groups appear in the overlapping fragment. The first one, 
"sur la place" ('on the squarer), can easily be connected to the rel- 
ative verb (as a location argument) and will remain in the relative 
clause. The second, "m" ("me") cannot be connected to "rencontre" 
('met"), the object slot of that verb being already fdled by the rela- 
tive pronoun "que". ~m" will thus be ejected from the relative 
clause, and included correctly in the main clause during the next 
fragmentation step. 
It is worth mentiorming that this mechanism involves no back- 
tracking and is extremely cheap in computational ressources. The 
only processing required is the displacement of the right clause 
boundary before erasing the text of the processed clause. 
286 
3.4. Infinitive clauses and backtracking. 
Infinitive clauses without an explicit left boundary (such as a sub- 
ordinate conjunction) give rise to several interesting problems con- 
ceming both fragmentation itself and the selection of the innermost 
clause. Consider the following examples: 
(9) ~J'irai 'ce soir a Parislvoir \[exposition'. 
(I will go this evening to Paris to see the exposition) 
(I0) r/e n'ai \]amats" vu Jacquesjl travadler." 
(I never saw Jacques working) 
In both eases, there is an attachment problem for the terms in the 
overlapping area. In (9), all the terms in that area belong to the 
relative clause, while in (10) Jacques is the subject of the infinitive 
clause. One might want to define here a "left-ejection" mechanism 
similar to the one described in the last section; however it would 
almost never work properly. Indeed, if terms such as "this evening" 
or ~to Paris ~ are tried in the infinitive clause first, there would be 
no reason to reject them during the semantic processing of that 
clause, and they will never be ejected. Things work out better if 
we try first the terms in balance in the main clause. This choice will 
be wrong when one of these terms is in fact the subject of the 
infinitive verb; but in that case, as we shall see, this term will conflict 
with the infinitive verb for Idling the OBJECT slot of the main verb, 
and the system will have a reason to reject the wrong choice. Ac- 
cordingly, we apply the following strategy: 
1, try first to place the terms of the overlapping area in the main 
clause; in effect, this consists in preventing the infinitive clause 
to extend to the let~ of its verb; 
2. if the choice made at point I fails, use a backtracking mech- 
anism that will restore the proper state of the analysis and try 
to extend, one group at a time, the left bound of the infinitive 
clause. 
With this strategy, (9) will be processed correctly at the frost try. (10) 
will lead to the following (erroneous) state of the analysis: 
(I1) iJe n'ai jamais vu Jacques PIN F. t 
where "Jacques" and PINF compete for the object slot of the main 
verb. The term PINF will then be ejected by the mechanism of the 
last section, giving the following state: 
(12) PP PINF 
This is a dead end state, since the sentence is not reduced to a PP 
symbol, and yet no further clause to process can be found. The 
backtracking mechanism will then restore the state shown in (10) 
with the following fragmentation, which leads to a successfull anal- 
ysis: 
(13) t Je n'ai jamais vu tlacques travailler.j 
Infinitive clauses raise also problems concerning the selection of the 
innermost clause. Consider the following examples: 
(14) J" ai vu un homme, qui voulai~ (tormir sur le trottoir~ 
(I saw a man who wanted to sleep on the street) 
(15) r\]" ai vu un hommet~qui avait bt~fdormir sur le trottoir.j 
(I saw a man who was drunk sleep on the street) 
In both cases, the selection rules will choose to process the infinitive 
clause fu'st. This choice is wrong for (15): if the relative relative 
clause is not processed first, its presence will prevent the system to 
fred out that the group "un homme" is in fact the subject of the 
infinitive clause. Processing the infinitive fu'st, the system will reach 
a dead end after the following steps: 
(16) tl" ai vu un homme tqui avait bu PINF I (ejection of PINF) 
iJ'ai vu un homme PR PINFi(ejection of PINF) 
PP PINF (dead end) 
This problem is again handled by backtracking. Let us note fu'st that 
the problem arises only when the subject of the infmitive verb is 
separated from that verb by a relative clause. In such a case, the 
system will try to process the infinitive ftrst, but will save the current 
state of the analysis so that it can later backtrack and process the 
relative first. In the case of our example, backtracking to (15) from 
the dead end state in (16), and processing the relative clause first, 
we obtain a correct analysis, as shown in (17): 
(17) iJ'ai VUl Un homme PRidormir sur le trottoir. 
iJ'ai vu PINF I 
3.5 Coordinate conjunctions 
Fragmenting sentences with coordinate conjunctions requires to 
make a decision regarding the scope of these conjunctions; specif- 
ically we need to distinguish between the conjunctions which coor- 
dinate clauses and the ones which coordinate groups inside a same 
clause. The following rules are used: 
Clause delimiter rules for coordinate conjunctions 
1. If the word to the right of the conjunction is a right de- 
limiter, or if next word in the current direction is the 
special symbol PP, the conjunction is taken as delimiter 
(excluded). 
2. If the next clause delimiter in the current direction is an 
explicit clause boundary or a sentence boundary, the 
conjunction is not taken as delimiter. 
3. Otherwise choose to consider first the conjunction as a 
delimiter (excluded); this choice can be undone by back- 
tracking. 
Rule I is based on the fact that there must always be at least one 
• conjunct to each side of a conjunction. If a delimiter is found im- 
mediately to the right, then the conjunction must connect clauses. 
The same is true if the conjunction is adjacent to the PP symbol. 
The following example illustrates the use of this rule: 
(18) iJ'aime les ehien.~qui m'obeissent; et aui ne mordent pas r 
(I love the dogs which obey me and which do not bite) 
~'aime les chiens PR D et ~ui ne mordent pas! 
J'aime les chiens PR et PR; 
PP 
If the next dellmitor is an explicit clause boundary, then there is no 
verb between the conjunction and this delimiter, and thus the 
conjuncts cannot be clauses. This fact, captured by rule 2, can be 
illustrated by the following example: 
(19) ~fque les pommes et les poires etaient cheres.j 
(I learned that apples and pears were expensive) 
J" ai appris PC 
Finaly, if the next delimitor is a verb, the scope ambiguity cannot 
be resolved at this stage. The conjunction could be a clause delim- 
iter, as in (20), or not, as in (21): 
(20) Connors a vaincu Lendl et McEnroe a vaincu Connors. 
(Connors defeated Lendl and McEnroe defeated Connors) 
287 
(21) Les hommes qui aiment les potatoes et les poires aiment aussi 
les oranges. 
(People who like apples and pears like also oranges) 
In such cases, the system will choose to take the conjunction as a 
delimiter, and record the state of the analysis, so that the choice c,'m 
be modified by backtracking. The choice will be correct for sen- 
tence (20). For sentence (2 l), the incorrect choice will lead to a dead 
end, as shown in (22), when the semantic module will try to coor- 
dinate ~hommes" and "poires" as agents of "aiment'. Backtracking 
to the choice point, followed by a new fragmentation, leads to the 
correct solution. 
(22) Les hommes tqui aiment les pommesjet /es poires aiment att¢si 
les oranges. 
iLes hommes PR et les poires aiment aussi les oranges; 
BACKTRACKING 
Les horames ~lui aiment lies potatoes et les poiresj aiment aussi 
les oranges. I 
¢Les hommes PR aiment aussi les oranges. I 
PP 
4. Part of speech disambiguation 
4.1 General discussion 
Many lexically ambiguous words can have different parts of speech 
(hereafter POS). The following table enumerates the main POS 
ambiguities for example (1). 
Le (occurs twice): article or personal pronoun (the, him, it) 
que: subordinate conjunction, relative or interrogative pronoun, 
particle (that, which, what, than) 
quand: subordinate conjunction or adverb (when) 
feet: noun or adverb (castle, very). 
The ambiguity problem is further compounded by an accentuation 
problem. "Passe', third person of the present of the indicative of 
the verb "passer', is quite different in French from "passe", past 
participle of the same verb? Similarly, "a", indicative of avoir ("to 
have'), has nothing to do with the preposition "a% llowever, for- 
getting an accent is one of the most common spelling mistakes. A 
robust system such as SABA must consider words such as "a", 
"passe" and "quiRe" as ambiguous. This would give at least 1024 
possible POS combinations for example (1)! 
Part of speech ambiguity is, of course, part of the more general 
problem of lexical ambiguity. Thus, one could argue that it doesn't 
need an independent solution. However, in the context of a frag- 
mentation system such as the one presented here, a POS 
disambiguation preprocessor is necessary. To give a simple example, 
the relative pronoun and subordinate conjunction senses of "que" 
are clause boundaries, while the (comparative or restrictive) particle 
sense is not. Many other problems of semantic processing need a 
prior decision regarding the POS of the words involved. Thus the 
French word "or ~ can be a noun ("gold'), and as such can fill a se- 
mantic role slot of some verb, or can be a coordinate conjunction 
("however'); qe" can be pronoun ('him', "it') and as such induce a 
search for a pronoun reference, or can be a determiner ("the-). 
Many other examples could easily be found. 
Other works have already investigated the usefullness of a POS. 
disambiguation preprocessor, but for syntactic parsers. (Klein and 
Simmons 63) presented very early a table based system for English 
Verb mood ambiguities can usefully be considered at the same level as POS ambiguities. 
where the emphasis was on the capability to classify "unknown 
words', and thereby to reduce the size of the dictionnary. Much 
more recently, (Merle 82) described a rule based POS disambiguator 
for French, its main objective being a gain of performance obtained 
by the reduction of combinatorial explosion during syntatic parsing. 
Mede's rules, however, were rather unwieldy for two reasons: 
1. each rule must make a final decision regarding the POS of one 
word; the designer must ensure himself the absence of contra- 
dictions between the rules. 
2. The rules permitted only to test for fixed patterns in the input. 
In contrast to that, we have developped a method permitting the 
use of cumulative rules and providing the possibility to test variable 
patterns through the use of a search function. 
4.2. The part of speech preprocessor for the SABA 
system. 
We have developped a part of speech disambiguation preprocessor 
for French, which is used as the first stage of the SABA system. 
This preprocessor consists of heuristic rules which are applied to 
each word in order to assign to every possible part of speech a cer- 
tainty factor. The different combinations of possible parts of 
speechs are then tried in decreasing order of likeliness. 
The heuristic rules are based on the well known fact that it is not 
necessary to scan the entire sentence to choose correctly the appro- 
priate part of speech for most words. The local context" (i.e. the 
few surrounding words) proves often enough to provide an accurate 
indication. Thus, if a word like "passe" is closely preceeded by an 
auxiliary, it is almost certainly a participe. As another example, 
"fort", if closely preceeded by a determiner, is more likely to be a 
noun than an adverb. 
We have captured such insights into heuristic rules which assign to 
each possible part of speech a certainty factor, according to the local 
context. Two of these rules, relating to the examples just 
mentionned, are given in natural language form below: 
Rule 2 
If the current word can be a past participe and has other 
possible POS, then 
1. If the current word is preceeded by a word that could 
be an auxiliary, and is only separated from that word 
by words that could be adverbs, personal pronouns 
or particles, then 
past participle CF = 0.7; other possibles POS 
CF = 0.3; 
2. Else: 
relative participe s CF = 0.7; other possible POS 
CF = 0.3. 
Rule 5 
If the current word can be a noun and has other possible 
POS, then 
I. If it is preceeded by a word that could be a 
determiner, and is only separated from it by words 
that could be adjectives or adverbs, then 
noun CF = 0.9; other possible POS CF = 0.1; 
2. else: 
noun CF = 0.4; other possible POS CF = 0.6; 
We distinguish between a participe used in a complex verbal form and 
a participe clause, as in "1he man defeated by Connors was ill'. In the 
later case, the participe will receive a POS called PPAREL ('relative 
partJcipe') because the participe clause is then processed exactly like a 
relative clause: in fact, when the POS PPAREL is assigned to a 
participe, a relative pronoun is inserted just before it. 
288 
These rules need several comments: 
I. Each rule can be seen as a production rule with a condition and 
an action. The condition is the clause starting with the first ~tf" 
of the rule; if it is not satisfied, this particular rule is not applied 
to the current word. The action is often itself a conditionnal 
statement, each branch of which must include a certainty factor 
assigment statement. 
2. The certainty factors that we are using range from 0 (absolute 
uncertainty) to 1 (absolute certainty). They can be compared 
to the belief factors used in the MYCIN system (Shortliffe 76). 
3. The application of any rule must result in one assigrnent of 
certainty factors to all possible POS of the current word. 
llowever, a given word could possess other possible POS than 
those that need to be explicitly mentionned in a given rule. 
These are refered to by the formal expression "other possible 
parts of speecht 
4. The intermediate words tested by a rule can also have several 
possible parts of speech. The expression ~ff such word could 
be of part of speech x" denotes a test bearing on all possible 
parts of speech of that word. 
5. We must be able to specify rules at varying levels of details. 
Sometimes, we will need to test if a word is a personal pro- 
noun; at another time, knowing that it is a pronoun of any kind 
is sufficient. The system offers the possibility to specify a hier- 
archy of parts of speech, which is taken into account by the 
rules. 
The part of speech disambignation preprocessor works in the fol- 
lowing way. It processes successively all the words of the input. For 
each word, it checks the conditions of all rules and fires all applica- 
ble rules. If several rules are applied to a same word, certainty fac- 
tors are combined by the following formula: 
CF = I - ((I - CFI)'(I - CF2)) 
where CFI and CF2 ate the certainty factors to be combined. Vdhen 
this is done, possible POS combinations are ordered by decreasing 
order of likeliness. Tile likeliness of a combination is simply defined 
as the product of the certainty factors of the parts of speech included 
in that combination. 
Although each rule is considered for every word, the resulting 
process is very fast. The ftrst reason for that is that there are very few 
rules: 14 in the current implementation. This is nothing compared 
to the size of the rule base needed for a large grammar, and yet these 
few rules are sufficient to choose the correct POS at the ftrst try in 
more than 80% of our test sentences. The second reason is that 
each rule is garded by a short, easy to check and very selective 
condition, so that most of the rules are immediately discarded for a 
given word. 
4.3. Implementation of the rules. 
The rules are implemented in a "semi-declarative" way: they can be 
specified separately, each being described as a condition-action pair. 
However, both condition and action can be any evaluable LISP 
form. in order to ease the task of rule specification, we have defmed 
a set of primitive operations. The figure (23) gives the formal 
specification of Rule 2. 
tlOMOGRAPH checks ff a word has more than one possible parts 
of speech. POSSIBLE-STYPE checks ff the specified part of 
speech is one of the possible parts of speech of the word. 
DEFINE-PTYPELIST assigns to each part of speech of the word 
a specific certainty factor. EXISTWORD, lastly, is a highly pa- 
rametered function performing searches in the input sentence. Its 
parameters are: 
1. POSITION: the starting word for the search; 
2. DIRECTION: the direction of the search (LEFT or RIGIIT); 
3. LIMIT: the ending word, beyond which the search should be 
stopped; 
4. GOAL-NAMES: admissible names for the target word 
5. GOAL-TYPES: admissibles parts of speech for the target 
word; 
6. GOAL-CLASSES: admissible semantic classes for the target 
word; 
7. BETWEEN-NAMES: admissible names for intermediate 
wolds 
8. BETWEEN-TYPES: admissible parts of speech for intermedi- 
ate words; 
9. BETWEEN-CLASSES: admissible semantic classes for'inter- 
mediate words; 
10. EXCLUDED-NAMES: excluded names for intermediate 
words; 
11. EXCLUDED-TYPES: excluded parts of speech for interme- 
diate words; 
12. EXCLUDED-CLASSES: excluded semantic classes for inter- 
mediate words. 
Parameters 3 through 12 are optional. The default value for LIMIT 
is the sentence boundary. The default value for parameters 4 
through 9 is "(ALL)", denoting that all values are accepted. The 
default value for parameters 10 through 12 is NIL (no value is ex- 
cluded). 
5. Results and conclusions 
We have presented two syntactic processes which offer useful and 
necessary support for semantic processing, syntactic parser. Both 
are based on simple heuristic rules assisted by a backtracking 
mechanism. Both have been implemented in the SABA system and 
tested on a corpus of about 125 sentences. Less than 5% of these 
required a backtracking of the fragmentation process. Since we tried 
to characterize precisely the situations in which a backtracking could 
arise, in most sentences there is not only no backtracking, but also 
no bookkeeping of the intermediate steps. 
(23) 
(ADD-SYNT-RULE R R2 
Condition 
(AND (POSSIBLE-STYPE WORD) 'PPA) (HOMOGRAPH WORD)) 
Action 
(COND ((EXISI~/ORD position (LEFT WORD) 
direction 'LEFT 
goal-classes '(AUX) 
between-types '(PR ADV PT)) 
(DEFINE-PTYPELIST WORD '((PPA . .7)(OI~{ERS . .3)))) 
(T (DEFINE-PTYPELIST WORD '((PPAREL . .7)(OTHERS . .3)))))) 
289 
As for the part of speech disambiguation preprocessor, the 14 rules 
that we implemented were sufficient to make the right choice in 
more than 80% of the cases. The very small size of this preprocessor 
is an important advantage if we think at the high human and com- 
putational costs involved in developing and using large size gram. 
mars. 
Although the specific rules that we implemented .were designed for 
French, we believe that the approach could be applied to other 
languages as well. 
ACKNOWLEDGMENTS 
Thanks are due to Professor D. Ribbens for his numerous helpfull 
comments and for his active support. 
REFERENCES 
1. Binot, J-L. 1984. A Set-oriented semantic network formalism 
for the representation of sentence meaning. In Proc. ECAI84, 
Pisa, September 1984. 
Binot, J-L. 1985. SABA: vers un systeme portable d'analyse du 
francais ecrit. Ph.D. dissertation, University of Liege, Belgium. 
Binot J-L. and Ribbens D. 1986. Dual frames: a new tool for 
semantic parsing. In Proc. AAAI86, Philadelphia, August 1986. 
Binot J-L, Gailly P-J. and Ribbens D. 1986. Elements d'une 
interface portable et robuste pour le francais ecrit. In Proc. 
liuitiemes .lournees de glnformatique Francophone, Grenoble, 
January 1986. 
Boguraev B.K. 1979. Automatic resolution of linguistic ambi- 
guities. Ph.D. thesis, University of Cambridge, England, 1979. 
Jensen K., Heidom G.E., Richardson S. and ttaas N., PLNLP, 
PEG and CRITIQUE: three contributions to computing in the 
Humanities. In Proc. of the conf., on Computers and tiumani- 
ties, Toronto, April 1986. 
Klein S. and Simmons R.F. A computational approach to 
grammatical coding of English words. Journal of the ACM. 10, 
March 1963. 
Lytinen S.L. 1986. Dynamically combining syntax and seman- 
tics in natural language processing. In Proc. of AAAI86, 
Philadelphia, August 1986. 
Merle A. 1982. Un analyseur presyntaxique pour la levee des 
ambiguites darts des documents ecrits en langue naturelle: ap- 
plication a gindexation automatique".Ph.D, thesis, Institut Na- 
tional Polytechnique de Grenoble. 
10. Ristad E.. 1986. Defining natural language grammars in GPSG. 
In Proc. of the 24th meeting of the ACL, New-York, June 1986. 
I1. Schank R.C., Leibowitz M. and Bimbaum L. 1980. An inte- 
grated understander. In Journal of the A CL, 6: I. 
12. Shieber S. 1986. An introduction to unification-bct¢ed ap- 
proaches to grammar, University of Chicago Press. 
13. Shortliffe E.H. 1976. CorrqTuter-based medical consultation: 
M YCIN Elsevier. 
14. Weir D.J., Vijay-Shanker K. and Joshi A.K. 1986. "File re- 
lationship between Tree adjoining grammars and head gram- 
mars. In Proc. of the 24th meeting of the ACL, New-York, June 
1986. 
15. Wilks Y. 1975. An intelligent analyser and understander of 
English. CACM 18:5, May 1975. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
290 
