An Endogeneous Corpus-Based Method for Structural Noun Phrase 
Disambiguation 
Didier Bourigault 
Centre d'Analyse et de Math6matiques Sociales 
(EHESS - Paris Sorbonne - CNRS) 
and 
Electricit6 de France - Direction des Etudes et Recherches 
Service Informatique et Math6matiques Appliqu6es 
1, avenue du G6n6ral de Gaulle 
92141 Clamart Cedex 
FRANCE 
Abstract 
In this paper, we describe a method for structural 
noun phrase disambiguation which mainly relies 
on the examination of the text corpus under 
analysis and doesn't need to integrate any 
domain-dependent lexico- or syntactico-semantic 
information. This method is implemented in the 
Terminology Extraction Sotware LEXTER. We 
first explain why the integration of LEXTER in 
the LEXTER-K project, which aims at building 
a tool for knowledge extraction from large 
technical text corpora, requires improving the 
quality of the terminolgy extracted by LEXTER. 
Then we briefly describe the way LEXTER 
works and show what kind of disambiguation it 
has to perform when parsing "maximal-length" 
noun phrases. We introduce a method of 
disambiguation which relies on a very simple 
idea : whenever LEXTER has to choose among 
several competing noun sub-groups in order to 
disambiguate a maximal-length noun phrase, it 
checks each of these sub-groups if it occurs 
anywhere else in the corpus in a non-ambiguous 
situation, and then it makes a choice. The 
half-a-million words corpus analysis resulted in 
an efficient strategy of disambiguation. The 
average rates are : 
27 % no disambiguation 
70 % correct disambiguation 
3 % wrong disambiguation 
The LEXTER-K project : 
knowledge extraction from large 
technical text corpora 
LEXTER is a Terminology Extraction Software 
(Bourigault, 1992a, 1992b). A corpus of 
French-language texts on any (technical) subject is fed 
in. LEXTER performs a grammatical analysis of this 
corpus and yields a list of noun phrases which are 
likely to be terminological units, representing the 
concepts of the subject field. This list together with the 
corpus it has been extracted from is then passed on to 
an expert for validation by the means of a 
terminological hypertext web. LEXTER has been 
developped in an industrial context, in the Research and 
Development Division of Electricit6 de France. It was 
previously designed to deal with the problem of 
creating or updating thesauri used by an Automatic 
Indexing System. 
We are integrating LEXTER in a text analysis 
tool to aid knowledge acquisition in the framework of 
Knowledge-Based System construction. This tool 
(LEXTER-K) will propose a structured list of candidate 
terms, rather than a flat list, which could be considered 
as a first coarse-grained modelisation of the information 
conveyed by the texts under analysis. 
Structuring of the terminology will be 
performed in two ways : on the one hand, by a 
structural analysis of the terminological noun phrases 
extracted by LEXTER; on the other hand, by an 
analysis of the sentences in which the candidate terms 
occur. This analysis will focus on the most relevant 
terms, determined by a statistical processing based on 
the assumption that the most frequent terms are 
probably the most relevant. 
We plan a two-stage architecture for 
LEXTER-K, that is, (1) the extraction of the 
terminology of the subject field, by a robust 
grammatical analysis (LEXTER), (2) the syntactic 
analysis of the sentences by a parser using this 
terminology. The syntactic structures of the sentences 
in a text, and the syntactic structures of the 
terminological units have to be placed on two different 
organisational levels. As the terminological unit is a 
semantic unit, it should be treated as such on the 
syntactic level, as well. Dissociating these two 
81 
analysis, though one taking advantages of the results 
given by the other, will guarantee a better efficiency for 
the parser, in particular by limiting the combinatory 
explosion of structural ambiguities. 
It is well known that Natural Language 
systems usually require considerable knowledge 
acquisition, especially in building a specifically 
oriented field vocabulary in the case of systems which 
have to analyse technical texts. We think that this 
two-stage analysis (extracting the terminology of the 
domain with a robust superficial analysis, and 
analysing the texts with a more in-depth parser using 
this terminology), may lighten the expensive burden of 
hand-coding a specialized language and may lead to 
more generic and domain-independant Natural Language 
systems. 
As long as the terminology extracted during 
the first step has not been validated by an expert, as is 
now the case, and it directly feeds the syntactic 
analyser, this two-stage architecture requires a better 
quality of the terminology extracted by LEXTER. This 
is the reason why we tried to improve the precision rate 
in the detection of the terminological noun phrases by 
implementing an efficient strategy for structural noun 
phrase disambiguation. This strategy is described in the 
following sections. 
2 The issue of parsing the 
ambiguous "Maximal-Length" 
Noun Phrases 
In this section, we briefly describe the type of 
grammatical analysis performed by LEXTER to extract 
likely terminological units and show what kind of 
disambiguation LEXTER has to perform. 
As we already pointed out, LEXTER has been 
achieved in an industrial context; from the beginning of 
the project, we had decided to focus upon a strongly 
restrictive criterium : applying and testing the system 
over a wide range of texts. The texts to be analysed are 
unrestricted texts gathered in large corpora. We had then 
to choose a fast and well-proved method. Moreover, we 
argued that, given the restricted grammatical structures 
of complex terminological units, it was not necessary 
to go into a complete syntactic analysis of the 
sentences to extract the terminology from a corpus 
(Bourigault, 1992b). 
First, a morphological analyser tags the texts, 
using a large lexical database and rules of lexical 
disambiguation. LEXTER treats texts in which each 
word is tagged with a grammatical category (noun, 
verb, adjective, etc.). LEXTER works in two main 
phases : (1) splitting and (2) parsing. 
(1) At the splitting stage, LEXTER takes 
advantage of "negative" knowledge about the form of 
terminological units, by identifying those string level 
patterns which never go to make up these units and 
which can thus be considered as potential 
terminological limits. Such patterns are made up by, 
say, conjugated verbs, pronouns, conjonctions, certain 
strings of preposition + determiner. The splitting 
module is thus set up with a base of about 60 rules for 
identifying frontier markers, which it uses to split the 
texts. The splitting phase produces a series of text 
sequences, most often noun phrases. These noun 
phrases may well be likely terminological units 
themselves, but more often than not, they contain 
sub-groups which are also likely units. That is why it 
is preferable at the splitting stage to refer to the noun 
phrases identified as "maximal-length noun phrases". 
Here is an example of a real maximal-length noun 
phrase : MESURE DU DEBIT DU VENTILATEUR 
D'EXTRACTION AVEC TRAPPE EN POSITION 
FERMEE (noun prep det noun prep det noun prep 
noun prep noun prep noun adj). 
(2) At the parsing stage, LEXTER parses the 
maximal-length noun phrases (henceforth MLNP) in 
order to generate sub-groups, in addition to the MLNP, 
which are likely terminological units by virtue of their 
grammatical structure and their position in the MLNP. 
The LEXTER parsing module is made up of parsing 
rules which indicate which sub-groups to extract from a 
MLNP on the basis of grammatical structure. 
Some of the MLNP structures are 
non-ambiguous : given such a structure it can be stated 
with a very high rate of certainty (Bourigault 1992a) 
that only one parsing is valid. The corresponding 
parsing rules are called non-ambiguous rules. For 
example, structure (1) is non-ambiguous, and parsing 
rule \[a\] is a non-ambiguous rule. 
(1) noun1 adj prep noun2 
parsing rule \[a\] 
noun1 adj prep noun2 
noun1 adj 
example of parse 
FUSIBLE THERMIQUE DE FERMETURE 
FUSIBLE THERMIQUE 
Some of the MLNP structures are ambiguous, 
that is, given such a structure it cannot be stated with a 
sufficient rate of certainty that only one parsing is 
valid. Several sub-structures compete. The 
corresponding ambiguous parsing rules generate several 
competing sub-groups. For example, when information 
about gender or number agreement are not available or 
of no help, structure (2) is ambiguous, that is, either 
82 
the adjective attaches the head noun1 of the noun 
sub-group noun1 prep noun2, or it attaches noun2, 
constituting the noun sub-group (noun2 adj); the 
competing noun sub-groups (noun1 prep noun2) and 
(noun2 adj) will be generated by the ambiguous parsing 
rule \[b\]. Structure (3) and parsing rule \[c\] are other 
examples of ambiguous structure and rule. 
(2) noun1 prep noun2 adj 
parsing rule \[bl 
noun1 prep noun2 adj 
# nounl prep noun2 
# noun2 adj 
example of parse 
REGISTRE D' EQUILIBRAGE MANUEL 
# REGISTRE D'EQUILIBRAGE 
# EQU1LIBRAGE MANUEL 
(3) noun1 prep noun2 prep noun3 
parsing rule \[c\] 
noun1 prep noun2 prep noun3 
# noun1 prep noun2 
# noun2 prep noun3 
example of parse 
ALARME DE SYNTHESE DE DEFAUT 
# ALARME DE SYNTHESE 
# SYNTHESE DE DEFAUT 
The issue is how to disambiguate in cases of 
MLNP with ambiguous structures, that means, 
whenever an ambiguous rule applies, how to choose 
among the competing generated sub-groups. The 
strategy of disambiguation is described in the next 
section. 
3 Strategy of disambiguation : 
looking at non ambiguous 
situations anywhere else in the 
corpus 
The strategy of disambiguation relies on a very simple 
idea, that of looking for non-ambiguous situations 
elsewhere in the corpus. Whenever an ambiguous rule 
applying to a MLNP with an ambiguous structure 
generates competing sub-groups, LEXTER (1) checks 
each of them to ascertain if it has been detected in a 
non-ambiguous situation (i.e. generated by a 
non-ambiguous rule) somewhere else in the corpus, and 
(2) chooses among the competing sub-groups using a 
set of disambiguation rules. 
There is one specific set of disambiguation 
rules for each ambiguous structure, which covers all 
the possibles situtations, that is, all, some, only one, 
none of the competing sub-group non-ambiguously 
detected. 
3.1 Situations where none of the competing 
sub-groups has been non-ambiguously 
detected 
Given an ambiguous Maximal-Length Noun Phrase, if 
none of the competing sub-groups has been detected in 
a non-ambiguous situation, LEXTER proposes only 
this MLNP, without any sub-group. No 
disambiguation is performed. On our half-a-million 
words test corpus, the total number of non-ambiguous 
MLNPs is 13,591, the total number of ambiguous 
MLNPs is 3,230, among which 880 are not 
disambiguated. The average rate of no-disambiguation 
is 27%. Rates of no-disambiguation for the ten most 
frequent ambiguous structures are shown in Table 1. 
(1) 
noun prep noun adj 
noun prep noun prep det noun 
noun adj noun 
noun prep noun noun 
noun prep noun prep noun 
noun noun adj 
noun noun noun 
noun noun prep noun 
noun prep noun prep noun adj 
noun prep noun adj adj 
(2) 
573 
331 
294 
260 
241 
193 
160 
82 
42 
33 
(3) 
110 
91 
74 
53 
55 
73 
47 
27 
7 
2 
(4) 
19 % 
27 % 
25 % 
20 % 
23 % 
38 % 
29 % 
33 % 
17 % 
6% 
Table 1 Rates of no-disambiguation for the ten most frequent ambiguous structures 
(1) ambiguous Maximal-Length Noun Phrase (MLNP) structure 
(2) total number of MLNP with this structure on the half-a-million words test corpus 
(3) number of cases where none of the competing subgroups has been detected in a non-ambiguous situation 
(4) rate of no-disambiguation 
83 
We are investigating rules that could perform a 
correct disambiguation in some of these cases. 
Choosing the right sub-group can be done by checking 
for each competing sub-group if it has been generated 
from the analysis of other ambiguous MLNP. For 
example, RE JET D'AIR FROID and CIRCUIT D'AIR 
FROID are two ambiguous MLNP extracted from the 
test corpus and parsed by parsing rule \[lo\] above : 
REJET D'AIR FROID 
# REJET D'AIR 
# AIR FROID 
CIRCUIT D'AIR FROID 
# CIRCUIT D'AIR 
# AIR FROID 
Since none of the sub-groups RE JET D'AIR 
and AIR FROID on the one hand, and CIRCUIT D'AIR 
and AIR FROID on the other hand, have been detected 
in non-ambiguous situations, the MLNPs RE JET 
D'AIR FROID and CIRCUIT D'AIR FROID have not 
been disambiguated by LEXTER. But comparing the 
parsings of these ambiguous MLNPs (AfR FROID 
generated in both cases) can lead to the hypothesis that 
extracting AIR FROID is the correct way of 
disambiguating them. This hypothesis is reinforced by 
the fact that the pattern AIR + adj. is very productive in 
the corpus (AIR EXTERIEUR, AIR FRAIS, AIR 
NEUF, AIR AMBIANT, AIR RECYCLE, etc.). 
Our experiments show that such situations 
(sub-groups never non-ambiguously detected but 
generated from different ambiguous MLNPs) are very 
rare and this explains why we have no specific 
treatment for them yet. 
nounl prep noun2 
detected 
nounl prep noun2 
not detected 
number of occurrences : 141 number of occurrences : 132 
noun2 adj noun2 adj noun2 adj 
detected number of wrong disamb. : 16 number of wrong disamb. : 1 
number of occurrences : 190 number of occurrences : 110 
noun2 adj noun1 prep noun2 
not detected number of wrong disamb. : 15 
Table 2 Set of disambiguation rules for the ambiguous parsing rule \[b\] : 
noun1 prep noun2 adj 
-> 
# nounl prep noun2 
# noun,? adj 
nounl prep noun2 
detected 
nounl prep noun2 
not detected 
number of occurrences : 52 number of occurrences : 81 
noun2 prep noun3 noun2 prep noun3 noun2 prep noun3 
detected number of wrong disamb. : 2 number of wrong disamb. : 0 
number of occurrences : 53 number of occurrences : 55 
noun2 prep noun3 
not detected 
Table 3 Set of disambiguation rules for the ambiguous parsing rule \[c\] : 
nounl prep noun2 prep noun3 
-> 
# nounl prep noun2 
# noun2 prep noun3 
84 
3.2 Situations where at least one competing 
sub-group has been non-ambiguously 
detected 
Given an ambiguous MLNP, systematically keeping 
(all) the competing sub-group(s) detected elsewhere in 
the corpus in a non-ambiguous situation is not a 
satisfying principle of disambiguation. We need more 
precise rules of disambiguation. 
For example, for each of the parsing rules \[b\] 
and \[c\], in more than 20 % of the cases on our test 
corpus (see top left cells of Table 2 and Table 3), both 
competing sub-groups have been non-ambiguously 
detected. That means that both sub-groups are attested 
valid noun phrases as such. However, only one of them 
corresponds to a correct parsing of the MLNP they have 
been extracted from. In these cases keeping both 
sub-groups would alter the precision rate since one of 
them is not grammatically valid. On the contrary, 
generating none of them would alter the recall rate. We 
chose to build a set of disambiguation rules for each of 
the ambiguous parsing rules. 
To work out the disambiguation rules, we 
adopted an empirical approach based on large-scale 
corpus experimentation. For each of the ambiguous 
structures, we examined all the different situations of 
disambiguation (only one, more than one, all the 
competing sub-groups non-ambiguously detected) and 
for each of them, we parsed by hand a significative 
number of ambiguous noun phrases extracted from a 
reference test corpus. Applied to the ambiguous parsing 
rules \[b\] and \[c\], this approach led us to the following 
set of disambiguation rules (see Table 2 and Table 3) : 
where both competing sub-groups have been 
non-ambiguously detected, we checked that most 
often (125 cases/141 for rule \[b\], 50 cases/52 for rule 
\[c\]) the correct parsing isolates the second 
sub-group,noun2 adj for rule \[b\], noun2 prep noun3 
for rule \[c\] (see the top left cells of Table 2 and 
Table 3). 
where only the second sub-group has been 
non-ambiguously detected, it always corresponds to 
the correct parsing and so it is systematically kept 
(see the top right cells of Table 2 and Table 3). 
parsing rules \[b\] and \[c\] differs for the situations 
where only the first sub-group (nounl prep noun2 
for both rules) has been non-ambiguously detected. 
For rule \[b\], this sub-group is kept since it most 
often corresponds to a correct parsing of the MLNP 
(175 cases/190, see the bottom left cell of Table 2). 
On the contrary, for rule \[c\], no systematic rule can 
be stated since the correct parsing sometimes isolates 
this non-ambiguously detected sub-group, but often 
isolates the second one (noun2 prep noun3), altough 
it appears nowhere else in the corpus in a 
non-ambiguous situation (see the bottom left cell of 
Table 3). This mainly happens in cases of "elliptical 
denominations", that is, a concept is first designated 
in a text by a "complete" term (for example, 
CIRCUIT D'ASPERSION D'ENCEINTE), and then 
is systematically refered to with an "elliptical" term 
(for example, CIRCUIT D'ASPERSION). 
The results we obtained with such sets of 
desambiguation rules (see Table 4) are satisfactory and 
show that the strategy described in this paper is 
efficient. This is partly due to the fact that 
terminological noun phrases are fixed never 
disconnected sequences of words with constrained 
grammatical structures. Our strategy was not designed 
to deal with adjective and prepositional phrase 
attachment in unrestricted noun phrases. 
rule \[b\] 
noun1 prep noun2 adj 
--> 
# noun1 prep noun2 
# noun2 ad, i 
rate of no-disambiguation 
rate of correct disambiguation 
rate of wrong-disambi~uation 
rule \[c\] 
noun1 prep noun2 prep noun3 --> 
# noun1 prep noun2 
# noun2 prep noun3 
rate of no-disambiguation 
rate of correct disambi$uation 
rate of wrong-disambiguation 
19 % 
75 % 
6% 
45 % 
54 % 
1% 
Table 4 Rates of disambiguation for parsing rules \[b\] 
and\[c\] 
4 Related works 
Some of the methodological ideas of LEXTER are 
similar to those expressed in (Andreewsky et al., 1977) 
for the extraction and disambiguation of lexical 
sequences in a method of building lexical semantic 
relations dictionaries. More recently a great deal of 
work in computational linguistics has been devoted to 
ambiguity resolution. Many rule-based approaches have 
been proposed which require a huge amount of 
hand-coded knowledge. (Jensen and Binot, 1987) 
proposes a method to disambiguate prepositional phrase 
attachment in English sentences which is similar to 
ours in that it aims at eliminating the hand coding of 
semantic information by exploiting an already available 
source of information. This method differs from ours in 
that the source of information is a machine-readable 
dictionary, and it uses complex heuristics, which are 
different according to the preposition under analysis and 
85 
it makes use of some semantic relationships signaled 
by prepositions. 
With regards to the specific problem of 
structural noun phrase disambiguation, (Wermter, 
1989) describes some experiments on article titles for 
scientific and technical domains. These noun phrases 
have very similar grammatical structures to those of 
our "maximal-length noun phrases". The proposed 
approach, based upon the integration of semantic and 
syntactic constraints, is quite different from ours since 
it requires hand-coding of head nouns with semantic 
features. 
(Zernik, 1992a, 1992b) describes a method to 
distinguish thematic relations (e.g., expressed 
concerns), which take on syntactic variations in the 
corpus, and sentential relations (e.g., preferred stocks) 
that do not. It consists of testing how many times a 
given ambiguous relation occurs anywhere else in the 
corpus in different syntactic configurations. This 
method ("Asking the right questions of the corpus") is 
similar to the strategy described in this paper, and we 
agree with Zernik's general line of thinking : "In order 
for a program to interpret natural language text, it must 
train on and exploit word connections in the text under 
interpretation (Zernik, 1992a)." 
Proceedings of the 2nd symposium of TernuVet, 
Avignon, May 1992 
\[Bourigault, 1992b\] Didier Bourigault. Surface 
Grammatical Analysis for the Extraction of 
Terminological Noun Phrases. In Proceedings of 
COLING-92, Nantes, August 1992 
\[Jensen and Binot, 1987\] Karen Jensen and Jean-Louis 
Binot. Disambiguating Prepositional Phrase 
Attachments by Using On-Line Dictionary 
Definition. Computational Linguistics 13 (3-4), 
1987 
\[Wermter, 1989\] Stefan Wermter. Integration of 
Semantic and Syntactic Constraints for Structural 
Noun Phrases Disambiguation. In Proceedings of the 
l l th IJCAI, 1989, Detroit 
\[Zernik, 1992a\] Lid Zernik. Shipping Departments vs. 
Shipping Pacemakers : Using Thematic Analysis to 
Improve Tagging Accuracy. In Proceedings of 
AAAI-92, July 1992, San Jose 
\[Zernik, 1992b\] Uri Zernik. Closed Yesterday and 
Closed Minds: Asking the Right Questions of the 
Corpus To Distinguish Thematic from Sentential 
Relations. In Proceedings of COLING-92, August 
1992, Nantes 
5 Conclusion 
We described a method for the structural 
disambiguation of complex terminological noun 
phrases, which relies on a simple idea : looking for 
non-ambiguous situations anywhere else in the corpus. 
We call this method Endogeneous since it does not need 
to integrate any domain-dependent, lexico- or 
syntactico-semantic information. We called it 
Corpus-Based since it exploits the experience the 
system has acquired after a first reading of the corpus 
under analysis. In that sense, our method can be viewed 
as a particular implementation of a model of language 
Performance (Bod, 1992), and may belong to the family 
of Data-Oriented methods in Computational 
Linguistics. 
References 
\[Andreewsky et al., 1977\] Alexandre Andreewsky, 
Christian Fluhr and Fathi Debilli. Computational 
Learning of Semantic Lexical Relations for the 
Generation and Automatical Analysis of Content. 
Information Processing 77, 1977 
\[Bod, 1992\] Rens Bod. A Computational Model of 
Language Performance : Data Oriented Parsing. In 
Proceedings of COLING-92, Nantes, August 1992 
\[Bourigault, 1992a\] Didier Bourigault. LEXTER, un 
Logiciel d'EXtraction de TERminologie. In 
86 
