Noun Phrasal Entries ill the 1,3)17, English Word Dictionary 
A. Koizumi, M. Arioka, C liar\]d\], M. Sugimoto 
Japan L:lectronic Dictionary Research Institute, IAd. 
Tokyo, Jap:m 
L. (htthrie, C. Watts 
Comptfling Research Laboratory 
New Mexico State University 
Las Cruces, NM, USA 
R. Cat~zone, Y. Wilks 
l)epartment of Computer Science 
University of Sheffield 
Sheffield, U K 
Keywords: Lexicon construction, tmiversal features, 
resources for eL. 
1. Introduction 
The dictionary construction project at tile Japan Flectronic 
Dictionary Research Institute, Ltd. (El)R) in Tokyo began 
in 1986 and is ahnost certainly the largest lexicon construc- 
tion project for computational purposes it\] tile world. This 
paper describes some aspects of the construction of the En- 
glish language dictionary, in particular a project to verify 
and enh,'mce information on noun phrases in the English 
Word Dictioniuy undertaken by the Computing Research 
Laboratory at New Mexico State University and the Univer- 
sity of Sheffield. We believe the work so far raises issues of 
wider linguistic interest which require practical solutions so 
that the large scale lexicon project can proceed. We hope 
that this palter will show the complexity, diversity, attd rich- 
ness of the content of the EI)R English Word Dictionary. 
Tile key idea has beet\] to construct a system of features, cat- 
egories and structures for encoding English words and 
phrases that is, at the same time, universal, or at least suffi- 
ciently universal to code both English and Japanese, two 
very different languages indeed. This is particuhMy evident 
in the use of left and right "adjacency attributes" in both the 
English and Japanese dictionaries. This general idea is a 
very natural outcome of the general state of linguistic 
theory, at least in tile generative tradition, in its broadest 
sense: one which emphasises universality in its feature sets 
and structural conslraints, but which has also evolved by a 
long and tortuous route to the current position where tile 
lexicon is primary in a linguistic system, and all other levels 
of linguistic analysis can be seen as a projection from that 
level. The alphabet-soup grammar theories that are now 
current all share that assumption to some degree. 
Thus, a practical attempt to construct a lexicon on i)rinciples 
as universal as possible for computational use. is indeed a 
project broadly consistent with the state of generative 
theory. Almost all other lexicon construction projects under 
way with computation as a main goal (e.g. COMLEX, CUP, 
Procter 1992 att(I see Wilks, Slator and Guthrie, it\] press) are 
designed principally for I¢nglish, although CUP intends to 
augment its structures l"ron\] non-English corpora \[IS soon as 
is feasible, and a COMI,I';X tbr Spanish is already under 
discussion. Nonctlmless, the sheer scale of the EDR enter- 
prise (see below) and its explicitly universalist assumptions 
(1o make it unique. We will now outline briefly tile general 
structure of the dictionaries it\] the project and then proceed 
directly to some of the theoretical :rod computational 
choices that have been nlade ill tile Fnglish lexicon. 
2. The EDR Dictionaries 
The EI)R I-lectronic l)ictionary (El)R, 1993; Yokoi, 1990) 
is designed as the first true machine-readable dictionary that 
contains, it\] a readily qccessible lorm, tile information re- 
quired for a colnputer to understand and generate natural 
hmguage. As such, the EDR Electronic l)ictionary is in- 
tended to be universally apl\]licahle and is not restricted to a 
p;trticuhtr application system. The part of the dictionary that 
handles surface information is kept separate from the sec- 
tion that handles semantic intbrlnation: surface inlormation 
that is heavily dependent upon a particular language is 
stored in the Word I)ictionary, attd sclnantic infornmtion is 
stored in the Concept Dictionary. 
There arc lbur different dictionaries that comprise the El)I?, 
l:!lectronic Dictionary: the Word Dictionary, the Concept 
I)ictionary, the Co-Occurrence Dictionary and the Bilingual 
Dictimmry. The different dictionaries that make np the 
EDR lilectronic Dictionary and the EDR Corpus are set in a 
structure of mutual interrelatedness. Four types of constitu- 
ent data are contained in the Ifl)R electronic dictionaries: 
word entries, concept entries, co-occurrence entries and bi- 
lingual entries. Word entries consist of headwords, gram- 
matical information that indicates tile grammatical charac- 
teristics of tile word, and concept identifiers that indicate tile 
concepts represented by a given word in different contexts. 
Concept entries represent the relationship between two dif- 
257 
ferent concepts. Co-occurrence entries use co-occurrence 
relation labels to describe the possible co-occurrence rela- 
tions between headwords. Bilingual entries describe the 
word correspondences between headwords in different 
languages. Thus, each of the EDR electronic dictionaries is 
related to the others, and by using the different component 
dictionaries as a single entity, they can usefully be applied 
to many forms of natural language processing. 
3. The EDR Word Dictionary 
The role of the Word Dictionary is to provide morphologi- 
cal, syntactic and some semantic informatio,~: the Word 
Dictionary is divided into a General Vocabulary Dictio- 
nary and a Technical Terminology Dictionary and the 
former is further subdivided into a Japanese General Vo- 
cabulary Dictionary and ,an English General Vocabulary 
Dictionary, each of which contains 200,000 words. The 
vocabulary covers words, compounds, and idioms used in 
ordinary documents. The Technical Terminology l)ictio- 
n,'u'y covers words or terms that are specific to infornmtion 
processing and related fields, and is also split into a Japa- 
nese Technical Terminology and an English Technical 
Terminology Dictionary. Each contains 100,000 words. 
The main characteristics of the General Vocabulary Dic- 
tionary are: 
(1) surface level information and deep (semantic) level 
information are stored separately; 
(2) surface level information is described independent of 
any specific application system or algorithm; 
(3) a large-scale vocabul,'u'y contains lexical items used 
in general writing. 
The Word Diction,'uy is a collection of word entries that 
contain entry information as shown in Fig. 1. 
Fig. 1 Structure and Content of Word Entries 
Ileadword Grammatical 
Information Information 
lteadword 
Notation 
Adjacency Attributes 
Extra Notation 
Pronunciation 
Part of speech 
Syntax tree 
Inflection 
Grammatical attributes 
Function word information 
Semantic 
Information 
Concept identifier 
Concept illustration 
S npplement,'u'y 
Information 
Usage 
Frequency 
Tile lleadword Information provides headword, extra no- 
tation, and pronunciation. A headword consists of notation 
(the orthographic spelling of a word - containing all the 
characters common to all inflected forms of tile word) and 
adjacency attributes. For phrasal entries, the headword is a 
list of the pairs of notation and adjacency attributes of each 
constituent of the phrasal entry. The adjacency attributes 
indicate the possibility of joining one inorpheme to another 
and ,are used to create adjacency rules for morphological 
analysis and generation. EDR employs a bidirectional 
connection grammar which divides the adjacency con- 
straint attributes into possible connectivity to the left of tile 
word and possible connectivity to the right of the word. 
This information is not nor,nally described in this form for 
English, but EI)R employs the same method in both Word 
Dictionaries so that morphological analysis of Japanese 
and English can be made by the same algorithm. The extra 
notation information stores headwords in kana for entries 
in the Japanese Word Dictionary and in a character string 
form with syllable markers for hyphenation for entries in 
the English Word Dictionary. 
Grammatical information consists of part of speech, syntax 
tree, inflection information, grammatical attributes and 
function word information. The grammatical information 
can be used to find tile syntactic stn, cmre of a sentence in 
syntactic analysis. A syntax tree is provided for compound 
words or idioms consisting of multiple words. The fimc- 
tion word code corresponding to the notation of the bead- 
word is provided for fimction words. 
A concept (listed under semantic information) in addition 
to being a flmdamental component of the Concept Dictio- 
nmy, describes tile semantic content of any word entry in 
the Word l)ietionary. If the same headword has two or 
more different concepts, separate word entries are used in 
the Word Dictionaries. This information is used to distin- 
guish between the various meanings a given word may 
have. The concept is the link between the Word Dictionary 
and the Concept l)ictionary. 
Supl)lementary inforlnation provides information on the 
usage as well as the frequency of the headword entry. 
4. Noun Phrase Entries in the EDR English Word Dic- 
tionary 
Portions of the English Word Dictionary have been sub- 
jected to rigorous verification through proiects at the Com- 
puting Research Laboratory (CRL) at New Mexico State 
University, and the University of Sheffiekt. Following is a 
report on one phase of the verification project at CRL 
which was aimed at describing the grammatical informa- 
tion as well as verifying the morphological information. 
The objects of this phase of tile verification project con- 
sisted of 37,039 entries initially coded by EI)R as noun 
phrase expressions. Among these entries, 2,389 were 
treated as single word entries and 34,650 were treated as 
258 
phrasal entries (see below for tile distinction between ,l.3 Aim of Verification 
single word and phrasal entries). 
4.1 Phrasal Entries vs. Single Word Entries in the lil)R IV.n - 
glish Word Dictionary 
In the EDR English Word Dictionary, headwords are 
treated as either single lexical item, while 'phrasal' ,efers 
to a word that is coml)osed of more than one lexieal item. 
In addition to the difference made on lexical units, some 
words are 'treated as' single word entries even though they 
,are composed of more tllan one lexical item. The type of 
information that is provided for headwords varies accord- 
ing to the type of headword. Phrasal entries are given the 
same information given to single word entries but they are 
also coded with additional information that indicates their 
internal syntactic structure. The adjacency attributes and 
the grammatical attributes are given to each of the constitu- 
ents of the phrasal. Phrasal expressions t~'eated as single 
word entries ~:tre uot segmented into constituent words. In- 
cluded under those words tbat are treated as single word 
entries are the following tylx~s of words: 
-foreign words 
-proper nouns 
-common nouns derived from prope,' nouns ("New 
Mexican") 
-idiomatic expressions which do not fit into a general- 
ized phrase structure l)atteru ("on the cheap," "open 
sesame") 
-function word equivalents 
4.2 Information Provided for Notre Phrase Entries 
For noun phrase entries in the English Word Dictionary 
information is provided for the phrase as a whole as well as 
far the individual constituents that comprise the F, hrase. 
Whole phrase information inch,des designation as either a 
common noun or proper noun (proper norms are treated as 
single word entries and constituents are not separately ana- 
lyzed), countability, collectivity, gender, verb agreement 
and article usage. In addition, the head noun is designaled. 
The constituent information provided for phrasal entries 
includes left and right adjacency attributes, part of speech, 
inflection information, and grammatical attributes. The 
grammatical attributes that are provided for each of tile 
constituents varies according to the part of speech of the 
constituent. Information reg,'u'diug collectivity and count- 
ability is provided for nouns and information on possible 
comparative, superlative, or positive degree forms is given 
for both adjectives and adverbs that appear as constituents 
of the phrase. Constituent information is provided within 
the context of the whole phrase. Syntax trees are also de- 
scribed for noun phrase entries. 
4.3.1 Coding of Grammatical Information 
The primary objective of the verification project was to 
code thc grammatical information for the noun phrase en- 
tries. The specific information given for the noun phrase 
entries included determining the intrn-phrasa\] structure of 
the phrasal, the grammatical attributes of the constituents 
.'rod also the grammatical attributes of the entire noun 
phrase. 
4.?,. 1.1 Syntactic Relationship Between Constituents 
The basic principle used in coding imra-phrasal syntactic 
information is th:~t the information should clarify the syn- 
tactic structure of the phrasal entry. \[:or example, the lol- 
lowing phrases look simihu' on the surface, i.e. adjective + 
noun + noun, hut actually the internal syntactic structure of 
each phrasal is different. 
(iii) traveling post office (iv) dead letter box 
The adjective "traveling" modifies the noun phrase "post 
office" itl the phrase "traveling post office" while the noun 
phrase "dead letter", con,posed of an adjective and a norm, 
modifies the nouu "box" in the phrase "dead letter box." 
\[:or building a source of \[exieal iuformation to be used in 
language processing, in¢lication of the head noun of a 
F, hrase is useful as hyl~ernyrn information. Location of the 
head noun cannot be determined automatically from the 
phrase structt,rc. That is to say, it is often the case that lhc 
head noun is the noun occurring in the final position of a 
phrasal composed of two (or more) lexical items, but this 
rule does not always apply as there are also cases in which 
the head noun is the first norm of the phrasal e.g., cou,t 
martial. 
l)tuing the actual task, the distinction between the syntac~ 
tic relationship between constituents was car,ied out by in- 
dicatin\[, the inlra-phrasal synlax lay parentllesizing the im- 
mediate constituents with categorical labels. The categori- 
cM labels used to mark the grouping of the phrasal are 
shown in the example below: 
EAJ(traveling)/EN 1 (post)/EN 1 (o ffice) 
~> EAJ(traveling)/EN 1 (F,N 1 (post)/EN 1 (@office)) 
In this syntactic notation a slash (/) divides constituents at 
the same level, ENI is an English common noun, EAJ au 
English adjective, etc.; and lhe bracketing structure is a lin- 
earized tree in a standard for,n, e.g., in (iii) above the tree 
expands to the right, while in (iv) it expands to tile left. The 
symbol "6.~" indicates the head noun. 
4.3.1.2 Grammatical Inl'ornlation for tile Constituents 
Once the intra-phrasal syntax structure of the phrasal has 
259 
been determined, the inflection information and grarmnati- 
cal attributes of the constituents are determined. The grmn- 
matical attributes of the constituents are determined by 
considering the constituent as part of the phrase. Given the 
information of the constituent words coded as separate dic- 
tionary entries in the EDR English Word Dictionary, the 
coding is given based on the behavior of the constituent 
when it is nsed in the phrase. 
traveling:EAPOS ;EANOCMP;EANOS UP 
-> EAPOS ;EANOCMP;EANOSUP 
post: ENSG;ECNI;ENC -> ENSG;ENU 
office: ENSG;ECN 1 ;ENC -> ENSG;ECN 1 ;ENC 
The coding ECN1 (takes plural ending -s) and ENC 
(Countable) is changed to ENU (Uncountable) to indicate 
that the word "post" when used in the context of the 
phrase, does not inflect. 
4.3.1.3 Grammatical Information for tile Noun Phrase Unit 
The final process in the coding of tile syntactic information 
for noun phrasals involves marking the grammatical at- 
tributes for the noun phrase as a whole. The grammatical 
attributes marked for the whole phrase include: part of 
speech, countability, collectivity, gender, verb agreement 
and article usage. The ex,-unple given in the previous sec- 
tion, "traveling post office", was coded as a common 
countable noun that may be preceded by both the definite 
and indefinite articles and is referred to by the pronoun 'it'. 
Since the phrase "traveling post office" does not have any 
special requirements on verb agreement, that is, when tile 
noun is used in the singular form it is followed by a singu- 
1,-u" verb and conversely, when it is used in the plural form it 
is followed by a plural form verb, the verb agreement 
marking is left blank for the entry. 
4.3.2 Verifying Morphological lufor,nation 
Although decisions for the descriptions of the intra-phrasal 
syntax structure were based on initial coding phases of the 
EDR Word Dictionary development, verification and cor- 
rection of that morphological information dr, ring the veri- 
fication project was essential. The coding of the syntactic 
information for the phras,-d may affect the morphological 
information for the entry thus requiring the verification of 
morphological information as well. The decisions regard- 
ing the morphological information including segmentation 
and p,'u't of speech of the constituents could be made with 
more precision if the syntactic structure of the phrase was 
taken into consideration. 
Tim basic principles of headword determination and seg- 
mentation are as follows: 
(1) a headword unit should be determined on the basis of 
whether the phrasal expression comprises a single unit 
of meaning; 
(2) phrasal headwords shovld be segmented into those 
constituents which are also found as single headwords 
in EDR's English Word Dictionary. 
In view of the second basic principle, tim part of speech of 
a phrasal constituent is decided according to the lexical 
part of speech consistent with the part of speech of the con- 
stituent as a single word entry in the dictionary. For ex- 
ample, nouns flmctioning as adjectives in phrases like 
"corn stalk" are coded as nouns and verbs modifying nouns 
as in the phrase "jam session" are coded as verbs. 
The treatment of hyphenated words as single words or 
words which should be broken down into separate con- 
stituents is a significmlt segmentation issue. Hyphenated 
words which are used on their own in Standard English 
should be treated as single constituents and hyphenated 
words which are not used on their own should be broken 
down into separate constituents. In the examples below, 
the constituents are separated by the slash (/) notation. 
"X-ray//spectroscopy" 
"deep-sea//angler" 
"directed/-/energy//weapon" 
"Bose/-/Einstein//statistics" 
A decision on segmentation for some hyphenated words in 
phrasal entries is difficult to judge purely by intuition from 
looking at the individual phrasal entries. These types of 
entries have to be looked at as a whole with attention being 
given to wider usage, and in particular to consistency with 
other headwords in the dictionary. For example, a decision 
to correct "yellow/--/green" to "yellow-green" cannot be 
made purely by intuition. The decision here is more an is- 
sue of the selection of headwords rather than one of hy- 
phenation. The verification task of morphological infor- 
mation also included ,'aising possible additional head- 
words to be added to the EDR English Word Dictionary 
through the analysis of tim entries. After a decision has 
been made regarding entering tile hyl)henated word of tile 
phrasal as a headword in the dictionary, tbe phrasal is fed 
back to the segmentation process. 
4.4 Some Results of the Work 
4.4.1 Syntactic Patterns 
Tim result of the coding shows that 98% of tile 34,650 
phrasal entries could be covered in approximately 40 dif- 
ferent patterns. Tim following seven patterns are the most 
frequent and cover over 80% of the total entries. 
260 
# Entries Pattern ExamAl~le 
1. 17422 
2. 10847 
3. 993 
4. 707 
5. 456 
6. 326 
7. 294 
EN10/ENI(@) tmmmock chair 
EAJ0/EN 1 (@) blue jay 
*EN20/ENI (@) Doppler effect 
*EN 1 (@)/EPP(EPR0/EN 10) 
piece of cake 
EAJ(EVE0~EV())~NI (@) 
circulating library 
ENI (EVE0/EEV0)/EN 1(@) 
changing room 
*EN 1 (EN 10/EEN:ENPOS 0)/EN 1 (@) 
teacher' s pet 
*ENI denotes a common noun, EN2 denotes a proper 
noun, EEN:ENPOS a noun possessive ending's and ', 
EPR a preposition, and EPP a prepositional phrase. The 
location of the head of tile phrasal is indicated by the @ 
notation. 
4.4.2 Grammatical Attributes of the Constituents 
As mentioned em'lier one of the tasks of the coding was to 
indicate the grammatical attributes of the constituents. The 
data show that of the adjective + noun pattern tEA J0/ 
EN 10), the adjective constituent of the noun phrase did not 
inflect to form the superlative or comparative degree 
forms, but rather most often occurred in the positive degree 
form. 
The grammatical attributes fcu' nouns other than the head 
noun also showed some interesting results. Nouns other 
than those designated as the head noun do not inflect in 
most of the cases. One of the exceptional cases is "the time 
of w#one's life," where "w#one's" is a word class name 
for any noun in the possessive form. In this example, "life" 
inflects in accordance with the content of "w#one's" word 
class, though it is not the head noun of tile phrase. Since 
phrases like this are very rare, it is also possible to treat 
"the time of w#one's life" and "the time of w#one's lives" 
as individual headwords and not its tile inflected forms of 
the same headword. Another exceptional case in which 
more than one constituent coukl inflect would be phrases 
containing the conjunction 'and.' llowever, most of the 
phrasal entries in the form of 'A anti B' are uncount:d)le 
and the final noun inflects if the phrase is countable, such 
as "gin and tonics." 
Therefore, we can assume that the grammatical behavior of 
constituents of noun phrase entries can be properly de- 
scribed by indicating the head noun and coding tile inflec- 
tion information and grammatical attributes of tile head 
noun, 
4.4.3 Grammatical Attributes for the Notul f'hrase Unit 
The coding of grammatical attributes for tile entire noun 
phrase unit also provided some interesting results on 
countability and the usage of articles with the noun phrase. 
As is expected, the most typic:d combimttion of cotmtabil- 
ity shows the following combinations: 
If the noun is countalfle it may be lneceded by tim definite 
article or the indefinite article; If the norm is tmcountable it 
may be preceded by the definite article or no article. 
Approximately 10% of the nouns coded as countable 
showed a wtriation on the foremeutioned pattern. These 
coutltable notms were coded its allowing the definite ar- 
ticle, indefinite article as well as no article. Nouns with this 
type of coding included mass nouns, names of phmls anti 
animals, metals, food, titles etc. or other nouns which 
could refer to both the group or a member of the group. 
Examl)les of such nouns included "Leconte's sparrow", 
"Madagascar jasmiue", "assembler language", and 
"atomic weight". Though this held for the majority of these 
types of nouus, it was not unive,sally applicable; the use of 
no article with "Nubian goat", "Oregon grape" and "arctic 
loon" is questionable. 
The significance of this data is that it implies perMps a 
new code is necessary to cover cases of countable (ENC) 
nouns becoming uncountable (I-';NU) nouns and vice versa. 
Instead of coding a single entry as both, or providing two 
entries which correspond to the ENC and ENU usage we 
,night better express the grauunatical behaviors which are 
commonly slmred by particular types of not, ns by using a 
new code, 
4.4.4 Verification of Morphological hfformation 
In tile morphological data some entries of the original data 
were segmented into constituents anti some were not. This 
was particularly the case with '-ing' and '-ed' forms of 
words. The segmentation was not always consistent. But 
through syntactic analysis, verificatiou of tim segmenta- 
tion and part of speech assignment could be carried out. 
The El)P, English Word Dictionary does not contain ger- 
nuds or p.'uticiple forms of a verb its separate headword 
entries (except for irregular inflected lbrms). If a word in 
tire '-ing' form is regarded as a gerund or a present parti- 
ciple, it is to be segmented into a verb and a verb ending. 
There are some cases where gerund forms or participle 
forms have been accepted as lexical items and not as in- 
flected forms of a verb. In such cases, they are klentified 
not as verbs, but as notms or adjectives. 
Noun phrases consisting of a word in the '-ing' form anti 
another noun are treated by using one of the following four 
patterns, where EVE denotes an English verb and EEV a 
verb ending: 
(l) EN 10/EN 1(@) 
"hunting knife" 
261 
(2) EAJ0/ENI (@) 
"flying fox" 
"man-eating shark" 
(3) EAJ(EVE0/EEV0)/EN 1 (@) 
"intervening sequence" 
"circulating medium" 
(4) EN 1 (EVE0/EEV0)/EN 1 (@) 
"changing room" 
"participating insurance" 
If a phrasal in the form of '-ing + noun' could be reworded 
as 'a noun that is v-ing' or 'a noun that v-s' the entry was 
coded using either pattern 2 or pattern 3. 
Through the verification of the morphological information 
we were able to gain more consistency in the segmentation 
of the constituents of phrasal headwords. Also we were 
able to indicate possible additional headword entries 
through the verification of the constituents that comprise 
the phrasal. 
5. Conclusion 
The syntactic structure of noun phrasal entries is described 
in a relatively small number of patterns. By coding a large 
number of noun phrasal entries it is possible to obtain an 
exhaustive list of syntactic patterns for noun phrases that 
would be listed as headword entries for English dictionar- 
ies. By describing the syntactic structure it is possible to 
obtain the syntactic information which is necessary to iden- 
tify the internal structure of the phrasal as well as confirm 
and improve upon the segmentation of constituents and 
pm-t of speech assignment to each constituent of the phrasal 
entry. 
The vast majority of additional vocabulary, not only in the 
EDR English Word Dictionary, but in dictionaries in gen- 
eral will most likely be noun phrases. By utilizing tile re- 
sults from the current improvement project, the list of syn- 
tactic patterns for noun phrase entries can be used to check 
the appropriateness of the phrases as dictionary headwords 
as well as provide screening in order to prevent tile record- 
ing of ill-formed structures, and finally to indicate syntactic 
ambiguity in the noun phrase itself. 
Wilks, Y., Slator, B., Guthrie, L. (in press) Electric words: 
dictionaries, computers and meanings. Cambridge, MA. 
M1T Press. 
Yokoi, T. (1990)Towards information technology. 
Kyoritsu Shul)pan. 
References 
EDR Electronic Dictionary Technic,.d Guide (EDR, 1993) 
Procter, P. (1992) The Cambridge Language Survey. Cam- 
bridge; Cambridge University Press. 
Suematsu, H., Sugiura, M., Arioka, M. (1992) "A Distribu- 
tive Representational Framework for English Collocations 
in an Electronic Dictionary," in: Lingvisticae 
Investigationes, XVI:2. John Benjamins, Amsterdam. 
Pages 373-394. 
262 
