LEXICAL ACCESS IN CONNECTED SPEECH RECOGNITION 
Ted Briscoe 
Computer Laboratory 
University of Cambridge 
Cambridge, CB2 3QG, UK. 
ABSTRACT 
This paper addresses two issues concerning lexical 
access in connected speech recognition: 1) the nature of 
the pre-lexical representation used to initiate lexical look- 
up 2) the points at which lexical look-up is triggered off 
this representation. The results of an experiment are 
reported which was designed to evaluate a number of 
access strategies proposed in the literature in conjunction 
with several plausible pre-lexical representations of the 
speech input. The experiment also extends previous work 
by utilising a dictionary database containing a realistic 
rather than illustrative English vocabulary. 
THEORETICAL BACKGROUND 
In most recent work on the process of word 
recognition during comprehe~ion of connected speech 
(either by human or machine) a distinction is made 
between lexical access and-word recognition (eg. 
Marslen-Wilsun & Welsh, 1978; Klan, 1979). Lexlcal 
access is the process by which contact is made with the 
lexicon on the basis of an initial aconstlo-phonetlc or 
phonological representation of some portion of the 
speech input. The result of lexical sccess is a cohort of 
potential word candidates which are compatible with this 
initial analysis. (The term cohort is used de__ccriptively in 
this paper and does not represent any commitment to the 
perticular account of lexical access end word recognition 
provided by any version of the cohort theory (e.g. 
Marslen-Wilsun, 1987).) Most theories assume that the 
candidates in this cohort are successively whittled down 
both on the basis of further acoustic-phonetic or 
phonological information, as more of the speech input 
becomes available, end on the basis of the candidates' 
compatibility with the linguistic and extralingulstie 
context of utterance. When only one candidate remains, 
word recognition is said to have taken place. 
Most psycholinguistlc work in this area has focussed 
on the process of word recognition after a cohort of 
candidates has been selected, emphasising the role of 
further lexical or 'higher-level' linguistic constraints such 
as word frequency, lexical semantic relations, or 
syntactic and semantic congruity of candidates with the 
linguistic context (e.g. Bradley & Forster, 1987; Marslen- 
Wilson & Welsh. 1978). The few explicit and well- 
developed models of lexical access and word recognition 
in continuous speech (e.g. TRACE, McCleliand & 
Elman, 1986) have small and tmrealistic lexicons of. at 
most, a few hundred words and ignore phonological 
processes which occur in fluent speech. Therefore, they 
tend to ove~.stlmatz the amount and reliability, of 
acoustic information which can be directly extracted 
from the speech signal (either by human or machine) and 
make unrealistic and overly-optimistic assumptions 
concerning the size and diversity of candidates in a 
typical cohort. This, in turn, casts doubt on the real 
efficacy of the putative mechanisms which are intended 
to select the correct word from the cohort. 
The bulk of engineering systems for speech 
recognition have finessed the issues of lexical access and 
word recognition by attempting to map directly from the 
acoustic signal to candidate words by pairing words with 
acoustic representations of the canonical pronunciation of 
the word in the lexicon and employing pattern-matching, 
best-fit techniques to select the most likely candidate 
(e.g. Sakoe & Chiba, 1971). However, these techniques 
have only proved effective for isolated word recognition 
of small vocabularies with the system trained to an 
individual speaker, as, for example, Zue & Huuonlocher 
(1983) argue. Furthermore, any direct access model of 
this type which does not incorporate a pre-lexical 
symbolic representation of the input will have di£ficulty 
capturing many rule-governed phonological processes 
which affect the ~onunciation of words in fluent speech. 
since these processes can only be chazacteris~ 
adequately in terms of operations on a symbolic, 
phonological representation of the speech input (e.g. 
Church. 1987; Frazier, 1987; Wiese, 1986). 
The research reported here forms part of an ongoing 
programme to develop a computationally explicit account 
of lexical access and word recognition in connected 
s1~e-~_~, which is at least informed by experimental 
results concerning the psychological processes and 
mechanisms which underlie this task. To guide research. 
we make use of a substantial lexical database of English 
derived from machine-readable versions of the Longman 
Dictionary of Contonporary English (see Boguracv et 
aL, 1987; Boguraev & Briscoe, 1989) and of the Medical 
Research Council's psycholinguistic database (Wilson, 
1988), which incorporates word frequency information. 
This specialised database system provides flexible and 
powerful querying facilities into a database of 
approximately 30,000 English word forms (with 60,000 
separate entries). The querying facilities can be used to 
explore the lexical structure of English and simulate 
different approaches to lexical access and word 
recognition. Previous work in this area has often relied 
on small illustrative lexicons which tends to lead to 
overestimation of the effectiveness of various 
approaches. 
There are two broad questions to ask concerning the 
process of lexical access. Firstly, what is the nature of 
the initial representation which makes contact with the 
lexicon? Secondly, at what points during the (continuous) 
analysis of the speech signal is lexical look-up triggered? 
84 
We can illustrate the import of these questions by 
considering an example like (1) (modified from Klan via 
Church. 1987). 
(1) 
a) Did you hit it to Tom? 
b) \[dlj~'~dI?mum~\] 
(Where 'I' represents a high, front vowel, 'E' schwa, 'd' 
a flapped or neutralised stop, and '?' a glottal stop.) The 
phonetic trmmcriptlon of one possible utterance of (la) in 
(lb) demonstrates some of the problems involved in any 
'dL,~ct' mapping from the speech input to lexical enu'ies 
not mediated by the application of phonological rules. 
For example, the palatalisation of final/d/before/y/in 
/did/means that any attempt to relate that portion of the 
W'~e¢___h input to the lexicel entry for d/d is h'kely to fail. 
Sitrfi/ar points can be made about the flapping and 
glottalisadon of the B/phonemes in/hit/and/It/, and the 
vowel reductions to schwa. In addition. (1) illustrates the 
wen-known point that there are no 100% reliable 
phonetic or phonological cues to word boundaries in 
connected speech. Without further phonological and 
lexical analysis there is no indication in a transcrilxlon 
like (lb) of where words begin or end; for example, how 
does the lexical access system distinguish word.initial/I/ 
in/17/fzom word-inlernal /I/ in /hid/? 
In this paper, I shall argue for a model which splits 
the lexical access process into a pre-lexical phonological 
parsing stage and then a lexicel enn7 retrieval stage. The 
model is simil~ to that of Church (1987), however I 
argue, firstly, that the initial phonological representation 
recovered from the speech input is more variable and 
often less detailed than that assumed by Church and, 
secondly, that the lexical entry retrieval stage is more 
directed and ~. in order to ~ce the 
number of spurious lexical enuies accessed and to 
cernp~z~te for likely indetenninacies in the initial 
representation. 
THE PRE-LEXICAL 
PHONOLOGICAL REPRESENTATION 
Several researchers have argued that phonological 
processes, such as the palatallsation of/d/in (1), create 
problems for the word recognition sysmn because they 
'distort' the phonological form of the word. Church 
(1987) and Frazier (1987) argue persuasively that, far 
fxom creating problems, such phonological processes 
provide imporu~ clues to the correct syllabic 
segmentation of the input and thus, to the locadon of 
word bounderies. However, this argument only goes 
through on ~ assump6on that quire derailed 'narrow' 
phonetic information is recovered from the signal, such 
as aspiration of M in/rE/ and /tam/ in (1) in order m 
recoguise tim preceding syllable botmdsrles. It is only in. 
terms of this represer~,tion that phonological processes 
c~m be recoguised and their effects 'undone' in order to 
allow correct matching of the input against the canonical 
phonological represenU~ons contained in lexical entries. 
Other researchers (e.g. Shipman & Zne, 1982)have 
argued (in the context of isolated word recogu/tion) that 
the initial representation which contacts the lexicon 
should be a broad mmmer-class transcription of the 
stressed syllables in the speech signal. The evidence in 
favot~ of this approach is, firstly, that extraction of more 
detailed information is nouniously diffic~dt and, 
secondly, that a broad transcription of this type appears 
to be vexy effective in partit/oning the English lexicon 
into small cohom. For example, Huttenlocher (1985) 
reports an average cohort size of 21 words for a 20,000 
word lexicon using a six-camgory manner of articulation 
transcription scheme (employing the categories: Stop, 
Strong-Fricative, Weak-Fricative, Nasal, Glide-Liquid, 
and Vowel). 
This claim suggests that the English lexicon is 
functionally organised to favour a system which initiates 
lex/cal access from a broad manner class pre-lexical 
representation, because most of the discriminatory 
iv.formation between different words is concentra~i in 
the manner articulation of stressed syllables. Elsewhere, 
we have argued that these ideas are mis|-~d;_ngly 
presented and that there is, in fact, no significant 
advantage for manner information in suessed syllables 
(e.g. Carter et al., 1987; Caner, 1987, 1989). We found 
that there is no advantage per s~ to a manner class 
analysis of stressed syllables, since a similar malysis of 
unstressed syllables is as discriminatory and yields as 
good a partitioning of the English lexicon. However, 
concantrating on a full phonemic malysis of stressed 
syllables provides about 10% more information them a 
similer analysis of tmstressed syllables. This research 
suggests, then, that the pre-lexical represenw.ion used to 
initiate lexical access can only afford m concentram 
exclusively on stressed syllables ff these are analysed (at 
least) phonemically. None of these studies consider the 
extracud~ility of the classifications fxom speech input 
however, whilst there is a g~m~ral belief that it is easier 
to extract infonnation from stressed portions of the 
signal, the~ is little reason to believe that mariner class 
infm'mation is, in general, more or less accessible than 
other phonologically relevant features. 
A second argument which can be made against the 
use of broad represmUstions to contact the lexicon (in 
the context of conn~ speech) is that such 
representations will not support the phonological parsing 
n~essary to 'undo" such processes as palatallsation. For 
example, in (1) the final/d/of d/d will be realised as/j/ 
and camgurised as a sarong-fricative followed by liquid- 
glide using the proposed broad manner ~ransoripfion. 
Therefore. palamlisadon will need m be recoguised 
before the required stop-vowel-stop represenr~ion can be 
recovered and used to initiate lexical access. However, 
applying such phonological rules in a constrained and 
useful manner requires a more detailed input 
transcription. Palamllsation inustra~es this point very 
cle~ly; not all sequences which will be transcribed as 
strong-fl'lcative followed by liquid-glide can undergo this 
process by any means (e.g. /81/), but there will be no 
way of preventing the rule oven-applying in many 
inappropriate conmxts and thus presumably leading to 
the get.ration of many spurious word candidates. 
85 
A third argument against the use of exclusively 
broad representations is that these representations will 
not support the effective recognition of syllable- 
boundaries and some word-boundaries on the basis of 
phonotactic and other phonological sequencing 
constraints. For example, Church (1987) proposes an 
initial syllabification of the input as a prerequisite to 
l~dcal access, but his sylla "bificafion of the speech input 
exploits phonotactic constraints and relies on the 
extraction of allophonic features, such as aspiration, to 
guide this process. Similarly, Harringmn et al. (1988) 
argue that approximately 45% of word boundaries are, in 
principle, recognisable because they occur in phoneme 
sequences which are rare or forbidden word-internally. 
However, exploitation of these English phonological 
constraints would be considerably impaired if the pre- 
lexical representation of the input is restricted to a broad 
classification. 
h might seem self-evident that people are able to 
recognise phonemes in speech, but in fact the 
psychological evidence suggests that this ability is 
mediated by the output of the word recognition process 
rather than being an essential prerequisite to its success. 
Phoneme-monimrin 8 experiments, in which subjects 
listen for specified phonemes in speech, are sensitive to 
lexical effects such as word frequency, semmfic 
association, and so forth (see Cutler et al., 1987 for a 
summary of the expemnen~ literature and putative 
explmation of the effect), suggesting that information 
concemm 8 at least some of the phonetic contain of a 
word is not available until after the word is recoguised. 
Thus, people's ability to recognise phonemes tells us 
very little about the nann~ of the representation used to 
initiate lexical access. Better (but still indireoO evidence 
comes from mispronunciation monitoring and phoneme 
confusion experiments (Cole, 1973; Miller & Nicely, 
1955; Sheperd, 1972) which suggest that tlsteners eere 
likdy to confuse or ~ phonemes along the 
dimensions predicted by distinctive feature theory. Most 
e~rcn result in reporting phonemes which differ in only 
one feanu~ from the target, This result suggests that 
listenexs are actively considering detailed phonetic 
information along a munber of dimemions (rather than 
simply, say, manner of articulation). 
Theoretical and experimental considerations suggest 
then that, regardless of the current capabilities of 
automated acoustic-phonetic fxont-ends, sysmms must be 
developed to extract as phonetically detailed a pm-lexical 
phonological represemation as possible. Without such a 
representation, phonological processes cannot be 
effectively recoguL~i and compensated for in the word 
recognition process and the 'extra' information conveyed 
in stressed syllables cannot be exploited. Nevertheless in 
fluent connected speech, unstressed syllables often 
undergo phonological processes which render them 
highly indemmlinam; for example, the vowel reductions 
in (I). Therefore, it is implausible m assume that my 
(human or machine) front-end will always output an 
accurate narrow phonetic, phonemic of perhaps even 
broad (say, manner class) mmscription of the speech 
input. For this reason, fur~er processes involved in 
lexical access will need to function effectively despim 
the very variable quality of information extracted from 
the speech signal. 
This last point creates a serious difficulty for the 
design of effective phonological parsers. Church (1987), 
for example, allows himself the idealisation of an 
accurate 'nsrmw' phonetic transcription. It remains to be 
demonstramd that any parsing mclmiques developed for 
determlnam symbolic input will transfer effectively to 
real speech input (and such a test may have to await 
considerably better automated front-ends). For the 
purposes of the next section. I assume that some such 
account of phonological parsing can be developed and 
that the pre-lexical representation used to initiate lexical 
access is one in which phonological processes have been 
'undone' in order to consuuct a representation close to 
the canonical (phonemic) representation of a word's 
pronunciation. However, I do not assume that this 
representation will necessarily be accuram to the same 
degree of detail throughout the input. 
LEXICAL ACCESS STRATEGIES 
Any theory of word recognition must provide a 
mechanism for the segmentation of connected speech 
into words. In effect, the theory must explain how the 
process of lexical access is triggered at appropriate 
points in the speech signal in the absence of completely 
reliable phonetic/phonological cues to word boundaries. 
The various theories of lexical access and word 
recognition in conneomd speech propose mechanisms 
which appear to cover the full specumm of logical 
possibilities. Klan (1979) suggests that lexicai access is 
triggered off each successive spectral frame derived from 
the signal (i.e. approximately every 5 msecs.), 
McClelland & Elman (1986) suggest each successive 
phoneme, Church (1987) suggests each syllable onset, 
Grosjean & Gee (1987) suggest each stressed syllable 
onset, aud Curler & Norris (1985) suggest each 
pmsodiceliy smmg syllable onset. Finally, Maralan- 
Wilson & Welsh (1978) suggest that segmentation of the 
speech input and recognition of word boundaries is an 
indivisible process in which the endpoint of the previous 
word defines the point at which lexical access is 
Iriggered again. 
Some of these access strategies have been evaluated 
with respect to three input transcriptions (which are 
plausible candidates for the pre-lexical represen~uion on 
the basis of the work discussed in the previous section) 
in the context of a realistic sized lexicon. The 
experiment involved one sentence taken from a reading 
of the 'Rainbow passage' which had been analysed by 
several phoneticians for independent purposes. This 
sentence is reproduced in (2a) with the syllables which 
were judged to be strong by the phoneticians underlined. 
(2) 
a) The rainbow is a divis.._ion of whim light into 
many beautiful col.__ours 
b) WF-V reln bEu V-SF V S-V vI SF-V-N V-SF 
walt Idt V-N S-V men V bju: S-V WF-V-G K^I 
V-SF 
86 
This utterance was transcribed: 1) fine class, using 
phonemic U-ensoription throughout; 2) mid class, using 
phonemic transcription of strong syllables and a six- 
category intoner of articulation tranm'ipdon of weak 
syllables; 3) broad class, as mid class but suppressing 
voicing disK, ations in the strong syllable transcriptions. 
(2b) gives the mid class transcription of the utterance. In 
this transcription, phonemes are represented in a manner 
compatible with the scheme employed in the Longman 
Dictionary of Contonporary English and the manner 
class categories in capitals are Stop, Strong-Fricative, 
Weak-Fricative, Nasal, Glide-liquid, end Vowel as in 
Hunmlocher (1982) end elsewhe=e. The terms, fine, mid 
end broad, for each transcription scheme are intended 
purely descriptively and are not necessarily related to 
other uses of these terms in the literature. Each of the 
schemes is intended to represent a possible behaviour of 
an acoustic-phonetic front-end. The less determinate 
transoriptions can be viewed either as the result of 
transcription errors and indatermlnacies or as the output 
of a less ambitious front-end design. The definition of 
syllable boundary employed is, of necessity, that built 
into the syllable parser which acts as the interface to the 
dictionary d~t-_bese (e.g. Carter, 1989). The parser 
syllabifies phonemic Iranscriptions according to the 
phonotactiz constraints given in Ghnson (1980) emd 
utilis~ the maximal onset principle (Selkirk, 1978) 
where this leads to ambiguity. 
Each of the three transcriptions was used as a 
putative pre-lexical representation to test some of the 
different access slrategies, which were used to initiate 
lexieal look-up into the dictionary database. The four 
access strategies which were tested were: 1) phoneme, 
using each mr..eessive phoneme to trigger an access 
amnnp~ 2) word. using the offset of the previous 
(correct) word in the input to control access attempts; 3) 
syllable, attempting look-up at each syllable boundary; 4) 
strong syllable, attemptin 8 look-up at earh strong 
syllable boundary. That is, the first smuegy assumes a 
word may begin at any p*'umeme boendary, the second 
that a word may only begin, at tlm end of the previous 
one, the third that a word may begin at any syllable 
boundary, end the fourth that a word may begin at a 
seron 8 syllable boundary. 
The strong syllable strategy uses a separate look-up 
process for typically urmtreimad grammatical, clor, ad-clus 
vocabulary end allows the possibility of extending look- 
up 'backwards' over one preceding weak syllable. It was 
assumed, for the purposes of the experiment, that look- 
up off weak syllables would be restricted to closed-class 
vocabulary, would not extend into a strong syllable, and 
that this process would precede attempts to incorporate a 
weak syllable *backwards' into an open-class word. 
The direct access approach was not considered 
because of its implausibility in the light of the discussion 
in the previous section. The stressed syllable account is 
v=y slmilar to the strong syllable approach, but given 
the problem of stress shift in fluent speech, a formulation 
in unms of strong syllables, which are defined in terms 
of the absence of vowel reduction, is preferable. 
Work by Marslen-Wilson and his colleagues (e.g. 
Marslen-Wilson & Warren. 1987) suggests that, whatever 
access strategy is used, there is no delay in the 
availability of information derived fi'om the speech signal 
to furth= select from the cohort of word candidates. This 
suggests that s model in which units (say syllables) of 
the pre-lexical representation are 'pre-packaged' and then 
used to wlgser a look-up attempt are implausible. Rathe~ 
the look-up process must involve the continuous 
integration of information from the pre-lexical 
representation immediately it becomes available. Thus 
the question of access strategy concerns only the points 
at which this look-up process is initiated. 
In order to simulate the continuous aspect of lexlcel 
access using the dictionary database, d~:__M3_ase look-up 
queries for each strategy were initiated using the two 
phonemes/segments Horn the trigger point and then again 
with three phonemes/segmonts and so on until no hu~er 
English words in the database were compatible with the 
look-up query (except for closed-class access with the 
strong syllable strategy where a strong syllable boundary 
terminated the sequence of accesses). The size of the 
resulting cohorts was measured for each successively 
larger query;, for example, using a fine class transcription 
and triggering access from the /r/ of rainbow yields an 
initial cohort of 89 cmdidams compatible with/re//. This 
cohort drops to 12 words when /n/ is added and to 1 
word when /b/ is also included and finally goes to 0 
when the vowel of/s is -dO,'d= Each sequence of queries 
of this type which all begin at the same point in the 
signal will be refened to as an access path. The 
differ, tee between the access strategies is mostly in the 
number of distinct access paths they generate. 
Simulating access attempts using the dictionary 
d~tnbasc involves generating database queries consisting 
of partial phonological representatious which return sere 
of words and enlries which satisfy the query. For 
example, Figure 1 relxesents the query corresponding to 
the complete broad-class trenscription of appoint. This 
qu=y matches 37 word forms in the database. 
\[ \[pron 
\[nsylls 2 \] 
\[el 
\[peak ?\] 
\[-.2 
\[etreee 2\] 
\[onzet (OR b d g k p t)\] 
\[peak ?\] 
\[coda (OR m n N) 
(OR b d g k p t)\]\]\]\] 
Figure 1 - Da'-bue query for 'aR?omt'. 
The ex~riment involved 8enera~8 s~uen~ of 
queries of this type and recording the number of words 
found in the database which matched each query. Figure 
2 shows the partial word lattice for the mid class 
trauscription of th, e ra/nbow /s. using the strong syllable 
access strategy. In this lattice access paths involving 
r~o'~sively larger portions of the signal are illustrated. 
The m=nber under each access attempt represents the 
size of the set of words whose phonology is compatible 
87 
with the query. Lines preceded by an arrow indicate a 
query which forms part of an access path, adding a 
further segment to the query above it. 
Th o 
14 
r ai n b ow i s a 
---I ---I --I -I 89 59 5 8 
" 
>-I >---I 
12 3 
>---I >-I 
1 o 
>--I I 
1 0 
>---I 
o 
Fisum 2 - Partial Word Lmi¢~ 
The corresponding complete word lattice for the 
same portion of input using a mid-class tr~cription and 
the strong syllable strategy is shown in Figure 3. In this 
lattice, only words whose complete phonology is 
compatible with the input are shown. 
Th e r ai n b ow i s a 
I--I I--I I--I I-I I 
14 1 2 5 8 
I .... I 
3 
I I 
Ir~re 3 - Complete Word 
The different strategies ware evaluated relative to the 
3 trensc6ption schemes by summing the total number of 
partial words matched for the test scmtence under each 
strategy and trans=ipdon and also by looking at the total 
number of complete words matched. 
RESULTS 
Table 1 below gives a selection of the more 
important results for each strategy by transcription 
scheme for the test umtence in (2). Column 1 shows the 
total number of access paths initiated for the test 
sentence under each strategy. Columns 2 to 6 shows the 
number of words in all the cohorts produced by the 
particular access strategy for the test sentence after 2 to 
6 phonemes/segments of the transcription have been 
incorporated into each access path. Column 7 shows the 
total number of words which achieve a complete match 
during the application of the particular access strategy to 
the test sentence. 
Table 1 provides m index of the efficiency of each 
access strategy in terms of the overall number of 
candidate words which appear in cohorts and also the 
overall number of words which receive a full match for 
the test sentence. In addition, the relative performance of 
each strategy as the ~ption scheme becomes less 
determinate is clear. 
The test sentence contains 12 words, 20 syllables, 
end 45 phonemes; for the purposes of this experiment 
the word a in the test sentence does not trigger a look- 
up attempt with the word strategy because cohort sizes 
were only recorded for sequences of two or more 
phonemes/segments. Assuming a fine class trmls=iption 
serving as lxe-lexical input, the phoneme strategy 
produces 41 full matches as compared to 20 for the 
strong syllable strategy. This demonstrates that the strong 
syllable strategy is more effective at ruling out spurious 
word candidates for the test sentence. Furthermore, the 
total number of candidates considered using the phoneme 
strategy is 1544 (after 2 phonemes/segments) but only 
720 for the strong syllable strategy, again indicafng the 
greater effectiveness of the lanef strategy. When we 
A _c¢~___- Access 
Strategy Paths 
Fine Class 
Phoneme 45 
Word 11 
Syllable 20 
StrongS 17 
Mld Class 
Word 11 
Synable 20 
StrongS 17 
Broad Class 
Syllable 20 
$trongS 17 
No. of words after x segments: 
2 3 4 
1544 251 46 
719 193 32 
1090 210 36 
720 105 24 
4701 1738 802 54 
12995 3221 1530 103 
760 232 89 13 
13744 3407 1591 140 
1170 228 100 18 
Table I 
Complete 
5 6 Matches 
6 2 41 
5 2 25 
6 2 28 
5 2 20 
8 249 
9 380 
4 80 
9 117 
88 
consider the less determinate tran.scriptlons it becomes 
even clearer that only the strung syllable slrategy 
remains reasonably effective and does not result in a 
ma~ive increase in the rmmber of spurious candidates 
accessed and fully matched. (The phonmne strategy 
resets are not reporud for mid end broad class 
tramcrlptlons because the cohort sizes were too large for 
the database query facilities to cope reliably.) 
The word candidates recovered using the phoneme 
strategy with a fine class transcription include 10 full 
matches resulting from accesses triggered at non-syllabic 
boundaries; for example arraign is found using the 
second phoneme of the and rain. This problem becomes 
considerably worse when moving to a less determinate 
transcription, illustrating very clearly the undesirable 
consequences of ignoring the basic linguistio constraint 
that word boundaries occur at syllable boundaries. 
Systems such as TRACE (McClelland & Elman. 1986) 
which use this strategy appear to compensate by using a 
global best-fit evaluation metric for the entire utterance 
which s~rongly disfavours 'unattached' input. However. 
these models still make the implausible claim that 
candid~_!e~ llke arraign will be highly-activated by the 
speech input. 
The results concerning the word based strategy 
presume that it is possible to determinately recognise the 
endpuint of the preceding word. This essmnption is 
based on the Cohort theory claim (e.g. Marslan-Wilsun 
& Welsh, 1978) that words can be recogulsed before 
their acoustic offset, using syntactic and semantic 
expectations to filter the cohort. This claim has been 
challenged experimentally by Grosjean (1985) and Bard 
et al. (1988) who demcmstrate that many monosyllabic 
words in context are not recognised until after their 
acoustic offset. The experiment reported here supports 
this expesimental result because even with the fine class 
transcription there are 5 word candM~t_~ which extend 
beyond the correct word boundary end 11 full matches 
which end before the correct boundary. With the mid 
clam tran.un'iption, ~e~ numbers rise to 849 end 57. 
respectively. It seems implausible that expectation-based 
corm~ainm could be powerful enough to correcdy select 
a unique candidate before its acoustic offset in all 
contexts. Therefore, the results for the word strategy 
reported here are overly-optim.isdc, because in order to 
guarantee that the correct sequence of words are in the 
cohorts recovered from the input, a lexical access system 
based on the word strategy would need to operate non- 
demrministically; that is, it would need to consider 
several pumndal word boundaries in most cases. 
Therefore, the results for a practicM syr.em based on Otis 
approach am likely to be significantly worse. 
The syllable strategy is effective under the 
assumption of • determinate and accurate phonemic pre- 
lexieal representation, but once we abandon this 
idealisation, the effectiveness of this strategy declines 
~trply. Under the plaus~le assumption that the pre- 
lexical input reprmemation is likely to be least 
accurate/deanminate for tmslressed/weak syllables, the 
sw~ng syllable strategy is far more robust. This is a 
direct consequence of triggering look-up attempts off the 
more determinate parts of the pre-lexical representation. 
Further theoretical evidence in support of the strong 
syllable strategy is provided by Cutler & Carter (1987) 
who demmmtrate that a listener is six times more likely 
to e~mter a word with a prosodically strong initial 
syllable than one with a weak initial syllable when 
listening to English speech. Experimental evidence is 
provided by Cutler & Norris (1988) who report results 
which suggest that listeners tend to treat strong, but not 
weak, syllables as appropriate points at which to 
undertake pre-lexical segmentation of the speech input. 
The architecture of a lexical access system based on 
the syllable strategy can be quite simple in terms of the 
organisation of the lexicon and its access routines. It is 
only n~essary to index the lexicon by syllable types 
(Church, 1987). By contrast, the strong syllable strategy 
requires a separate closed.class word lexicon end access 
system, indexing of the open-class vocabulary by strong 
syllable and a more complex matching procedure capable 
of inhering preceding weak syllables for words such 
as d/v/s/on. Nevertheless, the experimental results 
reported here suggest that the extra complexity is 
warranted because the resulting system will be 
considerably more robust in the face of inacct~rate or 
indeterminate input concerning the nature of the weak 
syllables in the input utterance. 
CONCLUSION 
The experiment reported above suggests that the 
strong syllable access strategy will provide the most 
effective technique for producing minimal cohorts 
gu~anteed to contain the correct word candidate from a 
pre-lexical phonological representation which may be 
partly inaccurate or indeterminate. Further work to be 
undertaken includes the rerunning of the experiment with 
further input transcriptions containing pseudo-random 
typical phoneme perception errors and the inclusion of 
further test sentences designed to yield a 'phonetically- 
balanced' corpus. In addition, the relative internal 
dlscriminability (in tmmm of further phonological and 
'higher-lever syntactic and semantic constraims) of the 
word candidates in the varying cohorts generated with 
the different strategies should be exandned. 
The importance of mai~ng use of a dictionary 
database with a realistic vocabulary size in order to 
evaluate proposals concerning lexlcal access and word 
recognition systems is hlghligh~d by the results of this 
experiment, which demonstrate the theoretical 
implausibility of many of the proposals in the literature 
whea we consider the consequences in a simulation 
involving more than a few hundred illustrative words. 
89 
ACKNOWLEDGEMENTS 
I would like to thank Longman Group Ltd. for 
making the typesetting tape of the Longmcat Dictionary 
of Contemporary English available to m for research 
purposes. Part of the work reported here was supported 
by SERC gram GR/D/4217. I also thank Anne Cuder, 
Francis Nolan and Tun Sholicar for useful comments and 
advice. All erroPs remain my own. 
REFERENCES 
Bard, E., Shillcock, R. & Altmann, G. (1988). The 
recognition of words after their acoustic offsets in 
spontaneous speech: effects of subsequent context. 
Perception & Psychophysic$, 44, 395-408. 
Boguraev, B. & Briscoe, E. (1989). Computational 
Lexicography for Natural Language Processing. 
Longman Limited, London. 
Boguraev, B., Carter, D. & Briscoe, E. (1987). A multi- 
purpose interface to an on-line dictionary. 3rd 
Conference of Eur. Assoc. for Computational Linguistics, 
Copenhagen. 
Bradley, D. & Forster, K. (1987). A reader's view of 
listeffmg. Cognition, 25, 103-34. 
Carter, D. (1987). An information-theoretic analysis of 
phonetic dictionary access. Computer Speech and 
Language, 2, 1-11. 
Carter, D., Boguraev, B. & BrL~oe, E. (1987). Lexical 
sUess and phonzfiz information: which szSments are 
most informative. Proc. of £ur. Conference on Speech 
Technology, Edinhoxgh. 
Carter, D. (1989). LIX)CE and speech recognition. In 
Boguraev & Briscoo (1989) pp. 135-52. 
Church, K. (1987). Phonological parsing and lexical 
muievaL Cognition, 25, 53-69. 
Cole, R. (1973). Listening for mispronunciations: a 
measure of what we hear during speech. Perception & 
Psychophysic~, 1, 153-6. 
Cutler, A. & Carter, D. (1987). The Ira:dominance of 
smm 8 initial syllables in the English vocabulary. Computer 
Speech and Language, 2, 133-42. 
Cuder, A., Mehler, J., Norris, D. & Segui, J. (1987). 
Phoneme identification and the lexicon. Cogni:ive 
Psychology, 19, 141-77. 
Cuder, A. & Norris. D. (1988). The role of slxong 
syllables in segmentation for lexical access. J. of 
Experimental Psychology: Human Perception and 
Performance, 14, 113-21. 
Frazier, L. (1987). Slrucmre in auditory word 
recognition. Cognition, 25, 15%87. 
Gimson, A. (1980). An Introduction to the Pronunciation 
of English. 3rd F.~tion, Edw~l Arnold, London. 
Gmsjean, F. & Gee, L (1987). Prosodic su-ucmre and 
spoken word recognition. Cognition, 25, 135-155. 
Harrington, J., Watson, G. & Cooper, M. (1988). Word 
hound~y identification from phoneme sequence 
~mtraims in automatic c~dnuons speech recognition. 
Proc. of 12th Int. Co~. on Computational Linguistics, 
Budapest, pp. 225-30. 
Huttanlocher, D. (1985). Exploiting sequential phonetic 
constraints in recognizing spoken words. MIT. AI. Lab. 
Memo 867. 
Klatt, D. (1979). Speech perceptiom a model of acoustic- 
phonetic analysis and lexical access. Journal of 
Pho~t/es, 7, 279-312. 
Maralen-WiLson, M. (1987). Functional parallelism in 
spoken word recognition. Cognition, 25, 71-i02. 
Marden-WiLson, W. & Warren, P. (1987). Continuous 
uptake of acoustic cues in spoken word recognition. 
Perception & Psychophy$ics, 41, 262-75. 
Marslen-Wilson, W. & WeLsh, A. (1978). Processing 
interactions and lexical access during word recognition in 
continuous speech. Cognitive Psychology, 10, 29-63. 
Mcclelland, J. & Elman, I. (1986). The TRACE model 
of speech perception. Cognitive Psychology, 18, 1-86. 
Miller. G. & Nicely, P. (1955). Analysis of some 
perceptual confusions among some English consonants. 
Journal of Acoustical Society of America, 27, 338-52. 
Sakoe, H. & Chiba, S. (1971). A dynatrdc programming 
optimization for spoken word recognition. IEEE 
Transactions, Acoustics, Speech and Signal Processing, 
ASSP-26, 43-49. 
Selkirk, E. (1978). On prosodic structure and its relation 
to syntactic su'ucmre. Indiana University Linguistics 
Club, Bloomington, Indiana. 
Sheperd, R. (1972) Psychological representation of 
speech sounds. In David, E. & Denes, P. Human 
Communication: A Unified View, New York: McGraw- 
Hill 
Shipman, D. & Zue, V. (1982). Properties of large 
lexicons: implications for advanced isolated word 
reco~don systan~. IEEE ICASSP, Paris, 546-549. 
Wiese, R. (1986). The role of phonology in speech processing. 
Proc. of llth Int. Conf. on Computational 
Linguistics, Bonn, pp. 608-11. 
WiLson. M. (1988). MRC psycholinguisfic database: 
machine-usable dictionary, version 2.0 Behaviour 
Research Methods, Instrumentation & Computers, 20, 
6-10. 
Zue, V. & Huttenlocher, D. (1983). Computer 
recognition of isolated words from large vocabularies. 
IEEE Conference on Trends and Applications. 
9O 
