Tagging Spoken Language Using Written Language Statistics 
Joakim Nivre Leif Gr6nqvist Malin Gustafsson TorbjSrn Lager Sylvana Sofkova 
Dept. of Linguistics 
G5teborg University 
S-41298 GSteborg 
Sweden 
{j oakim, leifg, malin, lager, sylvana}@ling, gu. se 
Abstract 
This paper reports on two experiments 
with a probabilistic part-of-speech tag- 
ger, trained on a tagged corpus of writ- 
ten Swedish, being used to tag a corpus 
of (transcribed) spoken Swedish. The re- 
sults indicate that with very little adap- 
tations an accuracy rate of 85% can 
be achieved, with an accuracy rate for 
known words of 90%. In addition, two 
different treatments of pauses were ex- 
plored but with no significant gain in ac- 
curacy under either condition. 
1 Introduction 
What happens when we take a probabilistic part- 
of-speech tagger trained on written language and 
try to use it on spoken language transcriptions? 
The answer to this question is interesting from 
several points of view, some more practical and 
some more theoretically oriented. From a practi- 
cal point of view, it, is interesting to know how well 
a written language tagger can perform on spoken 
language, because it may save us a lot of work if 
we can reuse existing taggers instead of develop- 
ing new ones for spoken language. Front a more 
theoretical point of view, the results of such an 
experiment may tell us something about the ways 
in which the strncture of spoken language is dif- 
ferent (or not so different;) from that of written 
language. 
In this paper, we report on experimental work 
dealing with the part-of-speech tagging of a corpus 
of (transcribed) spoken Swedish. The tagger used 
implements a standard probabilistic biclass model 
(see, e. g., (DeRose 1988)) trained on a tagged 
subset of the Stockhohn-Ume£ Corpus of written 
Swedish (Ejerhed et al 1992). Given that the tran- 
scriptions contain many modifications of standard 
orthography (in order to capture spoken language 
variants, reductions, etc.) a special lexicon had 
to be developed to map spoken langnage variants 
onto their canonical written language forms. In 
addition, a special tokenizer had to be developed 
to handle "recta-symbols" in the transcriptions, 
such as markers for pauses, overlapping speech, 
inaudible speech, etc. One of the interesting is- 
sues in this context is what use (if any) should be 
made of information about panses, interruptions, 
etc. In the experiment reported here, we com- 
pare two different treatments of pauses and evalu- 
ate the performance of the tagger under these two 
different conditions. 
2 Background 
2.1 Probabilistie Part-of-speech Tagging 
The problem of (automatically) assigning parts of 
speech to words in context has received a lot of 
attention within computational corpus linguistics. 
A variety of diffexent methods have been investi- 
gated, most of which fall into two broad classes: 
• Probabilistic methods, e. g. (DeRose 1988; 
Cutting et al 1992; Merialdo 1994). 
• Rule-based methods, e. g. (Brodda 1982; 
Karlsson 1990; Koskennienfi 1990; Brill 
1992). 
Probabilistic taggers have typically been imple- 
mented as hidden Markov models, using proha- 
bilistic models with two kinds of' basic probabili- 
ties: 
• The lexical probability of seeing the word w 
given the part-of-speech t: P(w I t). 
• The contextual pwbability of seeing the 
part-of-speech ti given the context of n - 1 
parts-of-speech: P(ti I ti-(,~-,),...,ti 1). 
Models of this kind are usually referred to as n- 
class models, the most common instances of which 
are the biclass (n = 2) and triclass (n = 3) models. 
The lexical and contextual probabilities of an n- 
class tagger are usually estimated using one of two 
methods: ~ 
1The terms 'RF training' and 'ML training' are 
taken from Merialdo 1994. It should be pointed out, 
though, that the use of relative frequencies to esti- 
mate occurrence probabilities is also a case of maxi- 
mmn likelihood estimation (MLE). 
1078 
• Relative l,Yequency (RF) training: Given a 
tagged training corpus, the i)rohabilities (:an 
be estimated with relative frequencies. 
• Maxinnun Likelihood (ML) training: Given 
an untagged training corpus, the probabilities 
can be estimated using the Bauin-Welch algo- 
rithm (also known as tile Forward-Backward 
algorithin) (Baron 1972). 
Of these two methods, R.F training seelns to give 
better estilnations while t)eing more labor inten- 
sive (Merialdo 1994). With proper training, r> 
class taggers typically readt all accuracy rate of 
about 95% \['or English texts (Charniak 1993), and 
similar results have been reported for other lan- 
guages such as lh'ench and Swedish (Chanod & 
Tapanainen 1995; Brants & Samuelsson 1995). 
2.2 Tagging Spoken Language 
Spoken language transcrit)tions are essentially a 
Mud of text, and can therefore be tagged with 
the methods used for otller kinds of text,. IIow- 
ever, sin(:(; t, he transcription of spoken language 
is a fairly labor-intensive tasks, the availability 
of suitable training corpora is much more limitexl 
than for ordinary written texts. One way to cir- 
cuinvent this problem is to use taggers trained on 
written texts to tag spoken language also. This 
has apparently been done successflllly for the spo- 
ken language part of the British National Corpus, 
using the CLAWS tagger (Garsi(te). 
However, the application of writte, n language 
taggers to spol(en language is not entirely unprob- 
lematic. First of all, spoken language transcrip- 
tions are typically produced ill a different format 
and with different conventions than ordinary writ- 
ten texts. For example, a transcription is likely to 
contain markers tbr pauses, (aspects of) t)rosody, 
overlapping speech, etc. Moreover, they do not 
usually contain the pun(:tuation marks found in 
ordinary texts. This means that the application of 
a written language tagger to spoken language min- 
imally requires a special tokenizer, i. e., a prepro- 
cessor segmenting the text into appropriate coding 
units (words). 
A second type of ditficulty arises from tile fact 
that spoken language is otten transcribed using 
non-standard orthograI)hy. Even if no phonetic 
t;ranscrit)tion is used, most transcription eonven- 
lions support the use of modified orthography to 
capture typical features of st)oken language (such 
as gem instead of going, kinda instead of kind 
of, etc.). Thus, the application of a written lan- 
guage tagger to spoken language typically requires 
a special lexicon, mapt)ing spoken language vari- 
ants onto their canonical written language forms, 
in addition to a special tokenizer. 
The problems considered so far may be seen 
as problems of a practical nature, but there is 
also a more filndmnentat problem with tile use 
of written language statistics to analyze spoken 
language, namely that the probability estimates 
derived from written language may not be rcp- 
resentative for spoken language. In the extreme 
case, some st)oken language phenomena (such as 
hesitation markers) Inay l)e (nearly) non-existent; 
in written language. But even for words and collo- 
cations that occur both ill written and ill spoken 
language, t;he occurrence probabilities may vary 
greatly between tile two media. How riffs affects 
the performance of taggers and what methods can 
be use(l to over(;olne or circunlvent tile I)rol)lems 
m-e issues that, surprisingly, do not seem to have 
t)een discussed in the literature at all. The I)resent 
paper can be seen as a first attempt to ext)lore this 
area. 
2.3 Tagging Swedish 
As far as we know, the methods for mltomatic 
part-of-speech tagging have not before been ap- 
plied (;o (transcribed) spoken Swedish. For writ- 
ten Swedish, there are a few tagge, d corpora avail- 
at)le, such as the Teleman tort)us (see, e. g., 
(Brants &, Samuelsson 1995)) and the Stockholnl- 
Urneh Corpus (Ejerhed et al 1992). A subpart of 
tim latter has been used as training dal;a in the 
experiments reported t)elow. 
3 Method 
3.1 The Tagger 
The tagger used fl)r tile experiments is a standard 
ItMM tagger using tile Viterbi algorithm to calcu- 
late the most probable sequence of parts-ofspee(:h 
for each string of words actor(ling to the following 
prol)al)ilistic t)iclass modeh 
(1) l'(,,,1,...,,,,,~,t,,...,~,,,) = 
P(t,)P(w, Itt)II'¢~ 2 P(til t/--1)I)(WJ I*,~) 
The tagger is coupled with a tokenizer that seg- 
ments a transcription into utterances (strings of 
words), that are fed to the tagger one by one. Be- 
sides ordinary words, the utterances may also con- 
tain markers for pauses and inaudibh: stretches of 
speech. ~ 
3.2 Training the Tagger 
Tile lexical and contextual probabilities were esti- 
mated with relative frequencies ill a tagged corpus 
of written Swedish, a subpart of the Stockholm- 
Ume'£ Cortms (SUC) containing 122,377 word to- 
kens (1.8,343 word types). Tile tagset included 27 
parts-of-speech. 3 
2Tile original transcriptions also contain inibrma- 
tion about overlapping speech, marking of certain as- 
pects of prosody, and various colninmlts. This infor- 
mation is currently disregarded by the tokenizer. 
3For a lnore detailed description of the linguistic 
annotation system of the Stockhohn-Ume£ Cort)us, 
see (Ejerhed et al 1992). 
1079 
3.3 The Spoken Language Lexicon 
As noted earlier, the spoken language transcrip- 
tions contain many deviations fl'om standard or- 
thography. Therefore, in order to inake optimal 
use of tile written language statistics, a special 
lexicon is required to map spoken language vari- 
ants onto their canonical written forms. For the 
present experiments we have developed a lexicon 
covering 2113 spoken language variants (which are 
mapped onto 1764 written language forms). We 
know, however, that this lexicon has less than to- 
tal coverage and that many regular spoken lan- 
guage reductions are not currently covered. 4 
3.4 Unknown Words and Collocations 
The occurrence of "unknown words", i. e., words 
not occurring in the training corpus, is a notorious 
problem in (probabilistic) part-of-speech tagging. 
In our case, this problem is even more serious, 
since we know beforehand that some words will 
be treated as unknown although they do in fact 
occur in the training corpus (because of deviations 
Dom standard orthography). In the experiments 
reported below, we have allowed unknown words 
to belong to any part-of-speech (which is possible 
in the given context), but with different weight- 
ings for different parts-of-speech. More precisely, 
when a word cannot be found in the lexicon, we 
replace the product in (2) (cf. equation 1 above) 
with the product in (3), where TTR(ti) is the type- 
token ratio of ti (in the training corpus). 
(2) p(t  I I td 
(3) P(t{ I t{_l)P(ti)TTR(t{) 
In this way, we favor parts-of-speech with high 
probability and high type-token ratio. In practice, 
this favors open classes (such as nouns, verbs, ad- 
jectives) over closed classes (determiners, conjunc- 
tions, etc.), and more frequent ones (e. g., nouns) 
over less frequent ones (e. g., adjectives). 
In addition to "unknown words", we have to 
deal with "unknown collocations", i. e., biclasses 
that do not occur in the training data. If these bi- 
classes are simply assigned zero probability, then 
in tile extreme case a word which is in the 
lexicon may fail to get a tag because the contex- 
tual probabilities of all its known parts-of-speech 
are zero in the given context. In order to prevent 
this, we use the following formula to assign con- 
textual probabilities to unknown collocations: 
(4) P(ti l t{_l) = P(ti)K 
The constant K is chosen in such a way that tile 
contextual probabilities defined by equation (4) 
are significantly lower than the "real" contextual 
probabilities derived from the training corpus, so 
4A common example is the ending -igt, which ap- 
pears in many adjectives (neuter singular) and adverbs 
and which is usually reduced to -it in ordinary speech. 
that they only come into play when no known col- 
location is possible. 
3.5 Pauses and Inaudible Speech 
As indicated earlier, the utterances to be tagged 
included markers for pauses and inaudible speech, 
since these were thought to contain information 
relevant for tile tagging process. The symbol for 
inaudible (and therethre untranscribed) speech 
(...) -was simply added to the lexicon and 
assigned the "t)art-of-speech" major delimiter 
(mad), which is the category assigned to full stops, 
etc. in written texts. The result is that the tag- 
ger will not treat the last, word before tile untran- 
scribed passage as immediate context for tile first 
word after tile passage. 
For pintoes we have experimented with two dif- 
ferent treatments, which are compared below. We 
refer to these different treatments as tagging con- 
dition 1 and 2, respectively: 
• Condition 1: Pauses are simply ignored in tile 
tagging process, which means that the last 
word before a pause is treated as immediate 
context for the first word after the pause. 
• Condition 2: Pause symbols are added to the 
lexicon, where short pauses are categorized 
as minor delimiters (mid) (commas, etc.), 
while long pauses are categorized as mad (fllll 
stops, etc.), which means that the contextual 
probabilities of words occurring before and 
after pauses in spoken language will be mod- 
elled on the probabilities of words occurring 
before and after certain punctuation marks in 
written language. 
It was hypothesized that, in certain cases, the tag- 
ger might perform better under condition 2, since 
pauses in spoken language often though by no 
means always indicate major phrase boundaries 
or even breaks in the grammatical structure. 
3.6 Test Corpus 
The test corpus was composed of a set of 47 ut- 
terances, chosen randomly from a corpus of tran- 
scribed spoken Swedish containing 267,206 words. 
The utterance length varied from 1 word to 688 
words (not counting pauses as words), with a 
mean length of 29 words. The test corpus con- 
tained 1360 word tokens and 498 word types. 
4 Results 
The number of correctly tagged word tokens under 
condition 1 was 1153 out of a total of 1360, i. e., 
84.8°./o. The results for condition 2 were slightly 
better: 1248/1457 = 85.7%. However, the latter 
figures also include the tagged imuses, for which 
only one category was possible. If these tokens 
are subtracted, the results for condition 2 are: 
1151/1360 = 84.6%. 
1080 
5 Discussion 
The overall ac(:uracy rate for the I,agger is al'Omld 
85%, which is not too imi)ressive wh(m (:oInl)m'e(| 
to the results reporte, d for writt;cn laitguage. How- 
ever, if we take a closer look at the results, it; seems 
that an imt)ortant source of error is the lack of cov- 
erage of the, lexicon m,t the training corpus. Of 
the |;we lmndred or so errors made 1)y the tagger, 
more than eighty con(:ern tokens that could not 
be matched with any word form occurring in the 
training corpus. The most; common tyt)e of error 
in this class is that a word is (~rroneously tagge, d 
as a noun. \[t is likely that this is an artifact of 
the way we assign lexical prol)abilities to unknown 
words and that a more Sol)histi(:ated method may 
lint)rove the results for this class of words. More 
importantly, though, if we only (:oilsi(ler the re- 
suits for words that were known to the tagger, the 
accuracy rate goes up to about 90%, mid most of 
the errors relnailfii~g concern classes that are noto- 
riously difficult even un(ter norlnal cir(:umstmLces, 
such as adverbs vs verb particles and prepositions 
vs sut)ordinating conjunctions. Taken togedmr, 
these results seen~ to indicate that with a more 
e.xtensive lexicon, a larger training corpus of writ- 
ten language, and l)erhat)s a more sot)histi(:ated 
treatment of mtknown words, it should |)e possi- 
ble to el)Cain results al)proa<',hing those, ()I>taine<l 
for written language. 
As regards the two treatments ()\[' \[)allses, the 
results are virtually identi(:al in terms of overall 
accuracy rate. If we look at individual words, 
however, we find that the part-of-st)eech assign- 
illellt differs in 25 cases, hi 10 of these (:ases, 
the corrc(:t part-of-st)eech is assigned under con- 
dition 1; in 9 cases, the corre, ct ttLg is tbund under 
(:ondition 2; ittl(t in 6 cases, l)oth conditions yield 
an incorrect assignlnent. The conclusion to draw 
from the.se results is i)robably that the. tre&tmcnt 
of pauses as delimiters yields it t)etter analysis in 
cases where the pause, marks an interruption or 
major phrase t)omldary, while it is better t() ig- 
nore pauses when they do iloi-, mark any break in 
grmnlnatical structure. Unfortunately, these two 
tyl)eS of t)auses seem to 1)e equally (:ommon, whi(:h 
means that neither treatment results in any gain 
in overall accuracy. However, preliminary obser- 
vations seem to in(ticate thai, it may be possible 
to get better results if a more line-grained analysis 
o\[" t)ause length is taken into account. This pre-- 
supposes, of course, that lifts kind of informal;ion 
is available in the transcriptions. 
6 Conclusion 
in this I)aper we, have ret)orted on an experiinent 
using a probabilistic part-of-speech tagger trained 
on written language to analyze (transcril)ed) spo- 
ken language. The results indicate that, with little 
or no adaptations, an overall accuracy rate of 85o/o 
c:ml 1)e a,chio, vcd, with ~1,i1 itC(;llFO, cy r;~te of 90% \['()r 
known words. ()n the negative side, we, found that 
the treatment of pauses as delimiters (a,s ot)t)osed 
to siml)ly ignoring them) did not result in a 1)ctlx!r 
performance, of the tagger. 
References 
MeriMdo, B. (71995) Tagging English Text with a 
I'robabilistic Model. Uomp'utatio'nal Linguistics 
20, t55 \[71. 
lbmm, L. E. (1972) An \[no, quality and Associated 
Maximizm;iou Technique, in Statistical \[';sLima-- 
lion for Probabilistic lqm(:tions of it Markov 
Process. l;ncqualil, ies 3~ \] 8. 
Brants, T. & Samuelsson, C. (1995) Tagging the 
Teleman Corl)uS. In I'rocccdin9 s of the lOI, h 
Nordic Conference of Computational Lin quis- 
tics, NODALLDA-95, flelsinki, 7 20. 
lbill, E. (1.992) A Simple Rule-based Part of 
Sl)eech Tagger. In 7'h, ird Conference of Applied 
Natural Lan.quage l~roccssing, ACL. 
lbo(hla, 13. (11982) Problems with T~l,gging and a 
Solution. Nordic .lowrnal of Lin q'n, istics 5, 93 
I 16. 
Chanod, 3.-P. & Tapanahmn, P. (19!)5)Tag- 
zing l,'rench Coinparing a Statistical and 
a Constraint-t)ased Method. In Seventh Con- 
flerc'nce of the Europeo, n Ch, aptcr" of the Asso- 
ciation for Computational Lin.quistics, l)ublin, 
149 \]56. 
Charniak, E. (1!)93) Statistical Language Lea'rn.. 
rag. Cambridge, MA: MIT I'ress. 
Cutting, l)., Kupiec, J., Pedersen, ,I. 82 Sibun, 
P. (1992) A Practical Part-of-st)(!ech Tagger. In 
Th, ird Confc.,vncc on Applied Natural Language. 
Processing, ACL, 133 140. 
DeRose, S. ,1. (1988) Grammati(:al Cat(gory l)i- 
amt)iguation t)y Statisti(;al Optimization. Co'm,- 
putational lAnguistic.s 14, 31 39. 
i.\]jerhe(l, E., Kgllgren, G., Wennstedt, O. 
)~strSm, M. (I992) The Linguistic Annotation 
Syst(:m of (;he Stockholm-Ume/~, Corpus Project. 
Report 33. University of Ume£: Department of 
Linguistics. 
Garside, R., Using (\]LAWS to Annotate the 
lh'itish National Corpus. \[httt)://info.ox.ac.uk: 
80/bnc/garside_ alk:.html\]. 
Kartsson, F. (1990) Constraint Grmnmar as a Sys-- 
tern for Parsing l{,unning Text. In Procecdings of 
COLING-90, Helsinki, 168 173. 
Koskenniemi, K. (\[990)Finite-state Parsing an(l 
I)isambiguation. In Proceedings of CO I, ING- 
00, lIelsinki, 229 232. 
1081 
