Keyword Extraction using Term-Domain Interdependence for 
Dictation of Radio News 
Yoshimi Suzuki Fumiyo Fukumoto Yoshihiro Sekiguchi 
Dept. of Computer Science and Media Engineering 
Yamanashi University 
4-3-11 Takeda, Kofu 400 Japan 
{ysuzuki@suwa, fukumot o@skyo, sokiguti©saiko}, osi. yamanashi, ac. jp 
Abstract 
In this paper, we propose keyword extraction 
method for dictation of radio news which con- 
sists of several domains. In our method, news- 
paper articles which are automatically classified 
into suitable domains are used in order to calcu- 
late feature vectors. The feature vectors shows 
term-domain interdependence and are used for 
selecting a suitable domain of each part of ra- 
dio news. Keywords are extracted by using the 
selected domain. The results of keyword extrac- 
tion experiments showed that our methods are 
robust and effective for dictation of radio news. 
1 Introduction 
Recently, many speech recognition systems 
are designed for various tasks. However, most 
of them are restricted to certain tasks, for ex- 
ample, a tourist information and a hamburger 
shop. Speech recognition systems for the task 
which consists of various domains seems to be 
required for some tasks, e.g. a closed caption 
system for TV and a transcription system of 
public proceedings. In order to recognize spoken 
discourse which has several domains, the speech 
recognition system has to have large vocabu- 
lary. Therefore, it is necessary to limit word 
search space using linguistic restricts, e.g. do- 
main identification. 
There have been many studies of do- 
main identification which used term weight- 
ing (J.McDonough et al., 1994; Yokoi et al., 
1997). McDonough proposed a topic identifi- 
cation method on switch board corpus. He re- 
ported that the result was best when the num- 
ber of words in keyword dictionary was about 
800. In his method, duration of discourses of 
switch board corpora is rather long and there 
are many keywords in the discourse. However, 
for a short discourse, there are few keywords 
in a short discourse. Yokoi also proposed a 
topic identification method using co-occurrence 
of words for topic identification (Yokoi et al., 
1997). He classified each dictated sentence of 
news into 8 topics. In TV or Radio news, how- 
ever, it is difficult to segment each sentence au- 
tomatically. Sekine proposed a method for se- 
lecting a suitable sentence from sentences which 
were extracted by a speech recognition system 
using statistical language model (Sekine, 1996). 
However, if the statistical model is used for ex- 
traction of sentence candidates, we will obtain 
higher recognition accuracy. 
Some initial studies of transcription of broad- 
cast news proceed (Bakis et al., 1997). However 
there are some remaining problems, e.g. speak- 
ing styles and domain identification. 
We conducted domain identification and key- 
word extraction experiment (Suzuki et al., 
1997) for radio news. In the experiment, 
we classified radio news into 5 domains (i.e. 
accident, economy, international, politics and 
sports). The problems which we faced with are; 
1. Classification of newspaper articles into 
suitable domains could not be performed 
automatically. 
2. Many incorrect keywords are extracted, be- 
cause the number of domains was few. 
In this paper, we propose a method for key- 
word extraction using term-domain interdepen- 
dence in order to cope with these two problems. 
The results of the experiments demonstrated 
the effectiveness of our method. 
2 An overview of our method 
Figure 1 shows an overview of our method. 
Our method consists of two procedures. In the 
procedure of term-domain interdependence cal- 
culation, the system calculates feature vectors 
1272 
of term-domain interdependence using an ency- 
clopedia of current term and newspaper articles. 
In the procedure of keyword extraction in radio 
news, firstly, the system divides radio news into 
segments according to the length of pauses. We 
call the segments units. The domain which has 
the largest similarity between the unit of news 
and the feature vector of each domain is selected 
as domain of the unit. Finally, the system ex- 
tracts keywords in each unit using the feature 
vector of selected domain which is selected by 
domain identification. 
Explanations of~ ~ Radio News n en,~pediaJ Lar~icle~=i"~' j Q::::::::::::::l 
Feature vectors caVe) 
D1 
D7 
... \[~ 
D141 
Feature vectors (FeaVa) ,~ 
...D1 \[~ Domain identification 
D7 "0" 
... ID3 ID7 D18 1 
© 
\[~ Keyword Extraction 
D141 *:~ 
I President \] 
I \[ {, Democratic partyJl 
Keyword extraction Calculation of term-domain interdependence 
Figure 1: An overview of our method 
3 Calculating feature vectors 
In the procedure of term-domain interdepen- 
dence calculation, We calculate likelihood of ap- 
pearance of each noun in each domain. Figure 2 
shows how to calculate feature vectors of term- 
domain interdependence. 
In our previous experiments, we used 5 do- 
mains which were sorted manually and calcu- 
lated 5 feature vectors for classifying domains of 
each unit of radio news and for extracting key- 
words. Our previous system could not extract 
some keywords because of many noisy keywords. 
In our method, newspaper articles and units of 
radio news are classified into many domains. At 
each domain, a feature vector is calculated by 
an encyclopedia of current terms and newspaper 
articles. 
3.1" Sorting newspaper articles 
according to their domains 
Firstly, all sentences in the encyclopedia are 
analyzed morpheme by Chasen (Matsumoto et 
An encyclopedia of current terms 1 41domains 10,236 explanations) 
© ISorting 
explanations \] 
Q Newspaper articles about 110,000 articles.,/ 
© 
\[Separa~articles I 
\[ Extra~:~nouns I 
IE~rac~'~ n°unsl i Calculating frequ~ vectors (FreqVa) I 
ICalculating frequency vectors (FreqVe)l 
._1 Calculating similarity I . ~'~ between FeaVe and FreqVa I 
Calculating X values of | ' _r-L J each noun on domains J X,,7 
I I Sorting articles into domains . V . I laccording to simitarity I 
~41 feature vectors (FeaVe)~ -.-I .:~ 
Calculating x:values of each noun on doma ns 
© 
041 feature vectors (FeaVa)~ 
Figure 2: Calculating feature vectors 
al., 1997) and nouns which frequently appear 
are extracted. A feature vector is calculated by 
frequency of each noun at each domain. We 
call the feature vector FeaVe. Each element 
of FeaVe is a X 2 value (Suzuki et al., 1997). 
Then, nouns are extracted from newspaper ar- 
ticles by a morphological analysis system (Mat- 
sumoto et al., 1997), and frequency of each noun 
are counted. Next, similarity between FeaVe of 
each domain and each newspaper article are cal- 
culated by using formula (1). Finally, a suitable 
domain of each newspaper article are selected by 
using formula (2). 
Sirn(i,j) = FeaVej. FreqVai (1) 
Dornainl = arg max Sim(i,j) (2) I~j~N 
where i means a newspaper article and j means 
a domain. (.) means operation of inner vector. 
3.2 Term-domain interdependence 
represented by feature vectors 
Firstly, at each newspaper articles, less than 
5 domains whose similarities between each ar- 
ticle and each domain are large are selected. 
Then, at each selected domain, the frequency 
vector is modified according to similarity value 
and frequency of each noun in the article. For 
example, If an article whose selected domains 
are "political party" and "election", and simi- 
larity between the article and "political party" 
1273 
and similarity between the article and "elec- 
tion" are 100 and 60 respectively, each fre- 
quency vector is calculated by formula (3) and 
formula (4). 
100 FreqVm = FreqV~ + FreqVal x 1-~ (3) 
60 freqV~l = FreqV~z + freqVai x 1-~ (4) 
where i means a newspaper article. 
Then, we calculate feature vectors FeaVa us- 
ing FreqV using the method mentioned in our 
previous paper (Suzuki et al., 1997). Each el- 
ement of feature vectors shows X 2 value of the 
domain and wordk. All wordk (1 < k < M :M 
means the number of elements of a feature vec- 
tor) are put into the keyword dictionary. 
4 Keyword extraction 
Input news stories are represented by 
phoneme lattice. There are no marks for word 
boundaries in input news stories. Phoneme lat- 
tices are segmented by pauses which are longer 
than 0.5 second in recorded radio news. The 
system selects a domain of each unit which is 
a segmented phoneme lattice. At each frame of 
phoneme lattice, the system selects maximum 
20 words from keyword dictionary. 
4.1 Similarity between a domain and 
an unit 
We define the words whose X 2 values in 
the feature vector of domainj are large as key- 
words of the domainj. In an unit of radio 
news about "political party", there are many 
keywords of "political party" and the X 2 value 
of keywords in the feature vector of "political 
2 party" is large. Therefore, sum of Xw,pollticalparty 
tends to be large (w : a word in the unit). In our 
method, the system selects a word path whose 
2 is maximized in the word lattice sum of Xkj 
at domaini. The similarity between unit/ and 
domainj is calculated by formula (5). 
Sim(i, j) = max Sim'(i, j) all paths 
= max np(wordk) x Xk,15) all paths 
In formula (5), wordk is a word in the 
word lattice, and each selected word does not 
share any frames with any other selected words. 
np(wordk) is the number of phonemes of wordk. 
2 Xk,j is x2value of wordk for domainj. 
The system selects a word path whose 
Siml(i,j) is the largest among all word paths 
for domainj. 
Figure 3 shows the method of calculating sim- 
ilarity between unit/ and domainD1. The sys- 
tem selects a word path whose Sim~(uniti, D1) 
is larger than those of any other word paths. 
phoneme lattice of uni~ 
andidates 
i- 
Si~unit. DI ) =max(3.2x3+ 0.5x6,3,2x3+ 4.3x4+ 0.7× 2, 
3.2x3+ 4.3x4+ 4.3x3, 
1.2 x 3+ 0.3 x 4,--.) 
Figure 3: Calculating similarity between unit/ 
and D1 
4.2 Domain identification and keyword 
extraction 
In the domain identification process, the sys- 
tem identifies each unit to a domain by formula 
(5). If Sim(i,j) is larger than similarities be- 
tween an unit and any other domains, domainj 
seems to be the domain of unit~. The system se- 
lects the domain which is the largest of all sim- 
ilarities in N of domains as the domain of the 
unit (formula (6)) . The words in the selected 
word path for selected domain are selected as 
keywords of the unit. 
Domaini = arg max Sim(i,j) (6) X<j<N " 
5 Experiments 
5.1 Test data 
The test data we have used is a radio news 
which is selected from NHK 6 o'clock radio news 
in August and September of 1995. Some news 
stories are hard to be classified into one do- 
main in radio news by human. For evalua- 
tion of domain identification experiments, we 
1274 
selected news stories which two persons classi- 
fied into the same domains are selected. The 
units which were used as test data are seg- 
mented by pauses which are longer than 0.5 
second. We selected 50 units of radio news for 
the experiments. The 50 units consisted of 10 
units of each domain. We used two kinds of test 
data. One is described with correct phoneme 
sequence. The other is written in phoneme lat- 
tice which is obtained by a phoneme recognition 
system (Suzuki et al., 1993). In each frame of 
phoneme lattice, the number of phoneme candi- 
dates did not exceed 3. The following equations 
show the results of phoneme recognition. 
the number of correct phonemes in 
phoneme lattice 
the number of uttered phonemes 
the number of correct phonemes in 
phoneme lattice 
phoneme segments in phoneme lattice 
= 95.6% 
= 81.2% 
5.2 Training data 
In order to classify newspaper articles into 
small domain, we used an encyclopedia of cur- 
rent terms "Chiezo"(Yamamoto, 1995). In the 
encyclopedia, there are 141 domains in 9 large 
domains. There are 10,236 head-words and 
those explanations in the encyclopedia. In or- 
der to calculate feature vectors of domains, all 
explanations in the encyclopedia are performed 
morphological analysis by Chasen (Matsumoto 
et al., 1997). 9,805 nouns which appeared more 
than 5 times in the same domains were selected 
and a feature vector of each domain was cal- 
culated. Using 141 feature vectors which were 
calculated in the encyclopedia, we identified do- 
mains of newspaper articles. We identified do- 
mains of 110,000 articles of newspaper for cal- 
culating feature vectors automatically. We se- 
lected 61,727 nouns which appeared at least 5 
times in the newspaper articles of same domains 
and calculated 141 feature vectors. 
5.3 Domain identification experiment 
The system selects suitable domain of each 
unit for keyword extraction. Table I shows 
the results of domain identification. We con- 
ducted domain identification experiments using 
two kinds of input data, i.e. correct phoneme 
sequence and phoneme lattice and two kinds of 
domains, i.e. 141 domains and 9 large domains. 
We also compared the results and the result us- 
ing previous method (Suzuki et al., 1997). For 
comparison, we selected 5 domains which are 
used by previous method in our method. In 
previous method, we used a keyword dictionary 
which has 4,212 words. 
Table 1: The result of domain identification 
number of Correct Phoneme 
method domains phoneme lattice 
our 141 62% 40% 
method 9 78% 54% 
5 90% 82% 
previous 5 86% 78% 
method 
5.4 Keyword extraction experiment 
We have conducted keyword extraction ex- 
periment using the method with 141 feature 
vectors (our method), 5 feature vectors (pre- 
vious method) and without domain identifica- 
tion. Table 2 shows recall and precision which 
are shown in formula (7), and formula (8), re- 
spectively, when the input data was phoneme 
lattice. 
the number of correct words in 
recall = MSKP the number of selected words in (7) 
MSKP 
the number of correct words 
precision = in MSKP the number of correct nouns (8) 
in the unit 
MSKP : the most suitable keyword path for se- 
lected domain 
6 Discussion 
6.1 Sorting newspaper articles 
according to their domains 
For using X 2 values in feature vectors, we 
have good result of domain identification of 
newspaper articles. Even if the newspaper ar- 
ticles which are classified into several domains, 
the suitable domains are selected correctly. 
6.2 Domain identification of radio news 
Table I shows that when we used 141 kinds of 
domains and phoneme lattice, 40% of units were 
identified as the most suitable domains by our 
1275 
Table 2: Recall and precision of keyword extrac- 
tion 
Method R/P 
our method R 
(141 domains) P 
previous method R 
(5 domains) P 
without DI R 
. (1 domain) P 
Correct 
phoneme 
Phoneme 
lattice 
88.5% 48.9% 
69.0% 38.1% 
80.0% 
63.1% 
77.0% 
60.1% 
24.0% 
33.0% 
12.2% 
9.5% 
R: recall P: precision Dh domain identification 
method and shows that when we used 9 kinds 
of domains and phoneme lattice, 54% of units 
are identified as the most suitable domains by 
our method. When the number of domains was 
5, the results using our method are better than 
our previous experiment. The reason is that we 
use small domains. Using small domains, the 
number of words whose X 2 values of a certain 
domain are high is smaller than when large do- 
mains are used. 
For further improvement of domain identifi- 
cation, it is necessary to use larger newspaper 
corpus in order to calculate feature vectors pre- 
cisely and have to improve phoneme recogni- 
tion. 
6.3 Keyword extraction of radio news 
When we used our method to phoneme lat- 
tice, recall was 48.9% and precision was 38.1%. 
We compared the result with the result of our 
previous experiment (Suzuki et al., 1997). The 
result of our method is better than the our pre- 
vious result. The reason is that we used do- 
mains which are precisely classified, and we can 
limit keyword search space. However recall was 
48.9% using our method. It shows that about 
50% of selected keywords were incorrect words, 
because the system tries to find keywords for 
all parts of the units. In order to raise recall 
value, the system has to use co-occurrence be- 
tween keywords in the most suitable keyword 
path. 
7 Conclusions 
In this paper, we proposed keyword extrac- 
tion in radio news using term-domain interde- 
pendence. In our method, we could obtain 
sorted large corpus according to domains for 
keyword extraction automatically. Using our 
method, the number of incorrect keywords in 
extracted words was smaller than the previous 
method. 
In future, we will study how to select correct 
words from extracted keywords in order to ap- 
ply our method for dictation of radio news. 
8 Acknowledgments 
The authors would like to thank Mainichi 
Shimbun for permission to use newspaper arti- 
cles on CD-Mainichi Shimbun 1994 and 1995, 
Asahi Shimbun for permission to use the data 
of the encyclopedia of current terms "Chiezo 
1996" and Japan Broadcasting Corporation 
(NHK) for permission to use radio news. The 
authors would also like to thank the anonymous 
reviewers for their valuable comments. 

References 
Baimo Bakis, Scott Chen, Ponani Gopalakrishnan, 
Ramesh Gopinath, Stephane Maes, and Lazaros 
Pllymenakos. 1997. Transcription of broadcast 
news - system robustness issues and adaptation 
techniques. In Proc. ICASSP'97, pages 711-714. 
J.McDonough, K.Ng, P.Jeanrenaud, H.Gish, and 
J.R.Rohlicek. 1994. Approaches to topic identifi- 
cation on the switchboard corpus. In Proc. IEEE 
ICASSP'94, volume 1, pages 385-388. 
Yuji Matsumoto, Akira Kitauchi, Tatuo Yamashita, 
Osamu Imaichi, and Tomoaki Imamura, 1997. 
Japanese Morphological Analysis System ChaSen 
Manual. Matsumoto Lab. Nara Institute of Sci- 
ence and Technology. 
Satoshi. Sekine. 1996. Modeling topic coherence for 
speech recognition. In Proc. COLING 96, pages 
913-918. 
Yoshimi Suzuki, Chieko Furuichi, and Satoshi Imai. 
1993. Spoken japanese sentence recognition us- 
ing dependency relationship with systematical 
semantic category. Trans. of IEICE J76 D-II, 
11:2264-2273. (in Japanese). 
Yoshimi Suzuki, Fumiyo Fukumoto, and Yoshihiro 
Sekiguchi. 1997. Keyword extraction of radio 
news using term weighting for speech recognition. 
In NLPRS97, pages 301-306. 
Shin Yamamoto, editor. 1995. The Asahi Encyclo- 
pedia of Current Terms 'Chiezo'. Asahi Shimbun. 
Kentaro Yokoi, Tatsuya Kawahara, and Shuji 
Doshita. 1997. Topic identification of news 
speech using word cooccurrence statistics. In 
Technical Report of IEICE SP96-I05, pages 71- 
78. (in Japanese). 
