I 
Application of NLP technology to production of closed-caption 
TV programs in Japanese for the hearing impaired 
Takahiro Wakao Terumasa Ehara 
Telecommunications NHK Science and 
Advancement Technical 
Organization (TAO) Research Labs. 
of Japan / TAO 
Eiji Sawamura 
TAO 
Yoshiharu Abe 
Mitsubishi 
Electric Corp 
Information 
Technology 
R&D Center / TAO 
Katsuhiko Shirai 
Waseda University 
Department of 
Information and 
Computer Science 
/ TAO 
1 Introduction 
The Telecommunications Advancement Organiza- 
tion (TAO) of Japan, with the support of the min- 
istry of Posts and Telecommunications, has initiated 
a project in which electronically available text of 
TV news programs is summarized and synchronized 
with the speech and video automatically, then su- 
perimposed on the original programs for the benefit 
of the hearing impaired people in Japan. This kind 
of service has been provided for more than 70% of 
the TV programs in the United States or in Europe, 
however, it is available in only 10% of the TV pro- 
grams in Japan. Most of the closed captions are 
literal transcriptions of what is being said. Reasons 
why the availability is low are firstly that thousands 
of characters are used in the Japanese language, 
and secondly that the closed captions are produced 
manually at present and it is a time-consuming and 
costly task. 
The project started in 1996 and will end in 2001. 
Its annual budget is 200 million yen. The main aim 
of the project is to establish the technology of pro- 
ducing closed captions for TV programs efficiently 
using natural language processing and speech recog- 
nition technology. 
We describe main research issues and the project 
schedule, and show the results of preliminary re- 
search. 
2 Research Issues 
Main research issues in the project are as follows: 
• automatic text summarization 
• automatic synchronization of text and speech 
• building an efficient closed caption production 
system 
We would like to have the following system (Figure 
1) based on the research on the above issues. 
Although all types of TV programs are to be han- 
dled in the project, the first priority is given to TV 
news programs. 
The outline of each research issue is described 
next. 
2.1 Automatic Text Summarization 
For most of the TV news programs today, scripts 
(written text) are available before they are read out 
by newscasters. The Japanese news text is read at 
the speed of four hundred characters per minute and 
it is too fast, and there are too many characters when 
all the characters of what is said are shown on the 
screen (Komine et al., 1996). Thus we need to sum- 
marize the news program text and then show it on 
TV screen. The aim of the research on automatic 
text summarization is to summarize the text fully 
or partially automatically to a proper size in order 
to assist the closed caption production. 
2.2 Automatic Synchronization of Text and 
Speech 
Once the original news program text is summarized, 
it should be synchronized with the actual sound, or 
the speech of the programs. At present this is done 
by hand when the closed captions are produced. We 
would like to make use of speech recognition tech- 
nology to help the task of synchronizing text with 
speech. Please note that what we aim at is to syn- 
chronize the original te'xt rather than the summa- 
rized text with the speech. 
2.3 Efficient Closed Caption Production 
System 
We will create a system by integrating the summa- 
rization and synchronization techniques with tech- 
niques for superimposing characters. We also need 
to research on other aspects such as what the best 
way is to show the characters on the screen for the 
handicapped viewers. 
55 
VTR VTR odst~ wozrm 
~W\[ __ 
i J audio & t/me code : mgtlo & time-cede ................................................................. ! 
closed ! auto_=___.A____~c synchronization } 
automatic recognition 
summarization J j | ~ i 
L 
ori2inal 
ABCDEFG 
, ~ IJKI~It4N 
J 
od~i~l ~ ~me code 
ABCDEFG ABC ....... 
H UKLMN HIJ 
im)gnun with 
Figure 1: System Outline 
3 Project Schedule 
The project is divided into two stages; the first three 
years and the rest, two years. We conduct research 
on the above issues and create a prototype system in 
the first stage. In addition, the prototype system is 
to be used to produce closed captions, and the capa- 
bility and functions of the system will be evaluated. 
We will improve the prototype system in the second 
stage. 
In 1996 and 1997, the following research has been 
conducted and will be continued. 
• Automatic text summarization 
- method for dividing a sentence into smaller 
sections 
- key word extraction 
- method for connecting sentence sections 
• Automatic synchronization of text and speech 
-transcription, speech model integration 
system 
- maximum likelihood matching system 
- speech database 
• E~cient closed caption production system 
-integrated simulation system for closed 
caption production 
4 Preliminary Research Results 
We have conducted preliminary research for auto- 
matic text summarization and synchronization of 
text and speech, and the results are as follows. 
4.1 Automatic Text Summarization 
Text summarization research in the past may he 
grouped into three approaches. The first is to gen- 
erate summarized sentences based on understanding 
of the text. It is desirable, however, it is not a prac- 
tical method at present in order to summarize actual 
TV news program text. 
The second is to digest the text by making use of 
text structures such as paragraphs. It has been ap- 
plied to newspaper articles in Japanese (Yamamoto 
et al, 1994). In this approach important parts of the 
text which are to be kept in the summarization, are 
determined by their locations, i.e. where they ap- 
pear in the text. For example, if nouns or proper 
nouns appear in the headline, they are considered 
as 'important' and may be used as measures of find- 
ing out how important the other parts of the text 
are. As we describe later, TV news text is differ- 
ent from newspaper articles in that it does not have 
obvious structures, i.e. the TV news text has fewer 
sentences and usually only one paragraph without 
titles or headlines. Thus the second approach is not 
suitable for the TV news text. 
The third is to detect important (or relevant) 
words (segments in the case of Japanese), and deter- 
mine which section of the text is important, and then 
put them together to have 'summarization' of the 
text. This is probably most robust among the three 
approaches and we are using the third approach cur- 
rently (for summary of various summarization tech- 
niques, please see (Paice, 1990)). 
To illustrate the difference between TV news pro- 
gram text and newspaper articles, we compared one 
56 
thousand randomly selected articles from both do- 
mains. The results are shown in Fig 2 and Fig 3. 
0.3 
o 0.2 
o 
".~ 0,1 
0.0 ..... 
I0 20 30 40 50 60 70 
number of sentences per article 
solid line: TV, dotted line: newspaper 
Figure 2: Number of sentences per article 
0.2 
0.1' "- ............... 
oo, 
0 200 600 characters per sentence 
solid line: TV, dotted line: newspaper 
400 
Figure 3: Number of characters per sentence 
Fig 2 and Fig 3 show that in comparison with 
newspaper text, the TV news program text has the 
following features: 
• Fewer sentences per text 
• Longer sentences 
If we summarize TV news program text by se- 
lecting 'sentences' from the text, it will be 'rough' 
summarization. On the other hand, if we can di- 
vide long sentences into smaller sections and thus 
increase the number of 'sentences (sections)' in the 
text, then we may have better summarization (Kim 
and Ehara, 1994): 
As a method of summarization, we are using the 
third approach. To find important words in the text, 
high-frequency key word method and TF-IDF (Term 
Frequency - Inverse Document Frequency) method 
have been adopted, and the two methods are eval- 
uated automatically on a large-scale in our prelim- 
inary research. We used ten thousand (10000) TV 
news texts between 1992 and 1995 (2500 texts each 
year) for the evaluation. One of the features of the 
TV news texts is that the first sentence is the most 
important. We conducted the evaluation by taking 
advantage of the feature. 
Key words used in the high-frequency key word 
method are content words which appear more than 
twice in a given text (Luhn, 1957),(Edmundson, 
1969). To determine the importance of a sentence, 
we counted the number of the key words in the sen- 
tence and then it is divided by the number of the 
words (including function and content words). In 
the TF-IDF method, first the weight of each word 
is computed by multiplying its frequency in the text 
(TF) and its IDF in a given text collection. The im- 
portance of the sentence is thus computed by sum- 
ming up all the weights of the words in the sen- 
tence and divided by the number of the words (Spark 
Jones, 1972), (Salton, 1971). 
The evaluation details are as follows. First, the 
importance of each sentence is calculated by the 
high-frequency key word or TF-IDF method. Then 
the sentence are ranked according to their impor- 
tance. We computed the accuracy of the method by 
looking at whether the first sentence is ranked the 
first, or ranked either the first or the second. 
The evaluation results are shown in Table 1. The 
high-frequency key word method produced better re- 
sults than TF-IDF method did. 
Method First First_or.Second (%) (%) 
High-frequency key word 68.86 88.95 
TF-IDF 54.02 80.67 
Table 1: Sentence Extraction Accuracy 
4.2 Automatic Synchronization of Text and 
Speech 
As the next step, we need to synchronize the text 
and the speech. First, the written TV news text 
is changed into the stream of phonetic transcrip- 
tions, and then synchronization is done by detect- 
ing the time points of the text sections and their 
corresponding speech sections. At the same time, 
we havestarted to create news speech database. In 
1996, we collected the speech data by simulating 
news programs, i.e. the TV news texts were read 
and recorded in a studio rather than actual TV news 
programs on the air were recorded. We collected 
seven and half hours of recordings of twenty people 
57 
(both male and female). We plan to record actual 
programs as 'real' data in addition to the simulation 
recording in 1997. The real data will be taken from 
both radio and TV news programs. 
Preliminary research on detection of synchroniza- 
tion points is conducted by using the data we have 
created. A speech model is produced by using three 
hours (four male and four female persons) of record- 
ing as training data. For each speaker, a two- 
loop, four-mixture-distribution phonetic HMM was 
learned. Based on the HMMs, key-word pair models 
were obtained from the phonetic transcription. The 
key-word pair model is shown in Fig 4. The model 
consists of two strings of words (keywordsl and key- 
words2) before and after the synchronization point 
(point B). 
NULL ARC 
B 
Figure 4: key-word pair model 
When the speech is fed to the model, the non- 
synchronizing input data travel through the garbage 
arc while the synchronizing data go through the key- 
words. It means that the likelihood at point B in- 
creases. Thus if we observe the likelihood at point 
B and it goes over a certain threshold, we decide 
it is the synchronization point for the input data. 
Twenty-one key-word pairs were taken from the data 
which was not used in the training, and selected for 
evaluation. We fed one male and one female speech 
to the model in the evaluation. The result is shown 
in Table 2. 
As we decrease the threshold, the detection rate 
increase, however, the false alarm rate increases 
rapidly. 
5 Conclusion 
We have described a national project in which 
'speech' of TV programs is changed into captions, 
and superimposed to the original programs for the 
benefit of the hearing impaired people in Japan. We 
also showed results of preliminary research on TV 
news text summarization, and synchronization of 
text and speech. We continue to integrate the natu- 
ral language processing and speech processing tech- 
nology for efficient closed caption production sys- 
tem, and put it to a practical use as soon as possible. 
Threshold Detection Rate False Alarm Rate 
(%) (FA/KW/Hour) 
-10 
-20 
-30 
-40 
-50 
-60 
-70 
-80 
-90 
-100 
69.05 
76.19 
85.71 
90.48 
92.86 
97.62 
97.62 
97.62 
97.62 
97.62 
2.78 
9.17 
39.72 
131.93 
409.97 
975.75 
1774.30 
2867.55 
4118.56 
5403.45 
Table 2: Synchronization Detection 

References 
Komine, K., Hoshino, H., Isono, H., Uchida, T., Iwa- 
hana, Y. 1996 Cognitive Experiments of News 
Captioning for Hearing Impaired Persons Techni- 
cal Report of IEICE (The Institute of Electron- 
ics, Information and Communication Engineers), 
HCS96-23, in Japanese, pages 7-12. 
H.P. Luhn 1957 A statistical approach to the mech- 
anized encoding and searching of literary infor- 
mation In IBM Journal of Research and Develop- 
ment, 1(4), pages 309--317 
H.P. Edmundson 1969 New Methods in Automatic 
Extracting. In Journal of the ACM, 16(2), pages 
264-285. 
Chris D. Paice 1990 Constructing literature ab- 
stracts by computer: techniques and prospects. 
In Information Processing ~ Management 26(1), 
pages 171-186. Pergamon Press plc. 
Yeun-Bae Kim, Terumasa Ehara. 1994. An Auto- 
matic Sentence Breaking and Subject Supplement 
Method for J/E Machine Translation Information 
Processing Society of Japan, Ronbun-shi, Vol 35, 
No. 6. In Japanese. 
Gerard Salton 1971 (Ed) The Smart Retrieval Sys- 
tem - Experiments in Automatic Document Re- 
trieval, Englewood Cliffs, N J: Prentice Hall Inc. 
Karen Spark Jones 1972 A statistical interpretation 
of term specificity and its application in retrieval 
In Journal of Documentation, 28(1), pages 11-21. 
Kazuhide Yamamoto, Shigeru Masuyama, Shozo 
Naito 1994 GREEN: An Experimental Sys- 
tem Generating Summary of Japanese Editorials 
by Combining Multiple Discourse Characteristics 
NL-99-3, Information Processing Society of Japan. 
In Japanese. 
