File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2207_intro.xml

Size: 3,952 bytes

Last Modified: 2025-10-06 14:06:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2207">
  <Title>Keyword Extraction using Term-Domain Interdependence for Dictation of Radio News</Title>
  <Section position="2" start_page="0" end_page="1272" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recently, many speech recognition systems are designed for various tasks. However, most of them are restricted to certain tasks, for example, a tourist information and a hamburger shop. Speech recognition systems for the task which consists of various domains seems to be required for some tasks, e.g. a closed caption system for TV and a transcription system of public proceedings. In order to recognize spoken discourse which has several domains, the speech recognition system has to have large vocabulary. Therefore, it is necessary to limit word search space using linguistic restricts, e.g. domain identification.</Paragraph>
    <Paragraph position="1"> There have been many studies of domain identification which used term weighting (J.McDonough et al., 1994; Yokoi et al., 1997). McDonough proposed a topic identification method on switch board corpus. He reported that the result was best when the number of words in keyword dictionary was about 800. In his method, duration of discourses of switch board corpora is rather long and there are many keywords in the discourse. However, for a short discourse, there are few keywords in a short discourse. Yokoi also proposed a topic identification method using co-occurrence of words for topic identification (Yokoi et al., 1997). He classified each dictated sentence of news into 8 topics. In TV or Radio news, however, it is difficult to segment each sentence automatically. Sekine proposed a method for selecting a suitable sentence from sentences which were extracted by a speech recognition system using statistical language model (Sekine, 1996).</Paragraph>
    <Paragraph position="2"> However, if the statistical model is used for extraction of sentence candidates, we will obtain higher recognition accuracy.</Paragraph>
    <Paragraph position="3"> Some initial studies of transcription of broadcast news proceed (Bakis et al., 1997). However there are some remaining problems, e.g. speaking styles and domain identification.</Paragraph>
    <Paragraph position="4"> We conducted domain identification and key-word extraction experiment (Suzuki et al., 1997) for radio news. In the experiment, we classified radio news into 5 domains (i.e.</Paragraph>
    <Paragraph position="5"> accident, economy, international, politics and sports). The problems which we faced with are;  1. Classification of newspaper articles into suitable domains could not be performed automatically.</Paragraph>
    <Paragraph position="6"> 2. Many incorrect keywords are extracted, be- null cause the number of domains was few.</Paragraph>
    <Paragraph position="7"> In this paper, we propose a method for key-word extraction using term-domain interdependence in order to cope with these two problems. The results of the experiments demonstrated the effectiveness of our method.</Paragraph>
    <Paragraph position="8"> 2 An overview of our method Figure 1 shows an overview of our method.</Paragraph>
    <Paragraph position="9"> Our method consists of two procedures. In the procedure of term-domain interdependence calculation, the system calculates feature vectors  of term-domain interdependence using an encyclopedia of current term and newspaper articles. In the procedure of keyword extraction in radio news, firstly, the system divides radio news into segments according to the length of pauses. We call the segments units. The domain which has the largest similarity between the unit of news and the feature vector of each domain is selected as domain of the unit. Finally, the system extracts keywords in each unit using the feature vector of selected domain which is selected by domain identification.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML