File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2207_metho.xml

Size: 12,956 bytes

Last Modified: 2025-10-06 14:15:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2207">
  <Title>Keyword Extraction using Term-Domain Interdependence for Dictation of Radio News</Title>
  <Section position="3" start_page="1272" end_page="1273" type="metho">
    <SectionTitle>
3 Calculating feature vectors
</SectionTitle>
    <Paragraph position="0"> In the procedure of term-domain interdependence calculation, We calculate likelihood of appearance of each noun in each domain. Figure 2 shows how to calculate feature vectors of term-domain interdependence.</Paragraph>
    <Paragraph position="1"> In our previous experiments, we used 5 domains which were sorted manually and calculated 5 feature vectors for classifying domains of each unit of radio news and for extracting keywords. Our previous system could not extract some keywords because of many noisy keywords.</Paragraph>
    <Paragraph position="2"> In our method, newspaper articles and units of radio news are classified into many domains. At each domain, a feature vector is calculated by an encyclopedia of current terms and newspaper articles.</Paragraph>
    <Paragraph position="3"> 3.1&amp;quot; Sorting newspaper articles according to their domains Firstly, all sentences in the encyclopedia are analyzed morpheme by Chasen (Matsumoto et An encyclopedia of current terms 1 41domains 10,236 explanations)  are extracted. A feature vector is calculated by frequency of each noun at each domain. We call the feature vector FeaVe. Each element of FeaVe is a X 2 value (Suzuki et al., 1997).</Paragraph>
    <Paragraph position="4"> Then, nouns are extracted from newspaper articles by a morphological analysis system (Matsumoto et al., 1997), and frequency of each noun are counted. Next, similarity between FeaVe of each domain and each newspaper article are calculated by using formula (1). Finally, a suitable domain of each newspaper article are selected by using formula (2).</Paragraph>
    <Paragraph position="6"> where i means a newspaper article and j means a domain. (.) means operation of inner vector.</Paragraph>
    <Section position="1" start_page="1272" end_page="1273" type="sub_section">
      <SectionTitle>
3.2 Term-domain interdependence
</SectionTitle>
      <Paragraph position="0"> represented by feature vectors Firstly, at each newspaper articles, less than 5 domains whose similarities between each article and each domain are large are selected. Then, at each selected domain, the frequency vector is modified according to similarity value and frequency of each noun in the article. For example, If an article whose selected domains are &amp;quot;political party&amp;quot; and &amp;quot;election&amp;quot;, and similarity between the article and &amp;quot;political party&amp;quot;  and similarity between the article and &amp;quot;election&amp;quot; are 100 and 60 respectively, each frequency vector is calculated by formula (3) and formula (4).</Paragraph>
      <Paragraph position="2"> where i means a newspaper article.</Paragraph>
      <Paragraph position="3"> Then, we calculate feature vectors FeaVa using FreqV using the method mentioned in our previous paper (Suzuki et al., 1997). Each element of feature vectors shows X 2 value of the domain and wordk. All wordk (1 &lt; k &lt; M :M means the number of elements of a feature vector) are put into the keyword dictionary.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1273" end_page="1273" type="metho">
    <SectionTitle>
4 Keyword extraction
</SectionTitle>
    <Paragraph position="0"> Input news stories are represented by phoneme lattice. There are no marks for word boundaries in input news stories. Phoneme lattices are segmented by pauses which are longer than 0.5 second in recorded radio news. The system selects a domain of each unit which is a segmented phoneme lattice. At each frame of phoneme lattice, the system selects maximum 20 words from keyword dictionary.</Paragraph>
    <Section position="1" start_page="1273" end_page="1273" type="sub_section">
      <SectionTitle>
4.1 Similarity between a domain and
</SectionTitle>
      <Paragraph position="0"> an unit We define the words whose X 2 values in the feature vector of domainj are large as key-words of the domainj. In an unit of radio news about &amp;quot;political party&amp;quot;, there are many keywords of &amp;quot;political party&amp;quot; and the X 2 value of keywords in the feature vector of &amp;quot;political 2 party&amp;quot; is large. Therefore, sum of Xw,pollticalparty tends to be large (w : a word in the unit). In our method, the system selects a word path whose 2 is maximized in the word lattice sum of Xkj at domaini. The similarity between unit/ and domainj is calculated by formula (5).</Paragraph>
      <Paragraph position="1"> Sim(i, j) = max Sim'(i, j) all paths = max np(wordk) x Xk,15) all paths In formula (5), wordk is a word in the word lattice, and each selected word does not share any frames with any other selected words.</Paragraph>
      <Paragraph position="2"> np(wordk) is the number of phonemes of wordk.</Paragraph>
      <Paragraph position="3"> 2 Xk,j is x2value of wordk for domainj.</Paragraph>
      <Paragraph position="4"> The system selects a word path whose Siml(i,j) is the largest among all word paths for domainj.</Paragraph>
      <Paragraph position="5"> Figure 3 shows the method of calculating similarity between unit/ and domainD1. The system selects a word path whose Sim~(uniti, D1) is larger than those of any other word paths.</Paragraph>
      <Paragraph position="6"> phoneme lattice of uni~</Paragraph>
    </Section>
    <Section position="2" start_page="1273" end_page="1273" type="sub_section">
      <SectionTitle>
4.2 Domain identification and keyword
</SectionTitle>
      <Paragraph position="0"> extraction In the domain identification process, the system identifies each unit to a domain by formula (5). If Sim(i,j) is larger than similarities between an unit and any other domains, domainj seems to be the domain of unit~. The system selects the domain which is the largest of all similarities in N of domains as the domain of the unit (formula (6)) . The words in the selected word path for selected domain are selected as keywords of the unit.</Paragraph>
      <Paragraph position="2"/>
    </Section>
  </Section>
  <Section position="5" start_page="1273" end_page="1274" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1273" end_page="1274" type="sub_section">
      <SectionTitle>
5.1 Test data
</SectionTitle>
      <Paragraph position="0"> The test data we have used is a radio news which is selected from NHK 6 o'clock radio news in August and September of 1995. Some news stories are hard to be classified into one domain in radio news by human. For evaluation of domain identification experiments, we  selected news stories which two persons classified into the same domains are selected. The units which were used as test data are segmented by pauses which are longer than 0.5 second. We selected 50 units of radio news for the experiments. The 50 units consisted of 10 units of each domain. We used two kinds of test data. One is described with correct phoneme sequence. The other is written in phoneme lattice which is obtained by a phoneme recognition system (Suzuki et al., 1993). In each frame of phoneme lattice, the number of phoneme candidates did not exceed 3. The following equations show the results of phoneme recognition.</Paragraph>
      <Paragraph position="1"> the number of correct phonemes in phoneme lattice the number of uttered phonemes the number of correct phonemes in phoneme lattice phoneme segments in phoneme lattice</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="2" start_page="1274" end_page="1274" type="sub_section">
      <SectionTitle>
5.2 Training data
</SectionTitle>
      <Paragraph position="0"> In order to classify newspaper articles into small domain, we used an encyclopedia of current terms &amp;quot;Chiezo&amp;quot;(Yamamoto, 1995). In the encyclopedia, there are 141 domains in 9 large domains. There are 10,236 head-words and those explanations in the encyclopedia. In order to calculate feature vectors of domains, all explanations in the encyclopedia are performed morphological analysis by Chasen (Matsumoto et al., 1997). 9,805 nouns which appeared more than 5 times in the same domains were selected and a feature vector of each domain was calculated. Using 141 feature vectors which were calculated in the encyclopedia, we identified domains of newspaper articles. We identified domains of 110,000 articles of newspaper for calculating feature vectors automatically. We selected 61,727 nouns which appeared at least 5 times in the newspaper articles of same domains and calculated 141 feature vectors.</Paragraph>
    </Section>
    <Section position="3" start_page="1274" end_page="1274" type="sub_section">
      <SectionTitle>
5.3 Domain identification experiment
</SectionTitle>
      <Paragraph position="0"> The system selects suitable domain of each unit for keyword extraction. Table I shows the results of domain identification. We conducted domain identification experiments using two kinds of input data, i.e. correct phoneme sequence and phoneme lattice and two kinds of domains, i.e. 141 domains and 9 large domains.</Paragraph>
      <Paragraph position="1"> We also compared the results and the result using previous method (Suzuki et al., 1997). For comparison, we selected 5 domains which are used by previous method in our method. In previous method, we used a keyword dictionary which has 4,212 words.</Paragraph>
    </Section>
    <Section position="4" start_page="1274" end_page="1274" type="sub_section">
      <SectionTitle>
5.4 Keyword extraction experiment
</SectionTitle>
      <Paragraph position="0"> We have conducted keyword extraction experiment using the method with 141 feature vectors (our method), 5 feature vectors (previous method) and without domain identification. Table 2 shows recall and precision which are shown in formula (7), and formula (8), respectively, when the input data was phoneme lattice.</Paragraph>
      <Paragraph position="1"> the number of correct words in recall = MSKP the number of selected words in (7)</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1274" end_page="1274" type="metho">
    <SectionTitle>
MSKP
</SectionTitle>
    <Paragraph position="0"> the number of correct words precision = in MSKP the number of correct nouns (8) in the unit MSKP : the most suitable keyword path for selected domain</Paragraph>
  </Section>
  <Section position="7" start_page="1274" end_page="1275" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1274" end_page="1274" type="sub_section">
      <SectionTitle>
6.1 Sorting newspaper articles
</SectionTitle>
      <Paragraph position="0"> according to their domains For using X 2 values in feature vectors, we have good result of domain identification of newspaper articles. Even if the newspaper articles which are classified into several domains, the suitable domains are selected correctly.</Paragraph>
    </Section>
    <Section position="2" start_page="1274" end_page="1275" type="sub_section">
      <SectionTitle>
6.2 Domain identification of radio news
</SectionTitle>
      <Paragraph position="0"> Table I shows that when we used 141 kinds of domains and phoneme lattice, 40% of units were identified as the most suitable domains by our  R: recall P: precision Dh domain identification method and shows that when we used 9 kinds of domains and phoneme lattice, 54% of units are identified as the most suitable domains by our method. When the number of domains was 5, the results using our method are better than our previous experiment. The reason is that we use small domains. Using small domains, the number of words whose X 2 values of a certain domain are high is smaller than when large domains are used.</Paragraph>
      <Paragraph position="1"> For further improvement of domain identification, it is necessary to use larger newspaper corpus in order to calculate feature vectors precisely and have to improve phoneme recognition. null</Paragraph>
    </Section>
    <Section position="3" start_page="1275" end_page="1275" type="sub_section">
      <SectionTitle>
6.3 Keyword extraction of radio news
</SectionTitle>
      <Paragraph position="0"> When we used our method to phoneme lattice, recall was 48.9% and precision was 38.1%. We compared the result with the result of our previous experiment (Suzuki et al., 1997). The result of our method is better than the our previous result. The reason is that we used domains which are precisely classified, and we can limit keyword search space. However recall was 48.9% using our method. It shows that about 50% of selected keywords were incorrect words, because the system tries to find keywords for all parts of the units. In order to raise recall value, the system has to use co-occurrence between keywords in the most suitable keyword path.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML