File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2225_metho.xml

Size: 10,268 bytes

Last Modified: 2025-10-06 14:15:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2225">
  <Title>Aligning Articles in TV Newscasts and Newspapers</Title>
  <Section position="3" start_page="0" end_page="1382" type="metho">
    <SectionTitle>
2 TV Newscasts and Newspapers
2.1 TV Newscasts
</SectionTitle>
    <Paragraph position="0"> In a TV newscast, events are generally reported in the following modalities: * image information, * speech information, and * text information (telops).</Paragraph>
    <Paragraph position="1"> In TV newscasts, the image and the speech information are main modalities. However, it is difficult to obtain the precise information from these kinds of modalities. The text information, on the other hand, is a secondary modality in TV newscasts, which gives us: * explanations of image information, * summaries of speech information, and * information which is not concerned with the reports (e.g. a time signal).</Paragraph>
    <Paragraph position="2"> In these three types of information, the first and second ones represent the contents of the reports. Moreover, it is not difficult to extract text information from TV newscasts. It is because a lots of works has been done on character recognition and layout analysis (Sakai 93) (Mino 96) (Sato 98). Consequently, we use this textual information for aligning the TV newscasts with the corresponding newspaper articles. The method for extracting the textual information is discussed in Section 3.1. But, we do not treat the method of character recognition in detail, because it is beyond the main subject of this study.</Paragraph>
    <Section position="1" start_page="0" end_page="1382" type="sub_section">
      <SectionTitle>
2.2 Newspapers
</SectionTitle>
      <Paragraph position="0"> A text in a newspaper article may be divided into four parts:  In a text of a newspaper article, several kinds of information are generally given in important order. In other words, a headline and a first paragraph in a newspaper article give us the most important information. In contrast to this, the rest in a newspaper article give us the additional information. Consequently, headlines and first paragraphs contain more significant words (keywords) for representing the contents of the article than the rest.</Paragraph>
      <Paragraph position="1">  crush point, the forth day [Croatian Minister of Domestic Affairs] &amp;quot;All passengers were killed&amp;quot; [Pentagon] The plane was off course. &amp;quot;accident under bad weather condition&amp;quot;.</Paragraph>
      <Paragraph position="2">  On the other hand, an explanation of a picture in an article shows us persons and things in the picture that are concerned with the report. For example, in Figure 2, texts in bold letters under the picture is an explanation of the picture. Consequently, explanations of pictures contain many keywords as well as headlines and first paragraphs.</Paragraph>
      <Paragraph position="3"> In this way, keywords in a newspaper article are distributed unevenly. In other words, keywords are more frequently in the headline, the explanation of the pictures, and the first paragraph. In addition, these keywords are shared by the newspaper article with TV newscasts. For these reasons, we align articles in TV newscasts and newspapers using the following clues:</Paragraph>
      <Paragraph position="5"> Summary of this article: On Apt 4, the Croatian Government confirmed that Commerce Secretary Ronald H. Brown and 32 other people were all killed in the crash of a US Air Force plane near the Dubrovnik airport in the Balkans on Apt 3, 1996. It was raining hard near the airport at that time.</Paragraph>
      <Paragraph position="6"> A Pentagon spokesman said there are no signs of terrorist act in this crash. The passengers included members of Brown's staff, private business leaders, and a correspondent for the New York Times.</Paragraph>
      <Paragraph position="7"> President Clinton, speaking at the Commerce Department, praised Brown as 'one of the best advisers and ablest people I ever knew.' On account of this accident, Vice Secretary Mary Good was appointed to the acting Secretary. In the Balkans, three U.S. officials on a peace mission and two U.S. soldiers were killed in Aug 1995 and Jan 1996, respectively.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1382" end_page="1384" type="metho">
    <SectionTitle>
3 Aligning Articles in TV
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1382" end_page="1382" type="sub_section">
      <SectionTitle>
Newscasts and Newspapers
3.1 Extracting Nouns from Telops
</SectionTitle>
      <Paragraph position="0"> An article in the TV newscast generally shares many words, especially nouns, with the newspaper article which reports the same event. Making use of these nouns, we align articles in the TV newscast and in the newspaper. For this purpose, we extract nouns from the telops as follows: Step 1 Extract texts from the TV images by hands.</Paragraph>
      <Paragraph position="1"> For example, we extract &amp;quot;Okinawa ken Ohla chiff' from the TV image of Figure 3. When the text is a title, we describe it. It is not difficult to find title texts because they have specific expression patterns, for example, an underline (Figure 4 and a top left picture in  lines. Then, segment these lines at the point where the size of character or the distance between characters changes. For example, the text in Figure 3 is divided into &amp;quot;Okinawa ken  shows several kinds of information which are explained by telops in TV Newscasts (Watanabe 96). In (Watanabe 96), a method of semantic analysis of telops was proposed and the correct recognition of the method was 92 %.</Paragraph>
      <Paragraph position="2"> We use this method and obtain the semantic interpretation of each telop.</Paragraph>
      <Paragraph position="3"> Step 5 Extract nouns from the following kinds of telops.</Paragraph>
      <Paragraph position="4"> * telops which explain the contents of TV images (except &amp;quot;time of photographing&amp;quot; and &amp;quot;image data&amp;quot;) * telops which explain a fact It is because these kinds of telops may contain adequate words for aligning articles. On the contrary, we do not extract nouns from the other kinds of telops for aligning articles. For example, we do not extract nouns from telops which are categorized into a quotation of a speech in Step 4. It is because a quotation of a speech is used as the additional infor- null 1. explanation of contents of a TV image null (a) explanation of a scene (b) explanation of an element i. person ii. group and organization iii. thing (c) bibliographic information i. time of photographing ii. place of photographing iii. image data 2. quotation of a speech 3. explanation of a fact (a) titles of TV news (b) diagram and table (c) other 4. information which is not concerned with a report (a) current time (b) broadcasting style (c) names of an announcer and re- null hoshii (Give me a chance to develop our country)&amp;quot; mation and may contain inadequate words for aligning articles. Figure 6 shows an example of a quotation of a speech.</Paragraph>
    </Section>
    <Section position="2" start_page="1382" end_page="1384" type="sub_section">
      <SectionTitle>
3.2 Extraction of Layout Information in
Newspaper Articles
</SectionTitle>
      <Paragraph position="0"> For aligning with articles in TV newscasts, we use newspaper articles which are distributed in the Internet. The reasons are as follows:  * articles are created in the electronic form, and * articles are created by authors using HTML which offers embedded codes (tags) to designate headlines, paragraph breaks, and so on. Taking advantage of the HTML tags, we divide newspaper articles into four parts:  The procedure for dividing a newspaper article is as follows.</Paragraph>
      <Paragraph position="1">  1. Extract a headline using tags for headlines. 2. Divide an article into the paragraphs using tags for paragraph breaks.</Paragraph>
      <Paragraph position="2"> 3. Extract paragraphs which start &amp;quot; {T\]:~&gt;&gt; (shashin, picture)&amp;quot; as the explanation of pictures. 4. Extract the top paragraph as the first paragraph. The others are classified into the rest.</Paragraph>
    </Section>
    <Section position="3" start_page="1384" end_page="1384" type="sub_section">
      <SectionTitle>
3.3 Procedure for Aligning Articles
</SectionTitle>
      <Paragraph position="0"> Before aligning articles in TV newscasts and newspapers, we chose corresponding TV newscasts and newspapers. For example, an evening TV newscast is aligned with the evening paper of the same day and with the morning paper of the next day. We aligned articles within these pairs of TV newscasts and newspapers.</Paragraph>
      <Paragraph position="1"> The alignment process consists of two steps. First, we calculate reliability scores for an article in the TV newscasts with each article in the corresponding newspapers. Then, we select the newspaper article with the maximum reliability score as the corresponding one. If the maximum score is less than the given threshold, the articles are not aligned.</Paragraph>
      <Paragraph position="2"> As mentioned earlier, we calculate the reliability scores using these kinds of clue information:  * location of words in each article, * frequency of words in each article, and * length of words.</Paragraph>
      <Paragraph position="3">  If we are given a TV news article z and a newspaper article y, we obtain the reliability score by using the words k(k - 1... N) which are extracted from the TV news article z:</Paragraph>
      <Paragraph position="5"> where w(i, j) is the weight which is given to according to the location of word k in each article. We fixed the values of w(i, j) as shown in Table 1. As shown in Table 1, we divided a newspaper article into four parts: (1) title, (2) explanation of pictures, (3) first paragraph, and (4) the rest. Also, we divided texts in a TV newscasts into two: (1) title, and (2) the rest. It is because keywords are distributed unevenly in articles of newspapers and TV newscasts, haper(i,k) and fTv(j,k) are the frequencies of the word k in the location { of the newspaper and in the location j of the TV news, respectively, length(k) is the length of the word k.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML