File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2097_evalu.xml

Size: 3,793 bytes

Last Modified: 2025-10-06 13:59:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2097">
  <Title>Visual Information Based on Hidden Markov Models</Title>
  <Section position="7" start_page="758" end_page="760" type="evalu">
    <SectionTitle>
5 Experiments and Discussion
5.1 Data
</SectionTitle>
    <Paragraph position="0"> To demonstrate the effectiveness of our proposed method, we made experiments on two kinds of cooking TV programs: NHK &amp;quot;Today's Cooking&amp;quot;  and NTV &amp;quot;Kewpie 3-Min Cooking&amp;quot;. Table 4 presents the characteristics of the two programs. Note that time stamps of closed captions synchronize themselves with the video stream. Extracted &amp;quot;pseudo-labeled&amp;quot; data by the expression mentioned in Section 4.2 are 525 clauses out of 13564 (3.87%) in &amp;quot;Today's Cooking&amp;quot;, and 107 clauses out of 1865 (5.74%) in &amp;quot;Kewpie 3-Min Cooking&amp;quot;.</Paragraph>
    <Section position="1" start_page="760" end_page="760" type="sub_section">
      <SectionTitle>
5.2 Experiments and Discussion
</SectionTitle>
      <Paragraph position="0"> We conducted the experiment of the topic identification. We first trained HMM parameters for each program, and then applied the trained model to five videos each, in which, we manually assigned appropriate topics to clauses. Table 5 gives the evaluation results. The unit of evaluation was a clause. The accuracy was improved by integrating linguistic and visual information compared to using linguistic / visual information alone. (Note that &amp;quot;visual information&amp;quot; uses pseudo-labeled data.) In addition, the accuracy was improved by using various discourse features.</Paragraph>
      <Paragraph position="1"> The reason why silence did not contribute to accuracy improvement is supposed to be that closed captions and video streams were not synchronized precisely due to time lagging of closed captions.</Paragraph>
      <Paragraph position="2"> To deal with this problem, an automatic closed caption alignment technique (Huang et al., 2003) will be applied or automatic speech recognition will be used as texts instead of closed captions with the advance of speech recognition technology. null Figure 3 illustrates an improved example by adding visual information. In the case of using only linguistic information, this topic was rec-First, saute and body.</Paragraph>
      <Paragraph position="3">  ognized as sauteing, but this topic was actually preparation, which referred to the next topic. By using the visual information that background color was white, this topic was correctly recognized as preparation.</Paragraph>
      <Paragraph position="4"> We conducted another experiment to demonstrate the validity of several linguistic processes, such as utterance-type recognition and word sense disambiguation with case frames, for extracting linguistic information from closed captions described in Section 3.1.1. We compared our method to three methods: a method that does not perform word sense disambiguation with case frames (w/o cf), a method that does not perform utterance-type recognition for extracting actions (uses all utterance-type texts) (w/o utype), a method, in which a sentence is emitted according to a state-specific language model (bigram) as Barzilay and Lee adopted (bigram). Figure 6 gives the experimental result, which demonstrates our method is appropriate.</Paragraph>
      <Paragraph position="5"> One cause of errors in topic identification is that some case frames are incorrectly constructed. For example, kiru:1 (cut) contains &amp;quot;J ~(cut a vegetable)&amp;quot; and &amp;quot; ~(drain oil)&amp;quot;. This leads to incorrect parameter training. Other cause is that some verbs are assigned to an inaccurate case frame by the failure of case analysis.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML