File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2097_metho.xml
Size: 11,456 bytes
Last Modified: 2025-10-06 14:10:30
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2097"> <Title>Visual Information Based on Hidden Markov Models</Title> <Section position="5" start_page="756" end_page="758" type="metho"> <SectionTitle> 3 Features for Topic Identification </SectionTitle> <Paragraph position="0"> First, we'll describe the features that we use for topic identification, which are listed in Table 1.</Paragraph> <Paragraph position="1"> They consist of three modalities: linguistic, visual and audio modality.</Paragraph> <Paragraph position="2"> We utilize as linguistic information the instructor's utterances in video, which can be divided into various types such as actions, tips, and even small talk. Among them, actions, such as cut, peel and grease a pan, are dominant and supposed to be useful for topic identification and others can be noise. In the case of analyzing utterances in video, it is natural to utilize visual information as well as linguistic information for robust analysis. We utilize background image as visual information. For example, frying and boiling are usually performed on a gas range and preparation and dishing up are usually performed on a cutting board.</Paragraph> <Paragraph position="3"> Furthermore, we utilize cue phrases and silence as a clue to a topic shift, and noun/verb chaining as a clue to a topic persistence.</Paragraph> <Paragraph position="4"> We describe these features in detail in the following sections.</Paragraph> <Section position="1" start_page="756" end_page="758" type="sub_section"> <SectionTitle> 3.1 Linguistic Features </SectionTitle> <Paragraph position="0"> Closed captions of Japanese cooking TV programs are used as a source for extracting linguistic fea- null Modality Feature Domain dependent Domain independent linguistic case frame utterance generalization cue phrases topic change noun chaining topic persistence verb chaining topic persistence visual background image bottom of image audio silence topic change type.) [action declaration] ex.^|px|tTTb}(Then, we 'll cook a steak) aKoMV`O}(OK, we'll fry.) [individual action] ex.sbxb{(Cut off a step of this eggplant.) morphological analyzer, JUMAN (Kurohashi et al., 1994), and make syntactic/case analysis and anaphora resolution with the Japanese analyzer, KNP (Kurohashi and Nagao, 1994). Then, we perform the following process to extract linguistic features.</Paragraph> <Paragraph position="1"> Considering a clause as a basic unit, utterances referring to an action are extracted in the form of case frame, which is assigned by case analysis. This procedure is performed for generalization and word sense disambiguation. For example, &quot;(add salt)&quot; and &quot;-vt (add sugar into a pan)&quot; are assigned to case frame ireru:1 (add) and &quot;A(carve with a knife)&quot; is assigned to case frame ireru:2 (carve). We describe this procedure in detail below. Utterance-type recognition To extract utterances referring to actions, we classify utterances into several types listed in Table 2 . Note that actions are supposed to have two levels: [action declaration] means a declaration of beginning a series of actions and [individual action] means an action that is the finest one. In this paper, [ ] means an utterance type. Input sentences are first segmented into clauses and their utterance type is recognized. Among several utterance types, [individual action], [food/tool presentation], [substitution], [note], and [small talk] can be recognized by clause-end patterns. We prepare approximately 500 patterns for recognizing the utterance type. As for [individual action] and [food state], considering the portability of our system, we use general rules regarding intransitive verbs or adjective + &quot; s(become)&quot; as [food state], and others as [individual action].</Paragraph> <Paragraph position="2"> Action extraction We extract utterances whose utterance type is recognized as action ([action declaration] or [individual action]). For example, &quot;X(peel)&quot; and &quot; ~(cut)&quot; are extracted from the following sentence. null</Paragraph> <Paragraph position="4"> peel this carrot and cut it in half.) We make two exceptions to reduce noises. One is that clauses are not extracted from the sentence in which sentence-end clause's utterance-type is not recognized as an action. In the following example, &quot;(simmer)&quot; and &quot; ~(cut)&quot; are not extracted because the utterance type of b{(We cut in this cherry tomato, because we'll fry it in oil.) Note that relations between clauses are recognized by clause-end patterns.</Paragraph> <Paragraph position="5"> Verb sense disambiguation by assigning to a case frame In general, a verb has multiple meanings/usages. For example, &quot;&quot; has multiple usages, &quot;(add salt)&quot; and &quot;A (carve with a knife)&quot; , which appear in different topics. We do not extract a surface form of verb but a case frame, which is assigned by case analysis. Case frames are automatically constructed from Web cooking texts (12 million sentences) by clustering similar verb usages (Kawahara and Kurohashi, 2002). An example of the automatically constructed case frame is shown in Table 3. For example, &quot;(add salt)&quot; is assigned to ireru:1 (add) and &quot;A (carve with a knife)&quot; is assigned to case frame ireru:2 (carve).</Paragraph> <Paragraph position="6"> As Grosz and Sidner (Grosz and Sidner, 1986) pointed out, cue phrases such as now and well serve to indicate a topic change. We use approximately 20 domain-independent cue phrases, such as &quot;px(then)&quot;, &quot;x(next)&quot; and &quot;fO`` h(then)&quot;.</Paragraph> <Paragraph position="7"> In text segmentation algorithms such as TextTiling (Hearst.M, 1997), lexical chains are widely utilized for detecting a topic shift. We utilize such a feature as a clue to topic persistence.</Paragraph> <Paragraph position="8"> When two continuous actions are performed to the same ingredient, their topics are often identical. For example, because &quot;Sb(grate)&quot; and &quot; [(raise)&quot; are performed to the same ingredient &quot;T(turnip)&quot; , the topics (in this instance, preparation) in the two utterances are identical. (5) a.TS`pS`oMVb{ (We'll grate a turnip.) b.S`hT_t[b{ (Raise this turnip on this basket.) However, in the case of spoken language, because there exist many omissions, it is often the case that noun chaining cannot be detected with surface word matching. Therefore, we detect noun chaining by using the anaphora resolution result of verbs (ex.(6)) and nouns (ex.(7)). The verb, noun anaphora resolution is conducted by the method proposed by (Kawahara and Kurohashi, 2004), (Sasano et al., 2004), respectively. When a verb of a clause is identical with that of the previous clause, they are likely to have the same topic. We utilize the fact that the adjoining two clauses contain an identical verbs or not as an observed feature.</Paragraph> </Section> <Section position="2" start_page="758" end_page="758" type="sub_section"> <SectionTitle> 3.2 Image Features </SectionTitle> <Paragraph position="0"> It is difficult for the current image processing technique to extract what object appears or what action is performing in video unless a detailed object/action model for a specific domain is constructed by hand. Therefore, referring to (Hamada et al., 2000), we focus our attention on color distribution at the bottom of the image, which is comparatively easy to exploit. As shown in Figure 1, we utilize the mass point of RGB in the bottom of the image at each clause.</Paragraph> </Section> <Section position="3" start_page="758" end_page="758" type="sub_section"> <SectionTitle> 3.3 Audio Features </SectionTitle> <Paragraph position="0"> A cooking video contains various types of audio information, such as instructor's speech, cutting sounds and frizzling sound. If cutting sound or frizzling sound could be distinguished from other sounds, they could be an aid to topic identification, but it is difficult to recognize them.</Paragraph> <Paragraph position="1"> As Galley et al. (Galley et al., 2003) pointed out, a longer silence often appears when topic changes, and we can utilize it as a clue to topic change. In this study, silence is automatically extracted by finding duration below a certain amplitude level which lasts more than one second.</Paragraph> </Section> </Section> <Section position="6" start_page="758" end_page="758" type="metho"> <SectionTitle> 4 Topic Identification based on HMMs </SectionTitle> <Paragraph position="0"> We employ HMMs for topic identification, where a hidden state corresponds to a topic and various features described in Section 3 are observed.</Paragraph> <Paragraph position="1"> In our model, considering the case frame as a basic unit, the case frame and background image are observed from the state, and discourse features indicating to topic shift/persistence (cue phrases, noun/verb chaining and silence) are observed when the state transits.</Paragraph> <Section position="1" start_page="758" end_page="758" type="sub_section"> <SectionTitle> 4.1 Parameters </SectionTitle> <Paragraph position="0"> HMM parameters are as follows:</Paragraph> <Paragraph position="2"> : the probability that state s i is a start state. * state transition probability a ij : the probabil-</Paragraph> <Paragraph position="4"> ) : the probability that symbol o discourse features are emitted when state s i transits to state s j . This probability is defined as multiplication of the observation probability of each feature (cue phrase, noun chaining, verb chaining, silence). The observation probability of each feature does not depend on state s</Paragraph> <Paragraph position="6"> are the same or different. For example, in the case of cue phrase (c), the probability is given by the following equation:</Paragraph> </Section> <Section position="2" start_page="758" end_page="758" type="sub_section"> <SectionTitle> 4.2 Parameters Estimation </SectionTitle> <Paragraph position="0"> We apply the Baum-Welch algorithm for estimating these parameters. To achieve high accuracy with the Baum-Welch algorithm, which is an unsupervised learning method, some labeled data have been required or proper initial parameters have been set depending on domain-specific knowledge. These requirements, however, make it difficult to extend to other domains. We automatically extract &quot;pseudo-labeled&quot; data focusing on the following linguistic expressions: if a clause has the utterance-type [action declaration] and an original form of its verb corresponds to a topic, its topic is set to that topic. Remind that [action declaration] is a kind of declaration of starting a series of actions. For example, in Figure 1, the topic of the clause &quot;We'll saute.&quot; is set to sauteing because its utterance-type is recognized as [action declaration] and the original form of its verb is topic sauteing.</Paragraph> <Paragraph position="1"> By using a small amounts of &quot;pseudo-labeled&quot; data as well as unlabeled data, we train the HMM parameters. Once the HMM parameters are trained, the topic identification is performed using the standard Viterbi algorithm.</Paragraph> </Section> </Section> class="xml-element"></Paper>