File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1145_metho.xml
Size: 11,983 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1145"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Time Period Identification of Events in Text Taichi Noro + Takashi Inui ++ Hiroya Takamura ++</Title> <Section position="5" start_page="1153" end_page="1154" type="metho"> <SectionTitle> 3 Corpus </SectionTitle> <Paragraph position="0"> In this section, we describe a corpus made from blog entries. The corpus is used for training and test data of machine learning methods mentioned in Section 4.</Paragraph> <Paragraph position="1"> The blog entries we used are collected by the method of Nanno et al. (2004). All the entries are written in Japanese. All the entries are split into sentences automatically by some heuristic rules. In the next section, we are going to explain &quot;time-slot&quot; tag added at every sentence.</Paragraph> <Section position="1" start_page="1153" end_page="1154" type="sub_section"> <SectionTitle> 3.1 Time-Slot Tag </SectionTitle> <Paragraph position="0"> The &quot;time-slot&quot; tag represents when an event occurs in five classes; &quot;morning&quot;, &quot;daytime&quot;, &quot;evening&quot;, &quot;night&quot;, and &quot;time-unknown&quot;. &quot;Timeunknown&quot; means that there is no temporal information. We set the criteria of time-slot tags as follows.</Paragraph> <Paragraph position="1"> Morning: 04:00--10:59 from early morning till before noon, breakfast Daytime: 11:00--15:59 from noon till before dusk, lunch Evening: 16:00--17:59 from dusk till before sunset Night: 18:00--03:59 from sunset till dawn, dinner Note that above criteria are just interpreted as rough standards. We think time-slot recognized by authors is more important. For example, in a case of &quot;about 3 o'clock this morning&quot; we judge the case as &quot;morning&quot; (not &quot;night&quot;) with the expression written by the author &quot;this morning&quot;. To annotate sentences in text, we used two different clues. One is the explicit temporal expressions or time-associated words included in the sentence to be judged. The other is contextual information around the sentences to be judged.</Paragraph> <Paragraph position="2"> The examples corresponding to the former case are as follows: Example 1 a. I went to post office by bicycle in the morning. b. I had spaghetti at restaurant at noon.</Paragraph> <Paragraph position="3"> c. I cooked stew as dinner on that day.</Paragraph> <Paragraph position="4"> Suppose that the two sentences in Example 2 appear successively in a document. In this case, we first judge the first sentence as morning. Next, we judge the second sentence as morning by contextual information (i.e., the preceding sentence is judged as morning), although we cannot know the time period just from the content of the second sentence itself.</Paragraph> </Section> <Section position="2" start_page="1154" end_page="1154" type="sub_section"> <SectionTitle> 4.2 Naive Bayes Classifier Example 2 </SectionTitle> <Paragraph position="0"> 1. I went to X by bicycle in the morning.</Paragraph> <Paragraph position="1"> In this section, we describe multinomial model that is a kind of Naive Bayes classifiers.</Paragraph> <Paragraph position="2"> 2. I went to a shop on the way back from X.</Paragraph> <Paragraph position="3"> A generative probability of example x given a category has the form: c</Paragraph> </Section> <Section position="3" start_page="1154" end_page="1154" type="sub_section"> <SectionTitle> 3.2 Corpus Statistics </SectionTitle> <Paragraph position="0"> We manually annotated the corpus. The number of the blog entries is 7,413. The number of sentences is 70,775. Of 70,775, the number of sentences representing any events is 14,220. The frequency distribution of time-slot tags is shown in Table 1. We can figure out that the number of time-unknown sentences is much larger than the other sentences from this table. This bias would affect our classification process. Therefore, we propose a method for tackling the problem.</Paragraph> <Paragraph position="2"> denotes the probability that a sentence of length x occurs, denotes the number of occurrences of w in text</Paragraph> <Paragraph position="4"> currence of a sentence is modeled as a set of trials, in which a word is drawn from the whole vocabulary.</Paragraph> <Paragraph position="5"> In time-slot classification, the x is correspond to each sentence, the c is correspond to one of time-slots in {morning, daytime, evening, night}. Features are words in the sentence. A detailed description of features will be described in Section 4.5.</Paragraph> <Paragraph position="6"> The EM algorithm (Dempster et al., 1977) is a method to estimate a model that has the maximal likelihood of the data when some variables cannot be observed (these variables are called latent variables). Nigam et al. (2000) proposed a combination of the Naive Bayes classifiers and the EM algorithm.</Paragraph> </Section> </Section> <Section position="6" start_page="1154" end_page="1156" type="metho"> <SectionTitle> 4 Proposed Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1154" end_page="1155" type="sub_section"> <SectionTitle> 4.1 Basic Idea </SectionTitle> <Paragraph position="0"> Suppose, for example, &quot;breakfast&quot; is a strong clue for the morning class, i.e. the word is a time-associated word of morning. Thereby we can classify the sentence &quot;I have cereal for breakfast.&quot; into the morning class. Then &quot;cereal&quot; will be a time-associated word of morning.</Paragraph> <Paragraph position="1"> Therefore we can use &quot;cereal&quot; as a clue of time-slot classification. By iterating this process, we can obtain a lot of time-associated words with bootstrapping method, improving sentence classification performance at the same time.</Paragraph> <Paragraph position="2"> Ignoring the unrelated factors of Eq. (1), we We express model parameters as th .</Paragraph> <Paragraph position="3"> If we regard c as a latent variable and introduce a Dirichlet distribution as the prior distribution for the parameters, the Q-function (i.e., the expected log-likelihood) of this model is defined as: To realize the bootstrapping method, we use the EM algorithm. This algorithm has a theoretical base of likelihood maximization of incomplete data and can enhance supervised learning methods. We specifically adopted the combination of the Naive Bayes classifier and the EM algorithm. This combination has been proven to be effective in the text classification (Nigam et al., 2000).</Paragraph> <Paragraph position="5"> user given parameter and D is the set of examples used for model estimation.</Paragraph> <Paragraph position="6"> The aim of this study is time-slot classification of events. Therefore we treat only sentences expressing an event.</Paragraph> <Paragraph position="7"> We obtain the next EM equation from this Qfunction: null denotes the number of features variety. For labeled example x , Eq. (5) is not used. Instead, ( )th, |xcP is set as 1.0 if c is the category of x , otherwise 0.</Paragraph> <Paragraph position="8"> Instead of the usual EM algorithm, we use the tempered EM algorithm (Hofmann, 2001). This algorithm allows coordinating complexity of the model. We can realize this algorithm by substituting the next equation for Eq. (5) at E-step: where b denotes a hyper parameter for coordinating complexity of the model, and it is positive value. By decreasing this hyper-parameter b , we can reduce the influence of intermediate classification results if those results are unreliable. Too much influence by unlabeled data sometimes deteriorates the model estimation. Therefore, we introduce a new hyper-parameter (10 [?][?] )ll which acts as weight on unlabeled data. We exchange the second term in the right-hand-side of Eq. (4) for the next equation: unlabeled data. We can reduce the influence of unlabeled data by decreasing the value of l . We derived new update rules from this new Qfunction. The EM computation stops when the difference in values of the Q-function is smaller than a threshold.</Paragraph> </Section> <Section position="2" start_page="1155" end_page="1156" type="sub_section"> <SectionTitle> 4.4 Class Imbalance Problem </SectionTitle> <Paragraph position="0"> We have two problems with respect to &quot;time-unknown&quot; tag.</Paragraph> <Paragraph position="1"> The first problem is the class imbalance problem (Japkowicz 2000). The number of time-unknown time-slot sentences is much larger than that of the other sentences as shown in Table 1.</Paragraph> <Paragraph position="2"> There are more than ten times as many time-unknown time-slot sentences as the other sentences. null Second, there are no time-associated words in the sentences categorized into &quot;time-unknown&quot;. Thus the feature distribution of time-unknown time-slot sentences is remarkably different from the others. It would be expected that they adversely affect proposed method.</Paragraph> <Paragraph position="3"> There have been some methodologies in order to solve the class imbalance problem, such as Zhang and Mani (2003), Fan et al. (1999) and Abe et al. (2004). However, in our case, we have to resolve the latter problem in addition to the class imbalance problem. To deal with two problems above simultaneously and precisely, we develop a cascaded classification procedure.</Paragraph> <Paragraph position="5"/> </Section> <Section position="3" start_page="1156" end_page="1156" type="sub_section"> <SectionTitle> 4.5 Time-Slot Classification Method </SectionTitle> <Paragraph position="0"> It's desirable to treat only &quot;time-known&quot; sentences at NB+EM process to avoid the above-mentioned problems. We prepare another classifier for filtering time-unknown sentences before NB+EM process for that purpose. Thus, we propose a classification method in 2 steps (Method A). The flow of the 2-step classification is shown in Figure 1. In this figure, ovals represent classifiers, and arrows represent flow of data.</Paragraph> <Paragraph position="1"> The first classifier (hereafter, &quot;time-unknown&quot; filter) classifies sentences into two classes; &quot;time-unknown&quot; and &quot;time-known&quot;. The &quot;time-known&quot; class is a coarse class consisting of four time-slots (morning, daytime, evening, and night). We use Support Vector Machines as a classifier. The features we used are all words included in the sentence to be classified.</Paragraph> <Paragraph position="2"> The second classifier (time-slot classifier) classifies &quot;time-known&quot; sentences into four classes. We use Naive Bayes classifier backed up with the Expectation Maximization (EM) algorithm mentioned in Section 4.3.</Paragraph> <Paragraph position="3"> The features for the time-slot classifier are words, whose part of speech is noun or verb. The set of these features are called NORMAL in the rest of this paper. In addition, we use information from the previous and the following sentences in the blog entry. The words included in such sentences are also used as features. The set of these features are called CONTEXT. The features in CONTEXT would be effective for estimating time-slot of the sentences as mentioned in Example2 in Section 3.1.</Paragraph> <Paragraph position="4"> We also use a simple classifier (Method B) for comparison. The Method B classifies all time-slots (morning ~ night, time-unknown) sentences at just one step. We use Naive Bayes classifier backed up with the Expectation Maximization (EM) algorithm at this learning. The features are words (whose part-of-speech is noun or verb) included in the sentence to be classified.</Paragraph> </Section> </Section> class="xml-element"></Paper>