File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1145_intro.xml
Size: 4,583 bytes
Last Modified: 2025-10-06 14:03:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1145"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Time Period Identification of Events in Text Taichi Noro + Takashi Inui ++ Hiroya Takamura ++</Title> <Section position="3" start_page="0" end_page="1153" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In recent years, the spread of the internet has accelerated. The documents on the internet have increased their importance as targets of business marketing. Such circumstances have evoked many studies on information extraction from text especially on the internet, such as sentiment analysis and extraction of location information.</Paragraph> <Paragraph position="1"> In this paper, we focus on the extraction of temporal information. Many authors of documents on the web often write about events in their daily life. Identifying when the events occur provides us valuable information. For example, we can use temporal information as a new axis in the information retrieval. From time-annotated text, companies can figure out when customers use their products. We can explore activities of users for marketing researches, such as &quot;What do people eat in the morning?&quot;, &quot;What do people spend money for in daytime?&quot; Most of previous work on temporal processing of events in text dealt with only newswire text. In those researches, it is assumed that temporal expressions indicating the time-period of events are often explicitly written in text. Some examples of explicit temporal expressions are as follows: &quot;on March 23&quot;, &quot;at 7 p.m.&quot;.</Paragraph> <Paragraph position="2"> However, other types of text including web diaries and blogs contain few explicit temporal expressions. Therefore one cannot acquire sufficient temporal information using existing methods. Although dealing with such text as web diaries and blogs is a hard problem, those types of text are excellent information sources due to their overwhelmingly huge amount.</Paragraph> <Paragraph position="3"> In this paper, we propose a method for estimating occurrence time of events expressed in informal text. In particular, we classify sentences in text into one of four time-slots; morning, daytime, evening, and night. To realize our goal, we focus on expressions associated with time-slot (hereafter, called time-associated words), such as &quot;commute (morning)&quot;, &quot;nap (daytime)&quot; and &quot;cocktail (night)&quot;. Explicit temporal expressions have more certain information than the time-associated words. However, these expressions are rare in usual text. On the other hand, although the time-associated words provide us only indirect information for estimating occurrence time of events, these words frequently appear in usual text. Actually, Figure 2 (we will discuss the graph in Section 5.2, again) shows the number of sentences including explicit tem- null poral expressions and time-associated words respectively in text. The numbers are obtained from a corpus we used in this paper. We can figure out that there are much more time-associated words than explicit temporal expressions in blog text. In other words, we can deal with wide coverage of sentences in informal text by our method with time-associated words.</Paragraph> <Paragraph position="4"> However, listing up all the time-associated words is impractical, because there are numerous time-associated expressions. Therefore, we use a semi-supervised method with a small amount of labeled data and a large amount of unlabeled data, because to prepare a large quantity of labeled data is costly, while unlabeled data is easy to obtain. Specifically, we adopt the Naive Bayes classifier backed up with the Expectation Maximization (EM) algorithm (Dempster et al., 1977) for semi-supervised learning. In addition, we propose to use Support Vector Machines to filter out noisy sentences that degrade the performance of the semi-supervised method.</Paragraph> <Paragraph position="5"> In our experiments using blog data, we obtained 0.864 of accuracy, and we have shown effectiveness of the proposed method.</Paragraph> <Paragraph position="6"> This paper is organized as follows. In Section 2 we briefly describe related work. In Section 3 we describe the details of our corpus. The proposed method is presented in Section 4. In Section 5, we describe experimental results and discussions. We conclude the paper in Section 6.</Paragraph> </Section> class="xml-element"></Paper>