File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2313_intro.xml
Size: 7,067 bytes
Last Modified: 2025-10-06 14:02:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2313"> <Title>Towards Automatic Identification of Discourse Markers in Dialogs: The Case of Like</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Role of Discourse Markers in Dialog </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Definition </SectionTitle> <Paragraph position="0"> Despite the wide research interest raised by discourse markers for many years, there is no generally agreed upon definition of this term. The first difficulty arises from the fuzzy terminology used to designate these elements. Even though in English they are most often referred to as discourse markers, a variety of other names are also used, such as discourse particles, discourse connectives, pragmatic markers, etc. But the main problem for the study of DMs is that there seems to be no agreement regarding which elements should be included in this class. For instance, in English, Fraser (1990) has proposed a list of 32 DMs, but Schiffrin (1987) has only 23. Moreover, these two lists have only five common elements. The lack of agreement on what counts as a DM reflects the great diversity of approaches used to investigate them, resulting from divergent research interests, methods and goals.</Paragraph> <Paragraph position="1"> At a very general level, it is nevertheless possible to formulate a rather consensual definition of DMs. Following Andersen (2001, p. 39), discourse markers are &quot;a class of short, recurrent linguistic items that generally have little lexical import but serve significant pragmatic functions in conversation.&quot; Items typically featured in this class include (in English): actually, and, but, I mean, like, so, you know, and well.</Paragraph> <Paragraph position="2"> Our study of DMs and its application to natural language processing is related to a wider-scope investigation of DMs which is grounded in relevance theory (Sperber & Wilson 1986/1995). In this framework, DMs encode a procedure whose role is to constrain the inferential part of communication, by restraining the number of hypotheses the hearer has to consider in order to understand the speaker's meaning3.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Importance of Discourse Markers for NLP </SectionTitle> <Paragraph position="0"> The analysis of DMs for language processing is often inspired by discourse analysis theories such as Rhetorical Structure Theory (Mann & Thompson 1988). In this context, DMs are used to detect coherence relations automatically (Marcu 2000). For example, so, therefore and then are supposed to indicate a relation of conclusion between two segments. However, this analysis of DMs is not fine-grained enough: for instance, if the three markers above imply the same type of relation, why can they not be interchanged in every context? More recently, DMs have also been used as useful cues to detect dialog acts and conversational moves. For example, oh implies a response to a new piece of information and well implies a correction (Heeman, Byron & Allen 1998). However, DMs are then only partial cues, since there is no one-to-one mapping between the use of a marker and the presence of a given relation (see for instance Taboada 2003).</Paragraph> <Paragraph position="1"> In order to provide a more precise and comprehensive framework for the use of DMs in natural language processing, we derived elsewhere a three-step resolution procedure from a relevance-theoretic analysis (Zufferey 2004). These steps can be summarized as follows: 1. detect the occurrences of DMs 2. attach an inferential procedure to every marker 3. determine the scope of each procedure 3 For a more detailed explanation of the role of DMs in rele- null vance theory, we refer the interested reader to Blakemore (2002) for a recent survey.</Paragraph> <Paragraph position="2"> In the remainder of this paper, we will focus only on the first step, i.e. the detection of DMs. The difficulty of this task comes from the fact that DMs are very ambiguous items. Typically, words like well, now or like can fulfill multiple functions. The first step towards a correct use of DMs for language processing is therefore to disambiguate them, i.e. to extract only the occurrences of the respective lexical item functioning as a DM - in other words the pragmatic occurrences (see their definition for like in section 3 below). Sections 6, 7 and 8 below will describe various automatic methods to accomplish this task. Note that even if we have grounded our approach in relevance theory, this first task is of paramount importance to any theory of discourse. For instance, in an RST framework, DMs can be used to infer coherence relations only if their pragmatic occurrences have previously been identified.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Overview of DM Frequencies </SectionTitle> <Paragraph position="0"> The manual annotation of DMs in a subpart of the ICSI meeting corpus (ca. 6 hours and 60,000 words) shows a big difference in the frequency of occurrence for various DMs. The most frequent ones are but (543 times), like (89), and well (287). Others are moderately frequent, e.g., actually (43), basically (21) or now (19), while other are very rare: furthermore (2), however (1), moreover (0). The frequency of each DM is relatively stable across the meetings.</Paragraph> <Paragraph position="1"> The frequency of DMs depends a lot on the type of discourse. For example, the DM however is found much more frequently in written than in spoken language.</Paragraph> <Paragraph position="2"> There are about 50 occurrences of however in the London-Lund Corpus (500,000 words, transcription of spoken language) and about 550 occurrences in the Lancaster-Oslo/Bergen (LOB) corpus (1 million words, written texts). However - like most other DMs - is also much more frequent in dialogs as opposed to monologs.</Paragraph> <Paragraph position="3"> Another bias comes from the type of activity recorded: however is more frequent in formal settings, such as interviews vs. telephone conversations. And last, the regional variation of English, e.g. American vs. British, can influence the results. According to Lenk (1998), &quot;however is not used in spoken American English&quot;. The conclusion is that the frequencies above cannot be taken to be universal. But in the type of data we are interested in - dialogs - there is a high proportion of DM like. Besides, in a greater part of the ICSI-MR corpus (ca. 50 hours), 37% of the 2,116 occurrences of like correspond to its use as a DM. Hence the necessity to disambiguate it correctly becomes quite obvious, not only to have a better pragmatic analysis of occurrences but also to improve parsing and POS tagging4.</Paragraph> </Section> </Section> class="xml-element"></Paper>