XML Viewer - w04-2328

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2328_metho.xml
Size: 22,793 bytes
Last Modified: 2025-10-06 14:09:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2328">
  <Title>Multi-level Dialogue Act Tags</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.3 Requirements for the De nition of a
DA Tagset
</SectionTitle>
    <Paragraph position="0"> In this paper, our goal is to design a new DA tagset for our application, with the following constraints in mind (see also the analysis by D. Traum (2000)): Relation to one or more existing theories (descriptive, explanatory, etc.).</Paragraph>
    <Paragraph position="1"> Compatibility with the observed functions of actual utterances in context, in a given domain.</Paragraph>
    <Paragraph position="2"> Empirical validation: reliability of human application of the tagset to typical data (high inter-annotator agreement, at least potentially).</Paragraph>
    <Paragraph position="3"> Possibility of automatic annotation (this requirement is speci c to NLP).</Paragraph>
    <Paragraph position="4"> Relevance to the targeted NLP application: there are numerous possible functions of utterances, but only some of them are really useful to the application. Within our IM2.MDM project, a study has been conducted on the relevance of dialogue acts (in particular) to typical user queries on meeting recordings (Lisowska, 2003)2.</Paragraph>
    <Paragraph position="5"> Mapping (at least partially) to existing tagsets, so that useful insights are preserved, and data can be reused.</Paragraph>
    <Paragraph position="6"> 2Many other potential uses of dialogue act information have been hypothesized, such as their use to increase ASR accuracy (Stolcke et al., 2000), or to locate \hot spots&amp;quot; in meetings (Wrede and Shriberg, 2003).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Available Data and Annotations:
ICSI Meeting Recorder
</SectionTitle>
    <Paragraph position="0"> The volume of available annotated data su ers from the diversity of DA tagsets (Klein and Soria, 1998).</Paragraph>
    <Paragraph position="1"> One of the most signi cant resources is the Switchboard corpus mentioned above, but telephone conversations have many di erences with multi-party meetings. Apart from the data recently available in the IM2 project, results reported in this paper make use of the ICSI Meeting Recording (MR) corpus of transcribed and annotated dialogues (Morgan et al., 2003; Shriberg et al., 2004)3.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Overview of ICSI MR Corpus
</SectionTitle>
      <Paragraph position="0"> The ICSI-MR corpus consists of 75 one-hour recordings of sta meetings, each involving up to eight speakers on separate mike channels. Each channel was manually transcribed and timed, then annotated with dialogue act and adjacency pair information (Shriberg et al., 2004). Following a preliminary release in November 2003 (sound les, transcriptions, and annotations), the full corpus was released in February 2004 to IM2 partners.</Paragraph>
      <Paragraph position="1"> The dialogue act annotation makes use of the pre-existing segmentation of each channel into (prosodic) utterances, sometimes segmented further into functional utterances, each of them bearing a separate dialogue act. There are about 112,000 prosodic utterances, and about 7,200 are segmented into two functional utterances (only one is segmented in three).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Discussion of the ICSI-MR DA Tagset
</SectionTitle>
      <Paragraph position="0"> Each functional utterance from the ICSI-MR corpus is marked with a dialogue label, composed of one or more tags from the ICSI-MR tagset (Dhillon et al., 2004). The tagset, which is well documented, is based on SWBD-DAMSL, but unlike SWBD-DAMSL, it allows one utterance to be marked with multiple tags. Also, the SWBD-DAMSL tagset was extended, for instance with disruption tags such as 'interrupted', 'abandoned', etc. Utterances can also be marked as 'unintelligible' or 'non-speech'. An ICSI-MR label is made of a general tag, followed by zero or more speci c tags, followed or not by a disruption tag: gen_tag [^spec_tag_1 ... ^spec_tag_n] [.d] Our formalization of the guidelines using rewriting rules (Popescu-Belis, 2003) shows that few tags are mutually exclusive. The number of possible combinations (DA labels) reaches several millions. For instance, even when not considering disruption marks,  the labels are a combination of one general tag out of 11, and one or more speci c tags out of 39. If up to ve speci c tags are allowed (as observed empirically in the annotated data), there are more than 7,000,000 possible labels; if speci c tags are limited to four, there are about 1,000,000 possible labels.</Paragraph>
      <Paragraph position="1"> Some studies acknowledge the di culties of annotating precisely with ICSI-MR, but also the ne-grained distinctions it allows for, e.g. between the possible functions of four related discourse particles ('yeah', 'right', 'okay', and 'uhhuh'): agreement/acceptance, acknowledgment, backchannel, oor grabber (Bhagat et al., 2003). Conversely, inter-annotator agreement on such ne-grained distinctions (speci c tags) is lower than agreement on major classes, though the kappa-statistic normally used to measure agreement adjusts to a certain extent for this. In fact, ICSI-MR also provided a set of ve 'classmaps' that indicate how to group tags into categories which reduce the number of possible labels. For instance, the simplest one reduces all DA labels to only ve classes: statement, question, backchannel, oor holder/grabber, disruption. Our MALTUS proposal (see 4.1 below) could be viewed as a classmap too: it preserves however more ICSI-MR tags than the existing classmaps, and assigns in addition conditions of mutual exclusiveness.</Paragraph>
      <Paragraph position="2"> We also note that, while SWBD-DAMSL was an attempt to reduce the dimensionality of the DAMSL tagset (which had a clear theoretical base), the ICSI-MR tagset allows SWBD tags to be combined again instead of going back to DAMSL tags. Although our proposal that we proceed to describe (MALTUS) remains close to ICSI-MR for reusability reasons, we are also working on a more principled DA tagset that departs from ICSI-MR (Popescu-Belis, 2003).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Some Figures for the ICSI-MR Data
</SectionTitle>
      <Paragraph position="0"> In the process of conversion to MALTUS (see 4.2 below), we validated the ICSI-MR data and made several observations. Detected incoherent combinations of tags (e.g., two general tags in a label) and other remarks have also been sent back to ICSI.</Paragraph>
      <Paragraph position="1"> We rst separate prosodic utterances into functional utterances, so that each utterance has one DA label (and not two, separated by 'j'), thus obtaining 120,205 utterances. Also at this stage, we split utterances that correspond to reported speech (marked with ':'). We then discard the disruption marks to focus on the DA labels only { about 12,000 labels out of ca. 120,000 are disruption marks, or contain one. We are left with 113,560 utterances with DA labels, with 776 observed types of labels. An important parameter is the number of occurring vs. possible labels, Nb. of Nb. of Nb. of Nb. of tags in theoretical occurring tokens label comb. comb.</Paragraph>
      <Paragraph position="2">  tags): theoretical vs. actual.</Paragraph>
      <Paragraph position="3"> Maximal nb. Maximal theoretical of tags accuracy on ICSI-MR  ICSI-MR data that could be reached using a limited number of tags per label.</Paragraph>
      <Paragraph position="4"> which depends a lot on the number of speci c tags in a label, as summarized in table 1. The maximum observed in the available data is ve speci c tags in a label (hence six tags in all).</Paragraph>
      <Paragraph position="5"> There is no guarantee that meaningful labels cannot have more than six tags. However, such labels are probably very infrequent, and a reasonable option for automatic tagging is to limit the number of tag combinations, which is the main goal of the MALTUS tagset. The maximal accuracies that could be obtained on the available ICSI-MR data if the number of tags in a label was limited to 1, 2, etc. are shown in Table 2. In computing the accuracy we consider here only perfect matches, but scores could be higher if partial matches count too. Two or three tags per label already allow very high accuracy, while considerable reducing the search space.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The MALTUS DA Tagset
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 De nition
</SectionTitle>
      <Paragraph position="0"> We de ned MALTUS (Multidimensional Abstract Layered Tagset for Utterances) in order to reduce the number of possible combinations by assigning exclusiveness constraints among tags, while remaining compatible with ICSI-MR (Popescu-Belis, 2003). MALTUS is more abstract than ICSI-MR, but can be re ned if needed. An utterance is either marked U (undecipherable) or it has a general tag and zero or more speci c tags. It can also bear a disruption mark. More formally (? means optional):</Paragraph>
      <Paragraph position="2"> re ned into: command, commitment, suggestion, open-option, explicit performative) AT = the utterance is related to attention management (can be re ned into: acknowledgement, rhetorical question backchannel, understanding check, follow me, tag question) PO = the utterance is related to politeness (can be re ned into sympathy, apology, downplayer, \thanks&amp;quot;, \you're welcome&amp;quot;) D = the utterance has been interrupted or abandoned null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Conversion of ICSI-MR to MALTUS
</SectionTitle>
      <Paragraph position="0"> There are only about 500 possible MALTUS labels, but observations of the converted ICSI-MR data show again that the probability distribution is very skewed. An explicit correspondence table and conversion procedure were designed to convert ICSI-MR to MALTUS, so that the considerable ICSI-MR resource can be reused.</Paragraph>
      <Paragraph position="1"> Correspondences between MALTUS and other tagsets (Klein and Soria, 1998) were also provided (Popescu-Belis, 2003). Such \mappings&amp;quot; are imperfect for two reasons: rst, they work only in one direction, from the more speci c tagset (ICSI-MR / SWBD / DAMSL) to the more abstract one (MAL-TUS). Second, a mapping is incomplete if one does not state which tags must be mutually exclusive.</Paragraph>
      <Paragraph position="2"> For MALTUS too, the idea to use at most three tags per label in an automatic annotation program might reduce the search space without decreasing the accuracy too much. Another idea is to use only the labels that appear in the data that is, only 50 labels. An even smaller search space is provided by the 26 MALTUS labels that occur more than 10 times each.</Paragraph>
      <Paragraph position="3"> If only these are used for tagging, then only 70 occurrences (only 0.061% of the total) would be incorrectly tagged, on the ICSI-MR reference data. Occurring labels ordered alphabetically and their frequencies (when greater than 10) are listed below.</Paragraph>
      <Paragraph position="5"> Further analysis will tell whether this list should be enriched with useful labels that are absent from it. Also, a comparison of MALTUS to the SWBD set (26 labels vs. 42) should determine whether the loss in informativeness in MALTUS is compensated by the gain in search space size and in theoretical grounding.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Automatic Classi cation
</SectionTitle>
    <Paragraph position="0"> As discussed above, one of the desiderata for a tagset in this application domain is that the tags can be applied automatically. A requirement for annotations that can only be applied manually is clearly unrealistic except for meetings of very high importance.</Paragraph>
    <Paragraph position="1"> The ICSI-MR corpus on the other hand is concerned with producing a body of annotated data that can be used by researchers for a wide range of di erent purposes: linguists who are interested in particular forms of interaction, researchers in acoustics and so on. It is by no means a criticism of their work that some of the distinctions that they annotate or attempt to annotate cannot be reliably automated.</Paragraph>
    <Paragraph position="2"> Here we report some preliminary experiments on the automatic annotation of meeting transcripts with these tagsets. Our focus here is not so much on evaluating a classi er for this task but rather evaluating the tagsets: we are interest in the extent to which they can be predicted reliably from easily extracted features of the utterance and its context. Additionally we are interested in the multi-level nature of the tagsets and exploring the extent to which the internal structure of the tags allows other options for classiers. Therefore, our goal in these experiments is not to build a high performance classi er; rather, it is to explore the extent to which multi level tagsets can be predicted by classifying each level separately { i.e. by having a set of \orthogonal&amp;quot; classi ers { as opposed to classifying the entire structured object in a single step using a single multi-class classi er on a attened representation. Accordingly there are a number of areas in which our experimental setup di ers from that which would be appropriate when performing experiments to evaluate a classi er.</Paragraph>
    <Paragraph position="3"> Since in this paper we are not using prosodic or acoustic information, but just the manual transcriptions, there are two sources of information that can be used to classify utterances. First, the sequence of words that constitutes the utterance, and secondly the surrounding utterances and their classi cation.</Paragraph>
    <Paragraph position="4"> generally in prior research in this eld, some form of sequential inference algorithm has been used to combine the local decisions about the DA of each utterance into a classi cation of the whole utterance.</Paragraph>
    <Paragraph position="5"> The common way of doing this has been to use a hidden Markov model to model the sequence and to use a standard decoding algorithm to nd either the sequence with maximum a posteriori (MAP) likelihood or to select for each utterance the DA with MAP likelihood. In the work here, we will ignore this complexity and allow our classi er access to the gold standard classi cation of the surrounding utterances. This will make the task substantially easier, since in a real application, there will be some noise in the labels.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Feature selection
</SectionTitle>
      <Paragraph position="0"> There are two sorts of features that we shall use here { internal lexical features derived from the words in the utterance, and contextual features derived from the surrounding utterances. At our current state of knowledge we have a very good idea about what the lexically derived features should be, and how they should be computed { namely n-grams or gappy n-grams including positional information. Additionally, there are ways of computing these e ciently. However, with regard to the contextually derived features, our knowledge is much less complete. (Stolcke et al., 2000) showed that in the Switchboard corpus there was little dependence beyond the immediately adjacent utterance, but whether this also applies in this multi-party domain is unknown. Thus we nd ourselves in a rather asymmetric position with regard to these two information sources. As we are not here primarily interested in constructing a high performance classi er, but rather identifying the predictable elements of the tag, we have resolved this problem by deliberately selecting a rather limited set of lexical features, together with a limited set of contextual features. Otherwise, we feel that our experiments would be overly biased towards those elements of the tag that are predictable from the internal lexical evidence.</Paragraph>
      <Paragraph position="1"> We used as lexical features the 1000 most frequent words, together with additional features for these words occurring at the beggining or end of the utterance. This gives an upper bound of 3000 lexical features. We experimented with a variety of simple contextual features.</Paragraph>
      <Paragraph position="2"> Preceding same label (SL) the immediately preceding utterance on the same channel has a particular DA tag.</Paragraph>
      <Paragraph position="3"> Preceding label (PL) a preceding utterance on a di erent channel has a particular DA tag. We consider an utterance to be preceding if it starts before the start of the current utterance.</Paragraph>
      <Paragraph position="4"> Overlapping label (OL) an utterance on another channel with a particular DA tag overlaps the current utterance. We anticipate this being useful for identifying backchannels.</Paragraph>
      <Paragraph position="5"> Containing label (CL) an utterance on another channel with a particular DA tag contains the current channel { i.e. the start is before the start of the current utterance and the end is after the end of the current utterance.</Paragraph>
      <Paragraph position="6"> Figure 1 shows an arti cial example in a multi-party dialog with four channels. This illustrates the features that will be de ned for the classi cation of the utterance that is shaded. In this example we will have the following features SL:C1, PL:B1, PL:D1, CL:D1, OL:A1, OL:B1, OL:B2, OL:D1. We have found  features de ned for a particular utterance (shaded).</Paragraph>
      <Paragraph position="7"> There are four channels labelled A to D; each box represents an utterance, and the DA tag is represented by the characters inside each box.</Paragraph>
      <Paragraph position="8"> that the overlapping label feature set does not help the classi ers here, so we have used the remaining three contextual feature sets. Note the absence of contextual features corresponding to labels of utterances that strictly follow the target utterance. We felt that given the fact that we use the gold standard tags this would be too powerful.</Paragraph>
      <Paragraph position="9"> The data made available to us was preprocessed in a number of ways. The most signi cant change was to split utterances that had been labelled with a sequence of DA labels (joined with pipes). We separated the utterances and the labels at the appropriate points and realigned. The data was provided with individual time stamps for each word using a speech recognizer in forced recognition mode: where there were errors or mismatches we discarded the words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> We use a Maximum Entropy (ME) classi er (Manning and Klein, 2003) which allows an e cient combination of many overlapping features. We selected 5 meetings (6771 utterances after splitting) to use as our test set and 40 as our training set leaving a further ve for possible later experiments. As a simple baseline we use the classi er which just guesses the most likely class. We rst performed some experiments on the original tag sets to see how predictable they are.</Paragraph>
      <Paragraph position="1"> We started by de ning a simple six-way classi cation task which classi es disruption forms, and undecipherable forms as well as the four general tags dened above. This is an empirically very well-founded distinction: the ICSI-MR group have provided some inter-annotator agreement gures(Carletta et al., 1997) for a very similar task and report a kappa of 0.79. Our ME classi er scored 77.9% (baseline 54.0%).</Paragraph>
      <Paragraph position="2"> We also tested a few simple binary classi cations to see how predictable they are. Utterances are annotated for example with a tag J if they are a joke. As would be expected, the Joke/Non-Joke classi cation produced results not distinguishable from chance.</Paragraph>
      <Paragraph position="3"> The performance of the classi ers on separating disrupted utterances from non disrupted forms scored slightly above chance at 89.9% (against baseline of 87.0%). We suspect that more sophisticated contextual features could allow better performance here. A more relevant performance criterion for our application is the accuracy of classi cation into the four general tags. In this case we removed disrupted and undecipherable utterances, slightly reducing the size of the test set, and achieved a score of 84.9% (baseline 64.1%).</Paragraph>
      <Paragraph position="4"> With regard to the larger sets of tags, since they have some internal structure it should accordingly be possible to identify the di erent parts separately, and then combine the results. We have therefore performed some preliminary experiments with classi ers that classify each level separately. We again removed the disruption tags since with out current framework we are unable to predict them accurately. The base-line for this task is again a classi er that chooses the most likely tag (S) which gives 41.9% accuracy. Using a single classi er on this complex task gave an accuracy of 73.2%.</Paragraph>
      <Paragraph position="5"> We then constructed six classi ers as follows Primary classi er S, H, Q or B Politeness classi er PO or not PO Attention classi er AT or not AT Order classi er DO or not DO Restatement classi er RI or not RI Response classi er RP, RN, RU or no response These were trained separately in the obvious way and the results combined. This complex classi er gave an accuracy 70.5%. This mild decrease in performance is rather surprising { one would expect the performance to increase as the data sets for each distinction get larger. This can be explained by dependences between the classi cations. There are a number of ways this could be treated { for example, one could use a sequence of classi ers, where each classi er can use the output of the previous classi er as a feature in the next. It is also possible that these dependencies re ect idiosyncracies of the tagging process: tendencies of the annotators for whatever reasons to favour or avoid certain combinations of tags.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML