File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1088_metho.xml
Size: 8,820 bytes
Last Modified: 2025-10-06 14:09:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1088"> <Title>FLSA: Extending Latent Semantic Analysis with features for dialogue act classification</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Feature Latent Semantic Analysis </SectionTitle> <Paragraph position="0"> We will start by discussing LSA. The input to LSA is a Word-Document matrix W with a row for each word, and a column for each document (for us, a document is a unit, e.g. an utterance, tagged with a DA). Cell c(i,j) contains the frequency with which wordi appears in documentj.1 Clearly, this w xd matrix W will be very sparse. Next, LSA applies 1Word frequencies are normally weighted according to specific functions, but we used raw frequencies because we wanted to assess our extensions to LSA independently from any bias introduced by the specific weighting technique.</Paragraph> <Paragraph position="1"> to W Singular Value Decomposition (SVD), to decompose it into the product of three other matrices, W = T0S0DT0 , so that T0 and D0 have orthonormal columns and S0 is diagonal. SVD then provides a simple strategy for optimal approximate fit using smaller matrices. If the singular values in S0 are ordered by size, the first k largest may be kept and the remaining smaller ones set to zero. The product of the resulting matrices is a matrix ^W of rank k which is approximately equal to W; it is the matrix of rank k with the best possible least-squares-fit to W.</Paragraph> <Paragraph position="2"> The number of dimensions k retained by LSA is an empirical question. However, crucially k is much smaller than the dimension of the original space.</Paragraph> <Paragraph position="3"> The results we will report later are for the best k we experimented with.</Paragraph> <Paragraph position="4"> Figure 1 shows a hypothetical dialogue annotated with MapTask style DAs. Table 1 shows the Word-Document matrix W that LSA starts with - note that as usual stop words such as a, the, you have been eliminated. 2 Table 2 shows the approximate representation of W in a much smaller space.</Paragraph> <Paragraph position="5"> To choose the best tag for a document in the test set, we first compute its vector representation in the semantic space LSA computed, then we compare the vector representing the new document with the vector of each document in the training set. The tag of the document which has the highest similarity with our test vector is assigned to the new document - it is customary to use the cosine between the two vectors as a measure of similarity. In our case, the new document is a unit (utterance) to be tagged with a DA, and we assign to it the DA of the document in the training set to which the new document is most similar.</Paragraph> <Paragraph position="6"> Feature LSA. In general, in FLSA we add extra features to LSA by adding a new &quot;word&quot; for each value that the feature of interest can take (in some cases, e.g. when adding POS tags, we extend the matrix in a different way -- see Sec. 4). The only assumption is that there are one or more non word related features associated with each document that can take a finite number of values. In the Word-Document matrix, the word index is increased to include a new place holder for each possible value the feature may take. When creating the matrix, a count of one is placed in the rows related to the new indexes if a particular feature applies to the document under analysis. For instance, if we wish to include the speaker identity as a new feature for the dialogue 2We use a very short list of stop words (<50), as our experiments revealed that for dialogue act annotation LSA is sensitive to the most common words too. This is why to is included in in Figure 1, the initial Word-Document matrix will be modified as in Table 3 (its first 14 rows are as in Table 1).</Paragraph> <Paragraph position="7"> This process is easily extended if more than one non-word feature is desired per document, if more than one feature value applies to a single document or if a single feature appears more than once in a document (Serafin, 2003).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Corpora </SectionTitle> <Paragraph position="0"> We report experiments on three corpora, Spanish CallHome, MapTask, and DIAG-NLP.</Paragraph> <Paragraph position="1"> The Spanish CallHome corpus (Levin et al., 1998; Ries, 1999) comprises 120 unrestricted phone calls in Spanish between family members and friends, for a total of 12066 unique words and 44628 DAs. The Spanish CallHome corpus is annotated at three levels: DAs, dialogue games and dialogue activities. The DA annotation augments a basic tag such as statement along several dimensions, such as whether the statement describes a psychological state of the speaker. This results in 232 different DA tags, many with very low frequencies. In this sort of situations, tag categories are often collapsed when running experiments so as to get meaningful frequencies (Stolcke et al., 2000). In Call-Home37, we collapsed different types of statements and backchannels, obtaining 37 different tags. CallHome37 maintains some subcategorizations, e.g.</Paragraph> <Paragraph position="2"> whether a question is yes/no or rhetorical. In Call-Home10, we further collapse these categories. CallHome10 is reduced to 8 DAs proper (e.g., statement, question, answer) plus the two tags ''%'' for abandoned sentences and ''x'' for noise.</Paragraph> <Paragraph position="3"> CallHome Spanish is further annotated for dialogue games and activities. Dialogue game annotation is based on the MapTask notion of a dialogue game, a set of utterances starting with an initiation and encompassing all utterances up until the purpose of the game has been fulfilled (e.g., the requested information has been transferred) or abandoned (Carletta et al., 1997). Moves are the components of games, they correspond to a single or more DAs, and each is tagged as Initiative, Response or Feedback. Each game is also given a label, such as Info(rmation) or Direct(ive). Finally, activities pertain to the main goal of a certain discourse stretch, such as gossip or argue.</Paragraph> <Paragraph position="4"> The HCRC MapTask corpus is a collection of dialogues regarding a &quot;Map Task&quot; experiment. Two participants sit opposite one another and each of them receives a map, but the two maps differ. The instruction giver (G)'s map has a route indicated while instruction follower (F)'s map does not in- null (Doc 1) G: Do you see the lake with the black swan? Query-yn (Doc 2) F: Yes, I do Reply-y (Doc 3) G: Ok, Ready (Doc 4) G: draw a line straight to it Instruct (Doc 5) F: straight to the lake? Check (Doc 6) G: yes, that's right Reply-y (Doc 7) F: Ok, I'll do it Acknowledge</Paragraph> <Paragraph position="6"> clude the drawing of the route. The task is for G to give directions to F, so that, at the end, F is able to reproduce G's route on her map. The MapTask corpus is composed of 128 dialogues, for a total of 1,835 unique words and 27,084 DAs. It has been tagged at various levels, from POS to disfluencies, from syntax to DAs. The MapTask coding scheme uses 13 DAs (called moves), that include: Instruct (a request that the partner carry out an action), Explain (one of the partners states some information that was not explicitly elicited by the other), Queryyn/-w, Acknowledge, Reply-y/-n/-w and others. The MapTask corpus is also tagged for games as defined above, but differently from CallHome, 6 DAs are identified as potential initiators of games (of course not every initiator DA initiates a game). Finally, transactions provide the subdialogue structure of a dialogue; each is built of several dialogue games and corresponds to one step of the task.</Paragraph> <Paragraph position="7"> DIAG-NLP is a corpus of computer mediated tutoring dialogues between a tutor and a student who is diagnosing a fault in a mechanical system with a tutoring system built with the DIAG authoring tool (Towne, 1997). The student's input is via menu, the tutor is in a different room and answers via a text window. The DIAG-NLP corpus comprises 23 'dialogues' for a total of 607 unique words and 660 DAs (it is thus much smaller than the other two). It has been annotated for a variety of features, including four DAs3 (Glass et al., 2002): problem solving, the tutor gives problem solving directions; judgment, the tutor evaluates the student's actions or diagnosis; domain knowledge, the tutor imparts domain knowledge; and other, when none of the previous three applies. Other features encode domain objects and their properties, and Consult Type, the type of student query.</Paragraph> </Section> class="xml-element"></Paper>