File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2019_metho.xml

Size: 12,908 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2019">
  <Title>A corpus-based approach to topic in Danish dialog[?]</Title>
  <Section position="5" start_page="0" end_page="109" type="metho">
    <SectionTitle>
3 Method
</SectionTitle>
    <Paragraph position="0"> The investigation proceeded in three stages: first, the topic expressions (see below) of all utterances were identified1; second, all NPs were annotated for linguistic surface features; and third, decision trees 1 Utterances with dicourse regulating purpose (e.g. yes/noanswers), incomplete utterances, and utterances without an NP were excluded.</Paragraph>
    <Paragraph position="1">  were generated in order to reveal correlations between the topic expressions and the surface features.</Paragraph>
    <Section position="1" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
3.1 Identification of topic expressions
</SectionTitle>
      <Paragraph position="0"> Topics are distinguished from topic expressions following Lambrecht (1994). Topics are entities pragmatically construed as being what an utterance is about. A topic expression, on the other hand, is an NP that formally expresses the topic in the utterance.</Paragraph>
      <Paragraph position="1"> Topic expressions were identified through a two-step procedure; 1) identifying topics and 2) determining the topic expressions on the basis of the topics.</Paragraph>
      <Paragraph position="2"> First, the topic was identified strictly based on pragmatic aboutness using a modified version of the 'about test' (Lambrecht, 1994; Reinhart, 1982).</Paragraph>
      <Paragraph position="3"> The about test consists of embedding the utterance in question in an 'about-sentence' as in Lambrecht's example shown below as (1): (1) He said about the children that they went to school. This is a paraphrase of the sentence the children went to school which indicates that the referent of the children is the topic because it is appropriate (in the imagined discourse context) to embed this referent as an NP in the about matrix clause. (Again, the referent of the children is the topic, while the NP the children is the topic expression.) We adapted the about test for dialog by adding a request to 'say something about . . . ' or 'ask about . . . ' before the utterance in question. Each utterance was judged in context, and the best topic was identified as illustrated below. In example (2), the last utterance, (2-D3), was assigned the topic TIME OF LAST WEIGHING. This happened after considering which about construction gave the most coherent and natural sounding result combined with the utterance. Example (3) shows a few about constructions that the coder might come up with, and in this context (3-iv) was chosen as the best alternative.</Paragraph>
      <Paragraph position="4">  (3) i. Say something about THE PATIENT (=you).</Paragraph>
      <Paragraph position="5"> ii. Say something about THE WEIGHING OF THE PATIENT. null iii. Say something about THE LAST WEIGHING OF THE PATIENT.</Paragraph>
      <Paragraph position="6"> iv. Say something about THE TIME OF LAST WEIGHING</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="109" end_page="109" type="metho">
    <SectionTitle>
OF THE PATIENT.
</SectionTitle>
    <Paragraph position="0"> Creating the about constructions involved a great deal of creativity and made them difficult to compare. Sometimes the coders chose the exact same topic, at other times they were obviously different, but frequently it was difficult to decide. For instance, for one utterance Coder 1 chose OTHER</Paragraph>
  </Section>
  <Section position="7" start_page="109" end_page="109" type="metho">
    <SectionTitle>
CAUSES OF EDEMA SYMPTOM, while Coder 2
chose THE EDEMA'S CONNECTION TO OTHER
</SectionTitle>
    <Paragraph position="0"> THINGS. Slightly different wordings like these made it impossible to test the intersubjectivity of the topic coding.</Paragraph>
    <Paragraph position="1"> The second step consisted in actually identifying the topic expression. This was done by selecting the NP in the utterance that was the best formal representation of the topic, using 3 criteria:  1. The topic expression is the NP in the utterance that refers to the topic.</Paragraph>
    <Paragraph position="2"> 2. If no such NP exists, then the topic expression is the NP whose referent the topic is a property or aspect of. 3. If no NP fulfills one of these criteria, then the utterance  has no topic expression.</Paragraph>
    <Paragraph position="3"> In the example from before, (2-D3), it was judged that det 'it' (emphasized) was the topic expression of the utterance, because it shared reference with the chosen topic from (3-iv).</Paragraph>
    <Paragraph position="4"> If two NPs in an utterance had the same reference, the best topic representative was chosen. In reflexive constructions like (4), the non-reflexive NP, in this case jeg 'I', is considered the best representative.  me (i.e. lost weight) In syntactially complex utterances, the best representative of the topic was considered the one occurring in the clause most closely related to the topic. In the following example, since the topic was THE PA-</Paragraph>
  </Section>
  <Section position="8" start_page="109" end_page="111" type="metho">
    <SectionTitle>
TIENT'S HANDLING OF EATING, the topic expres-
</SectionTitle>
    <Paragraph position="0"> sion had to be one of the two instances of jeg 'I'.</Paragraph>
    <Paragraph position="1"> Since the topic arguably concerns 'handling' more than 'eating', the NP in the matrix clause (emphasized) is the topic expression.</Paragraph>
    <Paragraph position="2">  A final example of several NPs referring to the same topic has to do with left-dislocation. In example (6), the preverbal object ham 'him' is immediately preceded by its antecedent min far 'my father'. Both NPs express the topic of the utterance. In Danish, resumptive pronouns in left-dislocation constructions always occur in preverbal position, and in cases where they express the topic there will thus always be two NPs directly adjacent to each other which both refer to the topic. In such cases, we consider the resumptive pronoun the topic expression, partly because it may be considered a more integrated part of the sentence (cf. Lambrecht (1994)).  The intersubjectivity of the topic expression annotation was tested in two ways. First, all the topic expression annotations of the two coders were compared. This showed that topic expressions can be annotated reasonably reliably (k = 0.70 (see table 1)). Second, to make sure that this intersubjectivity was not just a product of mutual influence between the two authors, a third, independent coder annotated a small, random sample of the data for topic expressions (50 NPs). Comparing this to the annotation of the two main coders confirmed reasonable reliability (k = 0.70).</Paragraph>
    <Section position="1" start_page="110" end_page="111" type="sub_section">
      <SectionTitle>
3.2 Surface features
</SectionTitle>
      <Paragraph position="0"> After annotating the topics and topic expressions, 16 grammatical, morphological, and prosodic features were annotated. First the smaller corpus was annotated by the two main coders in collaboration in order to establish annotating policies in unclear cases.</Paragraph>
      <Paragraph position="1"> Then the features were annotated individually by the two coders in the larger corpus.</Paragraph>
      <Paragraph position="2"> Grammatical roles. Each NP was categorized as grammatical subject (sbj), object (obj), or oblique (obl).These features can be annotated reliably (sbj: C1 (number of sbj's identified by Coder 1) = 208, C2 (sbj's identified by Coder 2) = 207, C1+2 (Coder 1 and 2 overlap) = 207, ksbj = 1.00; obj: C1 = 110, C2 = 109, C1+2 = 106, kobj = 0.97; obl: C1 = 30, C2 = 50, C1+2 = 29, kobl = 0.83). Morphological and phonological features. NPs were annotated for pronominalisation (pro), definiteness (def), and main stress (str). (Note that the main stress distinction only applies to pronouns in Danish.) These can also be annotated reliably (pro:</Paragraph>
      <Paragraph position="4"> Unmarked surface position. NPs were annotated for occurrence in pre-verbal (pre) or post-verbal (post) position relative to their subcategorizing verb. Thus, in the following example, det 'it' is +pre, but -post, because det is not subcategorized by tror 'think'.</Paragraph>
      <Paragraph position="5">  In addition to this, NPs occurring in pre-verbal position were annotated for whether they were repetitions of a left-dislocated element (ldis). Example (8) further exemplifies the three position-related features. null</Paragraph>
      <Paragraph position="7"> Marked NP-fronting. This group contains NPs fronted in marked constructions such as the passive (pas), clefts (cle), Danish 'sentence intertwining' (dsi), and XVS-constructions (xvs).</Paragraph>
      <Paragraph position="8"> NPs fronted as subjects of passive utterances were annotated as +pas.</Paragraph>
      <Paragraph position="9">  A cleft construction is defined as a complex construction consisting of a copula matrix clause with a relative clause headed by the object of the matrix clause. The object of the matrix clause is also an argument or adjunct of the relative clause predicate. The clefted element det 'that', which we annotate as +cle, leaves an 'empty slot', e, in the relative clause, as shown in example (10):  Danish sentence intertwining can be defined as a special case of extraction where a non-WH constituent of a subordinate clause occurs in the first  position of the matrix clause. As in cleft constructions, an 'empty slot' is left behind in the subordinate clause. NPs in the fronted position were annotated as +dsi:</Paragraph>
      <Paragraph position="11"> The XVS construction is defined as a simple declarative sentence with anything but the subject in the preverbal position. Since only one constituent is allowed preverbally2, the subject occurs after the finite verb. In example (12), the finite verb is an auxiliary, and the canonical position of the object after the main verb is indicated with the 'empty slot' marker e. The preverbal element in XVS-constructions is annotated as +xvs.</Paragraph>
      <Paragraph position="13"> Sentence type and subordination. Each NP was annotated with respect to whether or not it appeared in an interrogative sentence (int) or a subordinate clause (sub), and finally, all NPs were coded as to whether they occurred in an epistemic matrix clause or in a clause subordinated to an epistemic matrix clause (epi). An epistemic matrix clause is defined as a matrix clause whose function it is to evaluate the truth of its subordinate clause (such as &amp;quot;I think . . . &amp;quot;). The following example illustrates how we annotated both NPs in the epistemic matrix clause and NPs in its immediate subordinate clause as +epi, but not NPs in further subordinated clauses. The +epi feature requires a +/-sub feature in order to determine whether the NP in question is in the epistemic matrix clause or subordinated under it. Subordination is shown here using parentheses.</Paragraph>
      <Paragraph position="14">  verbal position. Left-dislocated elements are not considered part of the sentence proper, and thus do not count as preverbal elements, cf. Lambrecht (1994).</Paragraph>
      <Paragraph position="15"> ably (int: C1 = 55, C2 = 55, C1+2 = 55, kint = 1.00; sub: C1 = 117, C2 = 111, C1+2 = 107, ksub = 0.93; epi: C1 = 38, C2 = 45, C1+2 = 37, kepi = 0.92).</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.3 Decision trees
</SectionTitle>
      <Paragraph position="0"> In the third stage of our investigation, a decision tree (DT) generator was used to extract correlations between topic expressions and surface features. Three different data sets were used to train and test the DTs, all based on the larger dialog.</Paragraph>
      <Paragraph position="1"> Two of these data sets were derived from the complete set of NPs annotated by each main coder individually. These two data sets will be referred to below as the 'Coder 1' and 'Coder 2' data sets.</Paragraph>
      <Paragraph position="2"> The third data set was obtained by including only NPs annotated identically by both main coders in relevant features3. This data set represents a higher degree of intersubjectivity, especially in the topic expression category, but at the cost of a smaller number of NPs. 63 out of a total of 449 NPs had to be excluded because of inter-coder disagreement, 50 due to disagreement on the topic expression category.</Paragraph>
      <Paragraph position="3"> This data set will be referred to below as the 'Intersection' data set.</Paragraph>
      <Paragraph position="4"> A DT was generated for each of these three data sets, and each DT was tested using 10-fold cross validation, yielding the success rates reported below.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML