File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2010_metho.xml
Size: 18,152 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2010"> <Title>Using Readers to Identify Lexical Cohesive Structures in Texts</Title> <Section position="3" start_page="0" end_page="55" type="metho"> <SectionTitle> 2 From Lexical Cohesion to Anchoring </SectionTitle> <Paragraph position="0"> Cohesive ties between items in a text draw on the resources of a language to build up the text's unity (Halliday and Hasan, 1976). Lexical cohesive ties draw on the lexicon, i.e. word meanings.</Paragraph> <Paragraph position="1"> Sometimes the relation between the members of a tie is easy to identify, like near-synonymy (disease/illness), complementarity (boy/girl), whole-topart (box/lid), but the bulk of lexical cohesive texture is created by relations that are difficult to classify (Morris and Hirst, 2004). Halliday and Hasan (1976) exemplify those with pairs like dig/garden, ill/doctor, laugh/joke, which are reminiscent of the idea of scripts (Schank and Abelson, 1977) or schemata (Rumelhart, 1984): certain things are expected in certain situations, the paradigm example being menu, tables, waiters and food in a restaurant.</Paragraph> <Paragraph position="2"> However, texts sometimes start with descriptions of situations where many possible scripts could apply. Consider a text starting with Mother died today.1 What are the generated expectations? A description of an accident that led to the death, or of a long illness? A story about what happened to the rest of the family afterwards? Or emotional reaction of the speaker - like the sense of loneliness in the world? Or something more &quot;technical&quot; - about the funeral, or the will? Or something about the mother's last wish and its fulfillment? Many directions are easily thinkable at this point.</Paragraph> <Paragraph position="3"> We suggest that rather than generating predictions, scripts/schemata could provide a basis for abduction. Once any &quot;normal&quot; direction is ac- null tually taken up by the following text, there is a connection back to whatever makes this a normal direction, according to the reader's commonsense knowledge (possibly coached in terms of scripts or schemata). Thus, had the text developed the illness line, one would have known that it can be best explained-by/blamed-upon/abduced-to the previously mentioned lethal outcome. We say in this case that illness is anchored by died, and mark it illnessa0 died; we aim to elicit such anchoring relations from the readers.</Paragraph> </Section> <Section position="4" start_page="55" end_page="56" type="metho"> <SectionTitle> 3 Experimental Design </SectionTitle> <Paragraph position="0"> We chose 10 texts for the experiment: 3 news articles, 4 items of journalistic writing, and 3 fiction pieces. All news and one fiction story were taken in full; others were cut at a meaningful break to stay within 1000 word limit. The texts were in English original language for all but two texts.</Paragraph> <Paragraph position="1"> Our subjects were 22 students at the Hebrew University of Jerusalem, Israel; 19 undergraduates and 3 graduates, all aged 21-29 years, studying various subjects - Engineering, Cognitive Science, Biology, History, Linguistics, Psychology, etc. Three of the participants named English their mother tongue; the rest claimed very high proficiency in English. People were paid for participation.</Paragraph> <Paragraph position="2"> All participants were first asked to read the guidelines that contained an extensive example of an annotation done by us on a 4-paragraph text (a small extract is shown in table 1), and short paragraphs highlighting various issues, like the possibility of multiple anchors per item (see table 1) and of multi-word anchors (Scientific or American alone do not anchor editor, but taken together they do).</Paragraph> <Paragraph position="3"> In addition, the guidelines stressed the importance of separation between general and personal knowledge, and between general and instantial relations. For the latter case, an example was given of a story about children who went out in a boat with their father who was an experienced sailor, with an explanation that whereas fathera0children and sailora0boat are based on general commonsense knowedge, the connection between sailor and father is not something general but is created in the particular case because the two descriptions apply to the same person; people were asked not to mark such relations.</Paragraph> <Paragraph position="4"> Afterwards, the participants performed a trial annotation on a short news story, after which meetings in small groups were held for them to bring up any questions and comments2.</Paragraph> <Paragraph position="5"> The Federal Aviation Administration underestimated the number of aircraft flying over the Pantex Weapons Plant outside Amarillo, Texas, where much of the nation's surplus plutonium is stored, according to computerized studies under way by the Energy Department.</Paragraph> <Paragraph position="6"> the wherea1a2amarillo texas outsidea3 (extract). x a0 a4 c d a5 means each of c and d is an anchor for x.</Paragraph> <Paragraph position="7"> The experiment then started. For each of the 10 texts, each person was given the text to read, and a separate wordlist on which to write down annotations. The wordlist contained words from the text, in their appearance order, excluding verbatim and inflectional repetitions3. People were instructed to read the text first, and then go through the wordlist and ask themselves, for every item on the list, which previously mentioned items help the easy accommodation of this concept into the evolving story, if indeed it is easily accommodated, based on the commonsense knowledge as it is perceived by the annotator. People were encouraged to use a dictionary if they were not sure about some nuance of meaning.</Paragraph> <Paragraph position="8"> Wordlist length per text ranged from 175 to 339 items; annotation of one text took a person 70 min- null possible. We conjectured that repetitions are usually anchored by the previous mention; this assumption is a simplification, since sometimes the same form is used in a somewhat different sense and may get anchored separately from the previous use of this form. This issue needs further experimental investigation. utes on average (each annotator was timed on two texts; every text was timed for 2-4 annotators).</Paragraph> </Section> <Section position="5" start_page="56" end_page="58" type="metho"> <SectionTitle> 4 Analysis of Experimental Data </SectionTitle> <Paragraph position="0"> Most of the existing research in computational linguistics that uses human annotators is within the framework of classification, where an annotator decides, for every test item, on an appropriate tag out of the pre-specified set of tags (Poesio and Vieira, 1998; Webber and Byron, 2004; Hearst, 1997; Marcus et al., 1993).</Paragraph> <Paragraph position="1"> Although our task is not that of classification, we start from a classification sub-task, and use agreement figures to guide subsequent analysis. We use the by now standard a0 statistic (Di Eugenio and Glass, 2004; Carletta, 1996; Marcu et al., 1999; Webber and Byron, 2004) to quantify the degree of above-chance agreement between multiple annotators, and the a1 statistic for analysis of sources of unreliability (Krippendorff, 1980). The formulas for the two statistics are given in appendix A.</Paragraph> <Section position="1" start_page="56" end_page="57" type="sub_section"> <SectionTitle> 4.1 Classification Sub-Task </SectionTitle> <Paragraph position="0"> Classifying items into anchored/unanchored can be viewed as a sub-task of our experiment: before writing any particular item as an anchor, the annotator asked himself whether the concept at hand is easy to accommodate at all. Getting reliable data on this task is therefore a pre-condition for asking any questions about the anchors. Agreement on this task averages a0 a2 a3a4a5a6 for the 10 texts. These reliability figures do not reach the a0 a2 a3a4a7a8 area which is the accepted threshold for deciding that annotators were working under similar enough internalized theories4 of the phenomenon; however, the figures are high enough to suggest considerable overlaps.</Paragraph> <Paragraph position="1"> Seeking more detailed insight into the degree of similarity of the annotators' ideas of the task, we follow the procedure described in (Krippendorff, 1980) to find outliers. We calculate the categoryby-category co-markup matrix a9 for all annotators5; then for all but one annotators, and by subtraction find the portion that is due to this one annotator.</Paragraph> <Paragraph position="2"> We then regard the data as two-annotator data (one vs. everybody else), and calculate agreement coefficients. We rank annotators (1 to 22) according to the degree of agreement with the rest, separately for each text, and average over the texts to obtain the conformity rank of an annotator. The lower the rank, the less compliant the annotator.</Paragraph> <Paragraph position="3"> Annotators' conformity ranks cluster into 3 groups described in table 2. The two members of group A are consistent outliers - their average rank for the 10 texts is below 2. The second group (B) is, on average, in the bottom half of the annotators with respect to agreement with the common, whereas members of group C display relatively high conformity.</Paragraph> <Paragraph position="4"> It is possible that annotators in groups A, B and C have alternative interpretations of the guidelines, but our idea of the &quot;common&quot; (and thus the conformity ranks) is dominated by the largest group, C. Within-group agreement rates shown in table 2 suggest that two annotators in group A do indeed have an alternative understanding of the task, being much better correlated between each other than with the rest.</Paragraph> <Paragraph position="5"> The figures for the other two groups could support two scenarios: (1) each group settled on a different theory of the phenomenon, where group C is in better agreement on its version that group B on its own; (2) people in groups B and C have basically the same theory, but members of C are more systematic in carrying it through. It is crucial for our analysis to tell those apart - in the case of multiple stable interpretations it is difficult to talk about the anchoring phenomenon; in the core-periphery case, there is hope to identify the core emerging from 20 out of 22 annotations.</Paragraph> <Paragraph position="6"> Let us call the set of majority opinions on a list of items an interpretation of the group, and let us call the average majority percentage consistency. Thus, if all decisions of a 9 member group were almost unanimous, the consistency of the group is 8/9 = 89%, whereas if every time there was a one vote edge to the winning decision, the consistency was 5/9=56%. The more consistent the interpretation given by a group, the higher its agreement coefficient. null If groups B and C have different interpretations, adding a person p from group C to group B would usually not improve the consistency of the target group (B), since p is likely to represent majority opinion of a group with a different interpretation.</Paragraph> <Paragraph position="7"> On the other hand, if the two groups settled on basically the same interpretation, the difference in ranks reflects difference in consistency. Then moving p from C to B would usually improve the consistency in B, since, coming from a more consistent group, p's agreement with the interpretation is expected to be better than that of an average member of group B, so the addition strengthens the majority opinion in B6.</Paragraph> <Paragraph position="8"> We performed this analysis on groups A and C with respect to group B. Adding members of group A to group B improved the agreement in group B only for 1 out of the 10 texts. Thus, the relationship between the two groups seems to be that of different interpretations. Adding members of group C to group B resulted in improvement in agreement in at least 7 out of 10 texts for every added member.</Paragraph> <Paragraph position="9"> Thus, the difference between groups B and C is that of consistency, not of interpretation; we may now search for the well-agreed-upon core of this interpretation. We exclude members of group A from subsequent analysis; the remaining group of 20 annotators exhibits an average agreement of a0 a2 a3a4a5a0 on anchored/unanchored classification.</Paragraph> </Section> <Section position="2" start_page="57" end_page="57" type="sub_section"> <SectionTitle> 4.2 Finding the Common Core </SectionTitle> <Paragraph position="0"> The next step is finding a reliably classified subset of the data. We start with the most agreed upon items those classified as anchored or non-anchored by all the 20 people, then by 19, 18, etc., testing, for every such inclusion, that the chances of taking in instances of chance agreement are small enough. This means performing a statistical hypothesis test: with how much confidence can we reject the hypothesis 6Experiments with synthetic data confirm this analysis: with 20 annotations split into 2 sets of sizes 9 and 11, it is possible to get an overall agreement of about a1 a2 a3a4a5a3 either with 75% and 90% consistency on the same interpretation, or with 90% and 95% consistency on two interpretations with induced (i.e.</Paragraph> <Paragraph position="1"> non-random) overlap of just 20%.</Paragraph> <Paragraph position="2"> that certain agreement level7 is due to chance. Confidence level of a6 a7 a3a4a3a8 is achieved including items marked by at least 13 out of 20 people and items unanimously left unmarked.8 The next step is identifying trustworthy anchors for the reliably anchored items. We calculated average anchor strength for every text: the number of people who wrote the same anchor for a given item, averaged on all reliably anchored items in a text. Average anchor strength ranges between 5 and 7 in different texts. Taking only strong anchors (anchors of at least the average strength), we retain about 25% of all anchors assigned to anchored items in the reliable subset. In total, there are 1261 pairs of reliably anchored items with their strong anchors, between 54 and 205 per text.</Paragraph> <Paragraph position="3"> Strength cut-off is a heuristic procedure; some of those anchors were marked by as few as 6 or 7 out of 20 people, so it is not clear whether they can be trusted as embodiments of the core of the anchoring phenomenon in the analyzed texts. Consequently, an anchor validation procedure is needed.</Paragraph> </Section> <Section position="3" start_page="57" end_page="58" type="sub_section"> <SectionTitle> 4.3 Validating the Common Core </SectionTitle> <Paragraph position="0"> We observe that although people were asked to mark all anchors for every item they thought was anchored, they actually produced only 1.86 anchors per anchored item. Thus, people were most concerned with finding an anchor, i.e. making sure that something they think is easily accommodatable is given at least one preceding item to blame for that; they were less diligent in marking up all such items.</Paragraph> <Paragraph position="1"> This is also understandable processing-wise; after a scrupulous read of the text, coming up with one or two anchors can be done from memory, only occasionally going back to the text; putting down all anchors would require systematic scanning of the previous stretch of text for every item on the list; the latter task is hardly doable in 70 minutes.</Paragraph> <Paragraph position="2"> 7A random variable ranging between 0 and 20 says how many &quot;random&quot; people marked an item as anchored. We model &quot;random&quot; versions of annotators by taking the proportions a9a10 of items marked as anchored by annotator a11 in the whole of the dataset, and assuming that for every word, the person was tossing a coin with P(heads) = a9a10, independently for every word. 8Confidence level of a9 a12 a3a4a3a13 allows augmenting the set of reliably unanchored items with those marked by 1 or 2 people, retaining the same cutoff for anchoredness. This cut covers more than 60% of the data, and contains 1504 items, 538 of which are anchored.</Paragraph> <Paragraph position="3"> Having in mind the difficulty of producing an exhaustive list of anchors for every item, we conducted a follow-up experiment to see whether people would accept anchors when those are presented to them, as opposed to generating ones. We used 6 out of the 10 texts and 17 out of 20 annotators for the follow-up experiment. Each person did 3 text, each texts received 7-9 annotations of this kind.</Paragraph> <Paragraph position="4"> For each text, the reader was presented with the same list of words as in the first part, only now each word was accompanied by a list of anchors. For each item, every anchor generated by at least one person was included; the order of the anchors had no correspondence with the number of people who generated it. A small number of items also received a random anchor - a randomly chosen word from the preceding part of the wordlist. The task was crossing over anchors that the person does not agree with.</Paragraph> <Paragraph position="5"> Ideally, i.e. if lack of markup is merely a difference in attention but not in judgment, all non-random anchors should be accepted. To see the distance of the actual results from this scenario, we calculate the total mass of votes as number of anchoredanchor pairs times number of people, and check how many are accept votes. For all non-random pairs, 62% were accept votes; for the core annotations (pairs of reliably anchored items with strong anchors) 94% were accept votes, texts ranging between 90% and 96%; for pairs with a random anchor, only 15% were accept votes. Thus, agreement based analysis of anchor generation data allowed us to identify a highly valid portion of the annotations.</Paragraph> </Section> </Section> class="xml-element"></Paper>