File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1008_evalu.xml
Size: 5,930 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1008"> <Title>Annotating and measuring temporal relations in texts</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> In order to see whether the measures we propose are meaningful, we have looked at how the measures behave on a text &quot;randomly&quot; annotated in the following way: we have selected at random pairs of events in a text, and for each pair we have picked a random annotation relation. Then we have saturated the graph of constraints and compared with the human annotation. Results are typically very low, as shown on a newswire message taken as example Table 3.</Paragraph> <Paragraph position="1"> We have then made two series of measures: one on annotation relations (thus disjunctions of Allen relations are re-expressed as disjunctions of annotation relations that contains them), and one on equivalent Allen relations (which arguably reflects more the underlying computation, while deteriorating the measure of the actual task). In the first case, an Allen relation answer equals to b or d or s between two events is considered as &quot;before or is_included&quot; (using relations used by humans) and is compared to an annotation of the same form.</Paragraph> <Paragraph position="2"> We then used finesse and coherence to estimate our annotation made according to the method described in the previous sections. We tried it on a still limited2 set of 8 newswire texts (from AFP), for a total of 2300 words and 160 events, comparing to the English corpus of (Setzer, 2001), which has 6 texts for less than 2000 words and also about 160 events. Each one of these texts has between 10 and 40 events. The system finds them correctly with precision and recall around 97%. We made the comparison only on the correctly recognized events, in order to separate the problems. This course limits the influence of errors on coherence, but handicaps finesse as less information is available for inference. The measures we used were then averaged on the number of texts. This departs from what could be considered a more standard practice, summing everything and dividing by the number of comparisons made. The reason behind this is we think comparing two graphs as comparing two temporal models of a text, not just finding a list of targets in a set of texts. It might be easier to accept this if one remembers that the number of possible relations between n events is n(n 1)=2. A text t1 with k more 2We are still annotating more texts manually to give more significance to the results.</Paragraph> <Paragraph position="3"> events than a text t2 will thus have about k2 times more importance in a global score, and we find confusing this non-linear relation between the size of a text and its weight in the evaluation process. Therefore, both finesse and coherence are generalized as global measure of a temporal model of a text. It could then be interesting to relate temporal information and other features of a given text (size being only one factor).</Paragraph> <Paragraph position="4"> Results are shown Table 4. These results seem promising when considering the simplifications we have made on every step of the process. Caution is necessary though, given the limited number of texts we have experimented on, and the high variation we have observed between texts. At this stage we believe the quality of our results is not that important. Our main objective, above all, was to show the feasibility of a robust method to annotate temporal relations, and provide useful tools to evaluate the task, in order to improve each step separately later. Our focus was on the design of a good methodology.</Paragraph> <Paragraph position="5"> If we try a first analysis of the results, sources of errors fall on various categories. First, a number of temporal adverbials were attached to the wrong event, or were misinterpreted. This should be fine-tuned with a better parser than what we used. Then, we have not tried to take into account the specific narrative style of newswire texts. In our set of texts, the present tense was for instance used in a lot of places, sometimes to refer to events in the past, sometimes to refer to events that were going to happen at the time the text was published. However, given the method we adopted, one could have expected better coherence results than finesse results. It means we have made decisions that were not cautious enough, for reasons we still have to analyze.</Paragraph> <Paragraph position="6"> One potential reason is that relations offered to humans are maybe too vague in the wrong places: a lot of information in a text can be asserted to be &quot;strictly before&quot; something else (based on dates for instance), while human annotators can only say that events are &quot;before or meets&quot; some other event; each time this is the case, coherence is only 0.5.</Paragraph> <Paragraph position="7"> It is important to note that there are few points of comparison on this problem. To the best of our knowledge, only (Li et al., 2001) and (Mani and Wilson, 2000) mention having tried this kind of annotation, as a side job for their temporal expressions mark-up systems. The former considers only relations between events within a sentence, and the latter did not evaluate their method.</Paragraph> <Paragraph position="8"> Finally, it is worth remembering that human annotation itself is a difficult task, with potentially a lot of disagreement between annotators. For now, our texts have been annotated by the two authors, with an a posteriori resolution of conflicts. We therefore have no measure of inter-annotator agreement which could serve as an upper bound of the performance of the system, although we are planning to do this at a later stage.</Paragraph> </Section> class="xml-element"></Paper>