File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/w98-1106_evalu.xml
Size: 5,877 bytes
Last Modified: 2025-10-06 14:00:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1106"> <Title>References</Title> <Section position="5" start_page="54" end_page="55" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> The purpose of the selectional restrictions is to constrain the types of information that can be instantiated by each slot. Consequently, we hoped that the case frames would be more reliably instantiated than the extraction patterns, thereby producing fewer false hits. To evaluate the case frames, we used the same corpus and evaluation metrics as previous experiments with AutoSlog and AutoSlog-TS (Riloff, 1996b) so that we can draw comparisons between them. For training, we used the 1500 MUC-4 development texts to generate the extraction patterns and the semantic lexicon. AutoSlog-TS generated 44,013 extraction patterns in its first pass. After discarding the patterns that occurred only once, the remaining 11,517 patterns were applied to the corpus for the second pass and ranked for manual review. We reviewed the top 2168 patterns 5 and kept 306 extraction patterns for the final dictionary.</Paragraph> <Paragraph position="1"> We built a semantic lexicon for nine categories associated with terrorism: BUILDING, CIVILIAN, GOV-OFFICIAL, MILITARYPEOPLE, LOCATION, TERROR-IST~ DATEs VEHICLE, WEAPON. We reviewed the top 500 words for each category. It takes about 30 m~nutes to review a category assuming that the reviewer is familiar with the domain. Our final semantic dictionary contained 494 words. In total, the review process required approximately 6 person-hours: 1.5 hours to review the extraction patterns plus 4.5 hours to review the words for 9 semantic categories.</Paragraph> <Paragraph position="2"> From the extraction patterns and semantic lexicon, our system generated 137 conceptual case frames.</Paragraph> <Paragraph position="3"> One important question is how to deal with unknown words during extraction. This is especially important in the terrorism domain because many of the extracted items are proper names, which cannot be expected to be in the semantic lexicon. We allowed unknown words to fill all eligible slots and then used a precedence scheme so that each item was instantiated by only one slot. Precedence was based on the order of the roles shown in Figure 4. This is not a very satisfying solution and one of the weaknesses of our current approach. Handling unknown words more intelligently is an important direction for future research.</Paragraph> <Paragraph position="4"> We compared AutoSlog-TS' extraction patterns SWe decided to review the top 2000 but continued clown the list until there were no more ties.</Paragraph> <Paragraph position="5"> Slot cot mis mlb dup spu R P with the case frames using 100 blind texts s from the MUC-4 test set. The MUC-4 answer keys were used to score the output. Each extracted item was scored as either correct, mislabeled, duplicate, or spurious.</Paragraph> <Paragraph position="6"> An item was correct if it matched against the answer keys. An item was mislabeled if it matched against the answer keys but was extracted as the wrong type of object (e.g., if a victim was extracted as a perpetrator). An item was a duplicate if it was coreferent with an item in the answer keys. Correct items extracted more than once were scored as duplicates, as well as correct but underspecified extractions such as &quot;Kennedy&quot; instead of &quot;John F. Kennedy&quot; r An item was spurious if it did not appear in the answer keys.</Paragraph> <Paragraph position="7"> All items extracted from irrelevant texts were spurious. Finally, items in the answer keys that were not extracted were counted as missing. Correct + missing equals the total number of items in the answer keys.S Table 1 shows the results 9 for AutoSlog-TS' extraction patterns, and Table 2 shows the results for the case frames. We computed Recall (R) as correct / (correct + missing), and Precision (P) as (correct + duplicate) / (correct + duplicate + mislabeled + spurious). The extraction patterns and case frames achieved similar recall results, although the case frames missed seven correct extractions. However the case frames produced substantially fewer false hits, producing 82 fewer spurious extractions.</Paragraph> <Paragraph position="8"> Note that perpetrators exhibited by far the lowest precision. The reason is that the perpetrator slot received highest precedence among competing slots for unknown words. Changing the precedence s25 relevant texts and 25 irrelevant texts from each of the TST3 and TST4 test sets.</Paragraph> <Paragraph position="9"> 7The rationale for scoring coreferent phrases as duplicates instead of spurious is that the extraction pattern or case frame was instantiated with a reference to the correct answer. In other words, the pattern (or case frame) did the right thing. Resolving coreferent phrases to produce the best answer is a problem for subsequent discourse analysis, which is not addressed by the work presented here.</Paragraph> <Paragraph position="10"> SA caveat is that the MUC-4 answer keys contain some &quot;optional&quot; answers. We scored these as correct if they were extracted but they were never scored as missing, which is how the &quot;optional&quot; items were scored in MUC-4. Note that the number of possible extractions can vary depending on the output of the system.</Paragraph> <Paragraph position="11"> scheme produces a bubble effect where many incorrect extractions shift to the primary default category. The case frames therefore have the potential for even higher precision if the unknown words are handled better. Expanding the semantic lexicon is one option, and additional work may suggest ways to choose slots for unknown words more intelligently.</Paragraph> </Section> class="xml-element"></Paper>