File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1015_metho.xml
Size: 12,191 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1015"> <Title>The &quot;ALL TEMPLATES&quot; results of our &quot;official&quot; runs were as follows : RECALL PRECISION</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> RECALL PRECISION </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Evaluating the degree of improvement over the MUC-3 runs is complicated by the changes between MUC-3 and MUC-4: there were changes in the template structure, the MURDER templates were eliminated, content mappin g constraints were incorporated into the scoring program, and the rules for manual remapping were much more con strained. We resumed system development specifically for MUC (with regular runs and rescorings) in mid-March , approximately two weeks before the &quot;Dry Run&quot; was due, and the modifications prior to the Dry Run primaril y reflected the changes needed for the new template structure (no significant changes were made to concepts, ver b models, inference rules, etc.). The changes between our final MUC-3 scores and our Dry Run scores thus roughly reflect the changes due to the change in the task -- for both TST1 and TST2, a loss of about 7 points of recall . During the following 8 weeks, we made a number of system modifications which recovered much of this loss of recal l and substantially improved system precision.</Paragraph> <Paragraph position="3"> During the period from mid-March, when we adapted the system for the MUC-4 templates and began scoring runs , until the evaluation at the end of May, approximately 5 to 6 person-months were involved in developmen t specifically addressed to MUC-4 performance . This does not count the time we spent since MUC-3 on researc h using the MUC-3 data, on such topics as semantic pattern acquisition, Wordnet, and grammar evaluation ; most of this work was not directly used in the MUC-4 system .</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> IMPROVEMENTS </SectionTitle> <Paragraph position="0"> We made a number of small improvements in upgrading our MUC-3 system for the MUC-4 evaluation : (1) We integrated the BBN stochastic part-of-speech tagger into our system . We had done this for MUC-3, but in a rather crude way, keeping only the most probable part-of-speech assigned by the tagger. This made the system run faster, but with some loss of recall. For MUC-4, we made full use of the probabilities assigne d by the tagger, combining them with the other contributions to our scoring function (e.g., semantic scores , syntactic penalties) and selecting the highest-scoring analysis . This yielded a small improvement in syste m recall (1% on the TST1 corpus) .</Paragraph> <Paragraph position="1"> 12 4 (2) We incorporated a more elaborate time analysis component to handle constructs such as &quot;Three weeks late r ...&quot; and &quot;Two weeks after <event I>, <event 2> ...&quot;, in addition to the absolute times (explicit dates) and times relative to the dateline (&quot;two weeks ago&quot;) which were handled in our MUC-3 system . The system now produces a time graph relating events, and computes absolute times as the information becomes available . This produced a small benefit in recall and precision .</Paragraph> <Paragraph position="2"> (3) In our MUC-3 system, if no parse could be obtained of the entire sentence, we identified the longest string starting at the first word which could be analyzed as a sentence. We now have the option of taking th e remaining words, identifying the longest clauses and noun phrases, and processing these (in addition to th e longest initial substring). We refer to this as &quot;syntactic debris&quot; . Because most sentences obtain a full-sentence parse, this option has only a small effect . On TST3, selecting &quot;syntactic debris&quot; increased recall by 1% and reduced precision by 1% .</Paragraph> <Paragraph position="3"> (4) We implemented a simple mechanism for dynamically shifting the parsing strategy . For each sentence, up to a certain point, all hypotheses are followed, in a best-first order determined by our scoring function . Once a specified number of hypotheses have been generated (15000 in the official runs), we shift to a mode wher e only the highest-ranking hypothesis for each non-terminal and each span of sentence words is retained . This mode may yield a sub-optimal analysis (because many constraints are non-local), but will converge to some analyisis much more quickly (effectively shifting from an exponential to a polynomial-time algorithm) . (5) We made several improvements to reference resolution . In particular, we refined the semantic slot/fille r representation we use for people in order to improve anaphor-antecedent matching .</Paragraph> <Paragraph position="4"> (6) We have been steadily expanding our grammatical coverage .</Paragraph> <Paragraph position="5"> Except as needed for our other system changes, we made relatively few additions to the sets of concepts and lexica l models developed for MUC-3 . 1 We did not extend the effort at extensive corpus analysis pursued prior to MUC-3 ; rather we experimented with various strategies which would lead to greater automation of this process in the future (see the sections below on &quot;Wordnet&quot; and &quot;Acquiring Selectional Constraints&quot;) .</Paragraph> </Section> <Section position="5" start_page="0" end_page="125" type="metho"> <SectionTitle> DISCOURSE </SectionTitle> <Paragraph position="0"> At MUC-3, discourse analysis was frequently cited as a serious shortcoming of many of the systems . In our system, discourse analysis (beyond reference resolution) is reflected mainly in decisions about merging events t o form templates . Roughly speaking, our MUC-3 system tried to merge events (barring conflicting time, location, etc.) * when they affected the same target * when they appeared in the same sentence * when an attack (including bombing, arson, etc.) was followed by effect (death, damage, injury, etc .) For MUC-4 we tried 3 variations on our discourse analysis procedure: (1) blocking attack/effect merging across paragraph boundaries (2) in addition, making use of anaphoric references to events in the merging procedure (so that &quot;Five civilian s were killed in the attack.&quot; would cause the templates for the attack and the killings to be merged even if th e antecedent of &quot;attack&quot; were in a prior paragraph).</Paragraph> <Paragraph position="1"> (3) identifying and attempting to merge general and specific descriptions of events (this happens quite often i n newspaper-style articles, where the introductory paragraph is a summary of several distinct events which are reported separately later in the article). This linking of general and specific events was then used by reference resolution to order the search for antecedents . (This can be viewed as an attempt at a Grosz/Sidne r focus stack.) Variation 1 did slightly better than the MUC-3 base system (on TST3, it got 1% better recall at no loss in precision). Variations 2 and 3, although more &quot;linguistically principled&quot;, did slightly worse (variation 2 lost 2% recall , 1% precision on TST3) . We therefore used variation 1 for our official run .</Paragraph> <Paragraph position="2"> ' The set of lexico-semantic models grew by about 25% over MUC-3 ; the set of concepts (except for geographical names) by about 15% . A partial failure analysis for TST3 suggested that many of the template errors could be attributed to gaps or errors in the models or concepts, an d hence that further improvements in these two components were crucial to improved performance . An examination of some of the errors indicated that, while variations 2 and 3 did OK in and of themselves, they were sensitive to errors in prior stages of processing (in particular, shortcomings in semantic interpretation led to occasional incorrect anaphora resolution, which in turn led to excess event merging). In contrast, paragraph boundaries, while not as reliable a discourse indicator, are more reliably observed . Thus, the best component in isolation may not be the best choice for a system, because it may be too sensitive to errors made by prior components. null</Paragraph> </Section> <Section position="6" start_page="125" end_page="125" type="metho"> <SectionTitle> RELATED RESEARCH </SectionTitle> <Paragraph position="0"> Much of our time since MUC-3 was involved in research using the MUC-3/MUC-4 corpus and task . We describe here very briefly some of our work related to semantic acquisition, evaluation, and multi-lingual systems .</Paragraph> <Paragraph position="1"> WORDNE T One of our central interests lies in improving the methods used for acquiring semantic knowledge for new domains. As we noted earlier, we did not invest much additional effort (beyond that for MUC-3) in manual data analysis in order to augment the conceptual hierarchy and lexico-semantic models. We instead conducted several experiments aimed at more automatic semantic acquisition .</Paragraph> <Paragraph position="2"> One of these experiments involved using Wordnet, a large hierarchy of word senses (produced by Georg e Miller at Princeton), as a source of information to supplement our semantic classification hierarchy . We added to our hierarchy everything in Wordnet under the concepts person and building .</Paragraph> <Paragraph position="3"> We identified a number of additional events in this way. Some were correct. Some were incorrect, involving unintended senses of words . For example, the sentenc e El Salvador broke diplomatic relations .</Paragraph> <Paragraph position="4"> would be interpreted as an attack because &quot;relations&quot; (such as &quot;close relations&quot;, i .e., relatives) are people in Wordnet. Even more obscure is that He fought his way back .</Paragraph> <Paragraph position="5"> becomes an attack because &quot;back&quot; (as in &quot;running back&quot;, a football player) is a person. Some of the additional events were correct as events, but should not have appeared in templates, either because they were military (&quot;th e enemy&quot;) or because they were anaphoric references to prior phrases (&quot;the perpetrator&quot;) and so should have been replaced by appropriate antecedents .</Paragraph> <Paragraph position="6"> These results suggest that Wordnet may be a good source of concepts, but that it will not be of net benefi t unless manually reviewed with respect to a particular application .</Paragraph> </Section> <Section position="7" start_page="125" end_page="125" type="metho"> <SectionTitle> ACQUIRING SELECTIONAL CONSTRAINTS </SectionTitle> <Paragraph position="0"> An alternative source of semantic information is the texts themselves . NYU has conducted a number of studies aimed at gleaning selectional constraints and semantic classes from the co-occurrence patterns in the sampl e texts in a domain.</Paragraph> <Paragraph position="1"> In the past year, we focussed on the task of acquiring the selectional constraints needed for the MUC texts . We have tried to automate this task by parsing 1000 MUC messages (without semantic constraints) and collectin g frequency information on subject-verb-object and head-modifier patterns. Where possible, we used the classification hierarchy (which we had built by hand) to generalize words in these patterns to word classes . We then used these patterns as selectional constraints in parsing new text ; we found that they did slightly better than th e constraints we had created by hand last year [1] . The gain was small -- not likely to affect template score -- but should be an advantage in moving to a new domain, particularly if even larger corpora are available .</Paragraph> <Paragraph position="2"> We have not yet completed the complementary task of building the word classes from this distributionalinformation .</Paragraph> </Section> class="xml-element"></Paper>