File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1020_metho.xml
Size: 18,878 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1020"> <Title>MARINES INTO THE PORT OF EL CALLAO . SOME 3 YEARS AGO TWO MARINES DIED FOLLOWING A SHINING PATH BOMBING O F A MARKET USED BY SOVIET MARINES .. IN ANOTHER INCIDENT 3 YEARS AGO, A SHINING PATH MILITANT WAS KILLE D BY SOVIET EMBASSY GUARDS INSIDE THE EMBASSY COMPOUND . THE SOURCES ALSO SAID THAT THE SHINING PATH HAS ATTACKED SOVIET INTERESTS IN PERU IN THE PAST . :END-MSG :END-PROC Figure 6: TST1-MUC3-0099 Input to CAUCU S</Title> <Section position="3" start_page="129" end_page="132" type="metho"> <SectionTitle> SYSTEM ARCHITECTUR E </SectionTitle> <Paragraph position="0"> The CODEX architecture is depicted in Figure 1 . This architecture separates linguistic and conceptual domain knowledge (in the Profiler and Analyzer) from the particular information requirements of an application (in the Con troller) so that these knowledge bases can be used as interchangeable building blocks, reducing the incremental cos t of adding new domains, languages, and applications. We estimate that at least eighty percent of our lexicon development time on MUC-3 was spent creating lexical entries that are generally applicable to any NL parsing problem . Only the proper nouns do not have general applicability .</Paragraph> <Paragraph position="1"> The Controller is an expert system that knows about document formats and the data templates to be filled . It transforms and marks up the input stream into a canonical format for further processing, extracts data from formatted fields that do not require language analysis, sends it to the Profiler for keyword analysis, extracts relevant informatio n from the concept profile, sends relevant fragments to the Analyzer, extracts relevant information from the text fragment interpretation, and puts the extracted information into the required output format .</Paragraph> <Paragraph position="2"> The Profiler, based on ADS' RUBRIC information retrieval technology, takes the complete text of a document and returns a profile indicating the Profiler's confidence in the presence of each relevant concept, as well as th e location of textual keyword evidence for the concept .</Paragraph> <Paragraph position="3"> The Analyzer, which was disabled for official MUC-3 testing, performs detailed analysis of text to a depth required to completely disambiguate and interpret it within the bounds of a specific domain and application . In its current MUC-3 configuration, input to the Analyzer would have been a sentence to be analyzed had CAUCUS processing been turned on. As we have not yet developed any linguistic modules for inter-sentential analysis, we did expect thi s omission to have an impact on our MUC-3 scores . CAUCUS is designed so that it can also take as input a set of con textual constraints and hypotheses about the content of the input text as determined by the Profiler and Controller . We plan to use this feature to test the idea that the CAUCUS can be made to produce a more accurate analysis in a shorte r time by using this additional input .</Paragraph> <Paragraph position="4"> Only text fragments that require depth analysis for the needs of the application are passed to the Analyzer, as the deeper analysis is necessarily slower than the more shallow analysis . ADS' Analyzer, called CAUCUS, is only par tially implemented to date . Currently it consists of a parser that does both syntactic and semantic analysis . The complete CAUCUS concept also has various asynchronous pragmatic processes such as discourse focus tracking an d plausibility analysis interacting with the parser through a global chart . The syntactic and semantic knowledge bases for the parser are declarative and modular, making it possible to interchange domain-specific modules and opening up the possibility of automatic and interactive incremental knowledge acquisition . The modules work together under a best-first chart parsing strategy that optimizes speed without sacrificing recovery from unexpected input .</Paragraph> <Paragraph position="5"> CAUCUS is based on the PATR family of graph unification chart parsers . It can be configured to behave like a PATR parser, using graph unification and a top-down, left-to-right parsing strategy to find all possible parses of a n input. CAUCUS uses PATR-like specifications for rules, lexical entries, and templates, so that a PATR grammar an d lexicon could be implemented in CAUCUS in a straight-forward manner . CAUCUS' extensions to PATR thus retai n the modular, declarative representation of linguistic knowledge as feature sets, but they provide for a great deal of flex ibility for exploring ways of improving the parser's scalability and robustness in the face of apparent ambiguities an d unexpected input.</Paragraph> <Paragraph position="6"> CAUCUS' architecture is depicted in Figure 2 . As in the PATR parsers, CAUCUS has a chart of edges, whic h contain directed graph representations of the phrase analyses that they represent. Also like the PATR parsers, edge s can be extended and/or matched to create longer edges, . Unlike PAIR parsers, though, instantiation of extend an d match tasks are decoupled from their execution, allowing the strategy for task selection to be determined by a separat e control function. This control function manipulates the placement of tasks on the prioritized agenda by giving the m numeric priorities reflecting the likelihood that executing the task next will result in reaching the correct parse in th e shortest time. Also unlike PATR parsers, chart edges have numeric confidences reflecting the likelihood that the associated edge is part of the correct parse of the input string . These confidences come from (1) the distribution of grammatical constructs in a representative corpus of text, (2) the degree to which the input matches the structure generate d by the combination of rules and lexical entries combined by the edge, and (3) the degree to which pragmatic processe s confirm the phrase meaning as composed by the edge . As this latter source of numeric confidences might imply, an other significant difference between CAUCUS and the PATR parsers is the addition of pragmatics tasks to the parsin g agenda. Finally, a unique feature of the CAUCUS architecture is the generalization of graph unification to allow fo r non-Boolean and non-string-matching composition functions . We call this generalization of Unification Grammar &quot;Generalized Composition Grammar&quot; or &quot;GCG .&quot; To date, we have implemented composition functions in CAUCUS for nodes in a class lattice and nodes rep resenting the semantics of conjunctions, as well as the usual string match of straight graph unification . In addition to these, we plan to implement composition functions that perform reference resolution, spatial and temporal reasoning , part-whole reasoning, reasoning about measures, and other such special purpose semantic and pragmatic function s which do not translate well to traditional feature structure unification . Some of these composition functions will certainly yield non-boolean results on the question of whether two nodes are composeable . In addition to the node composition functions, we also plan to implement a probabilistic feature set composition function, where the degree o f match between a top-down proposed structure and its constituents composed from the bottom up is inversely proportional to the probability that constraints violated by a given input structure tend to be violated over a representative corpus of text.</Paragraph> <Paragraph position="7"> CAUCUS couples a semantic taxonomy with a declarative grammar and lexicon . The grammar/lexicon specification is similar to Lexical Functional Grammar in that compiled entries are used by the parser to build distinct syntactic constituent and functional structures for an input text segment, with the mapping between the constituent an d functional structure specified by the lexical entries . In addition, CAUCUS simultaneously composes a semantic functional structure based on the semantic functional structure and selectional restrictions specified in the lexical entrie s and semantic taxonomy . This knowledge base structure was designed for maximum portability to new discourse domains, languages, and applications. For portability to new discourse domains, we maintain a core grammar, lexicon, and taxonomy, to which we add domain-specific language constructs and concepts . Most of the lexical entries developed for MUC-3 are generally applicable to any parsing problem and have been added to the core .</Paragraph> <Paragraph position="8"> CAUCUS' probabilistic parsing strategy is similar to deterministic parsers in that the parser is directed to fin d the most probable solution first., but it differs in two critical ways. First, probabilistic constraint relaxation is built into a function that determines the degree of match between the input and a hypothesized interpretation structure . The degree of match is then rolled into the probability of continuing along the same search path . This enables the parser to exhibit both speed and robustness in the face of unexpected input. Second, instead of a procedurally determined, hard-wired parse strategy, ADS' parsing strategy is determined primarily by usage frequencies stored with the rules an d word senses retrieved by the parser as they are needed . 'Thus, the parse strategy can be tuned dynamically to differen t situations. Currently, the ability to use usage frequency has been implemented, but non-Boolean matching has not .</Paragraph> <Paragraph position="9"> Probabilistic best-first chart parsing, made possible by our Generalized Composition Grammar and the CAUCUS architecture, is designed to improve the scalability and robustness of natural language parsing . CAUCUS is an active chart parser because phrase constituents are generated once only and made available to be matched with adjacen t (active against inactive) edges that might be generated in the future . Typically, a chart parser is used to generate all possible parses in an efficient manner, so the active chart edge matching and proposing mechanisms are tightly couple d in a deterministic top-down, left-to-right control mechanism . This is a sensible approach for testing the generative and string recognition power of a grammar, but for a parser doing real text understanding tasks, the goal is to find the bes t fit that the non-deterministic knowledge bases can make to the input string in the shortest time possible . In CAUCUS , tasks are placed on a prioritized agenda as they are instantiated by the execution of other tasks . Executing a task may cause the instantiation of any number of tasks, from zero to many, and where they are placed on the queue is a functio n of an estimated likelihood that executing the task next will cause the parser to discover the best interpretation of th e input in the shortest amount of time .</Paragraph> <Paragraph position="10"> In CAUCUS, there are currently two types of tasks, EXTEND an active or inactive edge, and MATCH an active edge with an inactive edge, either to the right or to the left. Various parameters may be set to manipulate whic h tasks actually get instantiated during the execution of one of these tasks, and the composition functions and the functions that calculate the priority of a task can all be substituted freely without adversely affecting the parser's operation . Currently, matching is decomposed into two stages in order to minimize the number of graph compositions that ar e actually executed . As the number of specialized composition functions and pragmatic functions is increased, it ma y become appropriate to break some of these out as additional task types .</Paragraph> <Paragraph position="11"> In summary, ADS is developing a unique approach to natural language analysis, called CODEX, that combines a probabilistic, concept-oriented keyword pattern matcher with a probabilistic, best-first, generalized graph com position, active chart parser . These techniques were designed to address problems limiting the robustness, scalability , and portability of current techniques for data extraction from text . Although we have not completed the research an d development on these techniques, we have been implementing them in an incremental fashion as modifications to existing prototypes so that we can document the performance gains of each addition over the older techniques from whic h these new ones have grown . Because the large MUC-3 lexicon required an upgrade to the parser that we were unabl e to implement in time, the system we used for MUC-3 testing had the parser disabled .</Paragraph> </Section> <Section position="4" start_page="132" end_page="134" type="metho"> <SectionTitle> FLOW OF CONTROL </SectionTitle> <Paragraph position="0"> In this section we show how CODEX minus CAUCUS produced templates for the MUC-3 test messag e TST1-MUC3-0099.</Paragraph> <Paragraph position="1"> When the controller is invoked on a series of messages, it first breaks up the messages into a series of files , one message per file . Then it breaks up each of these files into a dateline file and a message body file. Each message body file is handed to the profiler, which uses its knowledge base to locate relevant concepts in the text . The Profiler knowledge base consists of two parts . The first part is a set of RUBRIC rules, and the second part is a set of concept s that the profiler is to report on. As can be seen in Figure 3 the rules may be invoked in a hierarchical manner . Although some of the lower level rules corresponded to slot fillers, as may be seen in Figure 4, we did not use this informatio n to create template fills because our strategy was to have CAUCUS do this by analyzing sentences found by the profiler . For the MUC-3 testing, we configured the profiler to report on the sentences that contained concepts of the variou s incident types. The strategy was to overgenerate so that CAUCUS could make the final determination of relevance an d slot fills. The profiler saves its output to a file for future use by the Controller . Figure 5 shows the message with thos e words highlighted that triggered profiler rules.</Paragraph> <Paragraph position="2"> Once the Profiler is finished with a message, the controller loops through the incident types and generates a template for each sentence that contains that incident . For TST1, we generated just one template per incident type con cept found in a message, but we found that recall would be higher if we produced one template per sentence containin g</Paragraph> <Paragraph position="4"> LIMA, 25 OCT 89 (EFE) -- [TEXT] POLICE HAVE REPORTED THAT TERRORISTS TONIGHT BOMBED</Paragraph> </Section> <Section position="5" start_page="134" end_page="134" type="metho"> <SectionTitle> THE EMBASSIES OF THE PRC AND THE SOVIET UNION. THE BOMBSCAUSED DAMAGE BUT NO INJURIES . A CAR-BOMB EXPLODED, IN FRONT OF THE PRC EMBASSY, WHICH IS IN THE LIMA RESIDENTIAL DISTRIC T OF SAN ISIDRO . MEANWHILE, TWO BOMBS WERE THROWN AT A USSR EMBASSY VEHICLE THAT WA S PARKED IN FRONT OF THE EMBASSY LOCATED IN ORRANTIA DISTRICT, NEAR SAN ISIDRO . POLICE SAID THE ATTACKS WERE CARRIED OUT ALMOST SIMULTANEOUSLY AND THAT THE BOMBS BROKE WINDOWS ANDDESTROYED THE TWO VEHICLES . NO ONE HAS CLAIMED RESPONSIBILITY FOR THE ATTACKS SO FAR . POLICE SOURCES, HOWEVER, HAVE SAID THE ATTACKS COULD HAVE BEEN CARRIED OUT BY THE MAOIST &quot;SHINING PATH' GROUP OR TH E GUEVARIST 'TUPAC AMARU REVOLUTIONARY MOVEMENT&quot; (MRTA) GROUP . THE SOURCES ALSO SAID THAT THE SHINING PATIO HAS ATTACKED SOVIET INTERESTS IN PERU IN THE PAST . </SectionTitle> <Paragraph position="0"/> </Section> <Section position="6" start_page="134" end_page="135" type="metho"> <SectionTitle> GUARDS INSIDE THE EMBASSY COMPOUND . THE TERRORIST WAS CARRYING J)YNAMITE. THE ATTACKS TODAY COME AFTER SHINING PATH ATTACKS DURING WHICH LEAST 10 BUSES WERE BURNED THROUGHOUT LIMA ON 24 OCT. </SectionTitle> <Paragraph position="0"> an incident. We also could have configured the profiler to scope concepts at the paragraph level, but with the newswire articles, this would produce very little difference in the result. If any of the incident types are found in a message, th e Controller also sends the name of the message file and the potentially relevant sentences to CAUCUS through another file, as shown in Figure 6.</Paragraph> <Paragraph position="1"> Generating templates based on Profiler output only was a fail-safe mechanism in case of parser failure . As it turned out, this was our only output. Thus, final output for TST1-MUC3-0099 consists of five templates with only the message-id, template-id, and incident-type slots filled out. A BOMBING template is produced for each of the first three sentences in Figure 6, a MURDER template is produced for the fourth sentence in Figure 6, and an ATTAC K template is produced for the last. Given these sentences as input, the parser's analysis would have indicated that onl y the first sentence contained relevant incidents, but it would have created two BOMBING templates, one for each o f the physical targets. Before processing the first sentence for the message is a line indicating the number and the dateline file of a new message . The parser uses this information to reset some global variables containing frames representin g the dateline information, which it may need to determine the time and location of incidents . With the parser up an d running as expected, then, the final output for this message would have been two bombing templates with the date, perpetrator-id, physical-target, foreign-nation, and location slots filled in .</Paragraph> <Paragraph position="2"> Since we never generated CAUCUS output for this message, we will not go into CAUCUS processing . Our simple initial strategy was to have CAUCUS parse the sentences found by RUBRIC to have concepts of MUC-3 incidents, determine relevance, and change the template output, as described above . This strategy would have missed the effects mentioned in the second sentence of the message, the neighborhood-level locations and secondary physical tar get mentioned in the second paragraph, or the suspected perpetrator organization mentioned in the third paragraph . In the future we will be experimenting with a feedback loop between the Controller and CAUCUS to provide additiona l sentences to CAUCUS for analysis, depending on the absence of information in the CAUCUS template output . Thus, only the minimum number of sentences would be parsed . In this case, all but two small sentences would be parsed, bu t in other messages, this strategy should reduce the burden on the parser significantly .</Paragraph> </Section> class="xml-element"></Paper>