File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1009_metho.xml
Size: 12,947 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1009"> <Title>The Generic Information Extraction Syste m</Title> <Section position="4" start_page="88" end_page="88" type="metho"> <SectionTitle> FILTE R </SectionTitle> <Paragraph position="0"> This module uses superficial techniques to filter out the sentences that are likely to be irrelevant, thus turnin g the text into a shorter text that can be processed more quickly . There are two principal methods used in thi s module. In any particular application, subsequent modules will he looking for patterns of words that signa l relevant events . If a sentence has none of these words, then there is no reason to process it further. This module may scan the sentence looking for these keywords . The set of keywords may be developed manually , or more rarely if ever, generated automatically from the patterns .</Paragraph> <Paragraph position="1"> Alternatively, a statistical profile may be generated automatically of the words or n-grams that characterize relevant sentences . The current sentence is evaluated by this measure and processed only if it exceeds some threshhold .</Paragraph> </Section> <Section position="5" start_page="88" end_page="88" type="metho"> <SectionTitle> PREPARSER </SectionTitle> <Paragraph position="0"> More and more systems recently do not attempt to parse a sentence directly from the string of words to a ful l parse tree . Certain small-scale structures are very common and can be recognized with high reliability. The Preparsing module recognizes these structures, thereby simplifying the task of the Sentence Parser . Some systems recognize noun groups, that is, noun phrases up through the head noun, at this level, as well a s verb groups, or verbs together with their auxilliaries . Appositives can be attached to their head nouns wit h high reliability, as can genitives, &quot;of&quot; prepositional phrases, and perhaps some other prepositional phrases .</Paragraph> <Paragraph position="1"> &quot;That&quot; complements are often recognized here, and NP conjunction is sometimes done as a special proces s at this level .</Paragraph> <Paragraph position="2"> Sometimes the information found at this level is merely encapsulated and sometimes it is discarded . Age appositives, for example, can be thrown out in many applications .</Paragraph> <Paragraph position="3"> This module generally recognizes the small-scale structures or phrases by finite-state pattern-matching , sometimes conceptualized as ad hoc heuristics. They are acquired manually.</Paragraph> </Section> <Section position="6" start_page="88" end_page="89" type="metho"> <SectionTitle> PARSER </SectionTitle> <Paragraph position="0"> This module takes a sequence of lexical items and perhaps phrases and normally tries to produce a pars e tree for the entire sentence . Systems that do full-sentence parsing usually represent their rules either as a phrase structure grammar augmented with constraints on the application of the rules (Augmented Transition Networks, or ATNs), or as unification grammars in which the constraints are represented declaratively . The most frequent parsing algorithm is chart parsing . Sentence are parsed bottom-up, with top-down constraint s being applied . As fragmentary parsing becomes more prevalent, the top-down constraints cannot be use d as much. Similar structures that span the same string of words are merged in order to bring the processin g down from exponential time to polynomial time .</Paragraph> <Paragraph position="1"> Recently more and more systems are abandoning full-sentence parsing in information extraction applications. Some of these systems recognize only fragments because although they are using the standar d methods for full-sentence parsing, their grammar has very limited coverage . In other systems the parser applies domain-dependent, finite-state pattern-matching techniques rather than more complex processing , trying only to locate within the sentence various patterns that are of interest in the application .</Paragraph> <Paragraph position="2"> Grammars for the parsing module are either developed manually over a long period of time or borrowe d from another site. There has been some work on the statistical inference of grammar rules in some areas o f the grammar .</Paragraph> </Section> <Section position="7" start_page="89" end_page="89" type="metho"> <SectionTitle> FRAGMENT COMBINATIO N </SectionTitle> <Paragraph position="0"> For complex, real world sentences of the sort that are found in newspapers, no parser in existence can fin d full parses for more than 75% or so of the sentences . Therefore, these systems need ways of combining th e parse tree fragments that they obtain . This module may be applied to the parse tree fragments themselves .</Paragraph> <Paragraph position="1"> Alternatively, each fragment is translated into a logical form fragment, and this module tries to combine th e logical form fragments. One method of combination is simply to take the logical form of the whole sentenc e to be the conjunction of the logical form fragments . A more informative technique is to attempt to fit som e of the fragments into unfilled roles in other fragments .</Paragraph> <Paragraph position="2"> The methods that have been employed so far for this operation are ad hoc . There is no real theory of it .</Paragraph> <Paragraph position="3"> The methods are developed manually.</Paragraph> </Section> <Section position="8" start_page="89" end_page="89" type="metho"> <SectionTitle> SEMANTIC INTERPRETATIO N </SectionTitle> <Paragraph position="0"> This module translates the parse tree or parse tree fragments into a semantic structure or logical for m or event frame . All of these are basically explicit representations of predicate-argument and modification relations that are implicit in the sentence . Often lexical disambiguation takes place at this level as well. Some systems have two levels of logical form, one a general, task-independent logical form intended to encode al l the information that is in the sentence, and the other a more specifically task-dependent representation tha t often omits any information that is not relevant to the application . A process of logical-form simplification translates from one to the other .</Paragraph> <Paragraph position="1"> The method for semantic interpretation is function application or an equivalent process that matche s predicates with their arguments. The rules are acquired manually.</Paragraph> <Paragraph position="2"> There are a number of variations in how the processing is spread across Modules 4-7 . It may be as I have outlined here. The system may group words into phrases, and then phrases into parsed sentences, and then translate the parsed sentences into a logical form . The more traditional approach is to skip the first of thes e steps and go directly from the words to the parsed sentences and then to the logical forms . Recently, many systems do not attempt full-sentence parsing . They group words into phrases and translate the phrases into logical forms, and from then on it is all discourse processing . In a categorial grammar framework, one goes directly from words to logical forms .</Paragraph> </Section> <Section position="9" start_page="89" end_page="90" type="metho"> <SectionTitle> LEXICAL DISAMBIGUATIO N </SectionTitle> <Paragraph position="0"> This &quot;module&quot;, if it is such, translates a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates . In fact, lexical disambiguation often occurs a t other levels, and sometimes entirely so . For example, the ambiguity of &quot;types&quot; in &quot;He types.&quot; and &quot;The types . . .&quot; may be resolved during syntactic processing or during part-of-speech tagging . The ambiguity of . . rob a bank . . .&quot; or &quot;. . . form a joint venture with a bank . . .&quot; may be resolved when a domain-dependent pattern is found . The fact that such a pattern occurs resolves the ambiguity .</Paragraph> <Paragraph position="1"> More generally, lexical disambiguation usually happens by constraining the interpretation by the contex t in which the ambiguous word occurs, perhaps together with the a priori probabilities of each of the wor d senses.</Paragraph> <Paragraph position="2"> These rules are in many cases developed manually, although this is the area where statistical method s have perhaps contributed the most to computational linguistics, especially in part-of-speech tagging.</Paragraph> </Section> <Section position="10" start_page="90" end_page="90" type="metho"> <SectionTitle> COREFERENCE RESOLUTIO N </SectionTitle> <Paragraph position="0"> This module turns a tree-like semantic structure, in which there may be separate nodes for a single entity , into a network-like structure in which these nodes are merged . This module resolves coreference for basi c entities such as pronouns, definite noun phrases, and &quot;one&quot; anaphora. It also resolves the reference for mor e complex entities like events . That is, an event that is partially described in the text may he identified wit h an event that was found previously; or it may be a consequence of a previously found event, as a death is o f an attack; or it may fill a role in a previous event, as an activity in a joint venture .</Paragraph> <Paragraph position="1"> Three principal criteria are used in determining whether two entities can be merged . First, semanti c consistency, usually as specified by a sort hierarchy . Thus, &quot;the Japanese automaker&quot; can be merged wit h &quot;Toyota Motor Corp .&quot; For pronouns, semantic consistency consists of agreement on number and gender, an d perhaps on whatever properties can be determined from the pronoun 's context; for example, in &quot;its sales&quot; , &quot;it&quot; probably refers to a company .</Paragraph> <Paragraph position="2"> Second, and more generally, there are various measures of compatibility between entities; for example, the merging of two events may be conditioned on the extent of overlap between their sets of known arguments , as well as on the compatibility of their types.</Paragraph> <Paragraph position="3"> The third criterion is nearness, as determined by some metric . For example, we may want to merge tw o events only if they occur within n sentences of each other (unless they are in The Financial Times) . The metric of nearness may be something other than simply the number of words or sentences between the item s in the text. For example, in resolving pronouns, we should favor the Subject over the Object in the previous sentence; this is simply measuring nearness along a different path .</Paragraph> <Paragraph position="4"> These rules have to be developed manually (and by &quot;manually &quot; I mean &quot;cerebrally&quot;) . The sort hierarchy used in consistency checking is usually developed manually, although it would be interesting to know i f researchers have begun to use WordNet or other thesauri for sort hierarchy development, or have attempte d to use statistical means to infer a sort hierarchy.</Paragraph> <Paragraph position="5"> The term &quot;discourse processin g&quot; as used by MUC sites almost always means simply coreference resolution of application-relevant entities and events . There have been no serious attempts to recognize or use the structure of the text, beyond simple segmenting on the basis of superficial discourse particles for use i n nearness metrics in coreference resolution .</Paragraph> </Section> <Section position="11" start_page="90" end_page="90" type="metho"> <SectionTitle> TEMPLATE GENERATIO N </SectionTitle> <Paragraph position="0"> This module takes the semantic structures generated by the natural language processing modules and produces the templates in the official form required by the rules of the evalution . Events that do not pass th e threshhold of interest defined in the rules are tossed out. Labels are printed, commas are removed fro m company names, percentages are rounded off, product-service codes are pulled out of a hat, and so on . A-nd on and on .</Paragraph> <Paragraph position="1"> There are no automatic methods for developing the rules in this module . The only method available i s long, hard work.</Paragraph> </Section> <Section position="12" start_page="90" end_page="91" type="metho"> <SectionTitle> A FINAL WORD </SectionTitle> <Paragraph position="0"> In this overview of the generic information extraction system, I have described what seemed to be th e principal methods used in the MUC-4 systems. The reader may find that the MUC-5 systems exhibi t interesting innovations over and above what I have described.</Paragraph> </Section> class="xml-element"></Paper>