File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1023_metho.xml
Size: 12,114 bytes
Last Modified: 2025-10-06 14:12:43
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1023"> <Title>Text of message TOKENIZER SYNTACTIC SEMANTIC ANALYZER ANALYZER Symbols Parse Trees</Title> <Section position="2" start_page="0" end_page="152" type="metho"> <SectionTitle> SYSTEM ARCHITECTURE </SectionTitle> <Paragraph position="0"> The TIA used for MUC-3 was developed from the AD-TIA (Alternate Domain TIA) . This system, shown in Figure 1, sequentially performs tokenization, syntactic analysis, semantic analysis, and output translation . Each components is discussed below .</Paragraph> <Paragraph position="1"> Tokenization. The tokenizer finds strings of text delimited by spaces, carriage returns and punctuation marks . It also attempts to classify the string as a known word or a member of a special token class such as a number or a latitude (e.g. 1234N). The output from the tokenizer is a list of Lisp symbols representing known words an d dotted pairs representing a special token class and the text which represents it, e .g. ( : number . 12) .</Paragraph> <Paragraph position="2"> If a token is not recognized as either a known word or a member of a special token class, spelling correction is attempted . The spelling corrector is a simple one that looks for transposed, elided or extra letters . If spelling correction fails, the token is classified as belonging to the special token class : unknown . For example, in the MUC-3 corpus, the string &quot;Orrantia&quot; is tokenized as ( :unknown . ORRANTIA) .</Paragraph> <Paragraph position="3"> Syntactic Analysis. TIA uses a syntactic analysis stage to preprocess tokenized text before attempting semantic analysis . Syntactic analysis finds phrases which may be treated as though they were single words, such a s noun phrases (NPs), and to define synonyms .. As used by the TIA, the syntactic analyzer does not operate at th e sentence level of free text, only at the phrase level.l If more than one phrase is possible at a given point in the text, ambiguities are resolved according to th e following scheme: 1.) Left to right: the first phrase of overlapping phrases encountered is chosen .</Paragraph> <Paragraph position="4"> 2.) Length: if two phrases start at the same point, the longer of the two is chosen .</Paragraph> <Paragraph position="5"> 3.) Syntactic Priority : A syntactic priority may be assigned to a phrase when it is declared . If two phrases of the same length start at the same point in the text, the one with the higher declared priority is chosen .</Paragraph> <Paragraph position="6"> The syntactic analyzer retains a parse tree for every phrase that it finds . These parse trees allow the semantic analyzer (see below) to know that, for example, a date was found, but also to extract the month from tha t date.</Paragraph> <Paragraph position="7"> In the MUC-3 domain, the parse trees tended to be rather shallow and consume only a token or two each . For example, the sentences &quot;Police have reported that terrorists tonight bombed the embassies of the PRC and th e The syntactic analyzer has the ability to automatically add production rules at run time . For example, the following (simplified) production rules are defined in the system :</Paragraph> <Paragraph position="9"> These rewrite rules allow &quot;Orrantia district, near San Isidro&quot; to be recognized as <location> and cause the following rule to be added to the grammar.</Paragraph> <Paragraph position="10"> <region> orrantia In this example, the question mark indicates an optional item, and the @ sign indicates a place where a new rule might be added. As an example of how phrases are defined in the TIA, <location> above is defined as follows : University of Massachusetts by Wendy Lehnert[1]) is the major component of the system . The input to the semantic analyzer is a list of syntactic units identified by the syntactic analyzer . The output of the semantic analyzer is a list of frame-like structured concepts, each consisting of a &quot;head&quot; and an unspecified number of (slot, value) pairs . The semantic analyzer predicts zero or more concepts for each parse tree found by the syntactic analyzer. Some of the slots in each concept are initialized from information found in the parse tree. For example. <location> predicts location-p, which has slots named input-text-s and type-s.</Paragraph> <Paragraph position="11"> In the above example, the first location-p predicted has the input-text-s slot filled with &quot;TH E PRC&quot; (directly from the parse tree) and its type-s slot filled with 'COUNTRY (by table lookup) . Slot initialization information is stored with syntax information, as shown in the sample definition of <location> above.</Paragraph> <Paragraph position="12"> Other slots are filled by &quot;expecting&quot; information in other frames For example, <bombing> predicts bombing-p, with slots which include agent-s and physical-target-s. Agent-s is expected to be filled by an actor-p, found previously in the same sentence, andphysical-target-s is expected to be filled by a theme-p, found later in the same sentence . Passive voice is recognized by the syntactic analyzer, and would have predicted passive-bombing-p, with different expectations about the structure of the text . Knowledge of expectations is stored in prediction prototypes. There is a prediction prototype for each possible type of structured concept.</Paragraph> <Paragraph position="13"> Concepts are instantiated when &quot;enough&quot; slots are filled . What is &quot;enough&quot; is specific to the individual concept type. Only instantiated concepts can fill slots. Disambiguation is handled by making a prediction for eac h sense of a phrase and only instantiating the prediction for the correct sense .</Paragraph> <Paragraph position="14"> That instantiation of a concept can cause actions to occur by allowing calls to Lisp Reference resolution is one example of the type of action that might occur after instantiating a concept .</Paragraph> <Paragraph position="15"> The concepts are hierarchical .. For example, terrorist-p, which is-a actor-p, can fill the agent-s slot in the bombing-p.</Paragraph> <Paragraph position="16"> A prediction prototype maintains a list of slots that need to be filled for that type of structured concept , along with information on how to fill them . An example is shown below . This particular concept is predicted by the syntactic constituent <embassy>.</Paragraph> <Paragraph position="17"> The structured concepts in *CONCEPT-MEM* correspond roughly to nouns, those in *EVENT-MEM* to verbs . *SYNTAX-MEM* is a catch-all for miscellaneous structured concepts . *STORY-LINE* is a subset of *EVENT-MEM* in which concepts that refer to the same event are resolved into the same concept . *DRAMATIS PERSONAE* is a similar subset of *CONCEPT-MEM* . The output of the semantic analyzer is the list o f concepts found in *STORY-LINE* .</Paragraph> <Paragraph position="18"> Output translation . The output translator transforms the internal representations of the concepts extracte d from the message into the proper database update template form . This component of the system is tailored to meet the requirements of the particular domain and database under consideration . In the MUC domain, the outpu t translator applies defaults, standardizes fillers for set list slots and performs cross referencing . This module is also responsible for not generating templates . For example, it should determine that a military versus military attack should not generate a template .</Paragraph> <Paragraph position="19"> The output translator is only partially implemented . It does not incorporate many of the heuristics abou t generating and combining templates, and incorrectly applies those heuristics that it does incorporate . Additionally, the output translator is dependent on the order in which its input is received . It may generate an entirely different se t of template if the list of concepts produced by the semantic analyzer is reversed . This was not intended to be a feature.</Paragraph> </Section> <Section position="3" start_page="152" end_page="152" type="metho"> <SectionTitle> REFERENCE RESOLUTIO N </SectionTitle> <Paragraph position="0"> Reference resolution takes place in two modules : the semantic analyzer and the output translator . In the semantic analyzer, newly instantiated concepts (of specific types) are compared with other similar concepts . If no contradictions are found between a pair of concepts, the two are merged into a single concept . &quot;No contradictions&quot; is defined to mean that 1) both concepts are the same type or the type of one is a direct ancestor of the other in the concept hierarchy and 2) every slot of each matches that of the other, or is empty . Certain slots, such as inputtext -s , are excluded from the matching requirement, since two concepts may refer to the same entity, but were represented differently in the text . To &quot;match&quot; usually means that the slot fillers are EQUAL, but not always . For example, the persons-name-s slot may allow partial matches to succeed ; &quot;Smith&quot; matches &quot;John Smith . &quot; The output translator is responsible for combining templates . For example, two attacks at the same tim e and place should be output as one template. This module is also responsible for deleting multiple templates for a single event. For example, a bombing at a given time and place is an attack or an attack that results in a perso n being killed is a murder. In each case only one template should be produced . Unfortunately, this module is only partially implemented, so many spurious templates are produced .</Paragraph> </Section> <Section position="4" start_page="152" end_page="153" type="metho"> <SectionTitle> CONJUNCTION PROCESSIN G </SectionTitle> <Paragraph position="0"> Conjunction processing occurs in the semantic analyzer module. The conjunction in the phrase &quot;embassie s of the PRC and the Soviet Union .&quot; is handled as follows: <embassy> predicts government -building-p, which when instantiated fills the physical-target-s slot of bombing-p The concept government-building-p has a slot country-s which can be filled by a location-p concept which has the type-s of ' country and obj-of-prep-s of ' of. The syntactic unit <preposition> predicts preposition- null p, which upon being instantiated, inserts itself as the obj-of-prep-s slot of the subsequent concept. The first <1 o c a t i o n >, i.e. &quot;The PRC&quot;, predicts a location -p, which meets the constraints needed to fill the country-s slot of the government-building-p. When that slot is filled, the location-p creates a record to indicate that it has filled a slot in the government-building-p . Next, <c on junction> predicts conjunction-p, which, when instantiated, makes a note to try to join the previous concept with the next concept at a later time (the end of the sentence .) The ,second <1ocation> then predicts a new location-p and the <period> predicts a number of concepts whose only function is to cause other concepts to be instantiated i n the proper order. One of these tells the con conjunct ion -p that it's time to try to join the previously noted concepts.</Paragraph> <Paragraph position="1"> The conjunction joining mechanism verifies that the two locations are the same types of thing. Upon verification, the locations are conjoined by copying concepts in which the first location filled one or more slots, an d replacing those fillers with the second location . In this instance, the government-building-p is copied with &quot;the Soviet Union&quot; filling the country-s slot. Since the government-building-p concept filled a slot in bombing-p concept, that bombing-p concept is copied with the new government-building-p filling the physical-target-s slot of the copy.</Paragraph> </Section> class="xml-element"></Paper>