File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1025_metho.xml
Size: 35,000 bytes
Last Modified: 2025-10-06 14:13:29
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1025"> <Title>USC : DESCRIPTION OF THE SNAP SYSTEM USED FOR MUC- 5</Title> <Section position="3" start_page="0" end_page="305" type="metho"> <SectionTitle> SYSTEM ARCHITECTURE </SectionTitle> <Paragraph position="0"> The SNAP system architecture is shown in Figure 1 . It consists of a preprocessor, phrasal parser, memory-based parser, knowledge base, and template generator . The preprocessor provides lexical look-u p and semantic tagging . The phrasal parser performs grouping of words, and the memory-based parser gener ates semantic meaning representation by using stored patterns in the knowledge base. The knowledge base consists of a domain concept hierarchy including location hierarchy, a set of phrasal patterns for TIE-UP, an d semantically relaxed versions of some phrasal patterns . The phrasal patterns are automatically extracte d from training texts by the PALKA lexical acquisition system . The template generator has a set of rules to fill slots from the phrasal parser results and the memory-based parser results . To detect a TIE-UP relationship , the template generator first checks the memory-based parser output for the regular patterns . If there is no memory-based parser output, it tries to apply rules to the phrasal parser output, and if no rules ar e applicable, it checks the memory-based parser output for the relaxed patterns . Figure 2 shows the flow of control including the back-up routes.</Paragraph> </Section> <Section position="4" start_page="305" end_page="305" type="metho"> <SectionTitle> PREPROCESSOR </SectionTitle> <Paragraph position="0"> Description of Operatio n Our preprocessor is not significantly different from the &quot;generic preprocessor&quot; . It locates sentence boundaries and inserts sentence boundary markers for the phrasal parser . In addition, sentences are joined togethe r on one line so that pattern matching can take place for concatenating noun phrases together . Certain noun phrases are grouped together to reduce work load on the phrasal parser . These noun phrases come from 4 different sources : our permuted list of company names, locations from the gazeteer, a list of special word s related to facilities, titles, and positions, specifically chosen for filling template slots and/or recognizing th e possibility of filling slots, and noun phrases from WordNet . (Company names are a special case that will be discussed below .) As in our system for MUC-4 last year, our preprocessor continues to group semi-auxiliary verbs lik e USED_TO and prepositional phrases like SUCH_AS and AS_TO together. Contractions such as CAN 'T and DON 'T are expanded into CAN NOT and DO NOT . Numbers in the text are normalized by removing comma s and/or dollar signs . Abbreviations immediately following numbers without intervening spaces are separated from the number, and expanded ; for example 2 .3BN is converted into 2.3 BILLION . Three such abbreviations following numbers deserve a small note . When s follows a number, it is always a plural, as in 1980s or IBM MODEL 55s ; this s is just removed as no important information required in our system is lost . The abbreviation m is ambiguous, and could mean either million or meters ; because it almost always means million in this domain, it is always changed to million . Finally, the obvious abbreviation 4wD is converted t o</Paragraph> </Section> <Section position="5" start_page="305" end_page="305" type="metho"> <SectionTitle> FOUR WHEEL DRIVE . Punctuation markers are separated from the words they are adjacent to, and then each </SectionTitle> <Paragraph position="0"> word is looked up in our dictionary . For each word, every possible part of speech is passed on to the phrasa l parser segment of our system . In the dictionary look-up function, numbers are a special case . In order that numbers which are not spelled out do not have to be added to the dictionary, the dictionary look-up progra m tests words to find out if they are cardinal, ordinal, or real numbers, and then reports their part of speech as such. No spelling correction is done in the preprocessor ; misspelled words are reported as unknown to th e phrasal parser, and they will usually be interpreted as a corporation name if some other word in the nou n</Paragraph> </Section> <Section position="6" start_page="305" end_page="307" type="metho"> <SectionTitle> * 'SUN </SectionTitle> <Paragraph position="0"> phrase such as CORP . suggests this, or a human name otherwise . Morphological processing is performed , but not at run time ; a set of programs is applied to the input files used in construction of the dictionary t o generate all possible inflected forms of words .</Paragraph> <Paragraph position="1"> As mentioned above, corporation names are recognized and grouped together in the preprocessor . In order to match as many company names as possible, a program was written to generate permutations o f the company names in our list of such names . Abbreviations such as INC . are permuted into both INC and INCORPORATED, and abbreviations such as &quot;S .A.&quot; are permuted into their various forms such as &quot;S .A&quot; and &quot;S . A .&quot; . This increases the size of our list of corporations several-fold . In addition, each of these abbreviations is added to our preprocessor dictionary, and tagged as a corporation name in case only th e abbreviation is recognized in some input text .</Paragraph> <Paragraph position="2"> In addition to the part of speech, root form, and other related information such as tense or number which was in our preprocessor dictionary for passing along to the phrasal parser, many of the words i n our preprocessor dictionary were tagged according to the requirements of our rule-based inferencing an d template filling module . These tags included MUC5-COMPANY, approximately 12 different types of facilities (MUC5-FACILITY1 for broadcasting stations and studios, MUC5-FACILITY2 for plants, mills , and development facilities, etc .), ten or so different position tags (MUC5-POSITION1 for CEO, chief executive officer, etc ., MUC5-POSITION2 for chairman, chairmen, chairman of the board, etc .), and several different human name and title tags (MUC5-NAME-TITLE for Mr ., Mrs., etc., MUC5-HUMAN-NAME-FAMILY, MUC5-HUMAN-NAME-GIVEN, MUC5-HUMAN-NAME-MALE, and MUC5-HUMAN-NAME-FEMALE) . These tags alerted later processing states that the words so tagged might be importan t for template filling . Tags were not the only method used for this . The knowledge-base that we built from WordNet, for example, provided a very easy way to recognize words that were part of a date, or words whic h indicated time. We also built a knowledge-base of location names and possible types (cities, countries, etc .), from the available gazeteer file, and included the tag MUC5-LOCATION on all words in our preprocesso r dictionary which might have been location names in the text . For several pronouns and prepositions such as I and TO, which are also city names in Indonesia, our phrasal parser always ignored this tag. Even thoug h the gazeteer was very large, there are place names missing such as KAOHSIUNG in the walk-through message.</Paragraph> <Paragraph position="3"> This year, WordNet was used to substantially increase the size of our dictionary . However, through an unfortunate oversight, the morphological programs were not run on the verbs from WordNet in order to generate all of their inflected forms . This caused the word ENTRUSTING to be missing from our dictionary , and not recognized, when it should have been there. Even with WordNet, though, our dictionary doesn' t contain all inflected forms of all words . WordNet contains the noun CAPITAL, but not the verb form CAPI-TALIZE, and so none of the inflected forms of that particular verb were included in our dictionary, and w e failed to recognize that CAPITALIZED was either a past tense or past participle verb form . Since our syste m doesn't do any morphological processing at run time, this is an area where significant improvement could b e made with relatively little effort ; if unknown words were run through a process that guessed at their par t of speech with decent accuracy, then our phrasal parser would do a better job of segmenting each sentence into its constituent phrases. This would have made a difference this year, as there were several times wher e a verb was unknown, treated incorrectly as a noun by our phrasal parser, and lumped into a noun phrase as part of a corporation name .</Paragraph> <Section position="1" start_page="307" end_page="307" type="sub_section"> <SectionTitle> Example of Operation </SectionTitle> <Paragraph position="0"> After all stages of preprocessing, except the last--dictionary lookup, the first sentence is all on one line (which can 't be seen here), and appears as follows : </SO> begin_text <TXT> BRIDGESTONE SPORTS CO_ SAID FRIDAY IT HAS SE T</Paragraph> </Section> </Section> <Section position="7" start_page="307" end_page="307" type="metho"> <SectionTitle> UP A JOINT_VENTURE IN TAIWAN WITH A LOCAL CONCERN AND A JAPANES E TRADING_HOUSE TO PRODUCE GOLF_CLUBS TO BE SHIPPED TO JAPAN . </SectionTitle> <Paragraph position="0"> Notice that the period on CO. was converted to an underscore character, the words JOINT_VENTURE TRADING_HOUSE and GOLF_CLUBS were grouped together, and punctuation was split off from the words it was attached to. After the final stage, the data sent to the phrasal parser portion of our system appears a s</Paragraph> </Section> <Section position="8" start_page="307" end_page="309" type="metho"> <SectionTitle> KNOWLEDGE BASE FOR PARSIN G </SectionTitle> <Paragraph position="0"> The knowledge base consisting of concept nodes and links between them is distributed over the processor array. The parsing is performed within the knowledge base by propagating markers through the network .</Paragraph> <Paragraph position="1"> The knowledge base contains a concept hierarchy, basic concept sequences, and domain specific patterns fo r TIE-UPs .</Paragraph> <Paragraph position="2"> Concept Hierarchy and Basic Concept Sequences Several different sources were used in constructing the knowledge-base used in memory based parsin g for this year 's system . WordNet was used in the construction of the noun-based ontologically structure d knowledge-base, and also in the construction of the verb-based basic concept sequence knowledge-base . The ontological and taxonomical classifications built into WordNet are carried over into our knowledge-base . The English gazeteer was used in building the location knowledge-base . As in WordNet, the inherent structur e built into the gazeteer is carried over into the location knowledge-base .</Paragraph> <Paragraph position="3"> In all of these knowledge-bases, a word may have several different semantic meanings. Our system provided for the representation of these various different meanings, but not the resolution of the semantic lexical ambiguity yet . In the future, new knowledge in the form of appropriate links between the noun-based ontologically structured knowledge-base and the verb-based concept sequence knowledge-base, and semanti c priming links between appropriate semantic meaning nodes, will aid in the automatic semantical lexical am biguity resolution of some of the nouns and verbs. This will occur during parsing, with no extra effort needed by the parser ; the knowledge built into the knowledge-bases will guide the markers during marker passing t o select the appropriate senses of some ambiguous words . For those words which cannot be disambiguated i n this fashion, common-sense and pragmatic inferencing rules can be applied after parsing . Location names, too, can be semantically ambiguous. An extra algorithm of marker passing along a limited number and type of links in the location knowledge-base will be added to the system in the future in order to resolve the ambiguities in location names.</Paragraph> <Paragraph position="4"> In each knowledge-base, a common structure emerges . Nodes are basically of two types . Some nodes represent the actual word itself, while other nodes represent a possible meaning of a single word or multiple words. Using the terminology of WordNet, we call these &quot;meaning&quot; nodes &quot;synset nodes&quot;, and we call the nodes representing the words themselves &quot;word nodes&quot; . Except in the case of antonyms, the semanti c relationships such as hypernomy, hyponomy, meronomy and holonymy exist as pointers between the synse t nodes, because they are relationships between the meanings, and not between the words themselves . For antonyms, however, the antonym relationship exists between individual words and not the meanings of thos e words. In the future, the hypernym links will be used like ISA links in propagating markers upwards fro m the nouns, through the restrictions which shall be added to the knowledge-base for semantic ambiguity resolution, to the verb-based concept sequences . Likewise, the meronym and holonym links can be used bot h for semantic priming, and also for &quot;understanding&quot; sentences where meronymic or holonymic metaphor i s used.</Paragraph> <Section position="1" start_page="309" end_page="309" type="sub_section"> <SectionTitle> Domain Specific Patterns for TIE-UPs </SectionTitle> <Paragraph position="0"> The most important task for the Joint Venture domain is to detect a TIE-UP relationship between several ENTITIES . Since the TIE-UP relationship is represented by using various kinds of expressions, a large amount of rules are necessary to detect the TIE-UPs, after the syntactic and semantic processing . For example, a TIE-UP relationship between company A and company B can be expressed by using the ver b set, form, or establish as &quot;A set up joint venture with B&quot;, or using the verb agree or sign as &quot;A agree wit h B to start joint venture&quot; . A verb-oriented semantic representation is not sufficient ; one more step is neede d to generate the final representation .</Paragraph> <Paragraph position="1"> This year, we developed a domain specific semantic pattern representation for efficient information extraction . A pattern is represented as a pair of a meaning frame defining the necessary information to b e extracted, and a phrasal pattern describing the surface syntactic ordering . This representation is called an FP-structure. The FP-structures are used by the parser to detect TIE-UP relationships by pattern matching . Figure 3 shows an example of the FP-structure for TIE-UP relationship . The frame has slots of ENTITYO (joint venture company), ENTITY1 and ENTITY2, and each slot has semantic constraint MUC5 -COMPANY. In the text representation of the FP-structure, the elements with parenthesis are concepts, an d other elements are all lexical elements . When the parser performs pattern matching, all the lexical entries , that are mapped under a concept specified in the FP-structure, are matched to their corresponding element .</Paragraph> <Paragraph position="2"> By using the FP-structure, the direct mapping from a surface phrase pattern of input sentence to the targe t representation of a TIE-UP relation between companies is possible .</Paragraph> <Paragraph position="3"> Currently, more than 300 patterns for the TIE-UP relationship have been automatically generated b y using our lexical acquisition system, and those patterns were successfully used by parser to detect TIE-UPs .</Paragraph> <Paragraph position="4"> The lexical acquisition system, and the performance improvement according to the FP-structures, are presented in the later sections.</Paragraph> </Section> </Section> <Section position="9" start_page="309" end_page="311" type="metho"> <SectionTitle> PHRASAL PARSER </SectionTitle> <Paragraph position="0"> From a sequence of dictionary definitions of entries in the modified input sentence, the phrasal parse r produces a sequence of phrasal segments consisting of relevant entries and assigns features to noun group s which will be used by the template generator .</Paragraph> <Paragraph position="1"> The list of phrasal segments is as follows : noun group, verb group, adverb group, date time group, an d several important word classes such as conjunction, preposition, relative pronoun, and punctuation, and th e word &quot;that&quot; and possessive marker ('s) . In case of syntactic ambiguity which still exists after ambiguity resolution based on local syntactic and semantic information, the phrasal parser returns several phrasa l segments where the input text can be parsed into different phrases . Figure 4 shows a sequence of phrasal segments generated from the first sentence Si of the walkthrough message.</Paragraph> <Paragraph position="2"> We describe below how these phrasal segments are produced from the preprocessor outputs using a nou n group example . Noun group refers to the noun phrase up to the head noun . Qualifiers of a head noun such a s prepositional phrases and relative clauses are processed separately . Noun groups have an intricate syntactic structure. However, the ordering of constituents in the noun group is relatively fixed as shown below : pre-determiner + article + quantifier + demonstrative pronoun + possessive pronoun + numerical modifier + ordinal + cardinal + adverb* + (adjective* past-participle ~ present-participle) + noun * Most of the choices are realized by including or omitting possible constituents . An asterisk indicates that a single slot can be filled by a sequence of constituents of the same type . The ordering constraint is used to capture the possible syntactic patterns of noun groups by a straightforward mapping of slot and fille r analysis as shown below : the result --~ article + common-nou n a densely populated area --> article + past-participle + common-nou n its 1985 takeover possessive-pronoun + ordinal + common-nou n 1. Place a P marker on the CSR and its first CSE of each concept sequence .</Paragraph> <Paragraph position="3"> 2. Get an input phrase from the phrasal parser.</Paragraph> <Paragraph position="4"> 3. Generate a concept instance from the input phrase .</Paragraph> <Paragraph position="5"> 4. Propagate A markers from the concept instance to the corresponding CSEs through a chain of Instance, hypernym, and reverse-semantic links.</Paragraph> <Paragraph position="6"> 5. If a P-marked CSE receives an A marker, accept the CSE.</Paragraph> <Paragraph position="7"> Then pass the P marker to the next CSE through next link, or propagate an A marker to the CSR through reverse-last limk .</Paragraph> <Paragraph position="8"> 6. If a P-marked CSR receives an A markeraccept the concept sequence and then propagate Cancel markers from the CSR through Inhibition link to cancel the subsumed concept sequence s which are also accepted.</Paragraph> <Paragraph position="9"> 7. If all input phrases are processed, generate CSIs of the accepted concept sequences as interpretation of th e input sentence. Else, go to 2.</Paragraph> <Paragraph position="10"> (a) Basic marker-passing step s Although this noun group syntactic structure is still a simplification compared to the full range of Englis h noun group constructs, it covers a wide variety of noun groups . In addition, whenever an exception occurs , a new noun group pattern covering the exception is added to the set of existing noun group patterns . In this way, we have successfully developed an extensive set of noun group patterns by using the MUC-5 test corpus .</Paragraph> <Section position="1" start_page="311" end_page="311" type="sub_section"> <SectionTitle> Noun Group Feature Assignment </SectionTitle> <Paragraph position="0"> The noun group features assigned by the phrasal parser are as follows :</Paragraph> </Section> </Section> <Section position="10" start_page="311" end_page="313" type="metho"> <SectionTitle> MEMORY-BASED PARSER </SectionTitle> <Paragraph position="0"> The memory-based parser matches the sequence of phrasal segments with concept sequences stored i n the knowledge base via a series of parallel marker-passing commands . When a concept sequence for a tie-up relation is indexed by a set of input phrasal patterns, it generates a pseudo-template. From this output, the template generator produces a template on tie-up relations .</Paragraph> <Paragraph position="1"> The memory-based parsing algorithm is based on repeated applications of top-down predictions an d bottom-up activations by propagating markers in parallel throughout the semantic network knowledge base . Figure 5-(a) shows the basic marker-passing steps used to match an input sentence with relevant concep t sequences. Prediction-markers (P markers) are used to predict some concept sequence elements as potential hypotheses of the input phrase . Activation-markers (A markers) are used to propagate activations fro m the input phrase to the corresponding concept sequence elements . As shown in Figure 5-(b), among th e P-marked concept sequence elements, those which receive A markers are accepted as valid hypotheses . After a P-marked concept sequence element is accepted, the P marker is passed to the next concept sequenc e element and waiting to be activated by the next input phrase . When all concept sequence elements of a concept sequence are accepted, a concept sequence instance is generated for meaning representation .</Paragraph> <Section position="1" start_page="312" end_page="313" type="sub_section"> <SectionTitle> Parsing Example </SectionTitle> <Paragraph position="0"> We describe here how the memory-based parser produces a pseudo-template from an example sentenc e Si of the walkthrough message . At the start of parsing, all concept sequences are potential hypotheses for an incoming sentence . Therefore, in step 1, all concept sequences are initially predicted by placing P markers o n their CSRs and the first CSEs . For example, in the example knowledge base shown in Figure 6, P markers are placed on TIE-UP-186, 186-1, TIE-UP-66 and 66-1. In step 4, from the first input phrase, [BRIDGESTONE SPORTS CO..], a concept instance MUC5-COMPANY#1 is generated and connected to the corresponding semantic concept MUCS-COMPANY via instance link, as indicated in Figure 6 . These links are used as paths for propagating bottom-up activations from the concept instance MUCS-COMPANY#1 to the corresponding concept sequence elements in step 5 . Even though the CSEs 186-1, 186-8, 186-10, 66-3 and 66-5 in Figure 6 received A markers, only the P-marked CSE 186-1 is accepted, and therefore, MUC5-COMPANY#1 is bound to 186-1 via cse-instance link, as shown in Figure 7 . A markers at other CSE nodes are disregarded, becaus e they are not predicted at this stage . As a result, the P marker at 186-1 is passed to the next CSE 186-2.</Paragraph> <Paragraph position="1"> However, other P markers are not affected by this change . The P-marked nodes (in this case, 186-2, 66-1, This procedure is repeated until all input phrasal segments are processed . Even though many nodes are predicted in the early processing stage, as more input words are processed, only a few nodes receive activations and are accepted. Eventually only relevant hypotheses survive as the interpretation of the input sentence . In this way, a parsing problem is converted into a phrasal pattern matching problem . The memory-based parser acts as a filter, in which each new input word reduces the number of possible meanings .</Paragraph> </Section> </Section> <Section position="11" start_page="313" end_page="315" type="metho"> <SectionTitle> TEMPLATE GENERATION MODUL E </SectionTitle> <Paragraph position="0"> Overvie w The template generation module is a rule-based system, which consists of 4 levels as shown in Figure 8 . The first part generates templates based on the output of the memory-based parser with strict constraint s for the verb cases. The second part generates templates based on the output of the phrasal parser, whe n the memory-based parsing failed to generate correct interpretation . The third part is almost same as the second part, but it requires less strict string patterns to apply the rules . The forth part is also based o n the memory-based parsing output but with less strict verb case constraints. When applying the rules, latter parts are not applied when a former part succeeds in generating templates .</Paragraph> <Paragraph position="1"> Template Generation Based on Memory-Based Parse r When memory-based parsing is successful, generated concept sequence instances are marked. In the template generation process, some markers are propagated from this marker, and the information is extracte d by collecting markers . The memory-based parser not only gives the syntactic information but also gives th e semantic interpretation. Thus, the output contains the template generation rules by itself, and the templates generated have high precision rate . All concept sequences are attained by an automatic knowledg e acquisition tool from the TIPSTER texts and keys . There are about 300 concept sequences for templat e generation in the knowledge base .</Paragraph> <Paragraph position="2"> One major problem is caused by the advantage above, because concept sequences require the satisfactio n of strict semantic constraints . Because many company and human names are not known to the system an d there is no inference routine to guess company or human names, the memory-based parser fails on man y sentences that are matched to the existing concept sequences but doesn 't satisfy the semantic constraints. To overcome this problem, the semantic constraints of some part of the concept sequences are relaxed to nothing . These concept sequences with relaxed semantic constraints are chosen very carefully from the concept sequences with semantic constraints, in order to guarantee correct template generation . About 100 concept sequences were used for the final run . Both of the modules with or without semantic constraint s only give the template information about tie-up relationship, entities and entity relationship . Other template information such as activity, industry and so on is filled out based on phrasal parser output .</Paragraph> <Paragraph position="3"> Template Generation Based on Phrasal Parse r In this part, information is extracted by matching string patterns . Important noun groups are classifie d using a special data structure, which specifies if a noun phrase has any (or multiple) of company name, special common nouns (e .g. venture, company or firm), human name, nationality, city name, product name , facility and so on . The template generation module uses actual string input and data structure, and extract s the required information when there is a given pattern which matches the text input .</Paragraph> <Paragraph position="4"> Though the phrasal parser does not give the semantic interpretation, the success rate of phrasal parse r is much higher than the memory-based parser, and it is used as a back-up input for the template generatio n module. Because it doesn't have any semantic information, the given strict string patterns impose a kind o f semantic constraint in template generation . For the purposes of template generation based on the phrasa l parser, domain-specific rules were generated . These rules form the skeleton of the template generation mod ule and serve as the driving force behind the template generation process . All template generation based on the phrasal parser is due to both the successful application of a specific rule, and the implementatio n of the actions associated with the specific rule. Another use of the rules was as an aid in building concep t sequences for the knowledge base to be utilized by the memory- based parser .</Paragraph> <Paragraph position="5"> Rule generation was performed through the use of training sentences obtained from the TIPSTER text s and also from human intuition . Training sentences were analyzed for important concepts, words, punctuation, symbols, etc. (e .g . company name, 'with', '$', ',') . Sequential patterns of these targeted attribute s were then formed . This process was repeated for all template fill categories . Around 750 rules (patterns) are implemented to be used to extract template information about tie-up relationship, entity, entity relationship , activity, industry, product/service, etc .</Paragraph> <Paragraph position="6"> Application of the rules is dependent upon the identification of concepts found to be important for th e joint-venture domain. The identification of important concepts for use in rule application by the templat e generation module is performed by the preprocessor and phrasal parser . The preprocessor identifies important domain-specific concepts through the use of a dictionary containing lists of words -- some of which ar e very long (e.g. company names, locations) . Several files made available to MUC5 participants were utilize d in forming these lists . The phrasal parser performs syntactic analysis in order to disambiguate results pro parser null phrasal parser memory-based parser duced by the preprocessor.</Paragraph> <Paragraph position="7"> Results obtained by the application of rules by the template generation module based on the phrasal parser augment the results obtained from the memory-based parser . Although the memory-based parse r produces precise results, it requires a highly formalized structure for its knowledge base . Due to size limitations and time limitations regarding construction of the knowledge base, a knowledge base of sufficient siz e and complexity was not available to the memory-based parser -- therefore making it more prone to failure . The precision of results obtained through rule application was not as high as the precision of results obtaine d by utilizing the memory-based parser . However, since the structural requirements for rules in this system are not as highly formalized, and since size limitations are not as severe, the likelihood of failure is substantiall y reduced.</Paragraph> <Section position="1" start_page="315" end_page="315" type="sub_section"> <SectionTitle> Definite Reference Resolution </SectionTitle> <Paragraph position="0"> The activation-based reference resolution scheme which has been developed as a part of SNAP project [1] could not be used, because it relies on the memory-based parsing output . Therefore, a very simple reference resolution scheme is used with phrasal parser output, only for the joint venture and the entities . When a definite reference to a joint venture (e .g. &quot;the joint venture&quot;) is found, the most recent joint venture is chosen as a referent. The same scheme is applied to definite references to entities, except the cases using company names (e .g. aliases) . A small module has been written to handle acronyms and company aliases .</Paragraph> </Section> </Section> <Section position="12" start_page="315" end_page="315" type="metho"> <SectionTitle> LEXICAL ACQUISITION SYSTEM </SectionTitle> <Paragraph position="0"> To provide the domain specific semantic patterns (FP-structures) presented in the previous section, a n automatic lexical acquisition system called PALKA has been developed [4] . The major goal of this system is to facilitate the construction of a large knowledge base of semantic patterns . PALKA acquires semanti c patterns from a set of domain specific sample texts .</Paragraph> <Paragraph position="1"> The information extraction task on a narrow domain is quite different from a general semantic interpretation task . First, there are only a small number of event categories to which each text should be mapped , and only a small amount of pre-defined types of information to be extracted . Various expressions need to be mapped into one event category . Second, the information to be extracted can be found anywhere in th e sentence, not only in the subject or the object of the sentence, but also in the prepositional phrases or in th e modifier . An efficient representation should map various expressions to one of the desired categories, an d detect the information carrying words or phrases from anywhere in the sentence . Based on these observations,the FP-structure has been developed as a suitable representation .</Paragraph> <Paragraph position="2"> Acquisition Procedure First., a meaning frame including several keywords is defined as a target of the acquisition . The TIE-UP meaning frame contains ENTITYO, ENTITYI, and ENTITY2 slots, each of which has its semantic constraint as MUC5-COMPANY . ENTITY0 corresponds to the joint venture company (child company) . The word &quot;joint venture &quot; has been selected as a keyword . Relevant sentences are extracted from training texts by using the keyword . Second, an extracted sentence is segmentized and decomposed into several simpl e phrasal patterns. The original text consists of complex sentences which contain relative clauses, nomina l clauses, conjunctive clauses, etc . Since semantic patterns are acquired from simple clauses, it is necessary t o convert a complex sentence to a set of simple clauses . The phrasal parser groups words based on each word' s syntactic category and ordering rules for noun-groups and verb-groups . After grouping is performed, the phrasal parser first simplifies the sentence by eliminating several unnecessary elements such as determiners , adverbs, quotations, brackets and so on . Then it. converts the simplified sentence into several simple clause s by using conversion rules. The conversion rules include separation of relative clauses, nominal clauses an d conjunctive clauses . Third, by using the semantic tags provided by the preprocessor . links between the fram e slots and the phrasal pattern elements are established . All the noun groups representing a company nam e are mapped as an ENTITY . Separating ENTITYO from others is done by a human after generating all th e FP-structures . Finally, PALKA constructs an FP-structure based on the mapping information . The basi c strategy for constructing an FP-structure is to include the mapped elements and the main verb, and discar d the unmapped elements .</Paragraph> <Paragraph position="3"> Figure 9 shows several examples of the training sentences and the FP-structure generated from them fo r TIE-UP. As discussed in the performance analysis section, the FP-structures generated by PALKA greatl y affects the overall scores . Also by automating the procedure of building patterns, it reduces the time fo r constructing the knowledge base of semantic patterns, and provides scalabilit .y and portability to the knowledge based information extraction .</Paragraph> </Section> class="xml-element"></Paper>