File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1023_metho.xml
Size: 48,776 bytes
Last Modified: 2025-10-06 14:13:28
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1023"> <Title>LOCAL CONCERN AND A JAPANESE TRADING HOUSE TO PRODUCE GOLF CLUBS TO BE SHIPPED TO @@_Japan *END*) (*START* THE JOINT VENTURE $COMMA$ BRIDGESTONE SPORTS TAIWAN CO= $COMMA$ CAPITALIZE D AT $$20000000 TWD $COMMA$ WILL START PRODUCTION **DURING 0190 WITH PRODUCTION OF ON AND METAL WOODCLUBS A MONTH *END*) (*START* THE MONTHLY OUTPUT WILL BE LATER RAISED T O &850000 UNITS $COMMA$ BRIDGESTON SPORTS OFFICIALS SAID *END*) (*START* THE NEW COMPAN Y $COMMA$ BASED IN KAOHSIUNG $COMMA$ SOUTHERN TAIWAN $COMMA$ IS OWNED %%75 BY BRIDGESTONE SPORTS $COMMA$ %%15 BY UNION PRECISION CASTING CO= OF C Taiwan AND THE REMAINDER BY 1ACA</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> UMASS/HUGHES : DESCRIPTION OF THE CIRCUS SYSTE M USED FOR MUC-5 1 </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="277" type="metho"> <SectionTitle> INTRODUCTIO N </SectionTitle> <Paragraph position="0"> The primary goal of our effort is the development of robust and portable language processin g capabilities for information extraction applications. The system under evaluation here is based on language processing components that have demonstrated strong performance capabilities in previous evaluation s [Lehnert et al . 1992a] . Having demonstrated the general viability of these techniques, we are no w concentrating on the practicality of our technology by creating trainable system components to replac e hand-coded data and manually-engineered software.</Paragraph> <Paragraph position="1"> Our general strategy is to automate the construction of domain-specific dictionaries and other language related resources so that information extraction can be customized for specific applications with a minimal amount of human assistance. We employ a hybrid system architecture that combines selective concept extraction [Lehnert 1991] technologies developed at UMass with trainable classifier technologies develope d at Hughes [Dolan et al . 1991]. Our MUC-5 system incorporates seven trainable language components to handle (1) lexical recognition and part-of-speech tagging, (2) knowledge of semantic/syntactic interactions , (3) semantic feature tagging, (4) noun phrase analysis, (5) limited coreference resolution, (6) domain objec t recognition, and (7) relational link recognition . Our trainable components have been developed so domai n experts who have no background in natural language or machine learning can train individual syste m components in the space of a few hours.</Paragraph> <Paragraph position="2"> Many critical aspects of a complete information extraction are not appropriate for customization o r trainable knowledge acquisition . For example, our system uses low-level text specialists designed t o recognize dates, locations, revenue objects, and other common constructions that involve knowledge o f conventional language. Resources of this type are portable across domains (although not all domains require all specialists) and should be developed as shamble language resources. The UMass/Hughes focus has been on other aspects of information extraction that can benefit from corpus-based knowledge acquisition . For example, in any given information extraction application, some sentences are more important than others , and within a single sentence some phrases are more important than others. When a dictionary is customized for a specific application, vocabulary coverage can be sensitive to the fact that a lot of words contribut e little or no information to the final extraction task : full dictionary coverage is not needed for informatio n extraction applications.</Paragraph> <Paragraph position="3"> In this paper we will overview our hybrid architecture and trainable system components . We will look at examples taken from our official test runs, discuss the test results obtained in our official and optional test runs, and identify promising opportunities for additional research .</Paragraph> </Section> <Section position="3" start_page="277" end_page="280" type="metho"> <SectionTitle> TRAINABLE LANGUAGE PROCESSIN G </SectionTitle> <Paragraph position="0"> Our MUC-5 system relies on two major tools that support automated dictionary construction : (1) OTB, a trainable part-of-speech tagger, and (2) AutoSlog, a dictionary construction tool that operates i n conjunction with the CIRCUS sentence analyzer. We trained OTB for EJV on a subset of EJV texts and then again for EME using only EME texts . OTB is notable for the high hit rates it obtains on the basis of relatively little training. We found that OTB attained overall hit rates of 97% after training on only 100 9 sentences for EJV . OTB crossed the 97% threshold in EME after only 621 training sentences . Incrementa l OTB training requires human interaction with a point-and-click interface . Our EJV training was completed after 16 hours with the interface; our EME training required 10 hours.</Paragraph> <Paragraph position="1"> AutoS log is a dictionary construction tool that analyzes source texts in conjunction with associated key templates (or text annotations) in order to propose concept node (CN) definitions for CIRCUS [Riloff & Lehnert 1993; Riloff 1993] . A special interface is then used for a manual review of the AutoSlog definitions in order to separate the good ones from the bad ones. Of 3167 AutoSlog CN definition s proposed in response to 1100 EJV key templates, 944 (30%) were retained after manual inspection . For EME, AutoSlog proposed 2952 CN definitions in response to 1000 key templates and 2275 (77%) of these were retained after manual inspection . After generalizing the original definitions with active/passive transformations, verb tense generalizations, and singular/plural generalizations, our final EJV dictionary contained 3017 CN definitions and our final EME dictionary contained 4220 CN definitions. It took 20 hours to manually inspect and filter the full EJV dictionary; the full EME dictionary was completed in 17 hours. TheCIRCUS dictionary used in our official run was based exclusively on AutoSlog CN definitions .</Paragraph> <Paragraph position="2"> No hand-coded or manually altered definitions were added to the CN dictionary.</Paragraph> <Paragraph position="3"> When CIRCUS processes a sentence it can invoke a semantic feature tagger (MayTag) that dynamically assigns features to nouns and noun modifiers. MayTag uses a feature taxonomy based on the semantics of our target templates, and it dynamically assigns context-sensitive tags using a corpus-driven case-based reasoning algorithm [Cardie 93] . MayTag operates as an optional enhancement to CIRCUS sentenc e analysis. We ran CIRCUS with MayTag for EJV, but did not use it for EME (well return to a discussion of this and other domain differences later) . MayTag was trained on 174 EJV sentences containing 559 1 words (3060 open class words and 2531 closed class words) . Our tests indicate that MayTag achieves a 74% hit rate on general semantic features (covering 14 possible tags) and a 75% hit rate on specific semantic features (covering 42 additional tags). Interactive training for MayTag took 14 hours using a text editor.</Paragraph> <Paragraph position="4"> An important aspect of the MUC-5 task concerns information extraction at the level of noun phrases .</Paragraph> <Paragraph position="5"> Important set fill information is often found in modifiers, such as adjectives and prepositional phrases. Part-of-speech tags help us identify basic noun phrase components, but higher-level processes are needed t o determine if a prepositional phrase should be attached, how a conjunction should be scoped, or if a comm a should be crossed . Noun phrase recognition is a non-trivial problem at this higher level . To address the more complicated aspects of noun phrase recognition, we use a trainable classifier that attempts to find the best termination point for a relevant noun phrase . This component was trained exclusively on the EJ V corpus and then used without alteration for both EJV and EME . Experiments indicate that the noun phras e classifier terminates EJV noun phrases perfectly 87% of the time . 7% of its noun phrases pick up spurious text (they are extended too far), and 6% are truncated (they are not extended or extended far enough) . Similar hit rates are found with EME test data: 86% for exact NP recognition, with 6% picking up spurious tex t and 8% being truncated . The noun phrase classifier was trained on 1350 EJV noun phrases examined i n context. It took 14 hours to manually mark these 1350 instances using a text editor .</Paragraph> <Paragraph position="6"> Before we can go from CIRCUS output to template instantiations, we create intermediate structure s called memory tokens. Memory tokens incorporate coreference decisions and structure relevant information to facilitate template generation . Memory tokens record source strings from the original input text, on tags, MayTag features, and pointers to concept nodes that extracted individual noun phrases.</Paragraph> <Paragraph position="7"> Discourse analysis contributes to critical decisions associated with memory tokens . Here we find the greatest challenges to trainable language systems. Thus far, we have implemented one trainable component that contributes to coreference resolution in limited contexts. We isolate compound noun phrases that are syntactically consistent with appositive constructions and pass these NP pairs on to a coreference classifier .</Paragraph> <Paragraph position="8"> Since adjacent NPs may be separated by a comma if they occur in a list or at a clause boundary, it is easy t o confuse legitimate appositives with pairings of unrelated (but adjacent) NPs . Appositive recognition is therefore treated as a binary classification problem that can be handled with corpus-driven training . For our official MUC-5 runs we trained a classifier to handle appositive recognition using EJV development text s and then used the resulting classifier for both EJV and EME . Our best test results with this classifie r showed an 87% hit rate on EJV appositives. It took 10 hours to manually classify 2276 training instances for the appositive classifier using a training interface .</Paragraph> <Paragraph position="9"> Our final tool, TTG, is responsible for the creation of template generators that map CIRCUS output into final template instantiations. TTG template generators are responsible for the recognition and creation of domain objects as well as the insertion of relational links between domain objects. TTG is corpus-driven and requires no human intervention during training. Application-specific access methods (pathing functions ) must be hand-coded for a new domain but these can be added to TTG in a few days by a knowledgeabl e technician working with adequate domain documentation. Once these adjustments are in place, TTG use s memory tokens and key templates to train classifiers for template generation . No further human intervention is required to create the template generators, although additional testing, tuning and adjustment s are needed for optimal performance.</Paragraph> <Paragraph position="10"> Our hybrid architecture demonstrates how machine learning capabilities can be utilized to acquire man y different kinds of knowledge from a corpus . These same acquisition techniques also make it easy to exploi t the resulting knowledge without additional knowledge engineering or sophisticated reasoning . The knowledge we can readily acquire from a corpus of representative texts is limited with respect to reusability , but nevertheless cost-effective in a system development scenario predicated on customized software. The trainable components used for both EJV and EME were completed after 101 hours of interactive work by a human-in-the-loop. Moreover, most of our training interfaces can be effectively operated by domain experts : programming knowledge or familiarity with computational linguistics is generally not required .2 Near the end of this paper will report the results of a system development experiment that supports this claim .</Paragraph> <Paragraph position="11"> There will always be a need for some amount of manual programming during the system developmen t cycle for a new information extraction application . Even so, significant amounts of system development that used to rely on experienced programmers have been shifted over to trainable language components. The ability to automate knowledge acquisition on the basis of key templates represents a significant redistribution of labor away from skilled knowledge engineers, who need access to domain knowledge, directly to the domain experts themselves. By putting domain experts into the role of the human-in-the-loop we can reduce dependence on software technicians. When significant amounts of system development work is being handled by automated knowledge acquisition and expert-assisted knowledge acquisition, it wil l become increasingly cost-effective to customize and maintain a variety of information extractio n applications. We have only just begun to explore the range of possibilities associated with trainabl e language processing systems .</Paragraph> <Paragraph position="12"> The hybrid architecture underlying our official MUC-5 systems was less than six months old at th e time of the evaluation, and most of the trainable language components that we utilized were less than a yea r old. Less than 24 person/months were expended for both of the EJV and EME systems, although thi s estimate is confounded by the fact that trainable components and their associated interfaces were being designed, implemented, and tested by the same people responsible for our MUC-5 system development . The creation of a trainable system component represents a one-time system development investment that can b e applied to subsequent systems at much less overhead.</Paragraph> <Paragraph position="13"> Figure 1 outlines the basic flow-of-control through the major components of the UMass/Hughes MUC- 5 system. Note that most of the trainable components depend only on the texts from the development corpus .</Paragraph> <Paragraph position="14"> The concept node dictionary and the trainable template generator also rely on answer keys during training. In the case of the concept node dictionary, we have been able to drive our dictionary construction process o n the basis of annotated texts created by using a point-and-click text marking interface . So the substantial overhead associated with creating a large collection of key templates is not needed to support automate d dictionary construction . However, we do not see how to support trainable template generation without a se t of key templates, so this one trainable component requires a significant investment with respect to labor .</Paragraph> <Paragraph position="15"> 2 Some technical background is needed to train on. Knowledge of our part-of-speech tags is needed for tha t interface.</Paragraph> </Section> <Section position="4" start_page="280" end_page="280" type="metho"> <SectionTitle> THE OPTIONAL TEST RUNS </SectionTitle> <Paragraph position="0"> We ran optional tests to see what sort of recall/precision trade-offs were available from the system .</Paragraph> <Paragraph position="1"> Since the template generator is a set of classifiers, and each classifier outputs a certainty associated with a hypothesized template fragment, we have many parameters that can be manipulated . Raising the threshold on the certainty for a hypothesis will, in most cases, increase precision and reduce recall . In the experiments reported here, we have varied the parameters over broad classes of discrimination trees. There are three important classes of decision tree: (1) trees that filter the creation of objects based on string fills, (2) trees that filter the creation of objects based on set fills, and (3) trees that hypothesize relations amon g objects. An example of the first class is the tree that filters the CIRCUS output for entity names in th e EJV domain. An example of the second class is the tree that filters possible lithography objects based o n evidence of the type of lithography process . The trees that hypothesize TIE_UP_RELATIONSHIP's and ME_CAPABILITY's are examples of the third class.</Paragraph> <Paragraph position="2"> For these experiments we have varied the certainty thresholds for all trees of a given class. Figure 4 shows the trade-off achieved for EME.</Paragraph> <Paragraph position="3"> This trade-off curve was achieved by varying, in concert the thresholds on all three classes o f discrimination tree from 0 .0 to 0.9. Figure 5 shows the trade-off curve achieved in EJV . The difference between the two curves highlights difference between the two domains and between the system configurations used for the two domains . The EME curve shows a much more dramatic trade-off. The EJV curve shows that only modest varying of recall and precision is achievable . Part of this is a reflection of the two domains. In EJV, most relationships were found via two noun phrases that shared a common CN trigger. This method proved to be effective at detecting relationships. Therefore the only real difference in the trade-off comes from varying the thresholds for the string-fill and set-fill trees, which generate the objects that are then composed into relationships. In EME, there not nearly as many shared triggers and so the template generator must attempt intelligent guesses for relations. The probabilistic guesses made in EME are much more amenable to threshold manipulation than the more structured information used in EJV .</Paragraph> <Paragraph position="4"> Also, in EJV the system ran with a slot masseur that embodied some domain knowledge. In EJV, TTG was configured to only hypothesize objects if the slot masseur had found a reasonable slot-fill or set-fill . This use of domain knowledge further limited the efficacy of changing certainty thresholds .</Paragraph> </Section> <Section position="5" start_page="280" end_page="280" type="metho"> <SectionTitle> TRAINABLE INFORMATION EXTRACTION IN ACTIO N </SectionTitle> <Paragraph position="0"> Before CIRCUS can tackle an input sentence, we have to pass the source text through a preprocesso r that locates sentence boundaries and reworks the source text into a list structure . The preprocessor replaces punctuation marks with special symbols and applies text processing specialists to pick up dates, locations , and other objects of interest to the target domain . We use the same preprocessing specialists for both EJV and EME : many specialists will apply to multiple domains . A subset of the Gazetteer was used to support the location specialist, but no other MRDs are used by the preprocessing specialists . We do not have a specialist that attempts to recognize company names .</Paragraph> <Paragraph position="1"> on tagged 97.1% of the words in FJV 0592 correctly . One error associated with &quot;... A COMPANY</Paragraph> </Section> <Section position="6" start_page="280" end_page="283" type="metho"> <SectionTitle> ACTIVE IN TRADING WITH TAIWAN ...&quot; led to a truncated noun phrase when &quot;active&quot; was tagged as a hea d </SectionTitle> <Paragraph position="0"> noun instead of a nominative predicate .</Paragraph> <Paragraph position="1"> With part-of-speech tags in place, CIRCUS can begin selective concept extraction . On this first sentence from EJV 0592, CIRCUS triggers 18 CN definitions triggered by the words &quot;said&quot; (3 CNs), &quot;set &quot; (3 CNs), &quot;venture&quot; (9 CNs), &quot;produce&quot; (1 CN), and &quot;shipped&quot; (2 CNs) . These CNs extract a number of ke y noun phrases, and assign semantic features to these noun phrases based on soft constraints in the C N definition . Some of these features were recognized to be inconsistent with the slot fill and others wer e deemed acceptable. Notice that different CNs picked up &quot;BRIDGESTONE SPORTS CO .&quot; with incompatible semantic features (it was associated with both ajoint venture and ajoint venture parent feature).</Paragraph> <Paragraph position="2"> As we can see from this sentence, CN feature types are not always reliable, and CIRCUS does not always recognize the violation of a soft feature constraint. An independent set of semantic features are obtained from MayTag . In the first sentence of EJV 0592, MayTag only missed marking &quot;golf clubs&quot; as a product/service. An independent set of semantic features are obtained from MayTag : In addition to extracting some noun phrases and assigning semantic features to those noun phrases, we also call the noun phrase classifier to see if any of the simple NPs picked up by the CN definitions should be extended to longer NPs. For this sentence, the noun phrase classifier extended only one NP : it decided that &quot;A JOINT VENTURE&quot; should be extended to pick up &quot;A JOINT VENTURE IN TAIWAN WITH A LOCA L</Paragraph> </Section> <Section position="7" start_page="283" end_page="284" type="metho"> <SectionTitle> CONCERN AND A JAPANESE TRADING HOUSE&quot;. The second prepositional phrase should not have bee n </SectionTitle> <Paragraph position="0"> included - this is an NP expansion that was overextended.</Paragraph> <Paragraph position="1"> Each noun phrase extracted by a CIRCUS concept node will eventually be preserved in a memory token that records the CN features, MayTag features, any NP extensions, and other information associated wit h CN definitions . But before we look at the memory tokens, let's briefly review the other NPs that ar e extracted from the remainder of the text. For each preprocessed sentence produced in response to EJV 0592 , we will put the noun phrases extracted by CIRCUS into boldface and use underlines to indicate how th e noun phrase classifier extends some of these NPs.</Paragraph> <Paragraph position="2"> As far as our CN dictionary coverage is concerned, we were able to identify all of the relevant noun phrases needed with the exception of &quot;A LOCAL CONCERN AND A JAPANESE TRADING HOUSE' which should have been picked up by a JV parent CN. In fact, our AutoSlog dictionary had two such definitions in plac e for exactly this type of construction, but neither definition was able to complete its instantiation because of a previously unknown problem with time stamps inside CIRCUS . This was a processing failure - not a dictionary failure.</Paragraph> <Paragraph position="3"> Trainable noun phrase analysis processes 13 of the 17 NP instances marked above correctly. Three of the NPs were expanded too far, and one was expanded but not quite far enough due to a tagging error b yon (&quot;a company active ...&quot;). An inspection of the 13 correct instances reveals that 7 of these would hav e been correctly terminated by simple heuristics based on part-of-speech tags . It is important to note that th e trainable NP analyzer had to deduce these more &quot;obvious&quot; heuristics in the same way that it deduce s decisions for more complicated decisions . It is encouraging to see that straightforward heuristics can b e acquired automatically by trainable classifiers. When our analyzer makes a mistake, it generally happen s with the more complicated noun phrases (which is where hand-coded heuristics tend to break down as well) . After the noun phrase classifier has attempted to find the best termination points for the relevant NPs , we then call the coreference classifier to consider pairs of adjacent NPs separated by a comma. In this text we find three such appositive candidates (the second of which contains an extended NP that was not properl y terminated):</Paragraph> </Section> <Section position="8" start_page="284" end_page="286" type="metho"> <SectionTitle> THE JOINT VENTURE, BRIDGESTONE SPORTS TAIWAN CO.TAGA CO., A COMPANY ACTIVE THE NEW COMPANY, BASED IN KAOHSIUNG, SOUTHERN TAIWA N </SectionTitle> <Paragraph position="0"> In the third case, the location specialist failed to recognize either Kaohsiung or Southern Taiwan a s names of locations. On the other hand, the fragment &quot;based in Kaohsiung&quot; was recognized as a locatio n description and therefore reformatted it as &quot;THE NEW COMPANY (%BASED-IN% KAOHSIUNG), SOUTHER N TAIWAN&quot; which set up the entire construct as an appositive candidate . The coreference classifier then went on to accept each of these three instances as valid appositive constructions . This was the right decision in the first two cases, but wrong in the third. If full location recognition had been working, this last instanc e would have never been handed to the coreference classifier in the first place.</Paragraph> <Paragraph position="1"> The coreference classifier tells us when adjacent noun phrases should be merged into a single memor y token. We also invoke some hand-coded heuristics for coreference decisions that can be handled on the basi s of lexical features alone. These heuristics determine that Bridgestone Sports Co. is coreferent with Bridgestone Sports, and that &quot;THE JOINT VENTURE,, BRIDGESTONE SPORTS TAIWAN CO.&quot; is coreferent with &quot; A JOINT VENTURE IN TAIWAN ...&quot; Our lexical coreference heuristics are nevertheless very conservative, so they fail to merge our four product service instances in spite of the fact that &quot;clubs&quot; appears in three of thes e string fills . In effect, we pass the following memory token output to TTG: We failed to extract &quot;the remainder by ...&quot; for the third ownership objec t(11)&quot;%%15&quot; because our percentage specialist was not watching for verbal referents (12) &quot;%%75&quot; in a percentage context - this could be fixed with an adjustment to th e specialist.</Paragraph> <Paragraph position="2"> . When TTG receives memory tokens as input, the object existence classifiers try to filter out spurious information picked up by overzealous CN definitions . Unfortunately, in the case of 0592, TTG filtered ou t two good memory tokens: (#1 describing the parent Tago Co.), and (#5 describing the joint venture) . It was particularly damaging to throw away #5 because that memory token contained the correct company nam e (Bridgestone Sports Taiwan Co.). Of the 3 remaining memory tokens describing companies, TTG correctl y identified the two parent companies on the basis of semantic features, but then it was forced to pick up #4 as the child company. Our pathing function was smart enough to know that -nm NEW COMPANY' was probably not a good company name, but that left us with *sotmIERN TAIWAN&quot; for the company name. So a failure that started with location recognition led to a mistake in trainable appositive recognition, which then combine d with a failure in lexical coreference recognition and a filtering error by TTG in order to give us a join t venture named &quot;SOUTHERN TAIWAN&quot; instead of &quot;BRIDGESTON SPORTS TAIWAN .&quot; Overly aggressive filtering by TTG resulted in the loss of our 4 product service memory tokens .</Paragraph> <Paragraph position="3"> Our CN instantiations do not explicitly represent relational information, but CNs that share a commo n trigger word can be counted on to link two CN instantiations in some kind of a relationship . Trigger families can reliably tell us when two entities are related, but they can't tell us what that relationship is . We relied on &quot;TTG to deduce specific relationships on the basis of its training . In cases like &quot;75% BY BRIDGESTONE SPORTS&quot;, TTG had no trouble linking extracted percentage objects with companies . But our trainable link recognition ran into more difficulties when trigger families contained multiple companie s Among the features that TTG had available for discrimination where closed class features, such as memor y token types, semantic features, and CN patterns, and open class features (i .e. trigger words) . However, although there exist heuristics for discriminating relationships based on particular words, the combinatio n of the algorithms used (ID3) and the amount of data (600 stories) failed to induce these heuristics . The may be other algorithms, however, that used the same or less data and external knowledge to derive suc h heuristics from the training data.</Paragraph> <Paragraph position="4"> The processing for EME proceeds very similarly to EJV, with the exception that MayTag is not used in our EME configuration, and in the EME system we used our standard CN mechanism and an additional keyword CN (KCN) mechanism. The KCN mechanism was used to recognize specific types of processing, equipment, and devices that have one or only a few possible manifestations. Below we see the OTB tags for the first sentence, all of which are correct . In fact, for EME text 2789568, OTB had 100% hit rate . The memory token structure below illustrates the processing of the text prior to TTG. Two NPs are identified as the same entity, &quot;Nikon&quot; and &quot;Nikon Corp.&quot; The two NPs are merged into one memory token based on name merging heuristics . The second NP demonstrates how multiple recognition mechanisms ca n add robustness to the processing . &quot;Nikon Corp.&quot; is picked up by both a CN triggered off of &quot;plans to market&quot; and by two KCNs, one that looks for &quot;Corp .&quot; and another that looks for the lead NP in the story. Unfortunately, our system did not get any lithography objects for this story . On our list of things to get to if time permitted was creating a lithography object for an otherwise orphaned stepper . We would hav e only gotten one lithography object since we merged all mentions of &quot;stepper&quot; into one memory token . We created a synthetic version of the system that inserted a lithography memory token corresponding to each stepper. One was discard by TTG and another was created because there were two different equipmen t objects attached to the remaining lithography object . The features that TTG used to hypothesize a ne w ME_CAPABILITY are illustrative of one of the weaknesses of this particular method . TTG used the following features to decide not to generate an ME_CAPABILI TY developer.</Paragraph> </Section> <Section position="9" start_page="286" end_page="286" type="metho"> <SectionTitle> FEATURE RELATION CERTAINTY AMER FEATURE </SectionTitle> <Paragraph position="0"> The process is not X-RAY 0.23 The entity is not triggered off &quot;developed&quot; 0.14 The process is not CVD 0.03 The process is not LITHOGRAPHY of UKN type 0.04 The process is not ETCHING 0.06 The entity is not triggered off &quot;from&quot; 0.12 All of the features are negative, and the absence of each feature reduces the certainty that the relatio n holds, because each feature's presence, broadly speaking, is positive evidence of a relation. Therefore, the node of the decision tree that is found is a grouping of cases that have no particular positive evidence to support the relation, but also no negative evidence . With the relation threshold set at 0 .3, this yields a negative identification of a relation . However, there are strong indications of a relation here . For example , the trigger &quot;plans to market&quot; is good evidence of a relation, however, the nature decision tree algorithm s (recursively splitting the training data) causes us to loose that feature (in favor of other, better features). The following set of features shows what TTG used to generate an ME CAPABILITY distributor .</Paragraph> </Section> <Section position="10" start_page="286" end_page="286" type="metho"> <SectionTitle> FEATURE RELATION CERTAINTY AFTER FEATURE </SectionTitle> <Paragraph position="0"> The process is not packaging 0.47 The entity is not in a PP 0.58 A CN marked the entity as an entity 0.55 The process is not layering type sputtering 0.40 Again, we do not see here the features that we would expect, given the text . A human generating rule s would say that &quot;plans to market&quot; is a good indication of a ME_CAPABILITY distributor.</Paragraph> </Section> <Section position="11" start_page="286" end_page="288" type="metho"> <SectionTitle> DICTIONARY CONSTRUCTION BY DOMAIN EXPERT S </SectionTitle> <Paragraph position="0"> Sites participating in the recent message understanding conferences have increasingly focused their research on developing methods for automated knowledge acquisition and tools for human-assiste d knowledge engineering . However, it is important to remember that the ultimate users of these tools will be domain experts, not natural language processing researchers . Domain experts have extensive knowledge about the task and the domain, but will have little or no background in linguistics or text processing. Tools that assume familiarity with computational linguistics will be of limited use in practical developmen t scenarios.</Paragraph> <Paragraph position="1"> To investigate practical dictionary construction, we conducted an experiment with government analysts . We wanted to demonstrate that domain experts with no background in text processing could successfully us e the AutoSlog dictionary construction tool [Riloff and Lehnert 1993] . We compared the dictionaries constructed by the government analysts with a dictionary constructed by a UMass researcher . The results of the experiment suggest that domain experts can successfully use AutoSlog with only minimal training and achieve performance levels comparable to NLP researchers.</Paragraph> <Paragraph position="2"> AutoSlog is a system that automatically constructs a dictionary for information extraction tasks . Given a training corpus, AutoSlog proposes domain-specific concept node definitions that CIRCUS [Lehnert 1991] uses to extract information from text. However, many of the definitions proposed by AutoS log should not be retained in the permanent dictionary because they are useless or too risky . We therefore rely on a human-in-the-loop to manually skim the defmitions proposed by AutoSlog and separate the good ones from the bad ones.</Paragraph> <Paragraph position="3"> Two government analysts agreed to be the subjects of our experiment. Both analysts had generated templates for the joint ventures domain, so they were experts with the EJV domain and the template-fillin g task. Neither analyst had any background in linguistics or text processing and had no previous experience with our system . Before they began using the AutoSlog interface, we gave them a 1 .5 hour tutorial to explain how AutoSlog works and how to use the interface. The tutorial included some examples to highlight important issues and general decision-making advice . Finally, we gave each analyst a set of 1575 concept node definitions to review. These included definitions to extract 8 types of information: jv-entities , facilities, person names, product/service descriptions, ownership percentages, total revenue amounts , revenue rate amounts, and ownership capitalization amounts .</Paragraph> <Paragraph position="4"> We did not give the analysts all of the concept node definitions proposed by AutoSlog for the EJV domain. AutoSlog actually proposed 3167 concept node definitions, but the analysts were only available for two days and we did not expect them to be able to review 3167 definitions in this limited time frame . So we created an &quot;abridged&quot; version of the dictionary by eliminating jv-entity and product/service patterns tha t appeared only infrequently in the corpus.3 The resulting &quot;abridged&quot; dictionary contained 1575 concept nod e definitions.</Paragraph> <Paragraph position="5"> We compared the analysts' dictionaries with the dictionary generated by UMass for the final Tipste r evaluation. However, the official UMass dictionary was based on the complete set of 3167 definition s originally proposed by AutoSlog as well as definitions that were spawned by AutoSlog's optiona l generalization modules. We did not use the generalization modules in this experiment, due to tim e constraints. To create a comparable UMass dictionary, we removed all of the &quot;generalized&quot; definitions from the UMass dictionary as well as the definitions that were not among the 1575 given to the analysts. The resulting UMass dictionary was a much smaller subset of the official UMass dictionary .</Paragraph> <Paragraph position="6"> Analyst A took approximately 12.0 hours and Analyst B took approximately 10 .6 hours to filter their respective dictionaries. Figure 6 shows the number of definitions that each analyst kept, separated by types . For comparison's sake, we also show the breakdown for the smaller UMass dictionary .</Paragraph> <Paragraph position="7"> definition (it may propose a definition more than once if the same pattern appears multiple times in th e corpus) . We removed all jv-entity definitions that were proposed < 2 times and all product/servic e definitions that were proposed < 3 times . We eliminated jv-entity and product/service definitions onl y because the sheer number of these definitions overwhelmed the other types.</Paragraph> <Paragraph position="8"> We compared the dictionaries constructed by the analysts with the UMass dictionary in the followin g manner. We took the official UMass/Hughes system, removed the official UMass dictionary, and replaced i t with a new dictionary (the smaller UMass dictionary or an analysts' dictionary) . One complication is that the UMass/Hughes system includes two modules, TTG and Maytag, that use the concept node dictionary during training. In a clean experimental design, we should ideally retrain these components for each ne w dictionary. We did retrain the template generator (TTG), but we did not retrain Maytag . We expect that thi s should not have a significant impact on the relative performances of the dictionaries, but we are not certai n of its exact impact. Finally, we scored each new version of the UMass/Hughes system on the Tips3 tes t set. Figure 7 shows the results for each dictionary .</Paragraph> <Paragraph position="9"> The F-measures (P&R) were extremely close across all 3 dictionaries . In fact, both analysts' dictionaries achieved slightly higher F-measures than the UMass dictionary. The error rates (ERR) for all three dictionaries were identical . But we do see some variation in the recall and precision scores . We also see variations when we score the three parts of Tips3 separately (see Figure 8) .</Paragraph> <Paragraph position="10"> In general, the analysts' dictionaries achieved slightly higher recall but lower precision than the UMas s dictionary. We hypothesize that this is because the UMass researcher was not very familiar with the corpu s and was therefore somewhat conservative about keeping definitions. The analysts were much more familiar with the corpus and were probably more willing to keep definitions for patterns that they had seen before . There is usually a trade-off involved in making these decisions : a liberal strategy will often result in higher recall but lower precision whereas a conservative strategy may result in lower recall but higher precision . It is interesting to note that even though there was great variation across the individual dictionaries (se e Figure 6), the resulting scores were very similar . This may be because some definitions can contribute a disproportionate amount of performance if they are frequently triggered by a given test set . If the thre e dictionaries were in agreement on that subset of the dictionary that is most heavily used, those definitions could dominate overall system performance. Some dictionary defmitions are more important than others . To summarize, this experiment suggests that domain experts can successfully use AutoSlog to build domain-specific dictionaries for information extraction . With only 1 .5. hours of training, two domain experts constructed dictionaries that achieved performance comparable to a dictionary constructed by a UMass researcher. Although this was only a small experiment, the results lend credibility to the claim that domain experts can build effective dictionaries for information extraction .</Paragraph> <Paragraph position="11"> 28 9</Paragraph> </Section> <Section position="12" start_page="288" end_page="290" type="metho"> <SectionTitle> WHAT WORKS AND WHAT NEEDS WORK </SectionTitle> <Paragraph position="0"> When we look at individual texts and work up a walk through analysis of what is and is not working, we find that many of our trainable language components are working very well. The dictionary coverage provided by AutoSlog appears to be quite adequate . OTB is operating reliably enough for subsequen t sentence analysis. When we run into difficulties with our trainable components, we often find that many of these difficulties stem from a mismatch of training data with test data . For example, when we trained the coreference interface for appositive recognition, we eliminated from the training data all candidate pair s involving locations because the location specialist should be identifying locations for us . If the coreference classifier were operating in an ideal environment, it would never encounter unrecognized locations .</Paragraph> <Paragraph position="1"> Unfortunately, as we saw with EJV 0592, the location specialist does not trap all the locations, and this le d to a bad coreference decision . In an earlier version of the coreference classifier we had trained it on imperfec t data containing unrecognized locations, but as the location specialist improved, we felt that the training fo r the coreference classifier was falling increasingly out of sync with the rest of the system so we updated it by eliminating all the location instances . Then when the coreference classifier was confronted with an unrecognized location, it failed to classify it correctly . When upstream system components are continually evolving (as they were during our MUC-5 development cycle), it is difficult to synchronize downstrea m dependencies in training data. A better system development cycle would stabilize upstream component s before training downstream components in order to maintain the best possible synchronization across trainable components .</Paragraph> <Paragraph position="2"> TTG was able to add some value to the output of CIRCUS and subsequent discourse processing. In module tests, TTG typically added 6-12% of accuracy in identifying domain objects and relationships . That added value is measured against picking that most likely class (yes or no) for a particular domain objec t (e.g. JV-ENTITY or ME-LITHOGRAPHY) or relationship (e .g. JV-TIE-UP or ME-MICROELECTRONICS-CAPABILITY) . However, TTG fell far below our expectations for correctly filtering and connecting the parser's output . We find two reasons for this short fall . First, some smal l deficit can be attributed to the system development cycle since TTG sits at the end of the cycle of training and testing various modules.</Paragraph> <Paragraph position="3"> The second, and by far the dominant effect comes from the combination of the training algorithm (11)3) and the amount of data. As mentioned previously, there are two types of features used by TTG: (1) closed class (e.g. token type, semantic features, and CN patterns) and (2) open class features (i .e. CN trigger words). Using open class features can be difficult, because most algorithms cannot detect reliabl e discriminating features if there are too many features--reliable features cannot be separated from noise.</Paragraph> <Paragraph position="4"> Using trigger words in conjunction relations between memory token results in 3,000-5,000 binary features .</Paragraph> <Paragraph position="5"> With no noise suppression added to the algorithm and given a large number of features, ID3 will create very deep decision trees that classify stories in the training set based on noise .</Paragraph> <Paragraph position="6"> We ran two sets of decision trees in deciding how to configure our system for the fmal test run. MIN-TREES using only closed class features and no noise suppression and MAX-TREE using closed class and open class features and a noise suppression rule. The noise suppression was a termination condition on the recursion of the ID3 algorithm. Recursion was terminated when all features resulted in creating a node tha t classified examples from few than 10 different source texts . Using closed class features rarely resulted in a terminal node that classified examples from fewer than 10 stories. In all tests the MAX-trees performed better. However, as a result of the noise suppression, no decision tree contained very many discrimination s on a trigger. The performance of the MAX-trees indicated that individual words are good discriminators , however their scarcity in the decision trees indicates that we are not using the appropriate algorithm . We believe that data-lean algorithms (such as explanation-based learning) in concert with shared knowledge bases might be effective.</Paragraph> <Paragraph position="7"> In attributing performance to various components, we measured 25 random texts in EME . At the memory token stage we found that CIRCUS had extracted string-fills and set-fills with a recall/precision of 68/54. However our score output for those slots was 32/45 (measured only on the slots we attempted) .</Paragraph> <Paragraph position="8"> Even when the thresholds for TTG were lowered to 0 .0, so that all output came through, the recall was not anywhere near 68. Therefore it would appear that the difficult part of the template task is not finding good things to put in the template, but figuring how to split and merge objects . We do not (yet) have a trainable component that handles splitting and merging decisions in general ., The EJV and EME systems that we tested in our official evaluation were in many ways incomplet e systems. Although our upstream components were operating reasonably well, additional feedback cycles were badly needed for other components operating downstream . In particular, trainable coreference and trainable template generation did not received the time and attention they deserve . We are generally encouraged by the success of our trainable components for part-of-speech tagging, dictionary generation , noun phrase analysis, semantic feature tagging, and coreference based on appositive recognition . But we encountered substantial difficulties with general coreference prior to template generation . This appears to be the greatest challenge remaining for trainable components supporting information extraction . We know from our earlier work in the domain of terrorism that coreference resolution can be reasonably well-managed on the basis of hand-coded heuristics [Lehnert et al . 1992b] . But this type of solution does not port acros s domains and therefore represents a significant system development bottleneck . True portability will only be achieved with trainable coreference capabilities.</Paragraph> <Paragraph position="9"> We believe that trainable discourse analysis was the major stumbling block standing between ou r MUC-5 system and the performance levels attained by systems incorporating hand-coded discourse analysis .</Paragraph> <Paragraph position="10"> We remain optimistic that state-of-the-art performance will be obtained by corpus-driven machine learnin g techniques but it is clear that more research is needed to meet this very important challenge . To facilitate research in this area by other sites, UMass will make concept extraction training data (CIRCUS output) fo r the full EJV and EME corpora available to research laboratories with intemet access . When paired with MUC-5 key templates available from the Linguistic Data Consortium, this data will allow a wide range of researchers who may not be experts in natural language to tackle the challenge of trainable coreference an d template generation as problems in machine learning. We believe it is important for the NLP community to encourage and support the involvement of a wider research community in our quest for practica l information extraction technologies .</Paragraph> </Section> class="xml-element"></Paper>