File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1019_metho.xml

Size: 33,669 bytes

Last Modified: 2025-10-06 14:13:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1019">
  <Title>SRI: Description of the JV-FASTUS System Used for MUC-5</Title>
  <Section position="3" start_page="222" end_page="222" type="metho">
    <SectionTitle>
SYSTEM ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> The basic architecture of the EJV-FASTUS system is illustrated in Figure 1 . The text is input to the cascade of transducers as a stream of ASCII characters . This amounts to a decision to treat all text a s unformatted, which for the English joint ventures texts is not unreasonable, since these texts contain ver y little relevant formatted data such as tables, and when they do occur, their format is idiosyncratic. The first transducer is the TOKENIZER, which produces symbolic and numeric tokens as output. These symbolic tokens are given to the PREPROCESSOR, which recognizes multiword lexical items, and some company and personal names, and produces lexical items as output. The PHRASE PARSER then breaks the input stream into Noun Groups (the part of the noun phrase consisting of determiner, prenominal modifiers and head noun ) Verb Groups (auxiliaries, intervening adverbs, with main verb) and particles (single lexical items, including conjunctions, prepositions, subordinating conjunctions, and relative pronouns) . The PHRASE PARSE R also identifies the head of each constituent, which, with some minor exceptions, is the only component of th e constituent that influences subsequent processing . The PHRASE COMBINER takes the phrases output b y the PHRASE PARSER and combines them into larger phrases of the same type . For example, adjacent nou n groups may be merged into appositives, certain prepositional phrases are attached to their noun groups, an d conjunctions of both verb groups and noun groups are combined. The combined phrases are input to th e</Paragraph>
  </Section>
  <Section position="4" start_page="222" end_page="224" type="metho">
    <SectionTitle>
DOMAIN PATTERN RECOGIZER, which nondeterministically matches the sentence against patterns tha t
</SectionTitle>
    <Paragraph position="0"> are relevant to the information to be extracted . The by-product of the match is partially instantiated raw templates that are merged by the MERGER. Finally a POSTPROCESSOR puts the raw templates int o  final form for printing .</Paragraph>
    <Paragraph position="1"> The EJV Walkthrough Example We were dismayed to see that our system did not produce a template in response to the walkthroug h text. Closer inspection, however, revealed that the system had in fact produced a reasonable analysis for this text, but the analysis was discarded by the POSTPROCESSOR because it failed a basic consistency check : the joint venture company had to be distinct from all of its parent entities . Experience has shown that a failure to satisfy this condition usually arises due to a failure in the merging process, and it turns out that the score is usually improved by discarding what is likely to be a spurious template . Unfortunately, in this case the strategy resulted in discarding basically correct information . We turned off the filter, and reran th e example, producing the output listed in Appendix I .</Paragraph>
    <Paragraph position="2"> The TOKENIZE R The TOKENIZER is a simple transducer that accepts ASCII characters as input and produces a stream of tokens as output . The tokenizer performs the following functions :  In case of ambiguity, the ambiguity is resolved in favor of the longest token that can be formed starting a t the current position in the input stream.</Paragraph>
    <Paragraph position="3"> The walkthrough text does not present any unusual difficulties for the TOKENIZER . The PREPROCESSO R The PREPROCESSOR accepts the tokens produced by the TOKENIZER as input and produces lexica l items as output . A lexical item is defined as a token or sequence of tokens that has an entry in the system's lexicon. During this phase, multiwords are recognized. Proper names of individuals, locations, an d corporations are considered lexical items, and the PREPROCESSOR makes the first attempt to recogniz e them.</Paragraph>
    <Paragraph position="4"> Case is very important for disambiguating proper and common nouns in English. In texts with both upper- and lowercase characters, capitalization provides very useful information about which words can o r cannot be parts of names ; this is not available in uppercase-only texts. Therefore, the PREPROCESSO R uses separate transducers for recognizing personal and corporate names for mixed-case and uppercase-only texts.</Paragraph>
    <Paragraph position="5"> There are three basic transducers for corporate names . One transducer, which operates on both uppercase-only and mixed-case texts, recognizes company names that do not appear with a standard suffix like &amp;quot;Inc,&amp;quot; o r &amp;quot;GmbH.&amp;quot; A recognizer for mixed-case-text corporate names basically accepts all capitalized words precedin g  a suffix like &amp;quot;Inc,&amp;quot; with some heuristics to avoid including capitalized words, at the beginning of a sentence , that are not part of the name. Uppercase-only texts present more of a problem, because the simple expedient , of accepting any noun group preceding the corporate suffix leads to overgeneration of company names, particularly in cases of lexical ambiguity of the words involved. For example, a sentence like &amp;quot;ALI3ION IRO N</Paragraph>
  </Section>
  <Section position="5" start_page="224" end_page="224" type="metho">
    <SectionTitle>
&amp; METAL SAW AN INCREASE IN PROFITS THIS YEAR&amp;quot; would probably result in &amp;quot;ALIHON IRON R*
METAL SAW &amp;quot; as the name of the company, because &amp;quot;saw&amp;quot; can be a noun as well as a verb . To prevent this
</SectionTitle>
    <Paragraph position="0"> kind of overgeneration of company names, we restrict the words that can combine to form company name s to be members of a list of product words that are likely to occur in names . &amp;quot;Iron&amp;quot; and &amp;quot;metal&amp;quot; occur on this list, while &amp;quot;saw&amp;quot; does not .</Paragraph>
    <Paragraph position="1"> This heuristic for recognizing company names in uppercase-only texts caused the most serious proble m we encountered in the walkthrough example . The first sentence of this example i s</Paragraph>
  </Section>
  <Section position="6" start_page="224" end_page="228" type="metho">
    <SectionTitle>
BRIDGESTONE SPORTS CO. SAID FRIDAY IT HAS SET UP A JOINT VENTURE IN
TAIWAN WITH A LOCAL CONCERN AND A JAPANESE TRADING HOUSE TO PRODUCE GOLF
CLUBS TO BE SHIPPED TO JAPAN .
</SectionTitle>
    <Paragraph position="0"> It turns out that &amp;quot;BRIDGESTONE&amp;quot; is known in the lexicon to be the name of a company ; however , &amp;quot;SPORTS &amp;quot; was not on the list of product words . Therefore, the system recognized &amp;quot;BRIDGESTONE&amp;quot; as a company name and as the subject of the sentence, and ignored &amp;quot;SPORTS CO .&amp;quot; as an apositive .</Paragraph>
    <Paragraph position="1"> When a company name is recognized, it is entered into the lexicon for the duration of the text, togethe r with any possible aliases that can be predetermined . The lexicon is restored to its initial state at the end of a text so any mistakes or perverse company names will have no effect on subsequent processing . For example, if an article mentions &amp;quot;Next, Inc .&amp;quot; it is important to recognize &amp;quot;Next&amp;quot; as a company name for the duration of the text, but that could obviously cause havoc with other texts .</Paragraph>
    <Paragraph position="2"> In summary, the preprocessor performs the following functions :  * Groups of words comprising multiword lexical items are collected together .</Paragraph>
    <Paragraph position="3"> * Company names that are in the system's lexicon, or composed from lexical entries by systematic rule s are identified .</Paragraph>
    <Paragraph position="4"> * Names of people are identified and grouped by title .</Paragraph>
    <Paragraph position="5"> * Groups of words that might, or might not, be companies, are flagged as possible company names .  In case of ambiguity, the longest phrase beginning at the current point in the input string is selected . The PHRASE PARSE R The next phase accepts the lexical items combined by the preprocessor as input and produces a sequenc e of phrases as output. The head of each phrase is identified, and if the head of the phrase corresponds t o an object in the domain for which a template object is defined, then an object of the appropriate type i s associated with the phrase. For example, if the noun group is &amp;quot;the Japanese company,&amp;quot; this noun group is associated with an ENTITY object whose NATIONALITY slot is Japan .</Paragraph>
    <Paragraph position="6">  The phrase parser constructs phrases that can be reliably described as a regular language . Attachment ambiguities are preserved for later phases where they will either be ignored as irrelevant, or combined o n the basis of domain-specific patterns when the combination can be done reliably .</Paragraph>
    <Paragraph position="7"> The basic grammar of English used in this phase is a superset of that used in the MUC-4 FASTU S system. The main differences involve more detailed processing of numbers consisting of mixed numeri c and symbolic parts (e .g ., 3 million), currency phrases (e .g., DM 2500), and the recognition of bank names. Possible companies are treated as proper nouns and can be combined to form noun groups in the same wa y as other proper nouns referring to locations, companies, or people.</Paragraph>
    <Paragraph position="8"> Lexical ambiguity can lead to multiple analyses at the end of the parsing phase . In general, longer phrases are preferred to shorter ones. In mixed-case texts, nominals with proper noun heads are preferre d to other analyses if they are capitalized . In uppercase-only texts, company names are preferred to othe r analyses because of the central role that companies play in the joint venture domain . However, in uppercase-only texts, common nouns and verbs are preferred to location names when ambiguity arises, because of th e relatively large number of locations in the gazetteer that overlap with ordinary English words . The PHRASE PARSER analyzes the first sentence of the walkthrough example as follows :</Paragraph>
    <Paragraph position="10"> At this point the system has entity objects representing a company named &amp;quot;BRIDGESTONE,&amp;quot; a &amp;quot;JOINT VENTURE&amp;quot; and a &amp;quot;LOCAL CONCERN .&amp;quot; The local concern has a location of &amp;quot;TAIWAN&amp;quot; because tha t was the most recently mentioned location . The system did not realize that a noun group with the head &amp;quot;HOUSE&amp;quot; could refer to a company, so no entity is created for &amp;quot;JAPANESE TRADING HOUSE .&amp;quot; The PHRASE COMBINE R The PHRASE COMBINER attempts to simplify the job of the final domain pattern recognizer by combining phrases from the initial parse into larger phrases whenever this is feasible . This combination takes place in a hierarchy of stages, so that various combination operations can be prioritized . For example, the attachment of certain prepositional phrases is performed before conjunction combination, so conjunctio n can apply to noun groups with prepositional phrases attached .</Paragraph>
    <Paragraph position="11">  Each level of the phrase combination phase has two subphases : a defeat subphase, and a pattern matchin g subphase. If a pattern in the defeat subphase matches the input, then that string is prevented from matchin g any pattern in the matching subphase. For example, in general, &amp;quot;for&amp;quot; and &amp;quot;of&amp;quot; prepositions attach almos t always to their closest noun group . These attachments are routinely made except in two cases : (1) a verb explicitly subcategorizes for a &amp;quot;for&amp;quot; or &amp;quot;of&amp;quot; complement, or (2) the subject and object of the prepositio n form an important domain pattern that is recognized during the next phase (e .g., &amp;quot;the production of gol f clubs&amp;quot; ). In these cases, defeat patterns are written to match the input and prevent the PP-attachment rule from operating .</Paragraph>
    <Paragraph position="12"> The PHRASE COMBINER performs the following tasks : * Adjacent location noun groups are merged when the result of the merger is consistent with the information in the gazetteer (e .g., Palo Alto, California) .</Paragraph>
    <Paragraph position="13"> * Noun groups with company words as heads are combined with appositives, genitives, and preposition s to provide further information about the entity (e .g., Foobarco, the California company, Japan's Kob e Steel, or Aerospatiale of France). If any of the company names in the input are only possible companie s (like Foobarco), matching one of these patterns will cause the possible company to be recognized as a company for the duration of the text .</Paragraph>
    <Paragraph position="14"> * Appositives and prepositions that associate people with titles and companies are combined, and thei r semantics processed (e .g., John Smith, president and CEO of Foobarco) . Any possible company that matches the pattern is promoted to an actual company.</Paragraph>
    <Paragraph position="15"> * Conjunctions of company names are combined (e .g., IBM, General Motors, and Foobarco) . Again, possible companies can be promoted if they match the pattern .</Paragraph>
    <Paragraph position="16"> * Certain patterns that can reliably be used to promote possible companies to actual companies are recognized, even though they do not directly contribute any information to template slots (e .g., the board of directors of Foobarco) .</Paragraph>
    <Paragraph position="17"> * Conjoined verb groups are recognized, as are certain phrases that can be treated by subsequent analysis as complex verb groups (e.g., manufacture and market, planning to set up, announced a plan to form) . * Finally, &amp;quot;of&amp;quot; and &amp;quot;for&amp;quot; prepositions are attached to their adjacent noun groups, and conjoined noun groups are combined, unless this is overridden by defeat patterns .</Paragraph>
    <Paragraph position="18"> As an example of the operation of the PHRASE COMBINER, consider the system 's processing of the second sentence of the walkthrough text:</Paragraph>
    <Paragraph position="20"> above with the word &amp;quot;SPORTS ,&amp;quot; the system did not correctly recognize the joint venture company name, an d the combiner formed the appositive &amp;quot;JOINT VENTURE BRIDGESTONE&amp;quot; and assigned BRIDGESTON E the role as the joint venture company . Of course, BRIDGESTONE was already identified as one of th e parent entities, so this was the source of the mistake that led to discarding the entire analysis of this text .</Paragraph>
    <Paragraph position="21"> In addition, because the previous sentence said that the joint venture was &amp;quot;in Taiwa n&amp;quot; we identified Taiwan as the location of Bridgestone .</Paragraph>
    <Paragraph position="22"> The DOMAIN PATTERN RECOGNIZE R The DOMAIN PATTERN RECOGNIZER does the most critical work of the system by recognizin g phrases that establish the most important relationships to be extracted . The DOMAIN PATTERN RECOGNIZER takes the output of the PHRASE COMBINER as input, and produces raw templates as output. The PATTERN RECOGNIZER of the MUC-4 system had only one subphase, but it was recognized tha t because of the limited development time available we could not possibly account for all the possible way s joint venture relationships could be expressed . Therefore, it was decided to implement the JV-FASTU S PATTERN RECOGNIZER as a multiphase process . The outputs of the earlier phases would be kept by the system only as long as they were consistent with outputs found in the later phases . Thus, the earlier phase s of the PATTERN RECOGNIZER could be used to implement extremely general, loose patterns that coul d serve as defaults that could be defeated by the output of more precise, specific patterns at higher levels . Inspection of the corpus revealed that there are three basic, general patterns that indicate joint ventur e relationships with surprisingly high reliability . They are  (ignoring all other words) and a single company name follows the words &amp;quot;joint venture .&amp;quot; The parent entities are the first set of companies, and the joint venture entity is the singular one . Typical instances of this pattern are &amp;quot;The Toyota - General Motors joint venture, NUMMI . ..&amp;quot; and &amp;quot;IBM and Intel formed a joint venture called Foobarco .&amp;quot; The second pattern is the &amp;quot;passive&amp;quot; variant of the first (although verb groups and their properties are completely ignored), which matches sentences like &amp;quot;Foobarco is a joint venture forme d by IBM and Intel .&amp;quot; Finally, the third pattern matches sentences that do not meet the number constraints of the above pattern . In that case, all of the entities are parents . An example is &amp;quot;IBM formed a joint venture with Intel to produce mainframes in Timbuktu.&amp;quot; It is, of course, easy to think of counterexamples to the above patterns . The patterns help recall much more than they hurt precision, however, because they are only defaults that can be defeated by more precis e  information . We were initially skeptical that the inclusion of such vague patterns would actually enhanc e system performance . However, a test showed that they improved the system's F-metric by approximately 6 points.</Paragraph>
    <Paragraph position="23"> We eventually settled on three levels for the PATTERN RECOGNIZER . The first level consisted of the above patterns, the second level consisted of a very general pattern for recognizing ownership percentage s with active verbs (which, like the above patterns, never actually examined the verbs or their properties) , and the third level included a similar pattern for passive ownership percentages (which is more constrained because of the frequent use of the preposition &amp;quot;by&amp;quot; ) together with more obviously motivated patterns for joint ventures and products .</Paragraph>
    <Paragraph position="24"> In the walkthrough example in the first sentence, the system recognized the pattern &amp;quot;BRIDGESTON E . . . SAID IT HAS SET UP A JOINT VENTURE ... WITH A LOCAL CONCERN .&amp;quot; This pattern led to a tie-up relationship with Bridgestone and a company as parent entities . The adverbial &amp;quot;IN TAIWAN&amp;quot; wa s recognized nondeterministically by a different pattern that caused &amp;quot;TAIWAN &amp;quot; to be recorded as a defaul t location for the joint venture, and to provide a referent for &amp;quot;local&amp;quot; in &amp;quot;LOCAL CONCERN .&amp;quot; As mentione d previously, the system did not realize that &amp;quot;JAPANESE TRADING HOUSE &amp;quot; was a company.</Paragraph>
    <Paragraph position="25"> The MERGER The MERGER operates at the end of each sentence in two steps : first all the raw templates found in a single sentence are merged to the extent possible, and then the remaining templates are merged with an y templates from previous sentences .</Paragraph>
    <Paragraph position="26"> There are two types of merge operations : full merges on templates of like types, and default merges on templates of different types . Full merges are like unification operations, merging each slot of the two templates recursively, each time determining the best alignment for elements of a slot when the slots ca n contain multiple fills .</Paragraph>
    <Paragraph position="27"> Default merges involve templates of different types . If it is possible for the template of one type to fill a slot, or merge with the slot contents of one of the slots in the other template, and certain other condition s are satisfied, then the merger is accomplished by filling in the appropriate slot . Default merging allows the combination of information from disparate parts of the text into a single tie-up schema. Default merges ar e allowed as long as the parts occur reasonably near each other in the text . We have found the best result s with allowing default merges over a distance of two sentences.</Paragraph>
    <Paragraph position="28"> In the walkthrough example, as previously mentioned, the entity for Bridgestone was merged with th e joint venture company because of the appositive, and because the company name was incorrectly recognized . Then, the pattern recognizer recognizes the sequences &amp;quot;BRIDGESTONE . .. CAPITALIZED AT 20 MIL-</Paragraph>
  </Section>
  <Section position="7" start_page="228" end_page="229" type="metho">
    <SectionTitle>
LION NEW TAIWAN DOLLARS&amp;quot; , leading to an instantiation of a tie-up relationship with the joint ventur e
</SectionTitle>
    <Paragraph position="0"> company BRIDGESTONE, and an OWNERSHIP object giving the capitalization, and &amp;quot;PRODUCTION O F 2000 IRON AND &amp;quot;METAL WOOD &amp;quot; CLUBS&amp;quot; as an activity and industry with appropriate industry-typ e and product/service slot fills . The tie-up relationship combines in a full merge with the tie-up relationship from the previous sentence, and since nearness constraints are satisfied, the activity object is attached to the tie-up relationship at this time .</Paragraph>
    <Paragraph position="1">  The next sentence partially matches the passive-ownership pattern; however, full recognition of the pat tern was blocked by the failure to correctly recognize the company name &amp;quot;UNION PRECISION CASTIN G CO .&amp;quot; and the erroneous attachment of &amp;quot;AND THE REMAINDER&amp;quot; to the previous noun group as a conjunction. The result was a tie-up relationship with an ownership template attributing 75% ownership t o Bridgestone, which merged in a full merge with the previously found tie-up relationship and ownership . This example illustrates the crucial importance of recognizing company names in this domain . If the company names had been correctly recognized here, the system's output would have been nearly perfect . As a direct result of name recognition failure, compounded errors led to a much less satisfactory result .</Paragraph>
    <Paragraph position="2"> Finally, spurious activity and industry templates are produced from the next sentence, which recognize s</Paragraph>
  </Section>
  <Section position="8" start_page="229" end_page="229" type="metho">
    <SectionTitle>
&amp;quot;PRODUCTION OF GOLF CLUB PARTS &amp;quot; and attaches it to the tie-up relationship in a default merge,
</SectionTitle>
    <Paragraph position="0"> because nearness constraints are satisfied.</Paragraph>
    <Paragraph position="1"> The POSTPROCESSO R The output of the PATTERN RECOGNIZER is raw templates . These templates match the structure o f the officially specified templates rather closely, but they contain enough differences to require normalizatio n of the output before printing so they will meet the specifications of the task . This task falls to the POST- null PROCESSOR . The POSTPROCESSOR is a rather complicated and task-specific piece of code that perform s several, mostly uninteresting functions . The following tasks are assigned to the POSTPROCESSOR : * ENTITY RELATIONSHIP objects are generated for entities involved in joint ventures . (Subordinat e ENTITY RELATIONSHIPs are generated as a result of patterns recognized when the text is processed .) * Ordered pair slots are constructed where required. (The system treats ordered-pair fills as full objects, as they were in the original TIPSTER specifications, because this makes the merging algorith m simpler.) * String fills are extracted from the original text, rather than printed in the normalized, uppercase for m used liy JV-FASTUS .</Paragraph>
    <Paragraph position="2"> * Company names are extracted from the original text and normalized to ensure compliance with th e specific ations.</Paragraph>
    <Paragraph position="3"> * Locations are disambiguated and normalized using information from the gazetteer . * SIC codes for product-service strings are generated . Associating these codes with strings is really blac k magic, and the keys are very inconsistent, and in some cases clearly wrong . We fill them in for those cases where we feel*we can guess the right answer at least 50 percent of the time .</Paragraph>
    <Paragraph position="4"> * Dates are normalized and printed according to specifications .</Paragraph>
  </Section>
  <Section position="9" start_page="229" end_page="230" type="metho">
    <SectionTitle>
JAPANESE JV-FASTU S
</SectionTitle>
    <Paragraph position="0"> The JJV-FASTUS architecture is largely the same as that of EJV-FASTUS except that a public-domai n morphological analyzer, Kyoto University's JUMAN, replaces both the TOKENIZER and the LEXICON i n the English system . Although the remaining flow of control starting from the PREPROCESSOR and endin g with the POSTPROCESSOR is the same, operations performed at each phase do not always coincide .</Paragraph>
    <Paragraph position="1">  The JJV Walkthrough Exampl e In the initial run, the system recognized two tie-up relationships, but not exactly correctly . The secon d one lacked two of the three companies involved . No industries or activities were recognized . After the few minor changes described below, all relevant companies were correctly recognized, and the relevant activitie s and industries started showing up . The output is listed in Appendix II .</Paragraph>
  </Section>
  <Section position="10" start_page="230" end_page="230" type="metho">
    <SectionTitle>
JUMAN
</SectionTitle>
    <Paragraph position="0"> JUMAN is both the TOKENIZER and LEXICON in JJV-FASTUS . We customized JUMAN's approximately 16,000-word lexicon in two ways : (1) eliminating words that contain numerical information (e .g. , touka `the 10th day (of the month)) so that numerical information can be independently computed, an d (2) dividing proper names into three categories, syamei `company name', chimei `location name', and jinmei `person name' . This lexicon then became the JJV-FASTUS lexicon .</Paragraph>
    <Paragraph position="1"> An entire text is first input into JUMAN as a character stream . JUMAN outputs a sequence of morpheme s with morphological categories . We let the preference heuristics internal to JUMAN choose the single bes t segmentation . We estimate the accuracy of this process to be about 95% . Higher accuracy may be gaine d by obtaining &amp;quot;all possible&amp;quot; segmentations from JUMAN and having FASTUS choose among them. One of JUMAN's categories is MITEIGIGO `undefined word ' for anything not found in the lexicon . This turned out to be a useful category. Since the basic vocabulary was already covered, these unknown words wer e mostly names or parts of names . It also made the system robust . There was no need to keep adding words to the lexicon in order cover unrestricted texts . The &amp;quot;jumanized&amp;quot; text, a sequence of morphemes with thei r categories, is then passed on to the subsequent phases one sentence at a time .</Paragraph>
    <Paragraph position="2"> The PREPROCESSO R The PREPROCESSOR's role is relatively small in the Japanese system . Its main job is to assign numerical information on all incarnations of numerical characters, both Chinese and Roman, includin g interpreting a big fat circle &amp;quot;O &amp;quot; to be &amp;quot;zero&amp;quot; . Most morphemes are simply passed on to the next phase . The PHRASE PARSER The PHRASE PARSER takes a sequence of morphemes with JUMAN categories as input, and outputs a sequence of small phrases in three major categories -- Noun Group, Verb Group, and Particle, each of whic h is further subcatgorized into useful categories such as COMPANY, PERSON, and COUNTRY . The head of each phrase is recognized, and domain objects such as ENTITY, LOCATION, and FACILITY are created in association with the referring phrases . The following is the PHRASE PARSER analysis of the first hal f of the second sentence in the walkthrough example . Each line lists the category, phrase string, span of wor d count, and the phrase head .</Paragraph>
    <Paragraph position="4"> The PHRASE PARSER attempts the initial recognition of company names. Since there is no upper lowercase distinction in Japanese, name recognition is often a guess, sensitive to the local linguistic context .</Paragraph>
    <Paragraph position="5"> This is perhaps analogous to the uppercase-only texts in English. The plausible company names recognized at this phase are basically ANYTHING followed by sya `company', guruupu `group' , or one or more</Paragraph>
  </Section>
  <Section position="11" start_page="230" end_page="232" type="metho">
    <SectionTitle>
INDUSTRY TYPE NOUNs .
</SectionTitle>
    <Paragraph position="0"> Each proposed company name is checked to see if it could be (1) an alias of a previously recognize d company name, (2) the full name of an existing shorter name, or (3) a brand new name. After this decision, ENTITY objects for the given text are updated accordingly . In the above example, three companies ar e recognized, the first of which is a sequence of four common nouns and the last three of which are INDUS-TRYTYPE NOUNs . The other two are company names in the lexicon.</Paragraph>
    <Paragraph position="1"> A plausible alias is determined as follows . Given two names X and Y, X is an alias of Y if (1) X is two o r more characters long and is a proper initial substring of Y (e .g., h. for 1' : A`$`c' ), (2) X end s with a company ending a, and without the company ending is a proper initial substring of Y (this substrin g can be only one character long) (e .g ., *a for * A 1*a ), (3) X is three or more characters long and is a noninitial substring of Y that contains a dot (typically for a foreign name) (e .g., 71'-nA- for 7 Sty. * a), or (4) X ends with a company ending, and without the company ending matches criterion (3) (e .g., 71-z )4+- for T 3t 1. * ~]--:A-a) .</Paragraph>
    <Paragraph position="2"> In the walkthrough example, we see two spurious aliases, gAXiLE and El for ENTITY p Pik XE . These appear to be the names of possible companies recognized during parsing this long nam e and were picked up by the longest name as possible aliases .</Paragraph>
    <Paragraph position="3"> The PHRASE COMBINE R The PHRASE COMBINER forms longer noun phrases. One of the most complex graphs is dedicate d to the parenthetical information about the headquarters location, personnel, and so forth that typicall y comes immediately after the first mention of a company -- for instance, f Q) :i  here -- for instance, the first five phrases in the above output are combined into one longer phrase El i k R- RIAt +o~ lth with category COMPANY .</Paragraph>
    <Paragraph position="4"> Whenever a pattern calls for a phrase of category COMPANY in this and subsequent phases, one o f category UNKNOWN, NAME, or NOUN could also make the transition . When the latter succeeds into a final state, it is `promoted' to a COMPANY. This is analogous to the treatment of `possible' versus `actual' companies in EJV-FASTUS .</Paragraph>
    <Paragraph position="5">  The DOMAIN PATTERN RECOGNIZE R Once COMPANY phrases are clearly recognized, the key patterns reporting tie-up relationships can b e captured at a higher level such as : (1) COMPANY1 wa/ga COMPANY2 to TIE-UP-VERB ..., (2) COMPANY1 to TIE-UP-VERB COMPANY2 wa/ga . . . Phrases of UNKNOWN, NAME, and NOUN subcategorie s of Noun Group are also considered as potential COMPANYs in these patterns . The current system actually overgenerates company entities .</Paragraph>
    <Paragraph position="6"> The ambiguities in tie-up relationships in the second sentence of the walkthrough example are not recognized in JJV-FASTUS. It simply places all three companies as partners in one tie-up . Patterns for activities and industries were harder to define for a number of reasons, including : * There is a wide variety of sentence patterns, often lacking explicit subject noun phrases . * Indeterminacy -- since verbs come at the end of the sentence, relevance cannot be known early in th e sentence.</Paragraph>
    <Paragraph position="7"> * Relations to particular companies and tie-up relationships are often unclear in the given sentence o r in the previous sentence .</Paragraph>
    <Paragraph position="8"> In the initial run of the walkthrough example, no industries were recognized, because the key verbs, uridashita `started selling' and hatsubai sum `start selling', were not part of the known INDUSTRY VERBs . Simply by adding them to the list, two SALES industries (not FINANCE as in the keys) and associated activitie s were recognized.</Paragraph>
    <Paragraph position="9"> The MERGER and POSTPROCESSOR operations in JJV-FASTUS were basically the same as in the English system . Throughout the cascade of transducers, JJV-FASTUS used the same preference heuristic s as EJV-FASTUS, namely, the preference for the longest string .</Paragraph>
    <Paragraph position="10"> In summary, despite the extreme dissimilarity starting from character sets all the way up to where to put the negation in the sentence, the English and Japanese JV-FASTUS systems share the same basic architecture, design philosophy, preference heuristics, three major phrase categories (Noun Group, Ver b Group, and Particle), and merging strategies . They were developed in an equally short time, and achieve d an equally competitive performance level .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML