File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1013_metho.xml
Size: 30,309 bytes
Last Modified: 2025-10-06 14:13:29
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1013"> <Title>SAID FRIDAY IT WILL SET U P A JOINT VENTUR E IN TAIWA N WITH A LOCAL CONCER N AND A JAPANESE TRADING HOUS E TO PRODUCE GOLF CLUBS</Title> <Section position="3" start_page="0" end_page="137" type="metho"> <SectionTitle> ARCHITECTURE </SectionTitle> <Paragraph position="0"> The system's underlying architecture, shown in Figure 1 overleaf, follows a by-now familiar task breakdown .</Paragraph> <Paragraph position="1"> Processing occurs in two main phases : preprocessing (in the elementizer module), which covers processes rangin g from document destructuring to simple parsing, and natural language analysis per se, including application-specifi c output generation . As with several other MUC systems, our use of an explicit pre-processing phase is directl y borrowed from work by Jacobs oral (1991). One way Alembic differs from other MUC systems, however, is in exploiting SGML (Goldfarb 1990) as the interchange lingua franca between processing phases. The intention is to allow system modules whose invocation occurs early in the analysis of a document to record processing result s directly in the document through SGML markup. This information then becomes available to subsequent modules as meta-data . Further, the elementizer itself is composed of a sequence of processing phases that also exchang e partial results through SGML .</Paragraph> <Paragraph position="2"> As a result of this SGML-based architecture, the system's overall flow of control is governed from an object oriented document manager built on top of Goldfarb's public domain SGML parser, ARC-SGML . For MUC-5, the pre-processing elementizer thus takes a source message file and normalizes it in various ways, yielding an intermediate SGML document . The document manager then builds an internal document object by parsing the resulting SGML. The actual content analysis of the document is performed by invoking the natural language analysi s modules on the internal document object, and the results of these analyses are stored as attributes of the document . Another way in which Alembic differs from a prototypical MUC-class system is in its emphasis on tex t enrichment. That is, the output of the system is actually a new version of the source document enriched by semanti c markup. The markup consists of a template-based index and simple semantic annotations identifying people , organizations, places, etc. For the MUC task, Alembic also provides selective output that consists solely of filled templates.</Paragraph> <Paragraph position="3"> 'alembic 1 : an alchemical apparatus used for distillation 2 : something that refines or transmutes as if by distillation .</Paragraph> </Section> <Section position="4" start_page="137" end_page="141" type="metho"> <SectionTitle> INDIVIDUAL PROCESSING MODULES </SectionTitle> <Paragraph position="0"> As shown in Figure 1, Alembic comprises six main processing modules. These modules perform all but one o f the functions described by Hobbs (1993) for his generic information extraction system--we omit the text filterin g phase. The correspondence from Hobbs' inventory of processing modules to Alembic is summarized in Table 1 .</Paragraph> <Paragraph position="1"> Elementize r The elementizer (so called because it inserts SGML elements into the text) is built as a cascade of finite-state LEX-style tokenizers . The elementizer performs Hobbs' zoning, pre-processing, and pre-parsing functions . Our zoning is more sophisticated than that in most like systems because we interpret the markup of MUC-5 messages a s bona fide SGML. Of Hobbs' several pre-processing tasks, the elementizer performs two : detecting sentence an d paragraph boundaries, and tokenizing certain parts of the lexicon . The parts in question include some gargantuan sub-lexica that we do not want to have to load into our main processor, e.g., lists of personal names and the inventory of geography terms from the TIPSTER gazetteer . Also tokenized are contractions and other punctuation laden lexemes, and assorted closed classes such as corporate designators and currency names . Finally, the elementizer performs some rudimentary parsing of dates and other constructs whose idiosyncrasies are much easie r to handle in a finite-state framework than in the strictly binary-branching framework of our main parser.</Paragraph> <Paragraph position="2"> The following elementized sample reproduces parts of the second paragraph of the walkthrough message. Note, for example, normalization tags such as < N UM > (for numerals) and tokenization tags, including <PT> (for punctuation), <GEO> (Gazetteer entries), and <LEX> (misc. closed classes).</Paragraph> <Paragraph position="3"> <P> THE JOINT VENTURE <PT>,</Pf> BRIDGESTONE SPORTS <GEO co>TAIWAN</GEO> <LEX> CO .</LEX > <PT>,</PT> . . . A MONTH<Pf>.</PT> <S>THE MONTHLY OUTPUT <NAME>WILL</NAME> BE LATER RAISE D TO <NUM V=&quot;50000&quot;>50,000</NUM> UNIT5<PT>,</Pf> BRIDGESTON SPORTS OFFICIALS SAID<PT> .</PT > The elementizer expects its output to be normalized with respect to an SGML document type definition at some later point, in our case as the elementized source file is read into the document manager . This allows us to exploit a number of SGML shortcuts, such as implicit tags, thus simplifying the elementization task significantly . In the preceding sample, for example, note how unambiguous </P>, <5> and <IS> tags are not introduced . They are reconstructed from context by the document manager when the elementized message is read into Alembic . More esoterically, perhaps, we also exploit short forms of attribute names so as to minimize the size of elementize d messages. The <GEO co> tag for example is actually short for (roughly) <GEO continent=NIL country=T province=NIL city=NIL airport=NIL port=NIL island=NIL island_group=NlL ></Paragraph> <Section position="1" start_page="138" end_page="138" type="sub_section"> <SectionTitle> Document Manage r </SectionTitle> <Paragraph position="0"> The document manager provides an object-oriented framework for working with SGML documents. The manager is entirely CLOS-based, and SGML elements are thus made available as instances of CLOS objects . A sentence element (corresponding to the string bracketed by matching <5> and <15> tags) is mapped into an instanc e of the 5 class in CLOS, and any S -specific code (e.g., the linguistic parser) is thus made applicable to the element .</Paragraph> <Paragraph position="1"> As mentioned, the document manager is built around a public domain SGML parser/tokenizer written by Goldfarb, the originator of SGML. The parser takes care of canonicalizing a document as it is read in by, e .g., inserting any tags left implicit by the elementizer, or by filling in the default attribute values of attributes .</Paragraph> <Paragraph position="2"> The parser consists of C language routines made available through the Consortium for Lexical Research . On the Lisp side, the document manager makes available a simple text-processing protocol that is easily specialized fo r contentful elements, such as <5>, <SO>, and so forth . We exploit this protocol to structure the entire flow of control of the main body ofAlembic, from invoking the scanner all the way through producing template files.</Paragraph> </Section> <Section position="2" start_page="138" end_page="139" type="sub_section"> <SectionTitle> Scanner </SectionTitle> <Paragraph position="0"> The scanner performs the remainder of Hobbs' pre-processing phase. That is, it builds a chart over which the parser will operate, and populates the chart with categories corresponding to either (1) ordinary lexical items in th e input stream, or (2) SGML elements identified by the elementizer (geographic terms, dates, and so forth) . The latter are handled by specializing the document manager's text processing protocol . <GEO> elements, for example , instantiate a prototypical geographical NP category into the chart (our categories are structured attribute valu e vectors with variables, i.e., standard unification-style DAGs (Shieber 1986 ; Pareschi &Steedman 1987)): [[syn :N] [bar-level 2] [scm [[head geo-region]]]] Non-SGML items are handled through a three-level lexicon . Our primary lexicon contains lexical items t o which is associated fine-grained information. This may include complex syntactic relationships, as with sentential complement verbs, or they may be associated with meaningful semantics, as with domain-specific lexicon entries . The primary lexicon was populated in part by adapting knowledge bases provided to us by Richard Tong of ADS . If a lexical item is not found in the primary lexicon, we obtain its part of speech from a secondary lexicon base d on the OAL (Oxford Advanced Learner) word list . This occurs most often for open-class tokens out of th e immediate realm of the domain . Finally, if an item does not even appear in the OAL, we assume it to be an instanc e of a special unknown noun category, UNK-N. This category is allowed to appear anywhere that is permitted fo r common nouns. In addition, UNK-Ns can participate in many syntactic positions that would otherwise be reserve d to proper nouns, e.g., as part of personal names and as pre-nominal modifiers to the left of the adjective position .</Paragraph> </Section> <Section position="3" start_page="139" end_page="140" type="sub_section"> <SectionTitle> Linguistic Parser </SectionTitle> <Paragraph position="0"> The original parser in Alembic was strictly based on combinatory categorial grammars (Steedman 1985, 1987) .</Paragraph> <Paragraph position="1"> We originally chose CCGs because their processing is inherently bottom-up and because they are strongly lexicall y governed. One lesson learned in our MUC-4 experience, however, is that strongly categorial frameworks are in adequate for processing free text, chiefly because they only provide inconvenient treatments of modifier attachment.</Paragraph> <Paragraph position="2"> As a result, one change we brought to our parser in the last year has been to recast it as a hybrid of the categoria l and phrase structure frameworks. Specifically, we rely on structured CCG-style categories to handle argumen t structure, and on binary-branching phrase structure rules to handle modifier attachment and other general phrase structure phenomena. A transitive verb, for example is treated as the V/N category, i.e., as a function that takes a noun phrase argument on the right (the N term) and produces a verb phrase (the V term) . The noun phrase is attached to the verb through the combinatorial rule of function application, i .e., a//3 a Strictly speaking, the above is a combinatorial schema, in which a and 13 in this case match V and N respectively . As in standard CCGs, a single combinatorial function application schema is used for attaching all syntactic arguments to a functor category. Complex argument attachment phenomena are thus encoded through complex syntactic categories, e .g., V/N/N (for ditransitive verbs such as &quot;give&quot;) or V/S/N (for &quot;tell&quot; and simila r sentential complement verbs) .</Paragraph> <Paragraph position="3"> In contrast, modifier attachment is handled by providing rules specific to the particular modification phenomenon, as in standard phrase structure grammar . The temporal modifiers of a verb, for example, are handled through the following rule (not a schema) .</Paragraph> <Paragraph position="5"> There are a number of advantages to this approach with respect to linguistic modeling . Beyond handling modifiers, Wittenburg (1987) used related strategies to project gapped constituents . From a computational perspective, however, the most important is that it makes the grammar entirely unary- or binary-branching . This is so because the syntactic phenomena that must traditionally be treated through greater-than-binary rules are typicall y complex argument structures--in our framework, these phenomena are handled through &quot;stacked&quot; functor categories, such as V/N/N .2 By relying on the binary-branching property of the grammar, we can thus simplify many aspects of our parser; we don't need to construct dotted rules, for example.</Paragraph> <Paragraph position="6"> Although there are many further interesting and unconventional aspects of Alembic's parser, they are only of secondary interest to an information extraction audience . For the present purpose, we will thus only allude to som e of the parser's other relevant characteristics .</Paragraph> <Paragraph position="7"> * The parser organizes rules according to specificity, so that only the most specific variant of a phrase structure rule is ever used in combining two categories.</Paragraph> <Paragraph position="8"> * We rely on a number of simple attachment disambiguation heuristics to reduce the number of parse s generated for a given sentence. These heuristics are simple and have so far proved themselve s workable in practice . In particular, we prefer argument attachments to modifier attachments, an d among multiple modifier attachments, we prefer those that attach deepest in the parse tree.</Paragraph> <Paragraph position="9"> * Although the parser often produces spanning parses of even long sentences, we only require it t o produce semi-parses. We use a greedy algorithm to segment the chart into maximal chunks and us e these when a spanning parse is unavailable.</Paragraph> <Paragraph position="10"> 2There are admittedly non-binary phenomena besides complex argument structure, for example the grammar of dates . We handle these cases through the finite-state pre-parsing that is performed by the Alembic elementizer.</Paragraph> <Paragraph position="11"> Semantic interpretations The parser produces semantic interpretations in concert with syntactic parses . The output of the parser is effectively a semantic interpretation of the predicate argument structure of the sentence. We use semantic structures derived from those in the King Kong NL interface (Bayer d Vilain 1991) ; these are in turn related to the &quot;interpretation level&quot; of Alshawi erVan Eijck (1989). These are semantic objects that have four primary components : * Head: that is, a semantic predicate corresponding to the lexical head of a phrase .</Paragraph> <Paragraph position="12"> * Args: a list of semantic interpretations corresponding to the syntactic arguments of the lexical head. * Proxy: a semantic individual that instantiates the lexical head .</Paragraph> <Paragraph position="13"> * Mods: a set of interpretations corresponding to the syntactic modifiers of the phrase.</Paragraph> <Paragraph position="14"> For example, the following is the top level interpretation of &quot;The company last March set up a joint venture.&quot; In this case, the head slot of the interpretation corresponds to the main verb of the sentence, and the args slo t holds the arguments of the verb, e .g., the subject and direct object. The proxy slot denotes a semantic individual which serves the role of an event instance in a partially Davidsonian scheme, as in (Hobbs 1985) or (Bayer d-Vilai n 1991). That is, modifiers of the lexical head (from the mods slot) are made to predicate the proxy individual ., as in the example with the date modifier &quot;last March,&quot; which predicates a time-of relation of the proxy individual c-e-1 . As with other frameworks with similar interpretation levels, it is worth observing that these representations are ambiguous with respect to quantifier scope . Although quantifier information is saved during parsing (as a fift h quant slot on interpretations), it is ignored except in as much as is needed for determining definiteness of referential expressions. As noted below, this representational ambiguity is of interest because we perform inference and data extraction directly upon propositionalized versions of these interpretations, with no need to scope quantifiers . Inference The inference module is a completely new addition to Alembic . It is built on top of a propositional database that stores propositionalized versions of the semantics of sentences . The database itself is coupled to a tractable inference mechanism (Vilain 1991, 1993) and to an equality reasoner based on congruence closure (McAllester 1989) . With respect to data extraction, the inference module performs two chief functions. First it provides a uniform framework that seamlessly integrates general-purpose semantic processing and the bulk of data extraction . Second, it serves as a common substrate upon which Alembic performs various kinds of discourse processing, ranging from general definite reference resolution to domain-specific event or entity merging . As we note below, the inferentia l nature of the process allows these two functions to be performed through simple rules, thus avoiding the monolithi c and unwieldy pattern-matching expressions that are typically used for semantic processing and event merging . This material is best illustrated through examplessee below in our discussion of the walkthrough message.</Paragraph> </Section> <Section position="4" start_page="140" end_page="141" type="sub_section"> <SectionTitle> Template Generation </SectionTitle> <Paragraph position="0"> As mentioned, we rely on the inference mechanism to perform the bulk of the work of data extraction. The template generator, to a large extent, simply reads the template structure out of the contents of the propositiona l database. The primary complexity lies in maintaining backpointers from facts in the propositional database to the text strings required to produce string fills . Since these fills are typically NPs, this is simply achieved by mappin g semantic individuals to their originating lexical heads, and mapping these in turn to their maximal projections .</Paragraph> <Paragraph position="1"> In addition, the template generator must handle the innumerable details concerning fill rules for geography , names, and aliases. These minutiae are handled through a combination of inference rules and ad hoc code .</Paragraph> </Section> </Section> <Section position="5" start_page="141" end_page="143" type="metho"> <SectionTitle> SEMANTIC PROCESSING IN THE WALKTHROUGH MESSAG E </SectionTitle> <Paragraph position="0"> From a data extraction perspective, most of the early stages of processing Alembic performs on this message are only of partial interest. Although Alembic has its own distinctive ways of pre-processing, scanning, and parsing th e document, the heart of the actual extraction process takes place during inference . As this is also one of the distinguishing characteristics of Alembic, our description of the walkthrough message thus focuses largely on the inferential process .</Paragraph> <Paragraph position="1"> Mapping parses to proposition s Inference begins when the semantic interpretations produced by the parser are added to the propositiona l database. For example, for the first sentence in the message, the parser produces the following schematic syntacti c attachments (two fully spanning parses are produced for this sentence) .</Paragraph> <Paragraph position="2"> The actual semantic interpretations are nested structures that generally mirror this attachment scheme : the top level of the interpretation, for example, is provided below. Note that in the particular case of verbs with sententia l complements (e.g., &quot;Bridgestone Sports Co . said&quot;), the propositional complement (e .g., &quot;it will set up&quot;) is actually pulled out to the top level of the interpretation, and the matrix sentence is demoted to a modifier role.</Paragraph> <Paragraph position="4"> &quot;a joint venture&quot; Note again that this process of propositionalization does not require making any choice of quantifier scoping . General semantic inference Once a sentence has been propositionalized and added to the database, the first set of inferences performed ar e typically of a general semantic nature . In the case of the first walkthrough sentence, the fast set of rules to deriv e new facts are those that identify the participants of the events . A very general such rule pulls participants out of the conjoined co-agentive phrase &quot;with a local concern and a Japanese trading house .&quot; To begin with, this co-agentive phrase causes the propositions below to be added in addition to those shown above .</Paragraph> <Paragraph position="5"> Together, these two rules identify the existence of two of the joint venture participants (but not their names) . The third participant is identified (modulo reference resolution) by a lexical rule about creating business ventures : (create-entity (x y)) & (business-venture y) -* (participant (x y)) .</Paragraph> <Paragraph position="6"> This rule, along with the immediately preceding one yield: (participant (bvl ptl)) i.e., &quot;it&quot; is a participant in the business ventur e (jv-entity (bvl ptl)) i.e., &quot;it&quot; is a venture entity of the venture These rules typify the kind of inference that goes on in Alembic, wherein domain-specific lexical inference s build directly on top of general domain-independent inferences . This provides an appealing degree of declarative modularity, and allows significant portions of the semantic infrastructure to be reused across multiple domains . Discourse processin g Perhaps the most appealing aspect of Alembic's propositional database is that it provides a very convenien t substrate for discourse processing . Indeed, Alembic exploits the equality mechanism built into the database to cause facts to become propagated and shared as a result of coreference determinations . In the version of Alembic used in MUC-5, these co-reference determinations can happen in three different ways.</Paragraph> <Paragraph position="7"> Definite reference The definite reference mechanism incorporated into the MUC-5 version of the system is a drastic simplificatio n of the probabilistic mechanism we used in MUC-4 (Aberdeen 6-al 1992). We restricted the mechanism to consider only designated classes in our sort hierarchy, namely joint ventures and companies . Once a definite reference to a designated class has been identified, it is simply resolved to the most recent instance of that class . This strategy i s error-prone, and we intend to replace it shortly with a probabilistic reference mechanism more in line with the on e we used in MUC-4 (but trained on more reliable data!) . What is most interesting about our approach, however, i s not the particular reference algorithm we ended up implementing, but the way it exploits the equality mechanism . To see this, consider sentence 2 of the walkthrough message, where we encounter the definite reference &quot;Th e joint venture, Bridgestone Sports Taiwan Co.&quot; This phrase engenders the following propositions in the database : The definite reference is resolved to the phrase &quot;a joint venture&quot; from sentence 1, which corresponds to th e individual bv1 . The individuals bvl and bv2 are then equated, and the following is automatically inferred : (has-name (bvl c2)) Other facts known about le/2 are automatically propagated to bvl as well .</Paragraph> <Paragraph position="8"> Event and entity merging Event and entity merging approaches to discourse processing have gained significant visibility recently in ligh t of the success of systems with very fragmentary parsing, e.g., CIRCUS (Lehnert eb al 1992) or FASTUS (Appelt cr al 1993). Although we have only had limited experience with this class of techniques, we believe that entity and event merging can be elegantly handled through Alembic's propositional database. For instance, consider sentence 4 of the walkthrough message, which refers to &quot;The new company, based in Kaohsiung.&quot; The MUC-5 version of our system contained the following entity-merging rule, which can be glossed as saying that any new compan y mentioned in the context of a joint venture is in fact that joint venture (or perhaps its joint venture company) : (company x) & (new-thing x) & (jv-in-context (x y)) -f (= (x y)).</Paragraph> <Paragraph position="9"> As a result of applying this rule, Alembic propagates facts known about the company in sentence four to the joint venture entity from sentence 1, e.g., the fact that it is based in Kaohsiung, and so forth . What is most appealing about merging events in this way is that it avoids much of the monolithic pattern-matching that is typically required for event merging, especially for copying information from the source events into their combined merger. Stereotypical pronominal referenc e Finally, Alembic exploits the inference mechanism to encode some patterns of pronominal reference that are so stereotypical as to be effectively frozen . One such pattern appears in the first sentence : &quot;Bridgestone Sports Co . said Friday it will set up a joint venture .&quot; This pattern is handled by the following simple rule: (declare (x y)) & (event (z) y) & (pronominal-thing z) -a (= (x z) ) This rule matches sentences headed by propositional statement verbs (through the declare term). When the embedded clause (matched through the event term3) has a pronominal subject (the pronominal-thing term), the rule equates the subject arguments of the two clauses . As with definite reference, the equality mechanism then propagates facts known about the matrix subject to the embedded subject--in this case the fact that it is a company, etc.</Paragraph> </Section> <Section position="6" start_page="143" end_page="144" type="metho"> <SectionTitle> SYNOPSIS OF EVALUATION MEASURE S </SectionTitle> <Paragraph position="0"> We believe the development effort we undertook in this past year has paid off to a large extent, though not a s much as we had originally hoped . We achieved an overall ERR score of 87, corresponding to a P&R score of 22 .</Paragraph> <Paragraph position="1"> Although this is on the low end of scores reported for MUC-5, it represents a significant improvement over ou r MUC-4 performance (P&R of 9 on TST3), especially given the increased difficulty of the MUC-5 task.</Paragraph> <Paragraph position="2"> More important perhaps than these single performance measures is our overall performance profile . Alembic is a clean system: we have among the highest precision scores to be presented here (P=58) . In contrast, our performance was hampered by our relatively lower recall (R=14) . This high-precision profile is also reflected in our comparatively better system-by-system ranking in the richness-normalized-error measures (minerr= .89 maxerr=.92) than in the primary error per response fill measure. This is due to the fact that Alembic avoids generating spurious fills, a characteristic that is favored by the richness-normalized error measures .</Paragraph> <Paragraph position="3"> Turning to our recall score, our major problem in obtaining a high recall level was simply an inadequate domai n model, especially an insufficient complement of extraction rules for locating joint venture entities . For example, we failed to exploit such reliable indicators of these entities as ownership or partnership events . Since entities 31n our domain model, event is a generalization that matches any relation, including the create-entity event in the example . contribute overwhelmingly to the overall complement of fills, we were correspondingly penalized. In addition, we experienced in the final run a number of hitherto unobserved bugs in the entity fill normalization code ; these errors contributed to lowering both our recall and precision scores .</Paragraph> <Paragraph position="4"> Beyond the official final runs, we conducted several ablation experiments in an effort to gain additional insigh t into the nature of our performance at MUC-5 . In particular, we created versions of Alembic in which we selectivel y disabled (1) the secondary lexicon (OAL), or (2) definite reference resolution . We then compared the performance of these derivative systems to the baseline performance of the full system on the 86-message section of the dry ru n test set. These results are summarized in Table 2, and as can be readily seen, there are no substantial difference s between the experimental conditions . We were somewhat surprised to note a slight performance improvement when we disabled the secondary OAL lexicon, which attests to a certain degree to the robustness of our fallback strateg y of mapping unknown lexical items to the UNK-N category .</Paragraph> <Paragraph position="5"> One should refrain, however, from drawing any sweeping conclusions from these experiments . Indeed, th e baseline recall level of Alembic on this task is still fairly low, so the effects of these experimental manipulations ar e probably masked by the lack of sufficient baseline recall. For example, if Alembic fails to identify a company as a joint venture participant, it does not matter whether the system later succeeds through reference resolution at deter mining the company's nationality or location .</Paragraph> <Paragraph position="6"> Finally, we performed an additional experiment in which we hand-tagged all geographical entities, corporations, and personal names, rather than leaving this task to the elementizer and parser . The intention here was to provide an oracle for these phrase types, so as to help us determine how much of our recall error might have been due t o inadequacies in the parser or elementizer, especially with respect to parsing the names of joint venture participants. Once again, we found no significant difference between the baseline system and the oracle-driven system . This substantiates our anecdotal impressions that an inadequate extraction model is the main cause of our recall errors .</Paragraph> </Section> class="xml-element"></Paper>