SUSSEX UNIVERSITY :
DESCRIPTION OF THE SUSSEX SYSTEM USED FOR MUC- 5
Robert Gaizauskas, Lynne Cahill & Roger Evan s
Cognitive and Computing Sciences ,
University of Sussex,
Brighton, U K
robertg@cogs.susx.ac.uk
Iynneca@cogs.susx.ac.uk
rogere@cogs .susx .ac .u k
INTRODUCTIO N
This paper describes the system used for the University of Sussex team's participation in the MUC-5 message
understanding trials. What is described below is the result of 12 person-months of intensive effort over six month s
to adapt a pre-existing system, designed with very different objectives and application in mind, to the MUC- 5
English Joint Ventures task. This task, starting from cold, is colossal : the overhead of understanding the task,
the training data, the scoring, the background resources, of developing a suitable harness for the system, not
to mention sorting out contractual arrangements, leaves little time for even basic porting – actual developmen t
tailored to the task was a very remote prospect. So, despite the quirks and failings exposed by the discussio n
below of the 'walkthrough ' example, we are pleased with our system's performance, and believe the effort to hav e
been a worthwhile part of our ongoing research .
HISTORY
The system used for MUC-5 is a modified version of a system developed at Sussex as part of the TIC ('Traffi c
Information Collator') and subsequently POETIC ('Portable Extendable Traffic Information Collator') projects, i n
association with Racal Research Ltd ., the Automobile Association, and National Transcommunications Ltd ., and
part-funded by the UK Science and Engineering Research Council and Department of Trade and Industry .
POETIC is a prototype software system which monitors police reports of traffic incidents and automaticall y
generates 'traffic bulletins' for motorists . POETIC 's inputs are police incident reports entered as text into a polic e
logging computer – database entries containing free format text fields which describe the incident in telegrammatic ,
jargon-laden English, as well as some information in fixed field format . The system identifies reports about traffi c
incidents, and uses them to build up a picture of the key features of an incident . From this information it makes
judgements about the effect, seriousness, duration etc. of each incident, formulates suitable advisory messages,
and coordinates delivery of the messages to the affected motorists . Further details of POETIC can be found in
[1], [11], [14], [5] .
The initial stages of the POETIC system constitute a message understanding system very similar in functio n
to the systems participating in the MUC evaluations : mapping from narrow-domain free text input to a semantic
representation of key pieces of information . Furthermore the 'Portable' in POLL IC refers to domain portabilit y
– the system is structured to make porting between domains relatively straightforward, albeit only in the narrow
sense of different police sublanguage domains all within the same overall domain of traffic reports . But althoug h
POETIC is a fairly mature system (more than 16 person-years of development effort over the past 8 years, of
321
which we estimate two thirds has been on the message understanding component), it has to date remained rathe r
isolated in its own field of traffic management systems, in which there is little similar work. In particular, the
message understanding component has been evaluated primarily only relative to the requirements of the rest o f
the POETIC system, rather than in an independent context .
Thus our objectives in participating in MUC-5 were threefold: firstly, to see how readily POETIC's message
understanding component (designed with a particular type of domain in mind) could be adapted to a radically
different topic domain, secondly to get a more objective view of how well the system performs, and how it relate s
to other approaches, and finally to take a further step towards the larger goal of developing a generic messag e
understanding system – portable to a much wider range of (still narrow) domains .
SYSTEM DESCRIPTIO N
Key Design Features
The key strengths and novelties of our approach to message understanding are :
• the declarative representation of all its main knowledge bases (lexicon, grammar, domain model)
• support for extendability and domain-portability, through identification and localisation of domain-specifi c
information
• the 'interesting corner' island parsing technique, which allows grammar coverage to focus on domain-relevant
phenomena
• the coupling of robust but overfragmented partial analysis of input sentences with a semantic modelle r
which utilises domain knowledge to construct a coherent interpretation of extended text s
• the application and integration in a practical information extraction system of theoretically well-founde d
techniques – bidirectional chart parsing, unification-based grammar, compositional semantics, inheritance -
based representation languages .
Architecture
The architecture of the Sussex system used for MUC is shown in Figure 1 .
The Pre-processor The Pre-processor takes an input document' and massages it into a form more easily processe d
by the rest of the system . Its key function is to break the document up into 'fixed field' sentences (such as document
number, date and source), and free text sentences . To do this, it interprets SGML tags where present, looks fo r
sentence boundaries in conventional ways, and tokenises the resulting text – identifying numbers and punctuatio n
symbols and separating them from alphabetic text etc . . It also attempts to locate quoted speech and discards it
– examination of the training corpora revealed that very rarely was there useful information in such text whic h
was not also elsewhere, and far more often it proved quite troublesome to process. The result is a sequence o f
distinct sentences, either fixed field or free text .
Fixed field sentences bypass lexical and syntactic processing: they are converted directly into fake semantic
parses and passed to the Discourse Interpreter . Free text sentences are passed to the lexical analyser and the n
the parser.
The Lexical Analyser The Lexical Analyser takes pre-processed sentences and performs a sequence of tests .
looking for lexical phrases and lexical entries for individual words . It is primarily here that background resource s
(company names, locations, personal names etc.) are employed . Broadly speaking, this process splits into
three phases: looking for lexical phrases with reliable indicators, looking in the main lexicon for known words .
and then looking for lexical phrases with less reliable indicators . l he aim here, and indeed throughout the
'Or, in test mode, a corpus, which it splits up into separate documents .
322
E XTPAC T 00C NAME DATE . SWIM
BREAK TEXT WTOSENTEWAS
TOWAGE WORDS . NUMBERS . P.LNCTLEA TON
ORCAPO REPORTED SPEECH
warms
TITLED MAWS
COMMON LOCATIONS
Faso mouses
CORPORATE OESIGNA
LOOKUP DE WIN LEXICON
	
. . .
Hi iii
> IRAN LEXICON
	
RR:::
'4 COMMON LOCATIONS
	
ii:
twos R•11aln Indicator s
SIMPLE CORPORATE NAMES
PHRASAL CORPORATE NAMES
SAME PERSONAL NAMES
GAZETTEER LOCATIONS
Parsing - Phan. 1
ACTIVATE 7ITERESTPRX LE%
HYPOTHESES . EXTEND CHAR Y
POT TOM UP. OPOIRECHONAL4
Prater
wFIUTION
: eAsL o
; GRAMMAR
. . . . ... . .
Parsing Phas. 2
ACTIVATE COMP*. WORDS NEXT
TO 'ROCKER ACTIVE EDGE S
APO EXTEND CHAR T
; MIL T
LEXICON
EXTRACT SEMANTIC PARSE S
APO SELECT WEST' LEAS T
F PAGER NIIO AND MOST
',FORMATIVE
:PKo Urse Interpreter
EXPAND PRESUPPOSITIONS
OF CURRENT SENTENCE
RESOLVE INTRRSENIENTIAL
CO REFERENCES
RESOLVE CO REFERENCES
WITH CUIWE MODEL
DOMAtn
wow,
Kmow.EGGE
Afn*r Neal ..M.ne .
COMPLETE MODEL BY ADONG
MISSWG KEY YALLAS FROM
INFERENCE OR BY DEFNA T
Post-processor
PROOUCE TEMPLATE DATA ST RUC TURE
WRITE TEMPLATE
Figure 1 : System Architecture
323
lexical and syntactic analysis, is to minimise the number of hypotheses so that fairly conventional parsing wit h
a fairly conventional grammar remains computationally feasible . To this end the search is ordered so that mor e
probable interpretations of a word are sought first, and if found will often preclude the hypothesis of less probabl e
alternatives.
The following reliable phrasal indicators are detected:
Numbers some of which may have alternative interpetations as years, or be followed by an ordinal marker (st,
nd, rd etc.).
Titled names where the titles come from a fixed set, and special-purpose name detection routines locate th e
entire name phrase. Titles can be prefixed or postfixed and most become incorporated in the resultin g
lexical phrase. A few, however, such as president and chairman have their own entries in the main lexicon ,
and so trigger a name phrase but remain independent of it so that the appropriate semantic compositio n
takes place during parsing .
Common locations gleaned from the training corpus. The supplied gazetteer is far too unwieldy to use directly
and the timescale did not permit much experimentation with it. Instead we compiled a list of the mos t
frequent locations (cities, countries regions — just over 300 in all) actually occurring in the training corpu s
and this is the data used here . Other locations may be picked up in the third phase of lexical processing —
see below.
Fixed phrases occurring frequently in the domain, such as joint venture, trading house, stock exchange, mostly
derived from a digram analysis of the training corpus . Also, common English fixed phrases such as as well
as.
Corporate designators whose presence triggers special-purpose company name detection code. Company name s
identified in this way are remembered so that they will be recognised later in the document even without a
corporate designator 2 .
The main domain-specific lexicon is written in the inheritance-based lexical representation language DAT R
[10], [9] . This contains the root forms of significant domain words (e .g . company, own) and the most common
English words (e .g. the, and) found in the training corpus — about 600 entries in all, although the level of semanti c
detail is as yet far from uniform . Before looking up a word in this lexicon, a simple morphological analysis i s
performed: a set of rules is used to detect standard suffixes, resulting in a root + suffix analysis for each word .
The root indexes a lexical entry (or entries, for ambiguous forms) while the suffix information is used to fill i n
feature details. Further details of this mechanism can be found in [4], [3] .
The following less reliable phrasal indicators are only accepted cautiously :
Simple corporate names from the supplied resources. One-word company names are sought after the mai n
lexicon lookup . Thus a normal domain interpretation of a word is preferred over any possible compan y
name intepretation. However the presence of a following corporate designator will generally force a compan y
interpretation when it is encountered.
Phrasal corporate names from the supplied resources . A phrasal name is split into an essential (contiguous )
core, plus peripheral optional parts . Phrases are indexed on the first word of the core, and if the whole cor e
is located, then optional parts will also be incorporated into the lexical phrase if present . In principle, this
might be a very reliable match but both the automatic identification of appropriate cores in the corpus, and
the heuristics for matching optional words are still rather rudimentary .
Bare personal names using the names resources and a certain amount of guesswork about unknown words .
2This is in fact the only 'global state' maintained by the lexical analyser or the parser — all other lexical and syntactic processin g
looks only at the current sentence .
324
Gazetteer locations from the supplied gazetteer – using this directly leads to many surprising, and generally
spurious hits, and much irrelevant ambiguity even in sensible cases, so it is employed very much as a las t
resort at present .
Note that the main lexicon contains only about 600 words, and since many words in the input will neither b e
in this lexicon nor be part of recognisable lexical phrases, many words will be given no lexical analysis whatsoever
by the Lexical Analyser. Avoiding complete initial lexical analysis is one of the features of our approach; limited
further lexical analysis is done during parsing, as described in the next section.
The Parser The Parser takes the initial lexical hypotheses for a single sentence produced by the lexical analyser ,
and attempts to build larger syntactic structures (ideally, but rarely, a complete parse). From these it builds
semantic representations. The lexical entries returned (for single words and larger phrases) have three basic
components: a backbone category, some feature/value specifications, and a semantic expression . The gramma r
rules similarly consist of backbone categories and feature specifications, plus semantic composition rules describing
how the semantics of the mother category is constructed from that of the daughters .
The backbone categories are parsed using a chart parser in a conventional fashion : the fact that they are
atomic allows greater efficiency in rule indexing and primary category matching. Once backbone categories have
been matched, the corresponding feature sets are unified together. The most important role of unification is to
enforce domain-specific type restrictions which play an important role in the parsing process . The parser has a
sophisticated semantic type system, which uses term unification to implement type-compatibility (see [13] for
further details).
As well as the restrictive effects of the type system in the grammar, the other main aspect which distinguishe s
the parser from standard chart parsers is the use of 'interesting-corner' parsing (see [13]) . A grammar rule is only
triggered if its indexing category is currently 'interesting'. What constitutes an interesting category is controlle d
dynamically by the parser and grammar and is quite domain-specific. Initially, just a few classes of lexical categories
are deemed interesting (such as company names and joint venture phrases), and so only parsing from such words
or phrases is ever initiated. This strategy means that the parser has to be bidirectional, but it does restrict it to
analysing phrases which have some interesting content. The 'interestingness' can spread to features which are
explicitly sought by a rule which has been triggered .
The parser also includes heuristics for handling common words not in the main domain lexicon . Once the
chart has been extended as far as it can be on the basis of the initial hypotheses (which included only domain
specific words and very common English words), the words which are 'blocking' active chart edges are examined.
These words are sought in the 'ANLT ' lexicon, a general lexicon of English developed under the 'Alvey Natura l
Language Tools' Project [2] which contains about 7000 root forms and associated with each a syntactic category
If the 'blocking' words are in the ANLT lexicon and are of the right category to extend the active edge, then they
are added at this point, and parsing proceeds again . In this way, the presence of the odd unknown word withi n
interesting text does not cause a problem, but the parser is not weighed down with attempting to process all the
common words in the sentence.
The parser constructs semantic representations – expressions of lambda calculus over predicate logic – as a
post-process on completed syntactic parse trees. The grammar rules specify compositionally how the semantic s
of complex linguistic categories are constructed out of the semantics of component categories . At the highest
level, in a parse tree of a complete message, all lambda variables disappear and the parser produces an expressio n
of predicate calculus as its interpretation of the 'meaning' of the message.
The Discourse Interpreter The Discourse Interpreter integrates semantic information from each successiv e
parsed sentence into the current semantic model . This component draws on a number of previous approaches .
such as auxiliary role fillers [7], scripts [15], interpretation as abduction [12] and world models [8] . [6].
The first semantic task is to add any facts presupposed by the new information, to give us a more explici t
picture of the intended meaning. So for example, a 'joint venture' event presupposes at least two (distinct )
joint venture participants and an activity in which the participants will engage, but this information may not b e
explicitly present in the input .
325
SEMANTIC ONTOLOG Y
erfly(X)
(Including al SC
	
(Inducing al Fealty hpas)
categories)
	
Figure 2: Fragment of EJV Ontology used in MUC- 5
The second semantic task is co-reference resolution – the same entity may be referenced throughout the tex t
in different ways, perhaps even implicitly (as in ellipsis), and this identity of reference needs to be establishe d
To do this a data structure is employed which locates all entity types in a hierarchical type system or ontology, a
fragment of which is shown in figure 2 .
Each node in the ontology has properties associated with it, either defined explicitly or inherited from above .
Such properties are themselves nodes in the ontology. Co-reference resolution is constrained to respect the typ e
system (inconsistent 'types' may not co-refer) and the property specifications (some properties are unique-valued .
others are not). Apart from this it is unconstrained, except that we attempt resolution within a parse before
attempting resolution between a parse and the existing model . This is a legacy of the traffic domain, wher e
multiple entities of the same type were rare, but as we shall see in the walkthrough, this means that inappropriat e
coreferences do sometimes occur .
This component also has a more subtle role in the overall architecture of the system. In order to parse robustly,
the grammar has rather patchy, oversimplified coverage in 'uninteresting' areas. This means that the resultan t
analyses tend to be overfragmented . Part of the role of the co-reference resolver is to pull such fragment s
back together at the semantic level . This technique is an important factor in the overall effectiveness of ou r
approach, allowing clearer, more conventional expression of the grammatical knowledge encoded, not encumbere d
by complicated support for robustness issues .
The final stage of semantic analysis, after all the sentences have been processed, is to provide values for a
number of key domain features. The system has a notion of the certain features which must be present in an y
model, and if values for them have not been provided explicitly from the text, it will attempt to provide the m
itself. This is partly achieved through inference from values which have been provided, so that, for example ,
the description of an industry can be inferred from the description of a facility - mining happens at mines .
manufacturing at plants etc. . When all else fails, default values are provided where appropriate . Many of these
defaults are derived from statistical analysis of the training corpus and templates – for example, in the WSJ, MEA D
and PROMT development template sets, ecentity-type') is COMPANY in 92% of cases, tie-up-status is EXISTIN G
3 Throughout this paper we use the term 'ecentity' (economic entity) to denote the MUC-5 template ENTITY object - the term'entity' being already in use as the top
of our own ontology .
poison(x) gowx) company(X)
°1 ct(X)
	
earvice(X)
awn(x(v,Z) softy(x)
	
)19oP(X)
le(%)
	
immutpble(X)
reational.prop(X)
agrculture(%)
mirrrq(X)
oorwruaion(X)
manulaauriny(X)
communications(X)
sia(X)
factory(X)
farm(x)
bulid(X,Y.Z)
aperate(X,Y,Z)
produce(X,Y,Z)
acility_name(x,Y)
acily_tocatcn(x,Y )
lie_up slatus(X,Y)
U._up_ecentny(X,Y)
de_upjv_comp(X,Y )
ecentiy_alias(x,Y)
number_ot(x)
time_ot(x)
(Including all unique
	
(including all non-unique
valued template skits)
	
valued template skits)
326
in 79% of cases.
The Post-processor The Post-processor simply takes the model built by the Discourse Interpreter, turns it int o
a template-shaped data structure and then writes it out in an appropriate format. In doing this, it traverses th e
template structure top down, so that any entities in the model which are not linked to the top are simply ignored .
THE WALKTHROUGH EXAMPLE
In this section we discuss the system's performance on the standard 'walkthrough' example, making referenc e
to the more complete picture provided in the appendix . This example gave the pre-processor very little trouble .
having no difficult sentence breaks or reported speech . So we begin by listing the input sentences after pre-
processing :
1. DATE '241189 '
2. SOURCE 'Jiji Press Ltd. '
3. NEWCONTEXT 1
4. BRIDGESTONE SPORTS CO . SAID FRIDAY IT HAS SET UP A JOINT VENTURE IN TAIWAN WITH A
LOCAL CONCERN AND A JAPANESE TRADING HOUSE TO PRODUCE GOLF CLUBS TO BE SHIPPED T O
JAPAN
5. THE JOINT VENTURE , BRIDGESTONE SPORTS TAIWAN CO . , CAPITALIZED AT 20 MILLION
NEW TAIWAN DOLLARS , WILL START PRODUCTION IN JANUARY 1990 WITH PRODUCTION OF 20000
IRON AND \" METAL WOOD \" CLUBS A MONTH
6. THE MONTHLY OUTPUT WILL BE LATER RAISED TO 50000 UNITS , BRIDGESTON SPORTS OFFICIALS
SAID
7. THE NEW COMPANY , BASED IN KAOHSIUNG , SOUTHERN TAIWAN , IS OWNED 75 PCT G Y
BRIDGESTONE SPORTS , 15 PCT BY UNION PRECISION CASTING CO . OF TAIWAN AND
THE REMAINDER BY TAGA CO . , A COMPANY ACTIVE IN TRADING WITH TAIWAN , THE
OFFICIALS SAID
8. BRIDGESTONE SPORTS HAS SO FAR BEEN ENTRUSTING PRODUCTION OF GOLF CLUB
PARTS WITH UNION PRECISION CASTING AND OTHER TAIWAN COMPANIES
9. WITH THE ESTABLISHMENT OF THE TAIWAN UNIT , THE JAPANESE SPORTS GOODS MAKER PLANS TO
INCREASE PRODUCTION OF LUXURY CLUBS IN JAPAN
10. CASE ALLUPPER
Of these, sentences 1, 2, 3 and 10 are all 'fixed field ' sentences and so not parsed syntactically . 3 perhap s
deserves a comment : it indicates to the Discourse Interpreter that we are beginning a new JV text (within a
single document) . This is of significance when handling documents which summarise several separate stories .
using -- to introduce each new one (such as document 0142 of the final run) . In such cases, each -- generates
a NEWCONTEXT sentence .
Sentence 4 actually makes things start happening . Lexical analysis identifies BRIDGESTONE as the com-
pany BRIDGESTONE CORPORATION I JAPAN in the supplied resources database (phrasal company matching wit h
BRIDGESTONE as core and an optional CORPORATION, which is not present in this case), and CO as a company (be -
cause all corporate designator abbreviations get mapped to their full forms — useful in some contexts, but clearl y
not all!) . SPORT is ignored as a common word, but had the example been mixed-case, and SPORT capitalised . a
single company name spanning these three words would have resulted . JOINT VENTURE and TRADING HOUSE are
both identified as fixed phrases, and most other words given reasonable analyses . Sadly, GOLF CLUBS does not
appear in our still incomplete list of product services . If we changed the input text to talk about 'gloves ' , which
our system does know about, then the line PRODUCT/SERVICE : (51 "glove") would appear in the template
327
The parser centers its activity on interesting lexical items, here, the companies . locations and the joint venture,
and produces five fragments on this sentence as follows :
1.BRIDGESTONE
[company(el), dbminfo(el,[[value,'BRIDGESTONE CORPORATION I JAPAN '] .
[orig_text,'BRIDGESTONE']]) ]
2.CO
[company(e2), numberof(e2, _, 1)]
3.A JOINT VENTURE IN TAIWAN
[tie_up(e3), numberof(e4,
	
1), location(e4), at(e3, e4) ,
dbminfo(e4,[[value,'Taiwan (COUNTRY)'),[type,'COUNTRY'] ,
[orig_text,'TAIWAN']]) ]
4. A LOCAL CONCERN AND A JAPANESE TRADING HOUSE
[ecentity(e5), numberof(e5,
	
1), other_ecentity(e6) ,
numberof(e6, _, 1), nationality(e6, japan) ]
5. JAPAN
[location(e7), dbminfo(e7,[[value,'Japan (COUNTRY)'] ,
[type,'COUNTRY'],(orig_text,'JAPAN'])) ]
Each of these fragments is a list of logical predicates about entities el, e2 etc. . The predicate 'dbminfo' wraps
up information obtained from the supplied resources (interfaced via the DBM database library package) . Notice
in fragment 4 that the common word LOCAL has not blocked the successful detection of the full noun phrase ,
despite not being present in the main lexicon .
On receiving these parses, the Discourse Interpreter first attaches an instance node to the ontology for eac h
entity referred to in the parses (el, e2 etc.) . These instance nodes are attached immediately below the mos t
specific object or event type node by which the entity is identified in the parses . Each predicate in the parses
which is a property (determined by reference to the ontology — see figure 2) is then added to the property lis t
of each of the entity instances which occur as its arguments. Once this mapping of representations has bee n
achieved, we end up with an instance model containing four ecentity objects, a tie_up object and a couple of
location objects . The next step is to expand presuppositions, and in this case only the tie_up entity has an y
presuppositions : that there are two tie_up_ecentities and a tie_up_activity. This latter has its own presupposition s
that it is at an activity_site and associated with an industry. So new objects are created for all these also, and w e
are ready to start co-reference resolution, the main task .
Since this is the first sentence of any significance, only intra-sentential co-reference resolution occurs . The
company BRIDGESTONE unifies with CO (hence ' correcting' the missed common word SPORT to some extent) sinc e
the latter is completely generic and will unify with any company. However, for the same reason it also unifies wit h
LOCAL CONCERN, incorrectly. It also unifies with one of the posited ecentities in the tie_up . It does not unify wit h
TRADING HOUSE since the latter is not a company — it has the incompatible type 'other_ecentity' ', nor with the
second tie_up ecentity, which is constrained to be distinct from the first, leaving these two free to unify together .
The resulting model looks like this:
z2 <-- company(_ )
Props : [numberof(z2, _, 1), distinct(z2, z5),ecentity_type(z2 . COMPANY) ,
dbminfo(z2,[(value,'BRIDGESTONE CORPORATION I JAPAN'] ,
'This is a bug in the ontology — TRADING HOUSE should have been classified as a subtype of 'company', rather than as a n
'other_ecentity' .
328
[orig_text,'BRIDGESTONE']]) ,
ecentity_core_name(z2,'BRIDGESTONE CORPORATION I JAPAN') ,
tie_up_ecentity(z3, z2), ecentity_name(z2, 'BRIDGESTONE') )
z3 <-- tie_up(_ )
Props : [tie_up_ecentity(z3,z5),tie_up_activity(z3,z7),numberof(z3,=,1) ,
tie_up_ecentity(z3,z2),at(z3,z4) )
z4 <-- location(_ )
Props : [dbminfo(z4,[[value,'Taivan (COUNTRY)'] ,
[type,COUNTRY],[orig_text,'TAIWAN']]), at(z3, z4) ]
z5 <-- other_ecentity(_ )
Props : [numberof(z5, _, .?), nationality(zS, japan),distinct(z2, z5) ,
tie up_ecentity(z3,zS),ecentity_nationality(z5,'japan (COUNTRY)') ]
z6 <-- location(_ )
Props : [dbminfo(z6 ,
((value,'Japan (COUNTRY)'] ,
[type,COUNTRY] ,
[orig_text,'JAPAN'])))
z7 <-- activity(_ )
Props : [activity_activity_site(z7,z9) ,
activity_industry(z7,z8) ,
tie_up_activity(z3,z7) ]
z8 <-- industry(_)
Props : [industry_product_service(z8,z10) ,
activity_industry(z7,z8) ]
z9 <-- activity site(_ )
Props : [activity_activity_site(z7,z9) ]
zlO <-- product_service(_ )
Props : [industry_product_service(z8,zl0) ]
z3 is the tie_up involving zl (BRIDGESTONE/CO/LOCAL CONCERN) and z5 (TRADING HOUSE), :6 is a floatin g
location and z7 to zlO are as yet unknown parts of the tie_up activity .
So how well have we done on this sentence? We have found a company called BRIDGESTONE, althoug h
probably not quite the right one (in mixed case we would have got it, however) . We have correctly identified i t
and the trading house as partaking in a joint venture, but got a bit confused about the local concern . We haven' t
managed to get anything about the activity, primarily because we don't know anything about golf clubs .
Turning now a little more briefly to sentence 5, again the lexical analyser falls foul of SPORTS, this tim e
returning BRIDGESTONE as before and a company called TAIWAN CO, (and once again, in mixed case it would hav e
got it right) . The presence of TAIWAN blocks the parsing of 20 MILLION NEW TAIWAN DOLLARS – a commo n
word here would have been successfully bridged, as would NEW . And of course, if we cannot get GOLF CLUBS .
it is not surprising that we do not get IRON AND " METAL WOOD " CLUBS either . So the parse ends up rathe r
piecemeal .
Nevertheless the discourse interpreter takes the joint venture, hypothesises participating entities and so forth ,
and starts resolving co-references . Within the parse for the sentence alone, there are two companies (BRIDGESTON E
and the incorrectly identified TAIWAN CO) and the tie_up is looking for two companies, so they get unified (th e
parser having missed the apposition which might have forced at least BRIDGESTONE to be the tie_up_activity) .
This time there is also co-reference resolution with the existing model ; in fact, the entire tie_up complex get s
unified . This results in a tie_up with now three participating companies — BRIDGESTONE, TAIWAN CO and JAPANES E
TRADING HOUSE . (Notice that the number of tie_up_ecentities is not constrained — it is only a presupposition tha t
there are two. )
Sentence 6 has almost no effect: nothing triggers company name recognition for BRIDGESTON . and not eve n
mixed case input would help us here . We do have a spelling corrector which we experimented with in the POETI C
project, and which would have corrected this type of error, but its computational expense was found to be out o f
proportion to its usefulness. In the present domain it was decided that this was even more the case . since spellin g
329
errors in this type of input text are actually extremely rare .
In sentence 7, the system picks out some vestige of most of the companies (THE NEW COMPANY, CASTIN G
CO . BRIDGESTONE, TAGA CO) and as a bonus A COMPANY and yet again would have done much better in mixe d
case. However it did not appreciate the significance of NEW, since its entry in the lexicon does not currentl y
have any semantic content, and so got very little of the structure right . THE NEW COMPANY got unified with
BRIDGESTONE, and the net effect was simply the addition of two floating companies TAGA and A to the model .
The company associated with CASTING doesn't make it out of the parser - clearly it only occurred in a les s
favoured parse . The parser chooses between parses initially on grounds of the coverage of each parse, and the n
subsequently on the basis of a "confidence" measure which has not been tuned to the MUC-5 task . This is i n
theory a measure of the confidence of each parse calculated as a function of the semantic predicates used and
the number of entities to which they refer . Since this has not been perfected, it is possible for parses which ar e
actually 'better' to be discarded in favour of one which scores highe r
The processing of sentence 8 is also superficial, with only one noteworthy feature : CASTING is identified as a
company, despite being a common word without any company cues, because it has been seen before .
Sentence 9 has the distinction of having no effect whatsoever on the model, since all the information extracted
was already present .
Once all the sentences have been processed, the model is completed, by filling in key values by inference o r
from defaults . In this case, an ecentity_relationship and the industry_type are added, resulting in the final mode l
shown in the appendix. From this model, the following template is produced :
<TEMPLATE-0592-1> :_
DOC NR: 0592
DOC DATE: 241189
DOCUMENT SOURCE : "Jiji Press Ltd . "
CONTENT: <TIE_UP_RELATIONSHIP-0592-1 >
<TIE_UP_RELATIONSHIP-0592-1> :_
TIE-UP STATUS : EXISTING
ENTITY: <ENTITY-0592-1>
<ENTITY-0592-3>
<ENTITY-0592-2>
ACTIVITY: <ACTIVITY-0592-1>
<ENTITY-0592-1> :=
NATIONALITY: Japan (COUNTRY)
TYPE: COMPANY
ENTITY RELATIONSHIP : <ENTITY RELATIONSHIP-0592-1 >
<ENTITY-0592-2> :_
NAME: BRIDGESTONE
TYPE: COMPANY
ENTITY RELATIONSHIP : <ENTITY RELATIONSHIP-0592-1 >
<ENTITY-0592-3> :_
NAME: TAIWAN CO
TYPE: COMPANY
ENTITY RELATIONSHIP : <ENTITY RELATIONSHIP-0592-1>
<INDUSTRY-0592-1> :_
INDUSTRY-TYPE: PRODUCTION
<ENTITY_RELATIONSHIP-0592-1> :_
ENTITYI: <ENTITY-0592-2>
<ENTITY-0592-3>
<ENTITY-0592-1>
REL OF ENTITY2 TO ENTITYI : PARTNER
STATUS: CURRENT
<ACTIVITY-0592-1> :_
INDUSTRY: <INDUSTRY-0592-1>
330
ERROR-BASED METRIC
RICHNESS-NORMALIZED ERROR:
RECALL AND PRECISIO N
P&R F-Measure
32 .54
Table l:Full Template Scores
OBSERVATIONS
The walkthrough example does not show our system off at its best, for a number of reasons . The first i s
the lack of mixed case. The system's preference for mixed case is more than just naive coding: the only way to
deal with caseless .data is to generate more lexical hypotheses — a situation which we are keen to avoid . In any
case, while we would have got more accurate company names with mixed case, it is not clear that many of th e
structural problems would go away. To fix those, one needs to look at the parser, or particularly the grammar .
In POETIC, the grammar is remarkably stable across police sublanguages, presumably because most of the
sublanguage variation is lexical rather than grammatical . However, this grammar is less appropriate for the Joint
Ventures domain, where the language is much closer to standard English . Hence a fair amount of developmen t
work is required to achieve a proper port, and of course, only a fragment of this work has been done so far. Many
of our problem areas are the standard ones — apposition, coordination etc . but it would be interesting to se e
whether our architecture offers more pragmatic solutions. If the parser can be persuaded to pass on just a littl e
more grammatical information, it might be possible for the discourse interpreter to make the right association s
more robustly. This is already happening to some extent : prepositions often indicate relationships between nou n
phrases that are difficult to capture syntactically . In the grammar, such links are reported as being simply a n
abstract relation between entities, and it is up to the discourse interpreter, with its far richer context, to decid e
what more specific relationship is appropriate .
A further weakness that the walkthrough example shows up well is our inability to deal with many objects o f
the same type, notably companies . In the traffic domain, it was rarely important to distinguish entities beyon d
their type, and so the co-reference resolution strategies used are rather eager and arbitrary. Finer control, and
perhaps even backtracking, is clearly needed here .
Finally, a few words on evaluation . Table 1 gives the results for the final evaluation run, scoring against the
full templates. Table 2 gives the results when scored against the reduced template that was used in the dry ru n
test two months before the final run . These figures are noticably better and arguably a better assessment of th e
system, because the reduced template corresponds more closely to the system's abilities . The reason for this i s
that the full template included a number of objects and slots not present in the reduced template, which ou r
system never attempted to fill — we simply did not have time to develop, any code to support them . These, of
course, count as errors in the full template score, but are ignored in the reduced template scor e
s lt is interesting to note here that the walkthrough example alone gave very similar figures, and so is fairly average for our system
UNDERR
78ALL OBJECTS :
SUB
3859
OVG
25
Max-errMin-err
0.90620.882
REC PRE
25 46
331
ERROR-BASED METRIC
SU BERR - UND OVG
38ALL OBJECTS : 2 35174
RICHNESS-NORMALIZED ERROR
: II	
Min-er r
0 .5311
Max-er r
0 .545 7
RECALL AND PRECISIO N
RE C
30
PRE
48
P&R F-Measur e
37 .1 4
Table 2 :Reduced Template Score s
ACKNOWLEDGEMENT S
The research described in this paper would not have been possible but for the generous support of Integral
Solutions Ltd ., Racal Research Ltd., Sussex University (COGS) and the Advanced Research Projects Agency ,
Software and Intelligent Systems Technology Office (contract no . N66001-90-D-0192, subcontract 19-940067-3 1
to SAIC) . Evans is supported by an SERC Advanced Fellowship.
Also special thanks to Jeremy Crowe in Edinburgh, who provided many useful insights about the training
corpora, and the code for removing reported speech .
The system has been developed entirely within the POPLOG programming environment, combining Popll,
Prolog, Lisp and C . Our thanks to the POPLOG development and support team at Sussex and ISL for providin g
such a productive environment to work with .
POPLOG is a trademark of the University of Sussex. Copyright for the Bridgestone Sports article is held b y
Jiji Press Ltd . (used with permission).
References
[1] D. Allport. The TIC: parsing interesting text . In Proceedings of the Second ACL Conference on Applie d
Natural Language Processing, pages 211-218, 1988 .
[2] E .J . Briscoe, C. Grover, B .K Boguraev, and J . Carroll . A formalism and environment for the development o f
a large grammar of English . IJCAI-87, 2:703-708, 1987 .
[3] L. J. Cahill. Some reflections on the conversion of the TIC lexicon to DATR . In Default Inheritance Within
Unification-Based Approaches to the Lexicon. Cambridge University Press, Cambridge, 1992.
[4] L .J . Cahill and R. Evans. An application of DATR : the TIC lexicon . In Proceedings of the 9th Europea n
Conference on Artificial Intelligence, pages 120-125, Stockholm, 1990 .
[5] L .J . Cahill, R. Gaizauskas, and R . Evans. POETIC : a fully-implemented NL system for understanding traffi c
reports . In Fully Implemented Natural Language Understanding Systems: Proceedings of the Trento Work -
shop, March 30. 1992 (IWBS Report No. 236), pages 86-99, IBM Institute for Knowledge Based Systems ,
Heidelberg, 1992 .
[6] L . Carlson and S. Nirenburg . Practical world modeling for NLP applications . In Proceedings of the Third Con-
ference on Applied Natural Language Processing, pages 235-236 . Association for Computational Linguistics ,
1992 .
332
[7] E. Charniak and D. McDermott . Introduction to Artificial Intelligence . Addison-Wesley, Reading, Mass .,
1985.
[8] K . Dahlgren . Naive Semantics for Natural Language Understanding . Kluwer, Boston, 1988 .
[9] R. Evans and G . Gazdar. Inference in DATR . Proceedings of the Fourth Conference of the European Chapte r
of the Association for Computational Linguistics, 1989 .
[10] R. Evans and G. Gazdar. The semantics of DATR. In A. Cohn, editor, Proceedings of the Seventh Conferenc e
of the Society for the Study of Artificial Intelligence and Simulation of Behaviour, pages 79-87 . Pitman,
London, 1989 .
[11] R. Evans and A.F. Hartley. The traffic information collator . Expert Systems: The International Journal of
Knowledge Engineering, 7(4):209-214, 1990 .
[12] J.R . Hobbs, M. Stickel, P. Martin, and D. Edwards. Interpretation as abduction . In Proceedings of the 26th
Conference of the Association for Computational Linguistics, Buffalo, N .Y., 1988 .
[13] C. Mellish, D . Allport, A.F . Hartley, R . Evans, L . J . Cahill. R. Gaizauskas, and J . Walker . The TIC message
analyser. Cognitive Science Research Paper 225, University of Sussex. 1992 . Submitted for publication .
[14] C. S. Mellish, R . Evans, and J . Walker . The TIC project final report . Cognitive Science Research Paper 208,
Cognitive and Computing Sciences, University of Sussex, 1991 .
[15] R.C. Schank and R .P. Abelson. Scripts, Plans. Goals and Understanding. Lawrence Erlbaum, Hillsdale, N .J . .
1977.
333
: 'i
Q \~ [! })z§
z|ki ® a !
n __ -!!
	
a ; /\ ; ~
5` iOi
""\\/
	
\ . !;~2\}
	
. . } })\{k
{ ) \
~ ® `-
	
~~3 iaEe e °
E : §:.
7"
	
:I i f
	
=	 /
	
, :
t
	
_
	
~
	
!_ :
	
:
	
,-
	
|
\3 ƒ \\5 ::iii 1( (2{ {{ \ \
• i¥ | !~ m
	
. .
\
n , .51 ; !
	
!
§@!!7 /
\
	
ik!~ /
()
	
k
	
i
	
}) \
~ ;
	
I
	
'.(({
\\ \
	
i i
-5 ( f 5
\}}| 5 Eg E
E tt i d
F5,,.-
\j \\ - : §_ \|§ (!,
5 E
}
i;)) iaiE055: 1
0
2
("}
o
--
2!:
2 !(
	
k~ (
21 ;,
	
5 k
HI
!,
,15
334
_
	
/
\
.| e a
	
EE ; f2 . l ;
	
;l3
	
£_\
	
.
	
a~,T --_	
	
: : ;\2} !_
	
: : g \
	
.
	
.2!§ ; ~_
e
	
-
	
~--_-
	
~
	
:Ti \ e ee \ ! ,
	
}\ ,
	
( §§ E` ill ;f1:1h :l£= i ~~~«f.
	
~\\~~~
	
it ;~ ~~r ;
	
!~ 2132 ;£ !\
	
a - !/ .3-- !©§££!!!f!/!! •
	
2:=5 :=
	
- \, ;a <=2y~6- kE -| l~ : ;&2n _ : : :~ :~~~ :
	
:
	
_~§ :
	
: . . |~ . . 2n ' .
	
.
	
. ! § : ;lil.
	
!.
	
() \.
	
! ; ~
	
\\ .ƒ :!
	
. a . .
	
a . .
	
& _
	
,§
	
. .
	
=§e : .==!~_,. .. . ;. - . . :. -, .. .- . t . ,it . . .°	 5:••, . ,;{! qJ$~7~~! \\~}' !!|! ! §/-§ . . . . d : : .2! :. . : :1l :
2- :ll a (.
	
,
	
, .
	
: . . : ; ; ;, :,! n u n „ I ; ;, i ;,
	
. : .
	
!	'£l a
	
| ; !2l: :
	
r-t_
{ƒ
	
{ƒ
	
ƒ{
	
}{ \{
	
} }
	
.
	
\!	 {
	
} i
	
/
E.
t §Z .
	
-
	
~
E.,
	
.
:: \ '
	
} ©^ g};
}k/k!
	
am --t ,F }\~i n
r .
335
