File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1005_metho.xml
Size: 29,068 bytes
Last Modified: 2025-10-06 14:14:48
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1005"> <Title>UNIVERSITY OF DURHAM: DESCRIPTION OF THE LOLITA SYSTEM AS USED IN MUC-7</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> ARCHITECTURE Overview </SectionTitle> <Paragraph position="0"> LOLITA is designed as a core system supplemented with a set of applications, the former supplying basic NL facilities to the latter. Figure 1 shows the MUC-relevant parts. The most important part of the core is the large knowledge base, which is called the 'Semantic Network', SemNet or net, for short. It is heavily used in most stages of analysis, and the results of analysis are added to it, as a disambiguated logical representation of the input. The analysis stages are fairly standard, and are arranged in a pipeline. Each is implemented in a rule-based way. The system does not currently use any form of stochastic or adaptive techniques in the main system.</Paragraph> <Paragraph position="1"> The applications can then read the results of the analysis from the SemNet, and generally interrogate the contents of the SemNet. Some central 'support' facilities are provided to aid application writing, such as the general template mechanism and the NL generator - which translates pieces of the SemNet into English. More detail on the architecture of LOLITA can be found elsewhere [2].</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> The Semantic Network </SectionTitle> <Paragraph position="0"> The SemNet is a 100,000+ node, directed hyper-graph. Each node has a set of links, plus a set of 'control variables' (or controls). Some nodes have an associated 'name': this is usually a single word which characterises the meaning of the node. Each link has an arc and a set of targets. Targets are other nodes, and the arc too is just a node. Nodes correspond to concepts of entities or events. Links correspond to relationships between nodes. Since an arc is also a node, the concepts of the different kinds of relationship possible between nodes can be represented in the same formalism as more concrete concepts. In this system, the 'meaning' of any particular node is given by its connections, its relative position in the net.</Paragraph> <Paragraph position="1"> Controls indicate basic information about a node, such as its type (e.g., event, entity, relation), its family (e.g., human, inanimate, food, organisation), its lexical type (e.g., noun, preposition, adverb) - as appropriate. An important control is a node's rank: this encodes quantification information. Concepts of general sets have a Universal rank, specifically named objects have a Named Individual rank, and general individuals an Individual rank. There are several other less important ranks, used for things like encoding script-like information or existential quantification. Controls could be represented using links, but for efficiency reasons a more compact form is used.</Paragraph> <Paragraph position="2"> There are approximately 60 different arcs. The arcs subj_, action_, and object_ are used to represent the basic roles of an event. Events can have other arcs, such as those indicating temporal information, the status of the information (e.g., known fact, hypothesis, etc), or arcs that indicate the source of the information. Most arcs also have inverses: e.g., the subject_ arc has the inverse subject_of_, which allows determination of the events in which a particular concept played the subject role.</Paragraph> <Paragraph position="3"> here as an example of SemNet structure, and its meaning is discussed in the section on the semantic network. The full structure is not shown, for reasons of space.</Paragraph> <Paragraph position="4"> Concepts are connected with arcs such as specialisation_ (and its inverse, generalisation_), or instance_ (inverse universal_). Specialisation links a set to one possible subset; for example, in Figure 2, chairman[U] represents the set of all possible chairmen, and old_chairman[U] the set of all possible old chairmen. Between the former and the latter is a specialisation_ link, indicating that old chairmen are a subset of chairmen. Conversely, the latter is linked to the former with a generalisation_ link, representing a superset. Using the specialisation_ link, hierarchies of concepts are specified. The instance_ arc connects a concept to an instance of that concept: e.g., a particular chairman chairman1[I] would be linked to chairman[U] by an instance_ link. Other links between concepts include synonym_ and antonym_.</Paragraph> <Paragraph position="5"> The SemNet is used to hold several kinds of information: * Concept hierarchies: built with arcs such as generalisation_, concept hierarchies encode knowledge like &quot;man is_a mammal is_a vertebrate&quot; etc. They prevent duplication of information by allowing information to be inherited within the hierarchy.</Paragraph> <Paragraph position="6"> * Lexical information: actual words are represented in the net, and their properties are stored in the net, as opposed to having a separate lexicon. The lexical-level nodes are indexed via a simple dictionary: i.e., a mapping from root words to all the senses of that word. Note that the lexical forms are distinct from the concepts: they are linked by a concept_ arc. Concepts are linked to lexical forms by a link named after the language of interest. For example, dog[U] has a link english to the noun form of 'dog', and a link italian to the Italian word 'cane'.</Paragraph> <Paragraph position="7"> * Prototypical events: these define restrictions on events by providing 'templates' for events, e.g., by imposing selectional restrictions on the roles in an event. &quot;Human owners own things&quot; says that only humans can take the subject role in 'ownership' events.</Paragraph> <Paragraph position="8"> * General events: other kinds of information. For example, the content of a MUC article would come in this class, when analysed.</Paragraph> <Paragraph position="9"> The bulk of the net (70%) comes from WordNet, a database containing lexical and semantic information about word forms in English [3]. More details about the formalism used in the net can be found in [4].</Paragraph> <Paragraph position="10"> Referring back to the Original Text Before MUC-6, LOLITA did not have a method of referring back to its input: the previous orientation was to move from language-dependent surface forms to a language-independent logical representation. Therefore, information about the surface form was discarded. Since the ability of reference has many uses outside of the MUC tasks, a more general mechanism was designed and added to the core. It allows fine-grain connection of the analysis results to the sections of the document giving rise to those results. The system allocates new SemNet nodes to components of the document (words, phrases, sentences, ...), which act as references into the document. This is called the 'Textref' system.</Paragraph> <Paragraph position="11"> Textrefs allow the document structure to be fully represented in the net, and represented uniformly with the other information in the system. At the word level, a Textref signifies a specific occurrence of a word at a certain position in the input, and is distinct from the nodes representing the lexical or semantic forms of its root form. It is an instance_ of the universal concept of all occurrences of that word. Concept nodes and Textref nodes are linked by an event with the internal action words_used. Two examples may be seen in Figure 2: single words are attached to the 'key' words of the sentence (only 'retire' is shown), and all of the Textrefs in the sentence are attached to the node representing the whole event.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Text Pre-processing </SectionTitle> <Paragraph position="0"> Core analysis of textual input starts from a LOLITA-specific SGML representation of the input (called an SGML tree). Individual applications must convert from their own formats (e.g., plain text, MUC articles, LaTeX, HTML, ...) into this internal format. The MUC converter is just a simple SGML parser. The preprocessor then adds additional structure to the internal SGML tree where necessary. In particular the following structures are handled in the order given: reported speech, paragraphs, sentences and words. Markers for reported speech are distributed over all sentences inside the quotes. Lastly, each word is allocated a Textref.</Paragraph> <Paragraph position="1"> Morphology Morphology is applied to an SGML tree whose leaves are individual word tokens, and whose nodes represent the structure of the document. A few transformations are done on this structure to unpack contractions (e.g., &quot;I'll&quot; expanded to &quot;I will&quot;), expand monetary and numeric expressions (e.g., &quot;$10 million&quot; to &quot;10 million dollars&quot;), and to transform certain surface-level idiomatic phrases (e.g., &quot;in charge of&quot;). Some splitting of hyphenated words is also done. Then, the basic morphology function is mapped on to all leaves (with additional treatment provided for sentence initial words).</Paragraph> <Paragraph position="2"> Lookups in the dictionary are done with the root forms suggested by affix stripping. If successful, a word is linked to lexical and semantic nodes, allowing access to lexical and semantic information during the rest of morphology, parsing, and semantics. Affix stripping loses information such as number and case, so this information is represented using a Feature system. Features are used in parsing (described below). Other Features include word class (Noun, Verb, ...) and some semantic-based Features. Finally, possible syntactic categories for a word are determined from the lexical (and sometimes semantic) node information. Thus, each leaf is mapped to a set of alternatives, varying in category and Features, which represent all possible interpretations of that leaf.</Paragraph> <Paragraph position="3"> Parsing The parsing mechanism utilised in MUC-6 consisted of five stages: 1. A pre-parser which identifies and provides structure for monetary expressions.</Paragraph> <Paragraph position="4"> 2. Parsing of whole sentences using the Tomita algorithm [5]. The result of this stage is a &quot;parse forest&quot;, a directed acyclic graph which indicates all possible parses. Due to the complexity of the grammar, this forest is frequently very large, implying many possible parses.</Paragraph> <Paragraph position="5"> 3. Decoding of the parse forest. The forest is selectively explored from the topmost node, using heuristics such as Feature consistency and hand-assigned likelihoods of certain grammatical constructions. Feature errors and unlikely pieces of grammar involve a cost: the aim of the search is to extract the set of lowest-cost trees. 4. Selection of best parse tree: subsequent analysis operates on a single tree. The lowest cost set is ordered on the basis of several heuristics on the form of the tree. For example, preferring a deeper tree.</Paragraph> <Paragraph position="6"> 5. Normalisation: syntax-based, meaning-preserving transformations are applied to the trees to reduce the number of cases required in semantics. A prime example of this is passive to active, i.e., &quot;I was bitten by a dog&quot; changed to &quot;A dog bit me&quot;. Another class involves transformations such as &quot;You are surprised&quot; to &quot;*SOMETHING* surprised you&quot;, which makes explicit the object doing the surprising.</Paragraph> <Paragraph position="7"> Although this mechanism remains at the core of the parsing for MUC-7 a number of additional strategies have since been included. These are described below in the section detailing the main changes to the system since MUC-6.</Paragraph> <Paragraph position="8"> An example parse is given in Figure 3. Note that 'will' and 'as' are missing. As so-called function words, they don't carry much inherent semantic meaning, so the tense information of 'will' is transferred to the Features of the main verb, and the copula function of 'as' is transformed into a syntactic construct. This simplifies the semantic rules.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Analysis of Meaning </SectionTitle> <Paragraph position="0"> This section describes how the parse tree is converted to a disambiguated piece of SemNet. There are two stages to this 'semantic' and 'pragmatic'. The semantic analysis is generally compositional in nature: the meaning of a tree is built from the meanings of its subtrees. A mechanism goes through the parse tree in depth-first, post-order traversal, applying semantic rules mainly on the basis of the syntactic phrase type of the current tree node. If the meaning of a particular subtree is unambiguous in role, the Textrefs for the text in that subtree are connected to that meaning. Since the meanings can be nodes which already have Textrefs connected, then particular nodes can collect Textrefs for all occurrences of their mention. This Textref handling is completely invisible to the semantic rules.</Paragraph> <Paragraph position="1"> A state value, the 'context', is passed around during traversal: this holds possible referents in order of occurrence, and is used to resolve anaphoric expressions. Use of this context prevents the semantics being purely compositional.</Paragraph> <Paragraph position="2"> The 'meaning' of most leaves is the semantic node associated with the word at the morphology stage. The node is passed to the leaf's parent in the form of a 'role' structure, which indicates the role the node may play in the semantics of the parent. Often this is unknown, but in cases like verbs, it can be determined as the act. The actual role structure allows for representation of semantic ambiguity.</Paragraph> <Paragraph position="3"> The main task of the pragmatic stage is disambiguation and type checking. Lexical ambiguities and anaphora are resolved using a series of preference heuristics which are first applied to disambiguate the action of the event. Once the action is known, any knowledge available from the prototype event associated with that action can be used to rule out pragmatically implausible readings, as well as to aid disambiguation of the remaining elements of the event (in the spirit of [6]).</Paragraph> <Paragraph position="4"> The contents of the current context together with the topic of the text (the latter is given to the system in advance) influence the choice of word senses: those meanings are preferred which are semantically closer to the meanings present in the context or the topic, where semantic closeness is computed on the basis of the distance between nodes in the network. Other factors may cause one concept to be preferred over others, such as the amount of knowledge the system has about a given concept, or the concept's frequency of use.</Paragraph> <Paragraph position="5"> Once an event is disambiguated, the system attempts to establish plausible connections between it and the previously processed discourse.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Reference Resolution </SectionTitle> <Paragraph position="0"> As the discourse is processed, the referents found in it are stored in the 'Context' buffer. Each time an anaphoric expression is identified in the incoming discourse, the system looks for a possible referent for this expression in the Context buffer (obeying matching rules dictated by the type of anaphoric expression). If the system finds no match, it introduces a new entity into the Context. If the system finds just one match, it unifies the two and adds the newly unified item into the Context. If the system finds more than one match, it builds a special structure to represent the ambiguity and passes it onto the system of preference heuristics to decide between the possibilities.</Paragraph> <Paragraph position="1"> The heuristics are loosely based on ideas from centering theory [7], psycholinguistic findings as well as common sense. They assess the salience of the candidates based on grammatical and semantic features, as well their position in the sentence, recency of mention and relatedness to the topic of the text. As in the whole of the LOLITA system, the algorithm relies heavily on a correct parsing and semantic analysis.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Template Support </SectionTitle> <Paragraph position="0"> The processes involved in producing templates can be generalised, hence the core contains a mechanism to help write templates at an abstract level. This mechanism handles search through the net, use of inference rules to derive implicit facts, and general output formatting.</Paragraph> <Paragraph position="1"> A template contains a predefined set of slots with associated fill-in rules that direct the search for appropriate information in the net. The slot fill-in rules are predicates that check node controls, or use the inference functions available in the core. For more details see [1].</Paragraph> <Paragraph position="2"> LOLITA is written mostly in Haskell, a non-strict functional programming language [8]. Two resource-critical sections are written in C - the parsing algorithm and the SemNet data structure and its access functions. Haskell has some similarity to LISP, such as building programs by writing functions, a garbage-collected heap, lists as a basic type, and full higher-order use of functions. However, it provides excellent support for modern Software Engineering, such as modularity, constrained polymorphism, a strong but flexible type system. It also enforces referential transparency and allows coding in a 'lazy' style, which means code is not executed unless needed. Thus, whilst our system has the external appearance of a pipeline architecture, the evaluation of individual pieces of code need not occur in that strict order.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> IMPROVEMENTS CARRIED OUT SINCE MUC-6 </SectionTitle> <Paragraph position="0"> The system that entered the 6th Message Understanding Conference suffered from three major problems. First, there was room for improvement in parsing. Second, the named entity recognition rate was fairly low, as compared with other systems. Third, the system contained a series of trivial errors in the code. Altogether, these three major shortcomings resulted in a considerable drop in performance.</Paragraph> <Paragraph position="1"> In the general approach adopted by the LOLITA project, every core component plays an important role in the final result. Consequently, if any of the components is unsatisfactory, overall performance is affected. This is especially prominent if an early stage of analysis (e.g., parsing) is incorrect.</Paragraph> <Paragraph position="2"> Many of the problems encountered during MUC-6 have been addressed and several improvements to the system have been carried out since that time. The most substantial of these improvements are discussed in the remainder of this section.</Paragraph> <Paragraph position="3"> Changes to the Grammar and Parsing Components Parsing can sometimes fail on very large forests: decoding these requires a lot of resources (time, memory). Rather than cause a crash due to overrunning limits, the parse is abandoned. This is implemented by fixing a timelimit on the process - resource usage being proportional to time: the expiry of the time limit is referred to as a 'timeout'. It is also possible for parses to fail if the sentence can't be analysed with the main grammar. In the system used for MUC-6 if the parse failed then analysis was discontinued on that sentence. This meant that no semantic result was produced and hence no information was available on NE's etc in the sentence. MUC-6 texts which contained sentences that timed out would therefore receive poor scores. For MUC-7 a number of improvements to the parsing mechanism have been adopted, including a recovery strategy for sentences that failed to parse.</Paragraph> <Paragraph position="4"> LOLITA's grammar has been improved and expanded to allow for a better parsing of the materials used in MUC-6. Furthermore, a new method for handling headlines in the articles has been added. As well as devising a special grammar for them, the headlines are now analysed at the end of the article, using as context the initial sentences of the main body of the text.</Paragraph> <Paragraph position="5"> The parsing mechanism itself has been improved. Island parsing, whereby easily recognisable noun phrases are 'locked' into units before being passed onto the parser, has been introduced. This has improved the parsing success rate substantially.</Paragraph> <Paragraph position="6"> Moreover two extra passes have been added should parsing fail: a second pass using Brill's tagger [9] and a third pass using a reduced grammar. These are aimed at recovering constituents of complex sentences, if a full parse isn't possible.</Paragraph> <Paragraph position="7"> Finally, in cases where all three parsing passes fail, a way of recovering most named entities, all the pronouns, possessive determiners and some noun phrases (particularly those related to the topic of the text, if the latter can be provided in advance) has been devised.</Paragraph> <Paragraph position="8"> Changes to the NE Recognition Components The components responsible for named entity recognition have been revised and many new rules have been added. A major change has been introduced to LOLITA's morphology module, which allows the system to reuse names of entities previously recognised in the preceding text, rather than treat the entities in each sentence of the incoming text as new. (Previously, the morphology module had no access to the results of the semantic and pragmatic analysis of the preceding text.) A change in the treatment of unknown proper names that appear without clear designators (i.e., without Corp, Ltd, Mrs., etc) has been introduced. In the MUC-6 system a decision as to what type of entity an unknown name stood for was made early and usually resulted in the conclusion that it must stand for an organisation. The new improved treatment, on the other hand, involves the introduction of the concept of human_or_organisation, the use of which allows for a delay in the decision, until some disambiguating information becomes available at the pragmatics stage. For example, given the following first sentence of an article: Shortly after Fossett's launching Monday his competitors sent him telegrams of congratulation The system cannot decide what sort of entity Fosset is on the basis of this name itself. However, the use of the pronoun his as well as the absence of any other possible referents, provide the disambiguating clues.</Paragraph> <Paragraph position="9"> Changes to the Semantic and Pragmatic Components At the semantic level, several new rules that had previously been missing and had been needed to handle expressions common in MUC-6 articles have been added. New rules were also needed due to the introduction of new constructions in the grammar.</Paragraph> <Paragraph position="10"> In the pragmatics component, the preference heuristics system has been substantially revised and expanded. In the MUC-6 system the heuristics acted as filters and so rejected any non-preferred candidates. This sometimes resulted in rejecting a candidate which didn't match one of the heuristics that was applied at an early stage. The same candidate could have been favoured by several later heuristics, but this had been ignored.</Paragraph> <Paragraph position="11"> Currently, the preference heuristics assign penalty points to non-preferred items and at the end of their application, the candidate with least penalties is chosen as the referent.</Paragraph> <Paragraph position="12"> Increase in Basic Data Since the time of MUC-6, a lot of data concerning organisation names, corporate designators, personal names and place names have been added to LOLITA's knowledge base (SemNet). The additions include names of major USA institutions and organisations (e.g., government departments), names of newspapers, names of major geographical locations in the USA, US states abbreviations and names of countries and nationalities of the world. Also, about 8000 new forenames have been added and all the existing forenames have been checked to ensure that they are marked correctly for gender.</Paragraph> <Paragraph position="13"> Text Output Errors Corrected Minor coding errors in the 'Textref' module of LOLITA resulted in the system occasionally inserting spurious space characters in some places, while deleting others. This adversely affected the final result of the MUC-6 evaluation, because the scoring software is sensitive to any misalignments between the answer keys and the responses. Many of these kinds of errors were corrected before participation in MUC-7.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> MUC-7 SPECIFIC CHANGES </SectionTitle> <Paragraph position="0"> The work carried out during the preparations for MUC-7 was concentrated in four main areas. These are discussed in this section.</Paragraph> <Paragraph position="1"> Addition of MUC-7 Specific Data Over three hundred airline names as well as some well known airport names have been added to the SemNet. Additionally, airline and aircraft specific artifacts, such as types of aircraft and most common aircraft models (including some military ones), have been added. The area of the SemNet with knowledge relevant to the aircrash scenario has been checked and adjusted as necessary.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Grammar Expansion </SectionTitle> <Paragraph position="0"> Grammar rules have been added to deal with constructions common in the training texts, such as references to aircraft and flights (Boeing 747, or Paris-bound Boeing 747, the TWA flight 800, etc).</Paragraph> <Paragraph position="1"> The MUC-7 corpus contains sentences which are, on average, much longer than the ones encountered in most of our previous tests. Sentences of around 40 words are not uncommon. In view of this the island parsing and the third pass parsing (of fragments of sentences, using a reduced grammar) proved particularly important, hence, a reasonable amount of work was needed on the rules for the reduced grammar and the failure recovery mechanism.</Paragraph> <Paragraph position="2"> Revision of Pragmatic Disambiguation Rules In order to deal with reported speech, commonly found in the training articles, it was necessary to improve the rules and heuristics in LOLITA which deal with first person pronouns (i.e., 'I' and 'we'). However, the new rules that were introduced using the examples from the MUC-7 training corpus were not designed specifically for those examples. In line with the normal strategy of the LOLITA project, our intention was to make them as general as possible.</Paragraph> <Paragraph position="3"> It was also found that some of our existing rules for noun phrase matching were not working well with the MUC-7 corpus. The existing rules produced much better results for the MUC-6 corpus whose topic area generally involved only companies and people. The rules needed tightening, especially when dealing with references to locations and aircraft related artifacts.</Paragraph> <Paragraph position="4"> A certain number of rules that we introduced were very MUC-7 specific and conflicted with LOLITA's basic analysis. For example, in a sentence such as: The military version of the Boeing 737 that crashed in Croatia Wednesday was not equipped...</Paragraph> <Paragraph position="5"> Boeing was to be marked as ORGANIZATION, while in LOLITA's analysis, it is an artifact.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Rules to Handle 'Non-Natural' Language </SectionTitle> <Paragraph position="0"> Special treatment had to be devised for PREAMBLE and SLUG fields of the articles at the morphology and parsing levels. Some of these fields contained strings which seemed more like a code (particular to the New York Additional morphology and grammar rules had to be written specifically to handle these. The system processes them at the end of the analysis of the main body of the text in the hope that the text can provide a useful context in which to deal with them.</Paragraph> </Section> </Section> class="xml-element"></Paper>