XML Viewer - m98-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1008_metho.xml
Size: 28,676 bytes
Last Modified: 2025-10-06 14:14:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1008">
  <Title>Description of the American University in Cairo's System Used for MUC-7</Title>
  <Section position="2" start_page="0" end_page="10" type="metho">
    <SectionTitle>
SYSTEM DESCRIPTION
Architecture Overview
</SectionTitle>
    <Paragraph position="0"> The MUC7-Plink system was composed of ten modules whichwere run in succession on each text. In order, they were:  Eight of these modules were used in the She#0Eeld MUC-7 entry. The only ones that were substantially di#0Berentwere the Lexicon and the Parser. I will brie#0Dy summarize the others and give a more expansive description of the Lexicon and the Plink parser. tokenizer The tokenizer reads the input stream and segments it into small chunks that are roughly equivalenttowords. It is an executable #0Cle compiled from a C program; the C program is generated from a Lex #5B10#5D input #0Cle. The token is the most commonly used unit of data for processing in GATE, and the tokenizer guarantees a somewhat uniform representation. These tokens are added to the GATE database; a separate database is maintained for each document. Additionally the tokenizer adds section annotations to mark areas of the text. Each of the GATE annotations have a start and end byte which de#0Cne a span. The span speci#0Ces the o#0Bset in the document to which the annotation applies. So, the token associated with #5CFord&amp;quot; mighthavethe span of byte 152 to 156, while a section might have the span of 0 and 562. Additional annotation speci#0Cc information might be added to each annotation. sentence splitter The sentence splitter is a perl script which notes the sentence boundaries. These boundaries are added as annotations into the GATE database; the annotation includes the o#0Bset of the sentence in the document #28the span#29 and all of the tokens which are constituents of the sentence. tagger The Brill tagger #5B3#5D is a part of speech tagger that has been extensively trained on Wall Street Journal Text. It annotates tokens with their part of speech. Since an annotation already exists  for each token, more information is simply added to each token annotation thus consolidating information.</Paragraph>
    <Paragraph position="1"> These parts of speech are not entirely compatible with the results of the Gazetteer or the Lexicon. These con#0Dicts are resolved before parsing begins.</Paragraph>
    <Paragraph position="2"> gazetteer The majority of nominal semantics in the system comes from the Gazetteer. It is a Lex #5B10#5D based system of 44 lists. Each list represents a di#0Berent semantic category. The lists include companies, airlines, aircraft manufacturers, cities, provinces, titles, #0Crst names, bodies of water and aircraft names among many other things. There are about 200,000 bytes of text in the lists making roughly 10,000 entries. The system is relatively easy to modify. Addition of new elements to the list is simple, and the addition of a new list is also simple.</Paragraph>
    <Paragraph position="3"> In addition to lists of proper names, some are lists of key words that signal certain semantic categories. For instance there is a list of organization signal words such as University, Hospital and Laboratory. These words alone are not su#0Ecient to mark an organization, but if they occur next to an unknown proper noun they suggest that that proper noun is an organization. This adjacency, and thus categorization, is noted in the parser.</Paragraph>
    <Paragraph position="4"> The Gazetteer was largely the one used in the She#0Eeld system. However, near the end of development, I had to freeze the lists while they were slightly modi#0Ced at She#0Eeld. In general this did not matter, but a large number of launch event speci#0Cc changes were not completely incorporated. The largest problem here was that spacecraft were not incorporated. Since rockets were needed for the scenario template task, virtually no scenario templates were generated.</Paragraph>
    <Paragraph position="5"> GATE is not a strictly linear system. Module A must be run before Module B when B needs information from A. However, if neither is dependent on the other they can be run independently.</Paragraph>
    <Paragraph position="6">  Since the Gazetteer does not depend on morphological analysis, it could be run before or after the tagger.</Paragraph>
    <Paragraph position="7"> morphological analyzer The morphological analyzer takes all nouns and verbs and returns the root form and the su#0Ex. The root form is often used as a semantic primitive. So the semantics for #5Creport&amp;quot; is the same as the semantics for #5Creports&amp;quot; or #5Creporting&amp;quot;. The analysis is done by some regular expression rules and a list of several thousand irregular exceptions derived from the exception list used in Wordnet #5B16#5D.</Paragraph>
    <Paragraph position="8"> lexicon The version of Plink used for MUC-4 and MUC-5 had a hand-crafted lexicon. Each lexical entry Theoretically, independent modules can be run in parallel, but the currentGATE system does not implement this feature.</Paragraph>
    <Paragraph position="9">  was a complex feature structure, and was rather di#0Ecult to construct. Words that were not speci#0Ccally in the lexicon were assumed to be proper nouns of no particular semantic category. It would be more e#0Bective if an on-line lexicon could be used to reduce the work load because the lexicon would both ease transition to a new domain, and reduce the time need to maintain Plink's own lexicon.</Paragraph>
    <Paragraph position="10"> Longman's Dictionary of Contemporary English #28LDOCE#29 #5B9#5D has electronic versions. One of these versions was selected, and added to GATE. The desired word #28root form#29 was passed to LDOCE and it returned the de#0Cnitions of the word that it found. Each of these de#0Cnitions were added as there own tokens to the GATE database, with spans that corresponded to the token. Initially, each de#0Cnition of the word was left in as an annotation. Plink was allowed to choose between the de#0Cnitions. Unfortunately, on medium size documents the large number of lexical entries tended to slow my machine down due to memory limitations. This meant that some pruning had to be done before addition to the GATE database.</Paragraph>
    <Paragraph position="11"> The Plink grammar that I developed roughly follows the HPSG #5B14#5D formalism. This requires rather sophisticated lexical entries. The addition of LDOCE has enabled me to begin to develop a more complex lexical system. Eventually, these de#0Cnitions will include semantic and complex syntactic features which should enable more e#0Bective parsing, and more useful semantic results which can be passed along to discourse analysis.</Paragraph>
    <Paragraph position="12"> The version of LDOCE that I used has semantics and selectional restrictions, but they seem to be inconsistently entered. Thus the information gathered from LDOCE is currently not very useful.</Paragraph>
    <Paragraph position="13"> plink The PLINK parser was designed for the #0Cfth Message Understanding Competition #28ARPA-93#29. PLINK does full #0Dedged parsing creating exactly one syntactic-semantic representation of a given sentence. Additionally, PLINK parses in linear time thus speeding parsing. PLINK is closely related to the Marcus parser #5B13#5D using a stack of constituents. Plink uses a heuristic rule selection mechanism based on the contents of the stack to select which grammar rule to apply at each step. These heuristics have access to elements of the partially completed parse and select the rules based on a preference mechanism. The preferential mechanism is based on a small number of rankings #28currently 6#29, so the system can select several rules and rank them. PLINK uses a standard-uni#0Ccation based grammar or UBG #5B15#5D, and is derived from the LINK parser #5B11#5D. The use of a UBG enables PLINK to encode grammar rules that have both syntactic and semantic components. Since the parser has access to syntax and semantics, it can take advantage of both types of knowledge to make parsing decisions. This allows parsing to proceed in one-pass and eliminate a great deal of ambiguity. PLINK also includes an inheritance hierarchy of semantic components. A more thorough discussion of PLINK and the MUC-5 system can be found in #5B7#5D.</Paragraph>
    <Paragraph position="14"> The grammar that was used was hand-crafted. Though it does not adhere to any speci#0Cc linguistic theory, it is similar to the HPSG grammar of Pollard and Sag #5B14#5D. The grammar rules are quite standard except in many cases they are more amenable to one-pass parsing. For instance left-recursion is avoided. These rules still recognize the same language, but some gram- null matical manipulation improves one-pass parsing. Rules to handle agrammatical phenomenon were derived with HPSG in mind, though of course, they di#0Ber from standard HPSG rules. The parsing model is based around a stack and selection rules. The stack is a standard parsing stack. Constituents were added to the stack, and when appropriate a grammar rule was applied to the stack modifying the top elements of the stack. I tried to keep the stack small, and in earlier experiments the stack never exceeded a size of seven constituents when it was parsing grammatical phenomena.</Paragraph>
    <Paragraph position="15"> At any given time a number of actions could take place. A new element could be pushed onto the stack or one of a number of grammar rules could be applied. Selection rules were used to choose the next action. Like the grammar rules themselves, the selection rules are themselves UBG rules. The selection rules inspect the stack, and give a preference weighting to each of the valid options. For example:  Here &amp;quot;...&amp;quot; represents other elements lower on the stack and period #28the punctuation mark#29 represents the most recently added element. All of the selection rules are uni#0Ced with the stack and #28for the sake of example#29 two selection rules match.</Paragraph>
    <Paragraph position="16"> #28Selection-Rule 1 #28Selection-Rule 1</Paragraph>
    <Paragraph position="18"> Which of the two rules is actually selected? Grammar rules are selected based on a preference ranking. In the current system the ranking is best, good, fair, last, spec-agram and gen-agram.</Paragraph>
    <Paragraph position="19"> The best rule is applied #0Crst. When the stack is as it is in example 1, the abbrev-eats-period rule is applied #0Crst. If it succeeds a new round of rule selection begins. If it fails, then rules from the next level, in this case NP-from-det-noun are applied. This continues until all rules fail. If multiple grammar rules are selected with the same preference ranking, then they are ordered randomly.</Paragraph>
    <Paragraph position="20"> If no rule succeeds a new constituent is pushed onto the stack. This is could be implemented by the selection rule: #28Selection-Rule 1 #28gen-agram push#29#29 This rule always succeeds and the keyword push is used to push a constituentonto the stack. Other selection rules may take advantage of the push mechanism, when more lexical information is needed to make a parsing decision.</Paragraph>
    <Paragraph position="21"> This parsing mechanism allows no backtracking. Consequently, this assures that the parse occurs in linear time. There is evidence that humans backtrack when parsing #5B4#5D, #5B5#5D. In this sense PLINK is not a full-#0Dedged model of human parsing.</Paragraph>
    <Paragraph position="22"> In example 2, I actually speci#0Ced the names NP-from-det-noun and abbrev-eats-period. This is the actual name of the grammar rule; that is the selection rules actually encode the grammar rule by name. The name of the grammar rule is speci#0Ced in the grammar rules #28pref name#29 feature. The grammar rule for NP-from-det-noun looks like example 5.</Paragraph>
    <Paragraph position="23">  The MUC-7 domain is an open ended domain of newspaper articles. These articles often have grammatical and spelling errors. Furthermore, the lexical mechanisms are not always correct. For example, occasionally words are mis-tagged. Consequently, the domain is ideal for robust parsing techniques. The simple technique that PLINK uses for robustness is low ranked rules. High priority rules handle grammatical and speci#0Cc phenomena; medium priority rules handle grammatical and general phenomena; low priority rules handle agrammatical phenomenon.</Paragraph>
    <Paragraph position="24"> A working version of the Plink parser existed by the time of the dry run. The parser was in GATE, and was receiving input from earlier modules via the GATE database. However, the grammar was designed to recognize general noun phrases. Some modi#0Ccations had to be made to generate the appropriate semantic category. For instance, the parser might encounter #5CRobert R. Smith&amp;quot;. This would be correctly recognized as an NP, but it would not state that it was a person. For the purposes of all of the MUC tasks, this information was needed. Consequently, new grammar rules had to be added. Since Robert is in the Gazetteer, the semantic type of #5CRobert&amp;quot; would be person and an NP formed from it would also be person. However, the type of #5CR.&amp;quot; and #5CSmith&amp;quot; would be unknown. Thus a grammar rule Example 6. was needed.</Paragraph>
    <Paragraph position="26"> Example 6. of course con#0Dicted with an already existing grammar rule which took the exact same constituents, but took the semantics from the second noun. A higher ranking parsing heuristic was made for the ng-from-NGperson-N grammar rule and it was always selected #0Crst.</Paragraph>
    <Paragraph position="27"> It only succeeded when the semantics were correct, so non-person NPs were una#0Bected.</Paragraph>
    <Paragraph position="28"> A total of 11 grammar rules, and 13 selectional rules were added for the MUC task. All of these were developed during the training phase and were thus specialized for the aircraft accident domain. It would be valid to say that this was the only work done on MUC7-Plink for MUC-7. These rules were written in a few hours over several afternoons. One of the advantages of the Plink approach is the simple integration of domain speci#0Cc grammar rules.</Paragraph>
    <Paragraph position="29"> The main modi#0Ccations from the MUC-5 system were a new grammar for a new tag set, and the introduction of lazy uni#0Ccation to speed heuristic rule selection. The new grammar was needed since the tag set had changed. The MUC-5 tag set was speci#0Cc to our hand-crafted lexicon. It now uses a combination of the tags used by the Brill tagger, the Gazetteer, and LDOCE. This has been combined with a hierarchyofsyntactic classes, to enable more general rules to be written. For example, instead of one syntactic class for comma, and one for eachof the other punctuations, I have combined this into symbol, but each symbol has a head feature which is the symbol. A general rule can be written to look at the lexical class `sym', or a speci#0Cc  rule can be written to look at the lexical class `sym' which has a head feature dollar for the dollar sign.</Paragraph>
    <Paragraph position="30"> Lazy uni#0Ccation is now used during rule selection. In the MUC-5 system full uni#0Ccation was used, and this lead to large structures being built unnecessarily. A future improvement would introduce lazy uni#0Ccation into grammar rule application. There is evidence that this would further improve parsing performance #5B12#5D.</Paragraph>
    <Paragraph position="31"> Finally, a great deal of modi#0Ccation was needed to produce the correct input for the XI discourse interpreter. Fortunately, this was mostly a matter of post-processing. Plink standardly produces a list of verb frames. XI wants a list of quasi-logical predicates. It is relatively simple to change the frames into predicates. However, the XI system that was used needed a certain set of predicates. A large amountofwork was needed to assure that the correct predicates were being produced. This is where the majority of work for MUC7-Plink happened. What was produced was a list of entities and relations between entities. The entities could be based on nouns or on verbs.</Paragraph>
    <Paragraph position="32"> name matcher This is a C++ program used as part of the coreference mechanism. If a name, or part of name, occurs in the list of entities, they are combined into one entity. This is a useful preprocessing step for the Discourse Interpreter.</Paragraph>
    <Paragraph position="33"> discourse interpreter The discourse interpreter was developed using the XI knowledge representation language #5B17#5D. The inputto the interpreter was a series of entities and relations between entities. The interpreter had rules which built new relations and reclassi#0Ced the entities. One particular important set of entities and relations was the MUC-7 speci#0Cc Element, Relations and Scenarios.</Paragraph>
    <Paragraph position="34"> The only work done for MUC7-Plink was to produce the appropriate input for the discourse interpreter. Unfortunately, this work was incomplete, particularly for the #0Cnal test domain. This lead to very low recall measures in all three tasks.</Paragraph>
    <Paragraph position="35"> An additional problem was that the coreference mechanism, whichwas largely implemented in the discourse interpreter, assumed that entities had a particular property. However, this relation was added by the Plink parser. This lead to a reduction in precision particularly in the Template Element task because entities that corefered in realitywere not associated by discourse interpretation.</Paragraph>
    <Paragraph position="36"> template writer The template writer is a prolog program that simply scans through the discourse model. It looks for certain types of entities and relations, formats the information for them in an appropriate manner, and generates the templates which are the results of the system.</Paragraph>
    <Paragraph position="37"> General Architecture for Text Engineering This whole system was developed as a system of the General Architecture for Text Engineering or GATE #5B6#5D. Text processing modules are added to GATE, and these modules can be combined  into a system. Once modules are added they can be combined in di#0Berent ways to form new systems.</Paragraph>
    <Paragraph position="38"> GATE provides a Tipster compatible database mechanism. The database store is organized around documents. Each document has its own set of annotations. Modules take input from the database, process the input, and generate output which is then usually placed into the database. The simplest way to add a new module to GATE is by writing a wrapper that interacts directly with the database. The wrapper gets annotations from the database and writes it to a #0Cle; the code for the module is then called with the #0Cle as input. It then produces an output #0Cle which is read by the wrapper and put into the database. Some modules, such as the name matcher, do not communicate this way. However, integrating a module in this fashion is not very di#0Ecult, and it allows the module to run without GATE if an input #0Cle exists.</Paragraph>
    <Paragraph position="39"> GATE currently has about 40 modules with complete wrappers. Addition of a new module varies in complexity, but can be done in well under an hour for simple systems, and in 2 days for complex systems such as the ANLT parser. Since processing can be independentofGATE, the source language of the new module is irrelevant. MUC7-Plink has modules written in C, C derived from Lex, C++, Lisp, Perl and Prolog.</Paragraph>
    <Paragraph position="40">  MUC7-Plink generated scores for the Scenario task, the Template Relations task and the Template Element task. The scores were lower than expected, but not much lower. No developmentwas done on the LaunchEvent domain. A small amountofwork could have raised the P&amp;R scores to 20 for ST, 40 for TR, and 60 for TE; these are roughly the scores on the tasks in the Aircraft Accident domain on texts that were run blindly. Of course a reasonable amountof work on the system could have raised the scores much higher.</Paragraph>
    <Section position="1" start_page="8" end_page="10" type="sub_section">
      <SectionTitle>
Development Time
</SectionTitle>
      <Paragraph position="0"> The only way that MUC7-Plink excelled for the MUC-7 competition was development time. No time was spent on the Launch Event domain, and very little time was spent on the Aircraft  #0F 48 hours on integration into the GATE#2FLaSIE Discourse Interpreter #0F 80 hours spent adding Plink and LDOCE to GATE #0F 90 hours running the #0Cnal test There was no time spentondevelopment in the #0Cnal test domain. The TE and TR scores are reasonable because some time was spent in development on the similar training domain of Aircraft Accidents. 48 hours was spent on modifying the output of the Plink parser to #0Ct with the XI discourse interpreter that was used. This could reasonably be considered part of the MUC-7 e#0Bort. Roughly 80 hours were spent in adding Plink and LDOCE to GATE, in the summer of 1996. The integration process has been improved since then, and adding two similar modules would probably take under 40 hours e#0Bort.</Paragraph>
      <Paragraph position="1"> The majority of the time was spent on running the #0Cnal tests. Iwas running on a PC-586 at 90 MHz, with 16 Meg of RAM. This lead to very slow processing. The longest article took over 6 hours to process. An average article to 30 minutes to process up to the discourse interpreter. 10 minutes was spent on parsing, and 5 minutes was spent on lexical lookup. Roughly 10 minutes of the remaining time was spentinterfacing with the GATE database. This is clearly a weakness of the GATE model and needs to be improved.</Paragraph>
      <Paragraph position="2"> Discourse analysis was taking much too long, and there would have been no way to run all of the texts on my PC. Fortunately the GATE approach of reading from the database, writing to a #0Cle and then calling the module was very helpful; it enabled me to write input #0Cles for the discourse interpreter, ftp them to a Sun workstation and run them there. Roughly half of the texts were run this way, and almost all of the texts over 4000 bytes.</Paragraph>
      <Paragraph position="3"> The major problem with this long running time was that it left no time for developmenton the Launch Event domain. An overnight run of texts would have enabled development of the MUC7-Plink system to havemuch higher results. Still it is quite remarkable that one can enter MUC-7 on a system almost solely run and developed on a low-end PC.</Paragraph>
      <Paragraph position="4"> Walkthrough I will concentrate on the sentence #5CThe China Great Wall Industry Corp. provided the Long March 3B rocket for today's failed launch of a satellite built by Loral Corp. of New York for Intelsat. &amp;quot; The tokenizer reads in the text and adds annotations like: 206 token 1118 1121 207 token 1122 1127 for the words The and China. 206 refers to the annotation number, and 1118 and 1121 is the span of the token in the text.</Paragraph>
      <Paragraph position="5"> The sentence splitter divided the document into sentences including the above sentence as the annotation: 1139 sentence 1118 1275 constituents: 206 207 ....</Paragraph>
      <Paragraph position="6"> This annotation says it is a sentence that goes from 1118 to 1275 and has the tokens 206, 207 etc.</Paragraph>
      <Paragraph position="7">  The tagger modi#0Ces the token annotations by adding part of speech information.  for the word March. As noted this information is not currently very useful but slots are left open for a more e#0Bective lexical retrieval mechanism.</Paragraph>
      <Paragraph position="8"> The Plink Parser is then run on the sentence and generates a syntactic structure for the sentence, whichwe will ignore, and a semantic structure for the sentence. The annotation is: 7867 semantics 1118 1275 #28qlf: #5Bfail#28e251#29, lobj#28e251,e252#29, launch#28e252#29 ....#5D#29 The quasi-logical forms that are of interest are: organization#28e256#29, name#28e256, o#0Bset#281244, 1255#29#29, city#28e257#29, name#28e257, 'new york'#29, apposed#28e256,e257#29, of#28e256,e257#29 This says that Loral Corp. is an organization which has an of relation with the city New York. In this particular text, the name matcher #0Cnds no matches.</Paragraph>
      <Paragraph position="9"> The discourse interpreter #0Cnds an of relation between an organization and a location. The interpreter has a rule that adds a location of predicate if this relation holds so a new predicate location of#28e256,e257#29 is added. The discourse interpreter in turn writes information back to the database. An example is: 8132 xi instance 1118 1275 #28class: e7 #3C#7B city#28 #29#29#29 #28props: location of#28e6,e7#29, country#28e7, 'United States'#29, of#28e6,e7#29...#29 The template writer reads these xi instance annotations and prints the appropriate template elements and relations for in this case, Loral Corp. and New York.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="10" end_page="11" type="metho">
    <SectionTitle>
OBSERVATIONS
</SectionTitle>
    <Paragraph position="0"> GATE made MUC7-Plink possible. Without GATE it would have been impossible for me to develop a system capable of participating in MUC in under a few weeks of work. GATE does  have some weaknesses: adding a new module to GATE while simple is not transparent; accessing the database is quite slow. However, it has been a very useful developmentenvironment. Plink has also shown to be quite useful. It was quite easy to add new rules for a new domain to Plink. The end result of parsing is easily translated into the quasi-logical form needed by the discourse interpreter. This comes from it being a full-parser which generates one interpretation, and generates a full semantic interpretation along with a syntactic one.</Paragraph>
    <Paragraph position="1"> MUC7-Plink can be most usefully seen as an example of how to build a system that can very easily be moved to a new domain. Assuming a working system, for say the MUC-6 Succession Event task, three main modules need to be modi#0Ced: the Gazetteer, the Parser and the Discourse Interpreter. Using the modules in MUC7-Plink only domain speci#0Cc data needs to be changed and the actual programs remain constant.</Paragraph>
    <Paragraph position="2"> The Gazetteer needed several lists changed. The parser needed to add several grammar rules, and for Plink selection rules, to account for the lists. Switching to a new domain would again call for new lists and new grammar rules. However, this data is all based around Noun Phrases. The NE task requires the system to classify several Named Entities. If there was a more di#0Ecult task, an Entity task, which required all Entities to be classi#0Ced, the system would be more domain independent. It would still be useful to add new lists and grammar rules to switch domains, but the introductory work would have been done. Furthermore, without adding new lists or grammar rules, some output could be generated.</Paragraph>
    <Paragraph position="3"> For example, in switching MUC7-Plink from Aircraft Accidents to LaunchEvents the grammar and the Gazetteer provided no space for rockets. Therefore, rockets could never have arrived as speci#0Cc semantic output #28except when speci#0Ccally mentioned as a rocket#29. This is why MUC7-Plink performed so badly on the ST task. It performed better on the TE and TR task because large parts of those tasks #28Organizations, Products and People#29 were accounted for by the grammar and the Gazetteer. If the original system had considered rocket entities, the scores would have been much higher.</Paragraph>
    <Paragraph position="4"> There was no Discourse Interpretation work done as part of MUC7-Plink. I simply took advantage of the work done at She#0Eeld. Clearly, in switching to a new domain, some discourse work would need to be done. However, the amount of work done at She#0Eeld on the discourse model was also small. To a large degree this work could be considered looking for speci#0Cc phenomenon in the text, speci#0Ccally, those phenomena required by the ST, and TR tasks. Perhaps the new SUMMAC tests will provide better insightinto a general discourse interpretation mechanism which can easily be culled for speci#0Cc information, but it seems likely a more sophisticated all-purpose Scenario task would be needed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML