XML Viewer - m98-1014

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1014_metho.xml
Size: 27,201 bytes
Last Modified: 2025-10-06 14:14:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1014">
  <Title>FACILE: DESCRIPTION OF THE NE SYSTEM USED FOR MUC-7</Title>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
SYSTEM ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> The FACILE preprocessor accepts input to the system, normalizes the text, recognizes special formatting, tokenizers, tags, looks up single and multi-word tokens in a database, and carries out proper name recognition and classification.</Paragraph>
    <Paragraph position="1"> The output forms the input to other FACILE modules (the Shallow Analyzer and the Deep Analyzer, neither of which wasusedintheMUC-7NEtask).</Paragraph>
    <Paragraph position="2"> Basic Preprocessor and Database Lookup  The same component was subjected to a highly experimental trial in the TE task, in which the results obtained reflect the NE analyser's ability unaided to extract no more than the name and category of some of the entities mentioned in the text. The preprocessor utilizes the InXight LinguistiX tools as a third-party component. Through a functional interface, the preprocessor is able to utilize these tools for tagging and morphological analysis. The proven finite-state technology that the tools employ ensures the necessary speed, reliability, coverage and portability. Modules developed within the project carry out text zoning, tokenisation, database lookup and named entity rule application.</Paragraph>
    <Paragraph position="3"> FACILE treats tokens as feature vectors. The follow-up modules derive all information about a token exclusively from its corresponding feature vector.</Paragraph>
    <Paragraph position="4"> The feature vector stores the following information about a token: where it begins and ends as character offsets, what separates it from its predecessor (white space, hyphen etc.), what text zone it comes from, its orthographic pattern (capitalised, all capitalised, mixed, lower case etc.), the token and its normalised forms, its syntax (category and features), semantic class (as obtained either from the database or morphological analyser), morphological analyses, partitioned into those consistent with the tagger's choice and others (for possible use by other modules). (1) is an example, in LISP notation.</Paragraph>
    <Paragraph position="5"> (1) (1192 1196 10 T C &amp;quot;Mrs.&amp;quot; &amp;quot;mrs.&amp;quot; (PROP TITLE) (^PER_CIV_F) ((&amp;quot;Mrs.&amp;quot; &amp;quot;Title&amp;quot; &amp;quot;Abbr&amp;quot;)) NIL) In this example, the separator is octal 10, the text comes from the main body, the token is capitalised, literally &amp;quot;Mrs.&amp;quot;, normalised to &amp;quot;mrs.&amp;quot;, syntactically a PROP and TITLE, and semantically according to the database a prefix (^)fora female civilian person. On the second line are the results of the morphological analysis.</Paragraph>
    <Paragraph position="6"> The preprocessor fills out the feature vector from various sources. The first six fields shown above are obtained from the text using the text zoner and tokenisation modules. The normalised form field comes either from the morphological analysis (see below) or from algorithmic procedures for the handling of numeric tokens. The syntax field comes from the morphosyntactic tagger, and the morphological analysis from the morphological analyser. The latter typically offers several alternative analyses and the full list of results is partitioned into those that are consistent with the tagger's decision and those that are not. The semantics field comes from lookup in a database which has information on words and phrases belonging to the categories of named entities themselves as well as to categories that occur as prefixes or suffixes of names. Where there exist database entries for multi-word tokens, the single words are replaced by a single token vector for the compound.</Paragraph>
    <Paragraph position="7"> The structure of this table differs from the data structure used in chart parsing in that there is only one initial edge per token. Alternative analyses are packed into the SEM and other-morph fields.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
RULE-BASED NAMED ENTITY RECOGNITION
</SectionTitle>
    <Paragraph position="0"> As mentioned earlier, we had found the approach to named entity recognition described in Coates-Stevens (1992) interesting, but that system was coded in Prolog, which was not one of the agreed languages for implementation in the FACILE project.</Paragraph>
    <Paragraph position="1"> Our first thoughts were to use a pattern-matching language like FLEX or PERL, starting from the tagger output as word/tag pairs. However, first attempts to use PERL led to unreadable patterns of many full lines in extent because of the number of features to take account of. The need to handle coreferences and scores, and the slowness of PERL pointed to a more complex interpreter, and so we came to specify a more congenial notation. Because in any one pattern constituent we would need to refer to only one or two of the properties, we chose to use a attribute-value notation for readability, but not one as powerful as we might want to use for a full syntactic and semantic analysis.</Paragraph>
    <Paragraph position="2"> The rule notation Two versions of the rule language and its interpreter have been defined, the current version having been proposed and implemented following an evaluation of our performance in the dry run. This description refers only to the current version.</Paragraph>
    <Paragraph position="3"> Rules have the general form A #29 BnC=D.Ais a set of attribute operator value expressions, where the values are atomic expressions, disjunctions (using the operator j) or negated atomic expressions or disjunctions, i.e. a one-level attribute-value matrix (AVM). In an individual attribute operator value expression, the left hand side may be any of ten specified columns in the input token vector, or an additional attribute if the matching token has been found by rule. B;C and D are sequences of such one-level AVMs (B and D possibly empty, since they constitute the left and right context of C). Each AVM is optionally followed by an iteration specification, *,+,?, or fhintegeri,hintegerig.AVMsmaybe grouped by parentheses to allow for the iteration of a sequence of constituents, although in the current implementation, recursive grouping by parentheses is not permitted.</Paragraph>
    <Paragraph position="4"> The left-hand side of a rule may also have a score in the range -1 :::+1, which defaults to 1. # is the comment character.</Paragraph>
    <Paragraph position="5"> The comparison operators include =, != (not equal),&lt;, &lt;=, &gt; and &gt;=. If the value in the chart edge is a disjunction (i.e. list) the = operator is satisfied if any of the members of the list is identical to the value in the expression (or any one of the values if that is a disjunction). The negation operator != is satisfied if there is a null intersection between the values in the edge and those in the expression. Substring comparisons are permitted by the inclusion of wildcards in the value expression.</Paragraph>
    <Paragraph position="6"> Any variable (a symbol whose print-name begins with &amp;quot; &amp;quot;) which occurs on the right-hand side of an expression is unified with all other occurrences of the same variable in the rule. This can be used to transfer specific information to the left-hand side as well as to enforce constraints.</Paragraph>
    <Paragraph position="7"> An example of a rule is (2), which is satisfied by a token whose normalised form (i.e. modulo capitalization) is &amp;quot;university,&amp;quot; the literal &amp;quot;of&amp;quot; and a location or city name. A string matching this description is assigned the syntactic tag PN and the semantic tag ORG.</Paragraph>
    <Paragraph position="8"> (2) [syn=NP, sem=ORG] (0.9) =&gt;</Paragraph>
    <Paragraph position="10"> Example (3) shows how the right context may be suggestive of the tag to assign, with a relatively low certainty factor to take account of this. The target pattern is a single upper-case token which the tagger guesses to be a PN, and not a PN that has been found by applying a rule. In this example, the right context is expressed literally instead of using SEM or SYN values. If the whole pattern is matched, the arbitrary additional attributeorgn receives the value of the token in the first constituent through unification of the instances of the variable O.</Paragraph>
    <Paragraph position="12"> The purpose of the variable in rule (3) may not be immediately apparent. Variables were first introduced because of the need to treat coreferences between instances of the same name. Whilst an extended form of a name may be used on its first mention in a text, it is typically not used again in the same text. The referent of the phrase &amp;quot;Foreign Secretary Robin Cook&amp;quot; will normally be mentioned subsequently as &amp;quot;Mr Cook,&amp;quot; for example. &amp;quot;Mr Robin&amp;quot; would not be possible (except in the households of aristocrats or in an old family firm). The variable allows the repeatable part of the name to be identified for matching with any subsequent mentions, as made clear by Rule (4).</Paragraph>
    <Paragraph position="14"> Rule (5) illustrates the coreference operator &gt;&gt;, which stipulates that there be an antecedent constituent matching the AVM following the operator, which can satisfy the variable bindings established in the rest of the rule. In this case, both surname and title must be identical with their antecedents.</Paragraph>
    <Paragraph position="16"> When a coreference of this type is found, in addition to the explicit assignment of the syn and sem fields and the title and surname attributes, an ante field of the newly found constituent is assigned the unique identifier of the matching antecedent. By using variables in this way, we could in principle deal with coreference relationships that are made explicit syntactically, such as that between a name and a description in apposition. However, we have only explored this possibility to a limited extent in our experimental application of the system to the TE task.</Paragraph>
    <Paragraph position="17"> Comparison with other pattern-matching languages Standard pattern-matching languages like PERL, FLEX, SNOBOL etc., are designed to process surface patterns in text.</Paragraph>
    <Paragraph position="18"> Where tagged text is to be pattern-matched, it is possible to pair tags with words as in the/AT cat/NN sat/VBD etc. and to define patterns over these sequences. However, as we have pointed out, more than just the literal token and its syntactic tag are relevant to the NE recognition problem. To make all of these properties into facets of a token structure makes the statement of the rules in &amp;quot;raw&amp;quot; PERL hopelessly long-windedand error prone. One important effort at producing a higher level language that PERL is &amp;quot;Mother of PERL&amp;quot; (MOP) - see Doran et al (1997). It enables the pattern writer to focus on a single or a few attributes at a time, but does this by the use of several separate layers of rules. Our language by contrast, allows conditions to refer to attributes arising from multiple levels of analysis in a single ruleset. However, in our view, a more significant advantage of our own rule language is its readability and accessibility to the rule-writer. We have in comparison far fewer symbolic operators and instead use a variant attribute-value matrix notation to make the individual rules more self-documenting.</Paragraph>
    <Paragraph position="19"> The rule interpreter The current implementation of the rule interpreter is adapted from a left-corner chart parser, although we have considered compiling to finite state machinery if it proved too inefficient.</Paragraph>
    <Paragraph position="20"> Althoughthe basic rule-invocationstrategy is bottom-up, partial parsing in this way could lead to runaway recursion with some rules with a single constituent (except for left and right context). This can arise when a rule has the pattern indicated in (6), or wherever the rhs is underspecified enough to accept a constituent with sem=CAT.</Paragraph>
    <Paragraph position="22"> For this reason, the rule interpreter is depth-limited.</Paragraph>
    <Paragraph position="23"> With the right-hand side of rules including context as well as the phrase to be matched, rule-invocation is more complex than in the left-corner algorithm, since the scanner can have moved on by the time a needed inactive edge is added.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
Data Structures
</SectionTitle>
      <Paragraph position="0"> The working data structures of the interpreter include an active chart, comprising sets of active and inactive edges and a vertex index. In addition, there is a property-value table which stores the values of any attributes not in the standard  chart columns. This table is a simple hash table, and can be illustrated as follows: Table 1 shows a simplified chart with initial edges 1-3 and added (inactive) edge 4 produced by application of Rule 4. The values of the title and surname fields are stored in the property-value table Table 2. This is indexed both by edge number and property, avoiding the need to search for antecedents.</Paragraph>
      <Paragraph position="1"> The main algorithm The active chart mechanism has been extended to deal with iterable and optional constituents. An iterable or optional constituent that can match the current token gives rise to two new edges, one an active one in which there can be more iterations, and another in which the cursor advances to the next constituent or concludes the rule. To make this process more efficient, active edges do not contain copies of the right hand side of the rule, but merely pointers to rules, a state vector referencing the current constituent group and constituent within the group, and bindings for any variables in the rule.</Paragraph>
      <Paragraph position="2"> Advanced rule-invocation strategy A working set of NE recognition rules may easily be over one hundred in number. If every rule within a rule set were to be tested against every edge of a document then the document would both take far longer to process than if some form of selection algorithm takes place. This is the role of the advanced rule invocation strategy - computing, for each edge of the document, which rules will definitely not fire and which rules have a chance of firing.</Paragraph>
      <Paragraph position="3"> The algorithm works by assigning properties to both rules and document edges when the rules and document are read in. Consider rule (7) which cannot be completed if there are no cardinal numbers in the edges it will subsume. Similarly, Rule (8) cannot be completed if there are no 'interesting' properties in the semantic field of the next few edges.</Paragraph>
      <Paragraph position="5"> When a rule is read in, it is assigned a 'requirement' value which indicates whether, in the next 3 edges, the rule will require (a) A cardinal number (b) A capitalised or an all capitals token, or (c) An 'interesting' semantic property - i.e.</Paragraph>
      <Paragraph position="6"> ORG, LOC^,PERCIV, DATEUNIT but not NULL, SGML, PUNCT etc.</Paragraph>
      <Paragraph position="7"> When the edges of a document are read in, they, similary, are assigned a 'property' value according to the properties of the edge. If an edge contains a capitalised word and is tagged as an organisation (sem=ORG) then then 'property' value will indicate this. Both 'requirement' and 'property' values are stored in binary arrays.</Paragraph>
      <Paragraph position="8"> When the main loop is initiated, the properties of the current edge and the following two edges are added together.</Paragraph>
      <Paragraph position="9"> Before any rule is fired, this 'property' value is checked against the requirements of each rule to make sure that the rule has at least a chance of completing.</Paragraph>
      <Paragraph position="10"> In the tests we have done, this preselection of which rules fire reduces the total run time by approximately one half, with no loss of accuracy.</Paragraph>
      <Paragraph position="11"> The preference mechanism As noted above, rules have a default certainty of 1, or an assigned certainty in the range -1 to 1. If rules give competing descriptions for the same span of the text, the sem value with the highest score is preferred. Where several rules come to the same conclusion about the sem value, evidence combination comes into play. We combine such scores using Shortliffe and Buchanan's (1975) Certainty Theory formula. C bf represents the initial certainty of a proposition or that based on accumulated evidence so far. C X is the certainty value for the same proposition, attributable on the basis of anewruleX, not previously considered. C cf represents the cumulative certainty after assimilating C</Paragraph>
      <Paragraph position="13"> shows how the formula applies in combining two positive certainties. For reasons of space, we omit the other cases here.</Paragraph>
      <Paragraph position="14">  After evaluation of certainties for each given text span, a further preference is applied which prefers longer spans to shorter in cases of overlap.</Paragraph>
      <Paragraph position="15"> The resulting analysis is a single semantic and property description of each identified name expression in the text. For our integrated system's categorization and template filling components, these are interspersed with the other expressions in the text with their tags and morphological analyses. For the purposes of MUC evaluation, reports are generated in the appropriate format.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
WALKTHROUGH
NE Task
</SectionTitle>
    <Paragraph position="0"> The walkthrough document is representative of our overall performance in the formal run.</Paragraph>
    <Paragraph position="1"> There are 11 missing entities. &amp;quot;MURDOCH&amp;quot; is missing three times although we capture it later in the main text. The three missing occurences are all the beginning of the article. We didn't have this surname in the database, so the only way to have identified this as a person name would be coreference with the instances in the body text.</Paragraph>
    <Paragraph position="2"> Unfortunately the current implementation finds only backward references. That is adequate in the main text since normally a new name is &amp;quot;explained&amp;quot; by a descriptive phrase the first time it is mentioned. However it is possible to have mentions of the name in the title and the preamble before any explanation is given. The way we intended to cope with this problem (followingthe example of some of the MUC-6 systems) was to process SLUG and PREAMBLE after the main text. However in the version of the NE analyzer that we used for the final run this approach was not properly implemented.</Paragraph>
    <Paragraph position="3"> We miss the four occurrences of &amp;quot;Grupo Televisa&amp;quot; and &amp;quot;Globo&amp;quot;. This can be explained by the fact that we don't have in our DB the suffix &amp;quot;SA&amp;quot; as a company designator.</Paragraph>
    <Paragraph position="4"> We also have some heuristic rules which can pick up partial descriptions in apposition to a name, but here again the crucial clue-words and phrases 'broadcaster&amp;quot;, &amp;quot;publisher&amp;quot; and &amp;quot;media conglomerate&amp;quot; were not in our database. There are certainly sufficient clues of this nature in sentence (10).</Paragraph>
    <Paragraph position="5">  (10) &amp;quot;Grupo Televisa SA, the Mexican broadcaster and publisher, and the giant Brazilian media conglomerate Globo&amp;quot; (11) &amp;quot;Llennel Evangelista, a spokesman for Intelsat&amp;quot; (12) PERSON, a spokesman for ORGANIZATION  We miss another person name which could have been easily captured. In sentence (11), it is possible to identify the pattern (12). This would require two separate rules. While we have the rule that captures the organization (13) we simply forgot to insert the corresponding rule to capture the person.</Paragraph>
    <Paragraph position="6"> (13) [syn = PROP, sem = ORG, zone = _X] (0.8) =&gt; [token = &amp;quot;a&amp;quot;, zone = _X] , [token = &amp;quot;spoke*&amp;quot;] ,</Paragraph>
    <Paragraph position="8"> We miss the location &amp;quot;Xichang&amp;quot; which again could have been identified in &amp;quot;Xichang launch site&amp;quot; if domain-specific clue words (&amp;quot;launch site&amp;quot;) had been inserted in the DB.</Paragraph>
    <Paragraph position="9"> As for the other two missing entries &amp;quot;Hughes Electronics&amp;quot; and &amp;quot;within six months&amp;quot; the first is explained by the missing clue-word &amp;quot;electronics&amp;quot; and the second by the fact that we did not consider &amp;quot;within&amp;quot; as an identifier in rules for Time expressions.</Paragraph>
    <Paragraph position="10"> While we get all the occurrences of &amp;quot;New York Times News Service&amp;quot; we miss all the occurrences of &amp;quot;N.Y. Times News Service&amp;quot;. This is easily explained by an error in the rule that identifies it (see example 14). Possible values (among others) for the &amp;quot;orth&amp;quot; flag are C (first capital letter) and A (all capital letters). In the specific case of &amp;quot;N.Y.&amp;quot; it assumes the value O (other, because it contains letters and dots). Simply altering the value of the feature to &amp;quot;C--A--O&amp;quot; would solve the problem. This error also causes the spurious occurrence of &amp;quot;N.Y.&amp;quot; as a single location.</Paragraph>
    <Paragraph position="11"> (14) [Syn = PROP, sem = ORG, zone = _X] (0.9) =&gt;</Paragraph>
    <Paragraph position="13"> Two occurrences of &amp;quot;March&amp;quot; in &amp;quot;Long March&amp;quot; (referring to a chinese rocket) appear incorrectly tagged as a date.</Paragraph>
    <Paragraph position="14"> While it is corret that they should be initiallytagged as a date, we had inserted in the system rules that capture &amp;quot;artifacts&amp;quot; like this one thus superseeding the initial semantic value. In the specific case that did not work, probably for some yet undetected error.</Paragraph>
    <Paragraph position="15"> In &amp;quot;2 p.m. EST&amp;quot; we identify correctly only &amp;quot;2 p.m.&amp;quot; (the suffix EST had not been inserted in the DB). There is an instance of CNN that we (correctly?) tag as an organization but it is not tagged as such in the keys (an annotator error ?).</Paragraph>
    <Paragraph position="16"> &amp;quot;Hughes&amp;quot; is twice tagged as a LOCATION istead of an ORGANIZATION. We haven't yet tracked this one down. In &amp;quot;Time Warner&amp;quot; we managed to identify only &amp;quot;Time&amp;quot; [difficult to say how we could have done it without having the whole expression in the DB] and in &amp;quot;LaRae Marsik&amp;quot; we identified only &amp;quot;Marsik&amp;quot; (probably becayse we don't have in the DB &amp;quot;LaRae&amp;quot; as a first name).</Paragraph>
    <Paragraph position="17"> The following two organizations had been tagged as persons: &amp;quot;Home Box Office&amp;quot; &amp;quot;Turner Broadcasting System&amp;quot; They appear in the sentence &amp;quot;Time Warner's Home Box Office and Turner Broadcasting System were among the companies that had leased space&amp;quot; and are picked up by a low-score rule meant to capture conjunctions of people names. Finally there are two occurrences of &amp;quot;Tele-Communications&amp;quot;, tagged as persons and this is caused again by a low-score rule that considers &amp;quot;said Tele-Communications&amp;quot; in the sentence (15) as evidence for classifying it as a person, as in ' &amp;quot;Right!&amp;quot; said Fred.' (15) &amp;quot;Ms. Marsik said Tele-Communications and its partners&amp;quot; TE task We will not discuss in detail the results of the TE task for the walkthrough article because as noted in footnote 1, the scores indicate little more than the contribution made by NE analysis, i.e. finding strings and their categories. In respect of locations, the value of recall for the slot COUNTRY is particularly low because we relied on quite a small table in lieu of a full-scale &amp;quot;Geographical DB&amp;quot; of Towns and Regions in order to find the country to which they belong. Furthermore, for the entities of type COUNTRY the LOCALE should have been used as the value for COUNTRY slot. That would have significantly increased the recall for this slot (but only a couple of percentage points overall).</Paragraph>
    <Paragraph position="18"> As for entities, our Recall for descriptors is extremely low because we didn't have the Information Extraction Module available, with its extensive linguisticcoverage. The Entity Names and Categories are also affected by this problem, although we still managed to capture 35 and 15 per cent of them respectively, using assignments to properties via variables. We attempted to adapt the approach taken in the NE task, for instance having a more refined set of semantic tags (PER CIV/PER MIL rather than simply PER) in our rules. However we could not perform this &amp;quot;porting&amp;quot; entirely because of lack of time so in many case our responses were as generic as in the NE task. For instance we have &amp;quot;China Great Wall Industry Corp.&amp;quot; classified simply as an ORGANIZATION and not as and ORG CO.</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
ANALYSIS AND CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> Our best efforts at the NE task on the training data achieved 92% recall and 93% precision just prior to the formal run, and with the same data and rules.</Paragraph>
    <Paragraph position="1"> Given that in the dry run, we had equalled our previous best performance with the training data, we were disappointed with a fall-off of 6 percentage points in precision and 14 points in recall.</Paragraph>
    <Paragraph position="2"> In general the category where we perfom worst is &amp;quot;Organizations&amp;quot; and this can be partly explained by too many domain-dependent rules, and an inadequate database of company designators and clue words.</Paragraph>
    <Paragraph position="3"> Judging from the results in the Walkthrough text, almost all of our errors and omissions are easily traceable to either a lack of entries in the database or a lack of rules or conditions in rules. With the software still under development, we probably dedicated no more than a person-month to resource development and testing, and will be able to make considerable further improvements in the coming months. We feel on the whole that the approach is vindicated. We also look forward to being able to use the FACILE IE component to carry our proper tests on all the MUC-7 data in the near future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML