XML Viewer - m98-1021

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1021_metho.xml
Size: 30,356 bytes
Last Modified: 2025-10-06 14:14:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1021">
  <Title>DATE TRAILER PP SS S P TEXTDOCID ... SLUGSTORYID PREAMBLENWORDS DOC</Title>
  <Section position="3" start_page="0" end_page="7" type="metho">
    <SectionTitle>
LTG TOOLS IN MUC
</SectionTitle>
    <Paragraph position="0"> Amongst the tools used in our muc system is an existing ltg tokeniser, called lttok.Tokenisers take an input stream and divide it up into #5Cwords&amp;quot; or tokens, according to some agreed de#0Cnition of what a token is. This is not just a matter of #0Cnding white spaces between characters|for example, #5CTony Blair Jr&amp;quot; could be treated as a single token.</Paragraph>
    <Paragraph position="1"> lttok is a tokeniser which looks at the characters in the input stream and bundles them into tokens.</Paragraph>
    <Paragraph position="2"> The input to lttok can be sgml-marked up text, and lttok can be directed to only process characters within certain sgml elements. One muc-speci#0Cc adjustment to the tokenisation rules was to treat a  hyphenated expression as separate units rather than a single unit, since some of the ne expressions required this, e.g. #3CTIMEX TYPE=&amp;quot;DATE&amp;quot;#3Efirst-quarter#3C#2FTIMEX#3E-charge.</Paragraph>
    <Paragraph position="3"> Here is an example of the use of lttok.</Paragraph>
    <Paragraph position="4"> cat text  |muc2xml  |lttok -q &amp;quot;.*#2FP&amp;quot; -mark W standard.gr The #0Crst call in this pipeline is to muc2xml, a programme which takes the muc text and maps it into valid xml. lttok then uses a resource grammar,standard.gr,to tokenise all the text in the P elements. It marks the tokens using the sgml element W. The output from this pipeline would look as follows:</Paragraph>
    <Paragraph position="6"> As the example shows, the tokeniser does not attempt to resolve whether a period is a full stop or part of an abbreviation. Depending on the choice of resource #0Cle for lttok, a period will either always be attached to the preceding word #28as in this example#29 or it will always be split o#0B.</Paragraph>
    <Paragraph position="7"> This creates an ambiguity where a sentence-#0Cnal period is also part of an abbreviation, as in the #0Crst sentence of our example. To resolve this ambiguitywe use a special program, ltstop, which applies a maximum entropy model pre-trained on a corpus #5B8#5D. To use ltstop the user must specify whether periods in the input are attached to or split o#0B fromthe preceding words; in our case, they were attached to the words, and ltstop is used with the option -split. With this option, ltstop will split the period from regular words and create an end-of-sentence token #3CW C=&amp;quot;.&amp;quot;#3E.#3C#2FW#3E; or it will leave the period with the word if it is an abbreviation; or, in the case of sentence-#0Cnal abbreviations, it will leave the period with the abbreviation and in addition create a virtual full stop #3CW C=&amp;quot;.&amp;quot;#3E#3C#2FW#3E Like the other ltg tools ltstop can be targeted at particular sgml elements. In our example, wewant to target it at #3CW#3E elements within #3CP#3E elements|the output of lttok. It can be used with di#0Berent maximum entropy models, trained on di#0Berenttypes of corpora.</Paragraph>
    <Paragraph position="8"> For our example, the full pipeline looks as follows: cat text  |muc2xml  |lttok -q &amp;quot;.*#2FP&amp;quot; -mark W standard.gr  |ltstop -q &amp;quot;.*#2FP#2FW&amp;quot; -split fs_model.me This will generate the following output:</Paragraph>
    <Paragraph position="10"> Note how ltstop has #5Cadded&amp;quot; a #0Cnal stop to the #0Crst sentence, making explicit that the period after #5CLtd&amp;quot; has two distinct functions.</Paragraph>
    <Paragraph position="11"> Another standard ltg tool we used in our muc system was our part-of-speech tagger lt pos #5B7#5D. lt pos is sgml-aware: it reads a stream of sgml elements speci#0Ced by the query and applies a Hidden Markov Modeling technique with estimates drawn from a trigram maximum entropy model to assign the most likely part of speech tags. An important feature of the tagger is an advanced module for handling unknown words #5B6#5D, which proved to be crucial for name spotting.</Paragraph>
    <Paragraph position="12"> Some muc-speci#0Cc extensions were added at this point in the processing chain: for capitalised words, we added information as to whether the word exists in lowercase in the lexicon #28marked as L=l#29 or whether it exists in lowercase elsewhere in the same document #28marked as L=d#29. We also developed a model which assigns certain #5Csemantic&amp;quot; tags which are particularly useful for muc processing. For example, words ending in -yst and -ist #28analyst, geologist#29 as well as words occurring in a special list of words  #28spokesman, director#29 are recognised as professions and marked as such#28S=PROF#29. Adjectives ending in -an or -ese whose root form occurs in a list of locations #28American#2FAmerica, Japanese#2FJapan#29 are marked as locative adjectives #28S=LOC JJ#29.</Paragraph>
    <Paragraph position="13"> The output of this part of speech tagging could look as follows:</Paragraph>
    <Paragraph position="15"> We also used a number of other sgml-tools, suchassgdelmarkup which strips unwanted markup from a document, sgsed and sgtr, sgml-aware versions of the unix tools sed and tr.</Paragraph>
    <Paragraph position="16"> But the core tool in our muc system is fsgmatch. fsgmatch is an sgml transducer. It takes certain types of sgml elements and wraps them into larger sgml elements. In addition, it is also possible to use fsgmatch for character-level tokenisation, but in this paper we will only describe its functionalityat the sgml level.</Paragraph>
    <Paragraph position="17"> fsgmatchcan be called with di#0Berent resource grammars,e.g. one can develop a grammarfor recognising names of organisations. Like the other ltg tools, it is also possible to use fsgmatch inavery targeted way, telling it only to process sgml elements within certain other sgml elements, and to use a speci#0Cc resource grammar for that purpose.</Paragraph>
    <Paragraph position="18"> Piping the previous text through fsgmatch with a resource grammar for company names would result in the following:</Paragraph>
    <Paragraph position="20"> The combined functionalityoflttok and fsgmatch gives system designers many degrees of freedom.</Paragraph>
    <Paragraph position="21"> Suppose you wantto mapcharacter strings like#5C25th&amp;quot;or #5C3rd&amp;quot; intosgml entities. Youcan do this at the character level, using lttok, specifying that strings that match #5B0-9#5D+#5B -#5D?#28#28st#29|#28nd#29|#28rd#29|#28th#29#29 should be wrapped into the sgml structure #3CW C=ORD#3E.Oryou can do it at the sgml level: if your tokeniser had marked up numbers like #5C25&amp;quot; as #3CW C=NUM#3E then you can write a rule for fsgmatch saying that #3CW C=NUM#3E followed bya#3CW#3E element whose character data consist of th, nd, rd or st can be wrapped into an #3CW C=ORD#3E element.</Paragraph>
    <Paragraph position="22"> A transduction rule in fsgmatch can access and utilize any informationstated in the element attributes, check sub-elements of an element, do lexicon lookup for character data of an element, etc. For instance, a transduction rule can say: #5Cif there are one or more W elements #28i.e. words#29 with attribute C #28i.e. part of speech tag#29 set to NNP #28proper noun#29 followed byaWelement with character data #5CLtd.&amp;quot;, then wrap this sequence into an ENAMEX element with attribute TYPE set to ORGANIZATION.</Paragraph>
    <Paragraph position="23"> Transduction rules can check left and right contexts, and they can access sub-elements of complex elements; for example, a rule can check whether the last W element under an NG element #28i.e. the head noun of a noun group#29 is of a particular type, and then include the whole noun group into a higher level construction. Element contents can be looked up in a lexicon. The lexicon lookup supports multi-word entries and multiple rule matches are always resolved to the longest one.</Paragraph>
    <Paragraph position="24"> TIMEX, NUMEX, ENAMEX Inour mucsystem,timexand numex expressions are handleddi#0Berently fromenamex expressions. The reason for this is that temporal and numeric expressions in English newspapers have a fairly structured appearance which can be captured by meansof grammarrules. We developed grammarsfor the temporal and numericexpressions we needed to capture, and also compiledlists oftemporalentities and currencies.</Paragraph>
    <Paragraph position="25"> The sgml transducer fsgmatch used these resources to wrap the appropriate strings with timex and numex tags.</Paragraph>
    <Paragraph position="26">  enamex expressions are more complex, and more context-dependent. Lists of organisations and place names, and grammars of person names, are useful resources, but need to be handled with care: context will determine whether Arthur Andersen is used as the name of a person or a company, whether Washington is a location or a person, or whether Granada is the name of a company or a location. At the same time, once Granada has been used as the name of a company, the author of a newspaper article will not suddenly start using it to indicate a location without giving contextual clues that such a shift in denotation has taken place. Because of this, we strongly believe that identi#0Ccation of supportive context is more important for the identi#0Ccation of names of places, organisations and people than are lists or grammars. We do use such lists, but alter them dynamically: if anywhere in the text wehave found su#0Ecient context to decide that Granada is used as the name of an organisation, it is added to our list of organisations for the further processing of that text. When we start processing a new text, we don't make any assumptions anymore about whether Granada is an organisation or place, until we #0Cnd supportive context for one or the other.</Paragraph>
    <Paragraph position="27"> To identify enamex elements we combine symbolic transduction of sgml elements with probabilistic partial matching in 5 phases:  1. sure-#0Cre rules 2. partial match1 3. relaxed rules 4. partial match2 5. title assignment  We describe each in turn.</Paragraph>
    <Paragraph position="28"> ENAMEX: 1. Sure-#0Cre Rules The sure-#0Cre transduction rules used in the enamex task are very context oriented and they #0Cre only when a possible candidate expression is surrounded by a suggestive context. For example, #5CGerard Klauer&amp;quot; looks like a person name, but in the context #5CGerard Klauer analyst&amp;quot; it is the name of an organisation #28as in #5CGeneral Motors analyst&amp;quot;#29. Sure-#0Cre rules rely on known corporate designators #28Ltd., Inc., etc.#29, titles #28Mr., Dr., Sen.#29, and de#0Cnite contexts such as those in Figure 2. At this stage our muc system treats information from the lists as likely rather than de#0Cnite and always checks if the context is either suggestive or non-contradictive. For example, a likely company name with a conjunction is left untagged at this stage if the company is not listed in a list of known companies: in a sentence like #5Cthis was good news for China International Trust and Investment Corp&amp;quot;, it is not clear at this stage whether the text deals with one or two companies, and no markup is applied. Similarly,the system postpones the markupof unknown organizationswhose name starts with a sentence initial common word, as in #5CSuspended Ceiling Contractors Ltd denied the charge&amp;quot;. Since the sentence-initial word has a capital letter, it could be an adjective modifying the company #5CCeiling Contractors Ltd&amp;quot;, or it could be part of the company name, #5CSuspended Ceiling Contractors Ltd&amp;quot;. Names of possible locations found in our gazetteer of place names are marked as location only if they appear with a context that is suggestive of location. #5CWashington&amp;quot;, for example, can just as easily be a surname or the name of an organization. Only in a suggestive context, like #5Cin the Wahington area&amp;quot;, will it be marked up as location.</Paragraph>
    <Paragraph position="29"> ENAMEX: 2. Partial Match1 After the sure-#0Cre symbolic transduction the system performs a probabilistic partial match of the entities identi#0Ced in the document. This is implemented as an interaction between two tools. The #0Crst  words; DD is a digit; PROF is a profession #28director, manager, analyst, etc.#29; REL is a relative #28sister, nephew, etc.#29; JJ* is a sequence of zero or more adjectives; LOC is a known location; PERSON-NAME is a valid person name recognized by a name grammar.</Paragraph>
    <Paragraph position="30"> tool collects all named entities already identi#0Ced in the document. It then generates all possible partial orders of the composing words preserving their order, and marks them if found elsewhere in the text. For instance, if at the #0Crst stage the expression #5CLockheed Martin Production&amp;quot; was tagged as organization because it occurred in a context suggestive of organisations, then at the partial matching stage all instances of #5CLockheed Martin Production&amp;quot;, #5CLockheed Martin&amp;quot;, #5CLockheed Production&amp;quot;, #5CMartin Production&amp;quot;, #5CLockheed&amp;quot; and #5CMartin&amp;quot; will be marked as possible organizations. This markup, however, is not de#0Cnite since some of these words #28such as #5CMartin&amp;quot;#29 could refer to a di#0Berententity. This annotated stream goes to a second tool, a pre-trained maximum entropy model. It takes into account contextual information for named entities, such as their position in the sentence, whether these words exist in lowercase and if they were used in lowercase in the document, etc. These features are passed to the model as attributes of the partially matched words. If the model provides a positive answer for a partial match, the match is wrapped into a corresponding ENAMEX element. Figure 3 gives an example of this.</Paragraph>
    <Paragraph position="31">  Once this has been done, the system again applies the symbolic transduction rules. But this time the rules havemuch more relaxed contextual constraints and extensively use the information from already existing markup and lexicons. For instance, the system will mark word sequences which look like person names. For this it uses a grammar of names: if the #0Crst capitalised word occurs in a list of #0Crst names and the following word#28s#29 are unknown capitalised words, then this string can be tagged as a PERSON. Here we are no longer concerned that a person name can refer to a company. If the name grammar had applied earlier in the process, it might erroneously have tagged #5CPhilip Morris&amp;quot; as a PERSON instead of an ORGANISATION.However, at this point in the chain of enamex processing, that is not a problem  anymore: #5CPhilip Morris&amp;quot; will bynow already have been identi#0Ced as an ORGANISATION by the sure-#0Cre rules or during partial matching. If the author of the article had also been referring to the person #5CPhilip Morris&amp;quot;, s#2Fhe would have used explicit context to make this clear, and our muc system would have detected this. If there had been no supportive context so far for #5CPhilip Morris&amp;quot; as organisation or person, then the name grammar at this stage will tag it as a likely person, and check if there is supportive context for that hypothesis.</Paragraph>
    <Paragraph position="32"> At this stage the system will also attempt to resolve the #5Cand&amp;quot; conjunction problem noted above with #5Cthis was good news for China International Trust and Investment Corp&amp;quot;. The system checks if possible parts of the conjunctions were used in the text on their own and thus are namesof di#0Berent organizations; if not, the system has no reason to assume that more than one company is being talked about.</Paragraph>
    <Paragraph position="33"> In a similarvein, the system resolves the attachmentofsentence initialcapitalised modi#0Cers, the problem alluded to above with the #5CSuspended Ceiling Contractors Ltd&amp;quot; example: if the modi#0Cer was seen with the organization name elsewhere in the text, with a capital letter and not at the start of a sentence, then the system has good evidence that the modi#0Cer is part of the company name; if the modi#0Cer does not occur anywhere else in the text with the company name, it is assumed not to be part of it.</Paragraph>
    <Paragraph position="34"> At this stage known organizations and locations from the lists available to the system are marked in the text, again without checking the context in which they occur.</Paragraph>
    <Paragraph position="35"> ENAMEX: Partial Match2 At this point, the system has exhausted its resources #28name grammar, list of locations, etc#29. The system then performs another partial match to annotate names like #5CWhite&amp;quot; when #5CJames White&amp;quot; had already been recognised as a person, and to annotate company names like #5CHughes&amp;quot; when #5CHughes Communications Ltd.&amp;quot; had already been identi#0Ced as an organisation. As in Partial Match 1, this process of partial matching is again followed by a probabilistic assignment supported by the maximum entropy model.</Paragraph>
    <Paragraph position="36"> ENAMEX: Title Assignment Because titles of news wires are in capital letters, they provide little guidance for the recognition of names. In the #0Cnal stage of enamex processing, entities in the title are marked up, by matching or partiallymatchingthe entities found in the text, and checking against a maximum-entropymodeltrained on document titles. For example, in #5Cmurdoch satellite explodes on take-off&amp;quot; #5CMurdoch&amp;quot; will be tagged as a person because it partially matches #5CRupert Murdoch&amp;quot; elsewhere in the text.  As one would expect, the sure-#0Cre rules givevery high precision #28around 96-98#25#29,but very low recall|in other words, it doesn't #0Cnd many enamex entities, but the ones it #0Cnds are correct. Note that the sure-#0Cre rules do not use list information much; the high precision is achieved mainly through the detection of supportive context for what are in essence unknown names of people, places and organisations. Recall goes up dramatically during Partial Match 1, when the knowledge obtained during the #0Crst step #28e.g. that this is a text about Washington the person rather than Washington the location#29 is propagated further through the text, context permitting. Subsequent phases of processing add gradually more and more enamex entities #28recall increases to around 90#25#29, but on occasion introduce errors #28resulting in a slight drop in precision#29. Our #0Cnal score for ORGANISATION,PERSON and LOCATION is given in the bottom line of Figure 4.</Paragraph>
  </Section>
  <Section position="4" start_page="7" end_page="8" type="metho">
    <SectionTitle>
WALKTHROUGH EXAMPLES
#3CENAMEX TYPE=&amp;quot;PERSON&amp;quot;#3EMURDOCH#3C#2FENAMEX#3E SATELLITE FOR LATIN PROGRAMMING
EXPLODES ON TAKEOFF
</SectionTitle>
    <Paragraph position="0"> The system correctly tags #5CMurdoch&amp;quot; as a PERSON, despite the fact that the title is all capitalised, and there is little supportive context. The reason for this is that elsewhere in the text there are sentences like #5Cdealing a potential blow to Rupert Murdoch's ambitions&amp;quot;, and the system correctly analysed #5CRupert Murdoch&amp;quot; as a PERSON, on the basis of its grammar of names #28see enamex: Relaxed Rules#29.</Paragraph>
    <Paragraph position="1"> During Partial Match 2, the partial orders of this name are generated and any occurrences of #5CRupert&amp;quot; and #5CMurdoch&amp;quot; are tagged as PERSONs #28e.g. in the string #5CMurdoch-led venture&amp;quot;#29, context permitting.</Paragraph>
    <Paragraph position="2"> During the Title Assignment phase, #5CMurdoch&amp;quot; in the title is then also tagged as PERSON, since there is no context to suggest otherwise.</Paragraph>
    <Paragraph position="3"> #3CENAMEX TYPE=&amp;quot;PERSON&amp;quot;#3ELlennel Evangelista#3C#2FENAMEX#3E, a spokesman for #3CENAMEX TYPE=&amp;quot;ORGANIZATION&amp;quot;#3EIntelsat#3C#2FENAMEX#3E, a global satellite consortium ...</Paragraph>
    <Paragraph position="4"> #5CLlennel Evangelista&amp;quot; is correctly tagged as PERSON. Our grammar of names would not have been able to detect this, since it didn't have #5CLlennel&amp;quot; as a possible Christian name; this again illustrates that it is dangerous to rely too much on resources like lists of Christian names, since these will never be complete. However, our muc system detected that #5CLennel Evangelista&amp;quot; is a person at a much earlier stage: because of the sure-#0Cre rule that in clauses like #5CXxxx, a JJ* PROFESSION for#2Fof#2Fin ORG&amp;quot;, the string of unknown, capitalized words Xxxx refers to a PERSON. Using partial matching, #5CEvangelista&amp;quot; in #5CEvangelista said...&amp;quot; was also tagged as PERSON.</Paragraph>
    <Paragraph position="5"> #5CIntelsat&amp;quot; was correctly tagged as an ORGANISATION because of the context in which if appears: #5CXxxx, a JJ* consortium#2Fcompany#2F...&amp;quot;. During Partial Matching, other occurrences of #5CIntelsat&amp;quot; are marked as ORGANISATION, e.g. in #5CIntelsat satellite&amp;quot;.</Paragraph>
    <Paragraph position="6">  that #5CXxxx SA#2FNV#2FLtd...&amp;quot; are names of organisations. Through partial matching, #5CGrupo Televisa&amp;quot; without the #5CSA&amp;quot; is also recognised as an ORGANIZATION.</Paragraph>
    <Paragraph position="7"> #5CGlobo&amp;quot; is recognised as an ORGANIZATION because elsewhere in the text there is reasonably evidence that #5CGlobo&amp;quot; is the name of an organisation. In addition, there is a conjunction rule which prefers conjunctions of likeentities.</Paragraph>
    <Paragraph position="8">  This conjunction rule also worked for the string #5Cin U7ited States and Russia&amp;quot;: #5CRussia&amp;quot; is in the list of locations and in a context supportive of locations; because of the typo, #5CU7ited States&amp;quot; was not in the list of locations. But because of the conjunction rule, it is correctly tagged as a LOCATION nevertheless.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="8" type="metho">
    <SectionTitle>
EVALUATION
</SectionTitle>
    <Paragraph position="0"> Our system achieved a combined Precision and Recall score of 93.39. This was the highest score of the participating named entity recognition systems. Here is a breakdown of our scores:  In what follows, we will discuss our system performance in each of the Named Entity categories. In general, our system performed very well in all categories. But the reason our system outperformed other systems was due to its performance in the category ORGANIZATION where it scored signi#0Ccantly better than the next best system: 91 precision and 95 recall, whereas the next best system scored 87 precision and 89 recall. We attribute this to the fact that our system does not rely much on pre-established lists, but instead builds document-speci#0Cc lists on the #0Dy, lookingfor sure-#0Cre contexts to makedecisions about names of organisations, and on the use of partial orders of multi-word entities. This pays o#0B particularly in the case of organisations, which are often multi-word expressions, containing many common words.</Paragraph>
  </Section>
  <Section position="6" start_page="8" end_page="9" type="metho">
    <SectionTitle>
ORGANIZATION
</SectionTitle>
    <Paragraph position="0"> One type of error occurred when a company such as #5CGranada Group Plc&amp;quot; was referred to just as #5CGranada&amp;quot;, and this word is also a known location. The location informationtended to override the tags resulting from partial matching, resulting in the wrong tag. The reason for this is that these metonymic relations do not always hold: if a text refers to an organisation called the #5CPittsburgh Pirates&amp;quot;, and it then refers to #5CPittsburgh&amp;quot;, it is more likely that #5CPittsburgh&amp;quot; is a reference to a location rather than another reference to that organisation. In the same vein, the system treats a reference to #5CGranada&amp;quot; as a location, even after reference has been made to the organisation #5CGranada Group Plc&amp;quot;, in the absence of clear contextual clues to the contrary.</Paragraph>
    <Paragraph position="1"> A second type of error resulted fromwronglyresolving conjunctions in companynames,as in #3CORG#3ESmith and Ivanoff Inc.#3C#2FORG#3E As explained above, the system's strategy was to assume the conjunction  referred to a single organisation, unless its constituent parts occurred on its list of known companies or occurred on their own elsewhere in the text. In some cases, the absence of such information led to mistaggings,which are penalised quite heavily: you lose once in recall #28since the system did not recognise the name of the company#29 and twice in precision #28since the system produced two spurious names#29.</Paragraph>
    <Paragraph position="2"> Many spurious taggings in ORGANIZATION were caused by the fact that artefacts like newpapers or TV channels havevery similar contexts to ORGANIZATIONs, resulting in mistaggings. For instance, in #5Ceditor of the Paci#0Cc Report&amp;quot;, the string #5CPaci#0Cc Report&amp;quot; was wrongly tagged as an ORGANISATION because of the otherwise very productive rule whichsays that Xxxx in #5CPROF of#2Fat#2Fwith Xxxx&amp;quot; should be tagged as an ORGANIZATION.</Paragraph>
    <Paragraph position="3"> The misses consisted mostlyof short expressions mentionedjust once in the text and withouta suggestive context. As a result, the system did not have enough information to tag these terms correctly. Also, there were about 40 mentions of the Ariane 4 and 5 rockets, and according to the answer keys #5CAriane&amp;quot; should have been tagged as organisation in each case, accounting for 40 of the 152 misses.</Paragraph>
  </Section>
  <Section position="7" start_page="9" end_page="9" type="metho">
    <SectionTitle>
PERSON
</SectionTitle>
    <Paragraph position="0"> The PERSON category did not present too many di#0Eculties to our system. The system handled a few di#0Ecult cases well when an expression #5Csounded&amp;quot; like a person name but in fact was not, e.g. #5CGerard Klauer&amp;quot; in #5Ca Gerard Klauer analyst&amp;quot;|the example discussed above.</Paragraph>
    <Paragraph position="1"> One article was responsible for quite a few errors: in an article about Timothy Leary's death, #5CTimothy Leary&amp;quot; was twice and #5CZachary Leary&amp;quot; seven times recognised as a PERSON; but 11 other mentions of #5CLeary&amp;quot; were wrongly tagged as ORGANIZATION. The reason for this was the phrase #5C...family members with Leary when he died&amp;quot;. The system applied the rule PROFs of#2Ffor#2Fwith Xxxx+ ==#3E ORGANIZATION . The word #5Cmembers&amp;quot; was listed in the lexicon as a profession and this caused #5CLeary&amp;quot; to be wrongly tagged as ORGANIZATION. This accounts for 11 of the 24 incorrectly tagged PERSONs.</Paragraph>
    <Paragraph position="2"> Most of the 17 missing person names were one-word expressions mentioned just once in the text, and the system did not have enough information to perform a classi#0Ccation.</Paragraph>
  </Section>
  <Section position="8" start_page="9" end_page="9" type="metho">
    <SectionTitle>
LOCATION
</SectionTitle>
    <Paragraph position="0"> LOCATION was the most disappointing category for us. Just one word #28#5CColumbia&amp;quot;#29 whichwas tagged as location but in fact was the name of a space-shuttle was responsible for 38 of the 73 spurious assignments. The problem arose from sentences like #5CColumbia is to blast o#0B from NASA's Kennedy Space Center...&amp;quot;, where we erroneously tagged #5CColumbia&amp;quot; as a location. Interestingly,we correctly did not tag #5CColumbia&amp;quot; in the string #5Cspace shuttle Columbia&amp;quot;; this was correctly recognised by the system as an artefact. In the Named Entity Recognition Task one does not have to mark up artefacts, but it is useful to recognise them nevertheless: using the partial matching rule, the system now also knew that #5CColumbia&amp;quot; was the likely name of an artefact and should not be marked up.</Paragraph>
    <Paragraph position="1"> Unfortunately, the text also contained the expression #5Ca satellite 13 miles from Columbia&amp;quot;. This context is strongly suggestiveofLOCATION. That, and the fact that #5CColumbia&amp;quot;occurs in the list of placenames, overruled the evidence that it referred to an artefact.</Paragraph>
    <Paragraph position="2"> Out of the 55 misses, 30 were due to not assigning LOCATION tags to various heavenly bodies.</Paragraph>
  </Section>
  <Section position="9" start_page="9" end_page="10" type="metho">
    <SectionTitle>
TIMEX
</SectionTitle>
    <Paragraph position="0"> In the TIMEX category wehave relatively low recall. Our failure to markup expressions was sometimes due to underspeci#0Ccation in the guidelines and the training data; with the corrected answer keys our recall for times went up from 79 to 85. Apart from this, we also failed to recognise expressions like #5Cthe second day of the shuttle's 10-day mission&amp;quot;, #5Cthe #0Cscal year starting Oct. 1&amp;quot; , etc, which need to be  marked as timex expressions in their entirety. And we did not group expressions like #5Cfrom August 1993 to July 1995&amp;quot; into one group but tagged them as two temporal expressions #28which gives three errors#29.</Paragraph>
  </Section>
  <Section position="10" start_page="10" end_page="10" type="metho">
    <SectionTitle>
NUMEX
</SectionTitle>
    <Paragraph position="0"> In the NUMEX category most of our errors came from the fact that we preferred simple constructions over more complex groupings. For instance, #5Cbetween $300 millionand $700 million&amp;quot;we didn't tag as a single numex expression, but instead tagged it as between #3CNUMEX TYPE=&amp;quot;MONEY&amp;quot;#3E$300 million#3C#2FNUMEX#3E and #3CNUMEX TYPE=&amp;quot;MONEY&amp;quot;#3E$700 million#3C#2FNUMEX#3E</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML