File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1055_metho.xml
Size: 18,047 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1055"> <Title>Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems Georgios Petasis +, Frantz Vichot SS, Francis Wolinski SS</Title> <Section position="5" start_page="0" end_page="1" type="metho"> <SectionTitle> 3 Rule-based NERC Systems </SectionTitle> <Paragraph position="0"> A typical NERC system consists of a lexicon and a grammar. The lexicon is a set of NEs that are known beforehand and have been classified into semantic classes. The grammar is used to recognize and classify NEs that are not in the lexicon and to decide upon the final classes of NEs in ambiguous cases.</Paragraph> <Paragraph position="1"> Manual construction of NERC systems is a complicated and time-consuming process, even for experts. The meaning of a single sentence may vary a lot according to which category a NE is assigned to. For example, the sentence &quot;Express group intends to sell Le Point for 700 MF&quot; indicates a sale of a newspaper company, if &quot;Le Point&quot; is classified as an organisation. Whereas the following sentence, which is grammatically identical to the previous one, &quot;Compagnie des Signaux intends to sell TVM430 for 700 MF&quot; gives only a price for an industrial product.</Paragraph> <Paragraph position="2"> In order for a NERC system to be able to recognise and categorise correctly NEs, both the lexicon and the grammar have to be validated on large corpora, testing their efficiency and their robustness. However, this process does not ensure that the performance of the developed system will remain steady over time. Almost under all thematic domains, the introduction of new NEs or the change in the meaning of existing ones can increase the error rate of the system.</Paragraph> <Paragraph position="3"> Our approach tries to identify such cases, facilitating the maintenance of the NERC system.</Paragraph> <Paragraph position="4"> The following subsections briefly describe the Greek and French rule-based NERC systems that have been used in our experiments.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 3.1 The Greek NERC System </SectionTitle> <Paragraph position="0"> The Greek NERC system (Farmakiotou et al., 2000) used for the purposes of this experiment forms part of a larger Greek information extraction system, being developed in the context of the R&D project MITOS.</Paragraph> <Paragraph position="1"> The NERC component of this system mainly consists of three processing stages: linguistic pre-processing, NE identification and NE classification. The linguistic pre-processing stage involves some basic tasks: tokenisation, sentence splitting, part-of-speech tagging and stemming. Once the text has been annotated with part of speech tags, a stemmer is used. The aim of the stemmer is to reduce the size of the lexicon as well as the size and complexity of the NERC grammar.</Paragraph> <Paragraph position="2"> The NE identification stage involves the detection of their boundaries, i.e., the start and the end of all the possible spans of tokens that are likely to belong to a NE. Identification consists of three sub-stages: initial delimitation, separation and exclusion. Initial delimitation involves the application of general patterns. These patterns are combinations of a limited number of words, selected types of tokens (e.g. tokens consisting of capital characters), special symbols and punctuation marks. At the separation substage, possible NEs that are likely to contain more than one NE or a NE attached to a non-NE, are detected and attachment problems are resolved. Finally, at the exclusion sub-stage two types of criteria are used for exclusion from the possible NE list: the context of the phrase and being part of an exclusion list. Suggestive context for exclusion consists of common names that refer to products, services or artifacts. The exclusion list includes capitalized abbreviations of common nouns, financial terms, capitalized person titles, which are not ambiguous, and nouns commonly found in names of products, artifacts and services.</Paragraph> <Paragraph position="3"> Once the possible NEs have been identified, the classification stage begins. Classification involves three sub-stages: application of classification rules, gazetteer-based classification, and partial matching of classified named-entities with unclassified ones. Classification rules take into account both internal and external evidence (McDonald, 1996), i.e., the words and symbols that comprise the possible name and the context in which it occurs. Gazetteer-based classification involves the look up of pre-stored lists of known proper names (gazetteers). The gazetteers contain stemmed forms and have been compiled from Web sites and an annotated train- null http://www.iit.demokritos.gr/skel/mitos ing corpus. The size of the gazetteers is rather small (3,059 names). At the partial matching sub-stage, classified names are matched against unclassified ones aiming at the recognition of the truncated or variable forms of names.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 The French NERC System </SectionTitle> <Paragraph position="0"> The French NERC system has been implemented with the use of a rule-based inference engine (Wolinski et al., 1995). It is based on a large knowledge base (lexicon) including 8,000 proper names that share 10,000 forms and consist of 11,000 words. It has been used continuously since 1995 in several real-time document filtering applications (Wolinski et al., 2000).</Paragraph> <Paragraph position="1"> The uses of the NERC system in these applications are the following: 1. Segmentation of NEs, in order to improve the performance of the syntactic analyser, particularly in the case of long proper names which contain grammatical markers (e.g. prepositions, conjunctions, commas, full stops).</Paragraph> <Paragraph position="2"> 2. Recognition of known NEs in order to supply precise information to a document filtering module.</Paragraph> <Paragraph position="3"> 3. Classification of NEs in order to feed a document filtering module with information dealing with the very nature of the NEs quoted in the documents.</Paragraph> <Paragraph position="4"> The NERC system tries to classify each NE in one of four different categories: association (non-commercial organisation), person, location or company.</Paragraph> <Paragraph position="5"> For the classification of known entities, a crucial problem appears when several NEs share a single form. To deal with these cases, two sets of rules have been implemented: 1. Local context: For instance, &quot;Saint-Louis&quot; may be interpreted in one of the following ways: the capital of Missouri, a French group in the food production industry, a small industry &quot;les Cristalleries de Saint Louis&quot;, a small town in France, a hospital in Paris, etc. Exploration of the local context using the proper name may enable, in certain cases, a choice to be made between these various interpretations. If the text speaks of &quot;St-Louis (Missouri)&quot;, only the first interpretation should be adopted. In order to do this the knowledge base should contain information that &quot;Saint-Louis&quot; is in Missouri, and a rule should exist to interpret the affixing of a parenthesis. null 2. Global context: Abbreviated NEs and acronyms are much more frequent sources of ambiguity and are almost always common to several NEs. In general, such ambiguous forms of NEs do not occur on their own in news but almost always together with non-ambiguous forms that enable the ambiguity to be removed. For instance, if the NEs &quot;Saint-Louis&quot; and &quot;Hopital Saint-Louis&quot; appear in a single news item, the interpretation corresponding to the hospital is more likely to be the one that should be adopted. For unknown entities, three sets of rules have been implemented: 1. Prototypes: Many NEs are constructed according to some prototypes. These can be categorised using pattern matching rules. Mr Andre Blavier, Kyocera Corp, Conde-sur-Huisne, Honda Motor, IBM-Asia, Bernard Tapie Finance, Siam Nissan Automobile Co Ltd are good examples of such prototypes.</Paragraph> <Paragraph position="6"> 2. Local context: Many single-word unknown NEs (some known NEs as well) may also be categorised using the local context. For instance, the small sentences &quot;Peskine, director of the group&quot;, &quot;the shareholders of Fibaly &quot;or&quot;the mayor of Gisenyi&quot; are used as categorisation rules.</Paragraph> <Paragraph position="7"> 3. Global context: After the first appearance of a NE in full, its head (e.g. family name, main company) is often used alone in the text instead of the full name. The company Kyocera Corp, for example, may be designated by the single word Kyocera in the remainder of the text. For each such unknown word, starting with a capital letter, a special rule examines whether it appears inside another NE in the text.</Paragraph> </Section> </Section> <Section position="6" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Controlling a Rule-based System Us- ing Machine Learning </SectionTitle> <Paragraph position="0"> Machine learning has been used successfully to control a rule-based system that performs a different task, namely document filtering (Wolinski et al., 2000). The learning method used in that case was a neural network (Stricker et al., 2001).</Paragraph> <Paragraph position="1"> In our present study, we control the rule-based NERC systems that have been presented in section 3, with NERC systems constructed by the C4.5 algorithm. Our method comprises two stages: the training stage, during which C4.5 constructs a new system using data generated by the rule-based system, and the deployment stage, in which the results of the two systems are compared on new data and their disagreements are used as signals for change in the rule-based system. This section describes the basic principles of our control method.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Control method: training stage </SectionTitle> <Paragraph position="0"> The training stage of our method consists of the following processing steps (Figure 1): Running the rule-based NERC system on a large training corpus (containing several thousands of NEs in our case). The aim of this process is to recognise and classify the NEs in the corpus. The end product is a set of NEs, associated with their class.</Paragraph> <Paragraph position="1"> Constructing a separate NERC system by applying C4.5 on the data generated by the rule-based system. In this process, the classified NEs are used as training data by C4.5, in order to construct the second NERC system (trained NERC). For each classified NE a training example (vector) is created, containing information about the part of speech and gazetteer tags of the first and the last two words of the NE, as well as the two words preceding and the two following the NE. It is important to note that, unlike other uses of supervised machine learning methods, this approach does not require manual tagging of training data.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 Control method: deployment stage </SectionTitle> <Paragraph position="0"> In the deployment stage, the two NERC systems are compared on a new corpus to identify disagreements. Despite the fact that the second method is trained on data generated by the first, the different nature of the NERC system generated by C4.5, i.e., a decision tree, leads to interesting disagreements between the two methods.</Paragraph> <Paragraph position="1"> The deployment stage consists of the following processing steps (Figure 2): 1. Running the rule-based NERC system on a new corpus. It should be stressed here that the documents in this corpus differ in some characteristic way from those in the training corpus. In our experiments the difference is chronological, i.e., the new corpus consists of recent news articles. The reason for adopting this approach is that we are interested in the maintenance of a rule-based system through time. An alternative approach might be for the new corpus to be from a slightly different thematic domain. In that case, the goal of the process would be the customisation of the rule-based system to a new domain.</Paragraph> <Paragraph position="2"> 2. Running the trained NERC system on the same corpus.</Paragraph> <Paragraph position="3"> 3. Comparing the results provided by both sys- null tems to identify cases of disagreement. The result is a set of data where the two systems disagree: in our case, disagreements deal with the different categories assigned by the NERC systems to NEs (see Section 5 for detailed results). These cases are then provided to the language engineer, who needs to evaluate them and decide on changes for the rule-based system.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 5Results </SectionTitle> <Paragraph position="0"> In order to evaluate the proposed method, two different experiments were contacted, one for each language. The exact experimental settings as well as the evaluation results are presented in the following sections.</Paragraph> </Section> </Section> <Section position="7" start_page="1" end_page="2" type="metho"> <SectionTitle> 5.1 Results for the Greek System </SectionTitle> <Paragraph position="0"> For the experiment regarding the Greek language, we used three NE classes: organisations, persons and locations. For the purposes of the experiment, two corpora of financial news were used.</Paragraph> <Paragraph position="1"> The first corpus that was used for training purposes, consisted of 5,000 news articles from the years 1996 and 1997, containing 10,010 instances of NEs (1,885 persons, 1,781 locations, 6,344 organisations). The second corpus The corpora were provided by the Greek publishing company Kapa-TEL.</Paragraph> <Paragraph position="2"> that was used for evaluation purposes consisted of 5,779 news from the years 1999 and 2000 and contained 11,786 instances of NEs (1,137 persons, 810 locations, 9,839 organisations). A good way to give an overview of the cases of disagreement of the two systems is through a contingency matrix, as shown in Table 1. The rows of this table correspond to the classification of the rule-based system, while the columns to the classification of the system constructed by C4.5.</Paragraph> <Paragraph position="3"> As we can see from Table 1, in 95% of the cases the two systems are in agreement. This means, that in order to update the rule-based NERC system, we have to examine only 5% of the cases, where the two systems disagree. Examining these cases gave us important insight regarding problems of the rule-based NERC system. Some examples are presented in the following sections.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.2 Recognition problems </SectionTitle> <Paragraph position="0"> The examination of cases in disagreement revealed some interesting problems regarding NE recognition. These problems concern NEs that the rule-based system identified only partially and as a result classified them incorrectly.</Paragraph> <Paragraph position="1"> For example, in the stage of initial delimitation, the general patterns fail to identify NEs that contain numbers in their names, like the organisation &quot;Athena 2004&quot; (Athens 2004) representing the organising committee of 2004 Olympics. In addition, during the separation phase some of the rules have not taken into account some inflexional endings, causing failures in separating some NEs. For example, in the phrase &quot;ouph . Politis uou G . Phlorides &quot; (the under-secretary of Culture G . Phlorides ) the recogniser failed to separate the person name from its title, due to the last accented character of the word &quot;Politi s uou &quot;.</Paragraph> <Paragraph position="2"> Finally, we were able to locate several stop-words and update our exclusion list. For instance, the phrase &quot;gra uuon ISDN&quot; (ISDN lines) was recognised as an organisation (as the word &quot;gra uuon &quot; is a frequent constituent of airline or shipping companies), but in reality the text was referring to ISDN telephone lines.</Paragraph> <Paragraph position="3"> Except from the problems identified in the recognition phase, the examination of the cases of disagreement revealed various problems regarding mainly the classification grammar. In fact, some of our classification rules were found to be too general, leading to wrong classifications.</Paragraph> <Paragraph position="4"> For example, according to one of the rules, a sequence of two words, starting with capital letters, constitutes a person name if it is preceded by a definite article and the endings of these two words belong in a specific set that usually denote person names. This rule caused the classification of various non-NEs as persons, including &quot;tou Olu upiakou Khoriou &quot;(the Olympic Village).</Paragraph> <Paragraph position="5"> Another example of an overly general rule is a rule that classifies a sequence of abbreviations or nouns starting with capital letter as an organisation, if this sequence is preceded by a comma that in turn is preceded by a NE already classified as an organisation. This rule caused the classification of few person names as organisations, such as &quot;o dioiketes tes Ethnikes Trape zas , Th .Karatzas &quot; (the director of National Bank, Th .Karatzas ).</Paragraph> </Section> </Section> class="xml-element"></Paper>