File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1055_intro.xml
Size: 3,895 bytes
Last Modified: 2025-10-06 14:01:12
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1055"> <Title>Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems Georgios Petasis +, Frantz Vichot SS, Francis Wolinski SS</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Machine learning has recently been proposed as a promising solution to a major problem in language engineering: the construction of lexical resources. Most of the real-world language engineering systems make use of a variety of lexical resources, in particular grammars and lexicons. The use of general-purpose resources is ineffective, since in most applications a specialised vocabulary is used, which is not supported by general-purpose lexicons and grammars. For this reason, significant effort is currently put into the construction of generic tools that can quickly adapt to a particular thematic domain.</Paragraph> <Paragraph position="1"> The adaptation of these tools mainly involves the adaptation of domain-specific semantic lexical resources.</Paragraph> <Paragraph position="2"> Named-entity recognition and classification (NERC) is the identification of proper names in text and their classification as different types of named entity (NE), e.g. persons, organisations, locations, etc. This is an important subtask in most language engineering applications, in particular information retrieval and extraction. The lexical resources that are typically included in a NERC system are a lexicon, in the form of gazetteer lists, and a grammar, responsible for recognising the entities that are either not in the lexicon or appear in more than one gazetteer lists. The manual adaptation of those two resources to a particular domain is time-consuming and in some cases impossible, due to the lack of experts. The exploitation of learning techniques to support this adaptation task has attracted the attention of researchers in language engineering.</Paragraph> <Paragraph position="3"> However, the adaptation of lexical resources to a specific domain at a certain point in time is not sufficient on its own. The performance of a NERC system degrades over time (Vichot et al., 1999; Wolinski et al., 2000) due to the introduction of new NEs or the change in the meaning of existing ones. We need to find ways that facilitate the maintenance of rule-based NERC systems. This paper presents such a method, exploiting machine learning in an innovative way.</Paragraph> <Paragraph position="4"> Our method controls rule-based NERC systems with NERC systems constructed by a machine learning algorithm. The method comprises two stages: the training stage, during which a supervised machine learning algorithm constructs a new system using data generated by the rulebasedsystem,andthedeployment stage,in which the results of the two systems are compared on new data and their disagreements are used as signals for change in the rule-based system. Note that, unlike most applications of supervised machine learning, the training data for the new system are not produced manually.</Paragraph> <Paragraph position="5"> In order to illustrate the generality of this approach, we have tested it with two different NERC systems, one for Greek and another one for French. The results are very encouraging and show that machine learning techniques can be used for the maintenance of rule-based systems.</Paragraph> <Paragraph position="6"> Section 2 presents existing work on the domain adaptation of NERC systems using machine learning (ML) techniques. Section 3 presents the two rule-based NERC systems for Greek and French. Section 4 explains our method and Section 5 describes the two experiments and presents the evaluation results. Finally, Section 6 concludes and presents our future plans.</Paragraph> </Section> class="xml-element"></Paper>