File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0432_intro.xml
Size: 1,440 bytes
Last Modified: 2025-10-06 14:01:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0432"> <Title>Named Entity Recognition Using a Character-based Probabilistic Approach</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Language independent NER requires the development of a metalinguistic model that is sufficiently broad to accommodate all languages, yet can be trained to exploit the specific features of the target language. Our aim in this paper is to investigate the combination of a character-level model, orthographic tries, with a sentence-level hidden Markov model. The local model uses affix information from a word and its surrounds to classify each word independently, and relies on the sentence-level model to determine a correct state sequence.</Paragraph> <Paragraph position="1"> Capitalisation is an often-used discriminator for NER, but can be misleading in sentence-initial or all-caps text.</Paragraph> <Paragraph position="2"> We choose to use a model that makes no assumptions about the capitalisation scheme, or indeed the character set, of the target language. We solve the problem of misleading case in a novel way by removing the effects of sentence-initial or all-caps capitalisation. This results in a simpler language model and easier recognition of named entities while remaining strongly language independent.</Paragraph> </Section> class="xml-element"></Paper>