File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0432_metho.xml
Size: 9,346 bytes
Last Modified: 2025-10-06 14:08:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0432"> <Title>Named Entity Recognition Using a Character-based Probabilistic Approach</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Probabilistic Classification using Orthographic Tries </SectionTitle> <Paragraph position="0"> Tries are an efficient data structure for capturing statistical differences between strings in different categories.</Paragraph> <Paragraph position="1"> In an orthographic trie, a path from the root through n nodes represents a string a1a2 ...an. The n-th node in the path stores the occurrences (frequency) of the string a1a2 ...an in each word category. These frequencies can be used to calculate probability estimates P(c |a1a2 ...an) for each category c. Tries have previously been used in both supervised (Patrick et al., 2002) and unsupervised (Cucerzan and Yarowsky, 1999) named entity recognition.</Paragraph> <Paragraph position="2"> Each node in an orthographic trie stores the cumulative frequency information for each category in which a given string of characters occurs. A heterogeneous node represents a string that occurs in more than one category, while a homogeneous node represents a string that occurs in only one category. If a string a1a2 ...an occurs in only one category, all longer strings a1a2 ...an ...an+k are also of the same category. This redundancy can be exploited when constructing a trie. We build minimumdepth MD-tries which have the condition that all nodes are heterogeneous, and all leaves are homogeneous. MD-tries are only as large as is necessary to capture the differences between categories, and can be built efficiently to large depths. MD-tries have been shown to give better performance than a standard trie with the same number of nodes (Whitelaw and Patrick, 2002).</Paragraph> <Paragraph position="3"> Given a string a1a2 ...an and a category c an orthographic trie yields a set of relative probabilities P(c |a1),</Paragraph> <Paragraph position="5"> a string indicates a particular class is estimated along the whole trie path, which helps to smooth scores for rare strings. The contribution of each level in the trie is governed by a linear weighting function of the form</Paragraph> <Paragraph position="7"> Tries are highly language independent. They make no assumptions about character set, or the relative importance of different parts of a word or its context. Tries use a progressive back-off and smoothing model that is well suited to the classification of previously unseen words.</Paragraph> <Paragraph position="8"> While each trie looks only at a single context, multiple tries can be used together to capture both word-internal and external contextual evidence of class membership.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Restoring Case Information </SectionTitle> <Paragraph position="0"> In European languages, named entities are often distinguished through their use of capitalisation. However, capitalisation commonly plays another role, that of marking the first word in a sentence. In addition, some sentences such as newspaper headlines are written in allcapitals for emphasis. In these environments, the case information that has traditionally been so useful to NER systems is lost.</Paragraph> <Paragraph position="1"> Previous work in NER has been aware of this problem of dealing with words without accurate case information, and various workarounds have been exploited. Most commonly, feature-based classifiers use a set of capitalisation features and a sentence-initial feature (Bikel et al., 1997). Chieu and Ng used global information such as the occurrence of the same word with other capitalisation in the same document (Chieu and Ng, 2002a), and have also used a mixed-case classifier to teach a &quot;weaker&quot; classifier that did not use case information at all (Chieu and Ng, 2002b).</Paragraph> <Paragraph position="2"> We propose a different solution to the problem of caseless words. Rather than noting their lack of case and treating them separately, we propose to restore the correct capitalisation as a preprocessing step, allowing all words to be treated in the same manner. If this process of case restoration is sufficiently accurate, capitalisation should be more correctly associated with entities, resulting in better recognition performance.</Paragraph> <Paragraph position="3"> Restoring case information is not equivalent to distinguishing common nouns from proper nouns. This is particularly evident in German, where all types of nouns are written with an initial capital letter. The purpose of case restoration is simply to reveal the underlying capitalisation model of the language, allowing machine learners to learn more accurately from orthography.</Paragraph> <Paragraph position="4"> We propose two methods, each of which requires a corpus with accurate case information. Such a corpus is easily obtained; any unannotated corpus can be used once English.</Paragraph> <Paragraph position="5"> sentence-initial words and allcaps sentences have been excluded. For both languages, the training corpus consisted of the raw data, training and test data combined. The first method for case restoration is to replace a caseless word with its most frequent form. Word capitalisation frequencies can easily be computed for corpora of any size. The major weakness of this technique is that each word is classified individually without regard for its context. For instance, &quot;new&quot; will always be written in lowercase, even when it is part of a valid capitalised phrase such as &quot;New York&quot;.</Paragraph> <Paragraph position="6"> The second method uses an MD-trie which, if allowed to extend over word boundaries, can effectively capture the cases where a word has multiple possible forms.</Paragraph> <Paragraph position="7"> Since an MD-trie is only built as deep as is required to capture differences between categories, most paths will still be quite shallow. As in other word categorisation tasks, tries can robustly deal with unseen words by performing classification on the longest matchable prefix. To test these recapitalisation methods, the raw, training, and development sets were used as the training set. From the second test set, only words with known case information were used for testing, resulting in corpora of 30484 and 39639 words for English and German respectively. Each word was classified as either lowercase (&quot;new&quot;), initial-caps (&quot;New&quot;), all-caps(&quot;U.S.&quot;), or innercaps (&quot;ex-English&quot;). On this test set, the word-frequency method and the trie-based method achieved accuracies of 93.9% and 95.7% respectively for English, and 95.4% and 96.3% in German. Table 1 shows the trie performance for English in more detail. In practice, it is usually possible to train on the same corpus as is being recapitalised. This will give more accurate information for those words which appear in both known-case and unknown-case positions, and should yield higher accuracy. null This process of restoring case information is language independent and requires only an unannotated corpus in the target language. It is a pre-processing step that can be ignored for languages where case information is either not present or is not lost.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Classification Process </SectionTitle> <Paragraph position="0"> The training data was converted to use the IOB2 phrase model (Tjong Kim Sang and Veenstra, 1999). This phrase model was found to be more appropriate to the nature of NE phrases in both languages, in that the first word in the phrase may behave differently to consecutive words.</Paragraph> <Paragraph position="1"> MD-Tries were trained on the prefix and suffix of the current word, and the left and right surrounding contexts.</Paragraph> <Paragraph position="2"> Each trie Tx produces an independent probability estimate, PTx(c |context). These probabilities are combined to produce a single estimate</Paragraph> <Paragraph position="4"> These probabilities are then used directly as observation probabilities in a hidden Markov model (HMM) framework. An HMM uses probability matrices P, A, and B for the initial state, state transitions, and symbol emissions respectively (Manning and Sch&quot;utze, 1999). We derive P and A from the training set. Rather than explicitly defining B, trie-based probability estimates are used directly within the standard Viterbi algorithm, which exploits dynamic programming to efficiently search the entire space of state assignments. Illegal assignments, such as an I-PER without a preceding B-PER, cannot arise due to the restrictions of the transition matrix.</Paragraph> <Paragraph position="5"> The datasets for both languages contained extra information including chunk and part-of-speech information, as well as lemmas for the German data. While these are rich sources of data, and may help especially in the recognition phase, our aim was to investigate the feasibility of a purely orthographic approach, and as such no extra information was used.</Paragraph> </Section> class="xml-element"></Paper>