File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0904_intro.xml
Size: 5,476 bytes
Last Modified: 2025-10-06 14:00:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0904"> <Title>Comparison between Tagged Corpora for the Named Entity Task</Title> <Section position="4" start_page="20" end_page="20" type="intro"> <SectionTitle> 2 Models </SectionTitle> <Paragraph position="0"> Recent studies into the use of supervised learning-based modeels for the NE task in the molecular-biology domain have shown that models based on hidden Markov models (HMMs) (Collier et al., 2000) and decision trees (Nobata et al., 1999) are not only adaptable to this highly technical domain, but are also much more generalizable to new classes of words than systems based on traditional hand-built heuristic rules such as (Fukuda et al., 1998). We now describe two models used in our experiments based on the decision trees package C4.5 (Quiuian, 1993) and HMMs (Rabiner and Juang, 1986).</Paragraph> <Section position="1" start_page="20" end_page="20" type="sub_section"> <SectionTitle> 2.1 Decision tree named entity </SectionTitle> <Paragraph position="0"> recogniser:NE-DT A decision tree is a type of classifier which has &quot;leaf nodes&quot; indicating classes and &quot;decision nodes&quot; that specify some test to be carried out, with one branch or subtree for each possible outcome of the test. A decision tree can be used to classify an object by starting at the root of the tree and moving through it until a leaf is encountered. When we can define suitable features for the decision tree, the system can achieve good performance with only a small amount of training data.</Paragraph> <Paragraph position="1"> The system we used is based on one that was originally created for Japanese documents (Seine et al., 1998). It has two phases, one for creating the decision tree from training data and the other for generating the class-tagged text based on the decision tree. When generating decision trees, tri-grams of words were used. For this system, words are considered to be quadruple features. The following features are used to generate conditions in the decision tree: Part-of-speech information: There are 45 part-of-speech categories, whose definitions are based on Pennsylvania Treebank's categories. We use a tagger based on Adwait Ratnaparkhi's method (Ratnaparkhi, 1996).</Paragraph> <Paragraph position="2"> Character type information: Orthographic information is considered such as upper case, lower case, capitalization, numerical expressions, symbols. These character features are the same as those used by NEHMM described in the next section and shown in Word lists specific to the domain: Word lists are made from the training corpus.</Paragraph> <Paragraph position="3"> Only the 200 highest fxequency words are used.</Paragraph> </Section> <Section position="2" start_page="20" end_page="20" type="sub_section"> <SectionTitle> 2.2 Hidden Markov model named entity </SectionTitle> <Paragraph position="0"> reco~.iser: NEHMM HMMs are a widely u~d class of learning algorithms and can be considered to be stochastic finite state machines. In the following model, summarized here from the full description given in (Collier et al., 2000), we consider words to be ordered pairs consisting of a surface word, W, and a word feature, F, given as < W, F >. The word features themselves are discussed below. As is common practice, we need to calculate the probabilities for a word sequence for the first word's name class and every other word differently since we have no initial name-class to make a transition from. Accordingly we use the following equation to calculate the initial name class probability,</Paragraph> <Paragraph position="2"> and for all other words and their name classes as follows:</Paragraph> <Paragraph position="4"> where f(I) is calculated with maximum-likelihood estimates from counts on training data. In our current system we set the constants Ai and al by hand and let ~ ai = 1.0, ~ Ai = 1.0, ao _> al > ~, ~o >_ A,... >_ As. The current name-class NCt is conditioned on the current word and feature, the previous name-class, NCt-1, and previous word and feature.</Paragraph> <Paragraph position="5"> Equations 1 and 2 implement a linearinterpolating HMM that incorporates a number of sub-models designed to reduce the effects of data sparseness.</Paragraph> <Paragraph position="6"> Once the state transition probabilities have been calculated according to Equations 1 and 2, the Viterbi algorithm (Viterbi, 1967) is used to search the state space of possible name class assignments in linear time to find the highest probability path, i.e. to maximise Pr(W, NC). The final stage of our algorithm that is used after naraeclass tagging is complete is to use a clean-up module called Unity. This creates a frequency list of words and name-classes and then re-tags the text using the most frequently used name class assigned by the HMM. We have generally found that this improves F-score performance by between 2 and 4%, both for re-tagging spuriously tagged words and for finding untagged words in unknown contexts that had been correctly tagged elsewhere in the text.</Paragraph> <Paragraph position="7"> Table 1 shows the char~ter features that we used in both NEHMM and NE-DT. Our intuition is that such features will help the model to find similarities between known words that were found in the training set and unknown words and so overcome the unknown word problem.</Paragraph> </Section> </Section> class="xml-element"></Paper>