File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0612_metho.xml

Size: 31,525 bytes

Last Modified: 2025-10-06 14:15:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0612">
  <Title>Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence</Title>
  <Section position="4" start_page="90" end_page="92" type="metho">
    <SectionTitle>
2 The Basic Model
</SectionTitle>
    <Paragraph position="0"> Before describing the algorithm, we will present a brief overview of some of its goals: * a language independent core model * the ability to exploit basic language-specific features null * the ability to learn from small named entity lists (on the order of 100 total training names) * the capability to handle both large and small texts * good class-scalability properties (the possibility of defining as many named entity types as desired, so that for different languages or different purposes the user can choose different classes of words to be recognized) * the capability to store the information learned from each instance for further use Three important concepts are used in our model: 2.1 'rrie structures are used for both morphological and contextual information Tries provide an effective, efficient and flexible data structure for storing both contextual and morphological patterns and statistics. First, they are very compact representations. Second, they support a natural hierarchical smoothing procedure for distributional class statistics. We consider character-based tries, in which each node contains a probability distribution (when working with tokenized text, two distributions are considered in each node, one for tokens and one for types). The distribution stored at each node contain the probability of each name class given the history ending at that node. Each distribution also has two standard classes, named &amp;quot;questionable&amp;quot; (unassigned probability mass in terms of entity classes, to be motivated below) and &amp;quot;non-entity'. To simplify the notations, we will refer to a start and end point bounded portion of text being analyzed (in order to determine if it represents a named entity or not) as a token.</Paragraph>
    <Paragraph position="1"> Two tries are used for context (left and right) and two for internal morphological patterns of tokens. Figure i shows an example of a morphological prefix trie, which stores the characters of tokens from</Paragraph>
    <Paragraph position="3"> Anda are a nice couple&amp;quot; left to right from given starting points (with optional word boundaries indicated by &amp;quot;#&amp;quot;).</Paragraph>
    <Paragraph position="4"> Suffix tries (typically more informative) have equivalent structure but reversed direction. The left and right context tries have the same structure as well, but the list of links refers now to the tokens which have the particular context represented by the path from the root to the current node. For right context, the letters are introduced in the trie in normal order, for left context they are considered in the reversed order (in our example, &amp;quot;Anda&amp;quot; has as left context &amp;quot;dna#xela#&amp;quot;). Similarly, nodes of the context tries contain links to the tokens that occurred in the particular contexts defined by the paths. Two bipartite graph structures are created in this way by these links.</Paragraph>
    <Paragraph position="5"> For reasons that will be explained later, raw counts are kept for the distributions.</Paragraph>
    <Paragraph position="6"> The probability of a token/context as being in or indicating a class is computed along the whole path from the root to the terminal node of the token/context. In this way, effective smoothing is realized for rare tokens or contexts.</Paragraph>
    <Paragraph position="7"> Considering a token/context formed from characters lll2...ln, (i.e. the path in the trie is root - ll -</Paragraph>
    <Paragraph position="9"> It is reasonable to expect that smaller lambdas should correspond to smaller indices, or even that A1 _&lt; A2 _&lt; ... _&lt; An. In order to keep the number of parameters low, we used the following model:</Paragraph>
    <Paragraph position="11"> where a, ~ E (0, 1), ~ having a small value The symbol F is used instead of P since we have raw distributions (frequencies) and a normalization step is needed to compute the final probability distribution. null A simpler model can use just one parameter (setting g = an), but this has limited flexibility in optimizing the hierarchical inheritance - the probability of a class given the first letter is often not very informative for some languages (such as English or Romanian) or, by contrast, may be extremely important for others (e.g. Japanese).</Paragraph>
    <Section position="1" start_page="91" end_page="91" type="sub_section">
      <SectionTitle>
2.2 EM-style bootstrapping
</SectionTitle>
      <Paragraph position="0"> The basic concept of this bootstrapping procedure is to iteratively leverage relatively independent sources of information. Beginning with some seed names for each class, the algorithm learns contextual patterns that are indicative for those classes and then iteratively learns new class members and word-internal morphological clues. Through this cycle, probability distributions for class given token, prefix/suffix or context are incrementally refined. More details are given when describing stage 2 of the algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="91" end_page="92" type="sub_section">
      <SectionTitle>
2.3 Unassigned probability mass as
</SectionTitle>
      <Paragraph position="0"> opposed to the classical maximum entropy principle When faced with a highly skewed observed class distribution for which there is little confidence due to small sample size, a typical response to this uncertainty in statistical machine learning systems is to backoff or smooth to the more general class distribution, which is typically more uniform. Unfortunately, this representation is difficult to distinguish from a conditional distribution based on a very large sample (and hence estimated with confidence) that  just happens to have a similar fairly uniform true distribution. One would like a representation that does not obscure this distinction, and represents the uncertainty of the distribution separately.</Paragraph>
      <Paragraph position="1"> We resolve this lproblem while retaining a single probability distribution over classes by adding a separate &amp;quot;questi0nable&amp;quot; (or unassigned) cell that reflects the uncertainty of the distribution. Probability mass continues to be distributed among the remaining class cells proportional to the observed distribution in the :data, but with a total sum (&lt; 1) that reflects the confidence in the distribution and is equal to 1 - P(q'uestionable).</Paragraph>
      <Paragraph position="2"> This approach has the advantage of explicitly representing the uncertainty in a given class distribution, facilitating the further development of an interactive system, while retaining a single probability distribution that simplifies trie architecture and model combinatiofi. Incremental learning essentially becomes the process of gradually shifting probability mass from questionable/uncertain to one of the primary categories.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="92" end_page="94" type="metho">
    <SectionTitle>
3 The Algorithm
</SectionTitle>
    <Paragraph position="0"> The algorithm can! be divided into five stages, which are summarized below.</Paragraph>
    <Paragraph position="1"> Stage 0: build the initial training list of class representatives Stage 1: read the text and build the left and right morphological and context tries Stage 2: introduce the training information in the tries and re-estimate the distributions by bootstrapping null Stage 3: identify and classify the named entities in the text using competing classifiers Stage 4: update the entity and context training space, using the new extracted information Stage O: This stage is performed once for each langnage/task and cbnsists of defining the classes and filling in the initial class seed data with examples provided by the user. The list of class training names should be as unambiguous as possible and (ideally) also relatively common. It is also necessary to have a relatively large unannotated text for bootstrapping the contextual models and classifying new named entities. Examples Of such training seeds and text for Romanian language are given in Tables 1 and 21 . For the primary experiments reported in this paper, we have studied a relatively difficult 3-way named entity partition between:First (given) names, Last (family) names and Place 'names. The first two tend to be relatively hard to distinguish in most languages. A 1The text refers %0 the mayor of a small town of Alba county, who was so drunk while officiating at a wedding that he shook the bride's hand and kissed the groom.</Paragraph>
    <Paragraph position="2"> simpler person/place-based distinction more comparable to the MUC-6 EMAMEX task is evaluated in  intimpl~ri de-a dreptul penibile, relatate in &amp;quot;Evenimentul zilei&amp;quot;. Practic, primul gospodar al celei mai bogate comune in aur din &lt;place&gt; Muntii Apuseni &lt;/place&gt; este mai tot timpul beat-crit~, drept pentru care, la oficierea unei c~s~torii, a s~rutat mina mirelui, a strins mina miresei si a intocmit certificat de deces in locul celui de c~s~torie. Recent, &lt;fname&gt; Andrei  There are two ways to start this stage, either by tokenizing the text or considering it in raw form. When tokenization is used, each token is inserted in the two morphological tries: one that keeps the letters of the tokens in the normal (prefix) order, another that keeps the letter in the reverse (suffix) order. For each letter on the path, the raw distributions are changed by adding the a priori probability</Paragraph>
    <Paragraph position="4"> Romanian. The paths shown are for Iulian, a &amp;quot;first name&amp;quot; entity, contained in the training word list; Ster/an a &amp;quot;last name&amp;quot;, not in the training data; and a partial path for the tokens ending in -escu.</Paragraph>
    <Paragraph position="5"> of the token belonging to each class (language dependent information may be used here). For example, in the case of Indo-European languages, if the token starts with an upper-case letter, we add 1 full count (all probability mass) to the &amp;quot;questionable&amp;quot; sum, as this entity is initially fully ambiguous. If the token starts with lower-case (and hence is an unlikely name) in this case we add the bulk of the probability mass 6 (e.g.6 t&gt; 0.9) to &amp;quot;non-entity&amp;quot; and the remainder (1-5) to &amp;quot;questionable&amp;quot; (otherwise unassigned). Other language-specific orthographic clues could potentially affect this initial probability mass assignment.</Paragraph>
    <Paragraph position="6"> When no tokenization is applied, we have to consider possible starting and ending points. Therefore, the strings (which, for simplicity, we will refer as well as tokens) introduced in the prefix morphological trie and the ones introduced in the suffix trie may differ.</Paragraph>
    <Paragraph position="7"> The left context of each token is introduced, letters in reverse order, in the left context trie, with pointers to the token in the morphlogical prefix trie; the right context of each token is introduced, in normal order, in the right context trie, keeping pointers to the token in the suffix trie. The distributions along the paths are modified according to the a pr/ori distribution of the targeted token.</Paragraph>
    <Paragraph position="8">  This stage is the core bootstrapping phase of the algorithm. In essence, as contextual models become better estimated, they identify additional named entities with increasing confidence, allowing reestimation and improvement of the internal morphological models. The additional training data that this yields allows the contextual models to be augmented and reestimated, and the cycle continues until convergence. One approach to this bootstrapping process is to use a standard continuous EM (Expectation-Maximization) family of algorithms (Baum, 1972; Dempster et al., 1977). The proposed approach outlined below is a discrete variant that is much less computationally intensive, and has the advantage of distinguishing between unknown probability distributions and those which are simply evenly distributed. The approach is conservative in that it only utilizes the class estimations for newly classified data in the retraining process if the class probability passes a confidence threshold, as defined below. The concept of confidence threshold can be captured through the following definitions of dominant and semi-dominant.</Paragraph>
    <Paragraph position="9"> Let us consider a discrete finite probability distribution P = (Pl, ...,Pn). We say that P has a dominant if there is an i in {1...n} such that pi &gt; 0.5, or</Paragraph>
    <Paragraph position="11"> We say that P has an a-semi-dominant with respect to an event k, where a &gt; 1, if it does not have k as dominant and there exist i in {1...n} such that</Paragraph>
    <Paragraph position="13"> A few commentsi about these definitions are necessary: it can be .easily observed that not every distribution has a dominant, even though it has a maximum value. The second definition, of a-semidominant, makes sense if we consider a particular event k that is not relevant (or the result cannot be measured). By rembving this event and normalizing the rest of the values, we obtain a new distribution (of size n-l) having i an a-dominant.</Paragraph>
    <Paragraph position="14"> The core of stage 2 is the bootstrapping procedure.</Paragraph>
    <Paragraph position="15"> The known names (either from the original training list or otherwise learned data) are inserted sequentially into the morphological tries, modifying the probability distributions of the nodes on the paths accordingly (the data structure is illustrated in Figures 1 and 2) . If the new distribution in one of the nodes on the path of a known token gains a dominant (for example &amp;quot;placer') then the effect of this change is propagated by reestimating other node distributions given this change. Each distribution on the context paths in which that token occurred in the text is modified, by subtracting from the &amp;quot;questionable&amp;quot; mass a quantity proportional to the number of times the respective token was found in that context and adding it to the dominant-position (e.g. &amp;quot;place&amp;quot;) mass. For the newly obtained distributions that gained a dominant :(in our example &amp;quot;place&amp;quot;) in the context trie, the bootstrapping procedure is called for all tokens that Occurred in that context, and so on, recursively. Here it is very important that we consider raw distributions and not normalize them.</Paragraph>
    <Paragraph position="16"> For example, if word &amp;quot;Mariana&amp;quot; occurs x times with the right context &amp;quot;merge&amp;quot; (meaning &amp;quot;goes&amp;quot;) and the distribution for &amp;quot;rhariana#&amp;quot; has now been identified with the dominant &amp;quot;first name&amp;quot;, then x units from the &amp;quot;questionable&amp;quot; mass can be moved to &amp;quot;first name&amp;quot; mass along the path of &amp;quot;merge#&amp;quot; in the right context trie. If semi-dominants are used instead of dominants then we have to account for the fact that the semi-dominants may change over time, so the probability mass must be moved either from &amp;quot;questionable&amp;quot; position Or previous semi-dominant position, if a semi-dominant state has been reached before. null It may be easily observed that stage 2 has a sequential characteristic, because the updating is done after reading each name incrementally. When using dominants the Order does not affect the process, because of the face that once a dominant state is reached, it cannot change to another dominant state in the future (probability mass is moved only from &amp;quot;questionable&amp;quot;). In the case of semi-dominants, the data ordering in the training file does influence the learning procedure.</Paragraph>
    <Paragraph position="17"> The more conservative strategy of using dominants rather then semi-dominants has, on the other hand, the disadvantage of cancelling or postponing the utilisation of many words. For example, if both &amp;quot;questionable&amp;quot; and &amp;quot;first name&amp;quot; have 49% of the mass then subsequent reestimation iterations are not initiated for this data, even though the alternative name classes are very unlikely.</Paragraph>
    <Paragraph position="18"> Considering those advantages and disadvantages, we used the less conservative semi-dominant approach as the default model.</Paragraph>
    <Paragraph position="19"> Stage 3: In this stage the text is re-analysed sequentially, and for each token (given a start-end point pair) a decision is made. Here the bipartite structure of the two pairs of tries has a central role: during stage 2, the left context and prefix tries interact with each other and so do the right context and suffix tries, but there's no interference between the two pairs during the bootstrapping stage. Therefore, for each instance of a token in the text, four classifiers are available, a different one given by each trie. The decision with regard to the presence of an entity and its classification is made by combining them. Comparative trials indicate that higher performance is achieved by initially having the classifters vote. Results indicate that the most accurate classifications are obtained from the two independently bootstrapped morphological tries (they incorporate the morphological information about the token to be classified, and, during the bootstrapping, they also incorporate information from all the contexts in which the token occurred). If the two agree (they have semi-dominants and they are the same) then the corresponding class is returned. Otherwise, agreement is tested between other paired independent classifiers (in order of empirically measured reliability). If no agreement is found, then a simple linear combination of all four is considered for the decision. This approach yields 6% higher F-measure than the simple interpolation of classifiers for the default parameters.</Paragraph>
    <Paragraph position="20"> Stage ~ : The newly classified tokens and contexts are saved for future use as potential seed data in subsequent named-entity classification on new texts.</Paragraph>
  </Section>
  <Section position="6" start_page="94" end_page="96" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The basic measures for evaluation of this work are precision and recall. Precision (P) represents the percentage of the entities that the system recognized  which are actually correct. Recall (R) represents the percentage of the correct named entities in the text that the system identified. Both measures are incorporated in the F-measure, F = 2PR/(P + R).</Paragraph>
    <Paragraph position="1"> It would be inappropriate to compare the results of a language independent system with the ones designed for only one language. As Day and Palmer (1997) observed, &amp;quot;the fact that existing systems perform extremely well on mixed-case English newswire corpora is certainly related to the years of research and organized evaluations on this specific task in this language. It is not clear what resources are required to adapt systems to new languages.&amp;quot; It is important to mention that the F-measure for the human performance on this task is about 96%, (Sundheim 1995). Our experiments on Romanian text were consistent with this figure.</Paragraph>
    <Section position="1" start_page="95" end_page="95" type="sub_section">
      <SectionTitle>
4.1 Baseline measures
</SectionTitle>
      <Paragraph position="0"> In order to obtain a baseline performance for this method we considered the performance of a system that tags only the examples found in one of the the original training wordlists. We consider this to be a plausible lower bound measure if the training words have not been selected from the test text. Day and Palmer (1997) showed that a baseline F-measure score for the ENAMEX task varies from 21.2% for English to 73.2% for Chinese. It is important to mention that, when they computed these figures, they trained their language independent system on large annotated corpora (e.g. the Wall Street Journal for English).</Paragraph>
      <Paragraph position="1"> The fact that the precision obtained by the base-line approach is not 100% indicates that the seed training names for each class are not completely unambiguous, and that a certain degree of ambiguity is generally unavoidable (in this case, mainly because of the interference between first names and last names).</Paragraph>
      <Paragraph position="2"> Another significant performance measure is forced classification accuracy, where the entities have been previously identified in the text and the only task is selecting their name class. To obtain baseline performance for this measure, we considered a System that uses the original training word labels if there is an exact match, with all other entities labeled with a default &amp;quot;last name&amp;quot; tag, the most common class in all languages studied. The baseline accuracy was measured at 61.18% for Romanian. System accuracies range from 77.12% to 91.76% on this same data.</Paragraph>
    </Section>
    <Section position="2" start_page="95" end_page="95" type="sub_section">
      <SectionTitle>
4.2 Evaluation of basic estimation methods
</SectionTitle>
      <Paragraph position="0"> The results shown in Table 3 were obtained for a Romanian text having 12320 words, from which 438 were entities, using a training seed set of 300 names (115 first names, 125 last names, and 60 city/country names).</Paragraph>
      <Paragraph position="1"> The baseline measures and default system (a) are as described above.</Paragraph>
      <Paragraph position="2"> In configuration (b), the based parameters of the system have been optimized for Romanian, using greedy search on an independent development test (devtest) set, yielding a slight increase in F-measure. Configuration (c) used the default parameters, but the more conservative &amp;quot;dominant&amp;quot; criterion was utilized, clearly favoring precision at the expense of recall. null Configuration (d), which is relevant for the ENAMEX task, represents the performance of the system when classes &amp;quot;first name&amp;quot; and &amp;quot;last name&amp;quot; are combined into &amp;quot;person&amp;quot; (whenever two or more such entities are adjacent, we consider the whole group as a &amp;quot;person&amp;quot; entity).</Paragraph>
      <Paragraph position="3"> Configuration (e) shows contrastive performance when using standard continuous EM smoothing on the same data and data structures.</Paragraph>
    </Section>
    <Section position="3" start_page="95" end_page="96" type="sub_section">
      <SectionTitle>
4.3 Evaluation by language and knowledge
</SectionTitle>
      <Paragraph position="0"> source Table 4 shows system performance for 5 fairly diverse languages: Romanian, English, Greek, Turkish and Hindi. The initial 4 rows provide some basic details on the training data available for each language. Note that when annotators were generating the lists of 150-300 seed words, they had access to a development test from which to extract samples, but they were not constrained to this text and could add additional ones from memory. Furthermore, it was quite unpredictable how many contexts would actually be found for a given word in the development texts, as some appeared several times and many did not appear at all. Thus the total number of contextual matches for the seed words was quite variable, from 113-249, and difficult to control. It is also the case that not all additional contexts bring comparable new benefit, as many secondary instances of the same word in a given related document collection tend to have similar or identical surrounding contexts to the first instance (e.g. &amp;quot;Mayor of XXX&amp;quot; or &amp;quot;XXX said&amp;quot;), so in general it is quite difficult to control the actual training information content just by the number of raw seed word types that are annotated. null For each of these languages, 5 levels of information sources are evaluated. The baseline case is as previously described for Table 3. The context-only case restricts system training to the two (left and right) contextual tries, ignoring the prefix/suffix morphological information. The morphology only case, in contrast, restricts the system to only the two (prefix and suffix) morphological models. These can be estimated from the 3 training wordlists (150-300 words total), but without an independent source of information (e.g. context) via which bootstrapping can iterate, there is no available path by which these  models can learn the behaviour of previously unseen affixes and conquer new territory. Thus the model is entirely static On just the initial training data.</Paragraph>
      <Paragraph position="1"> For the same reasOns, the context only model is also static. In this case there is a possible bootstrapping path using alternating left and right context to expand coverage to new contexts, but this tends to be not robust and wa s not pursued. Interestingly, recall for morphology only is typically much higher than in the context only case. The reason for this is that the morphology models are full hierarchically smoothed character tries rather than word token tries, and hence have much '~ denser initial statistics for small training data sets~ proving greater partial matching potential for previously unseen words.</Paragraph>
      <Paragraph position="2"> In an effort to I test the contribution of the full iterative boostrapping, the &amp;quot;context and morphology only&amp;quot; results ', are based on the combination of all 4 tries, but w:ithout any bootstrapping. Thus they are trained ekclusively on the 150-300 training examples. Performance for the combined sources by language and knowledge source is in all cases greater than for the morphology or context source used alone. Furthermore, the full iterative bootstrapping clearly yields substantial improvement over the static models, almost exclusively in the form of increased recall (and its corresponding boost the the F-measure).</Paragraph>
      <Paragraph position="3"> Cross-language analysis yields further insight.</Paragraph>
      <Paragraph position="4"> First, recall is much higher for the 4 languages in which case is explicitly marked and is a clue for named entity identification (Romanian, English, Greek and Turkish) than for a language like Hindi, where there are no case distinctions and hence any word could potentially be a named entity. A language such as German would be roughly in the middle, where lower-case words have low probability as named entities, but capitalized words are highly ambiguous between common and proper nouns. Because approximately 96% of words in the Hindi text are not named entities, without additional orthographic clues the prior probability for &amp;quot;non-entity&amp;quot; is so strong that the morphological or contextual evi- null dence in favor of one of the named entity classes must be very compelling to overcome this bias. With only 50 training words per context this is difficult, and in the face of such strong odds against any of the named entity classes the conservative nature of the learning algorithm only braves an entity label (correctly) for 38% more words than the baseline model. In contrast, its performance on entity classification rather than identification, measured by forced choice accuracy in labelling the given entities, is comparable to all the other languages, with 79% accuracy relative to the 62% baseline. 2</Paragraph>
    </Section>
    <Section position="4" start_page="96" end_page="96" type="sub_section">
      <SectionTitle>
4.4 Evaluation at different training set sizes
</SectionTitle>
      <Paragraph position="0"> Figure 3 demonstrates that the performance of the algorithm is highly sensitive to the size of the training data. Based on Romanian, the first graph shows that as the size of the raw text for bootstrapping increases, F-measure performance increases roughly logrithmically, due almost exclusively to increases in precision. (Approximately the same number of unique entities are being identified, but due to the increased number of examples of each, their classification is more accurate). This is avery encouraging trend, as the web and other online sources provides virtually unlimited raw text in most major languages, and substantial on-line text for virtually all languages. So extrapolating far beyond the 10K word level is relatively low cost and very feasible.</Paragraph>
      <Paragraph position="1"> The second graph shows that F-measure performance also increases roughly logrithmically with the total length of the seed wordlists in the range 40300. This increase is due entirely to improved recall, which doubles over this small range. This trend sug2Note again that this baseline is more competitive than typical, as it not only assigns the majority tag (&amp;quot;last name&amp;quot;), but when there is an exact match with the training wordlist (e.g. &amp;quot;deepak&amp;quot;), a common occurrence given repeated high-frequency names in the Hindi data, the training classification is used as the baseline answer gests that there is considerable benefit to be gained by additional human annotation, or seed wordlist acquisition from existing online lexicons. However, relative to case of raw text acquisition, such additional annotations tend to be much costlier, and there is a clear cost-benefit tradeoff to further investment in annotation.</Paragraph>
      <Paragraph position="2"> In summary, however, these evaluation results are satisfying in that they (a) show clear and consistent trends across several diverse languages, (b) show clear trends for improvement as training resources grow, and (c) show that comparable (and robust) classification results can be achieved on this diversity of languages.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="96" end_page="98" type="metho">
    <SectionTitle>
5 Future work
</SectionTitle>
    <Paragraph position="0"> For future work, natural next steps include incorporating a language independent word segmentation phase like the one proposed by Amitay, Richmond and Smith (1997), to improve the performance on large texts. Different statistics can be pre-computed for different languages and language families and stored in external files. For example, the a priori probability of a named entity given the set of characteristics of its representation in the text, such as position, capitalization, and relative position of other entities (e.g.: first name followed by last name). A further step is the implementation of a supervised active learning system based on the present algorithm, in which the most relevant words for future disambiguation is presented to the user to be classified and the feedback used for bootstrapping. The selection of candidate examples for tagging would be based on both the unassigned probability mass and the frequency of occurrence. Active learning strategies (Lewis and Gale, 1994) are a natural path for efficiently selecting contexts for human annotation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML