XML Viewer - w02-2007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2007_metho.xml
Size: 10,003 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2007">
  <Title>Language Independent NER using a Unified Model of Internal and Contextual Evidence</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Contextual Information
</SectionTitle>
    <Paragraph position="0"> An entity's left and right context provides an essentially independent evidence source for model bootstrapping. This information is also important for entities that do not have a previously seen word structure, are of foreign origin, or polysemous. Rather than using word bigrams or trigrams, the system handles the context in the same way it handles the entities, allowing for variable-length contexts. The advantages of this unified approach are presented in the next paragraph.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. A Unified Structure for both Internal and
Contextual Information
</SectionTitle>
    <Paragraph position="0"> Character-based tries provide an effective, efficient and flexible data structure for storing both contextual and morphological patterns and statistics.</Paragraph>
    <Paragraph position="1"> ... organizada por la Concejalia de Cultura , tienen un ...</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PREFIX RIGHT CONTEXTLEFT CONTEXT
SUFFIX
</SectionTitle>
    <Paragraph position="0"> the way the information is introduced in the four tries (arrows indicate the direction letters are considered) They are very compact representations and support a natural hierarchical smoothing procedure for distributional class statistics. In our implementation, each terminal or branching node contains a probability distribution which encodes the conditional probability of entity classes given the sistring corresponding to the path from the root to that node.</Paragraph>
    <Paragraph position="1"> Each such distribution also has two standard classes, named &amp;quot;questionable&amp;quot; (unassigned probability mass in terms of entity classes, to be motivated below) and &amp;quot;non-entity&amp;quot; (common words).</Paragraph>
    <Paragraph position="2"> Two tries (denoted PT and ST) are used for internal representation of the entity candidates in prefix, respectively suffix form, respectively. Other two tries are used for left (LCT) and right (RCT) context. Right contexts are introduced in RCT by considering their component letters from left to right, left contexts are introduced in LCT using the reversed order of letters, from right to left (Figure 1).</Paragraph>
    <Paragraph position="3"> In this way, the system handles variable length contexts and it attempts to match in each instance the longest known context (as longer contexts are more reliable than short contexts, and also the longer context statistics incorporate the shorter context statistics through smoothing along the paths in the tries). The tries are linked together into two bipartite structures, PT with LCT, and ST with RCT, by attaching to each node a list of links to the entity candidates or contexts with, respectively in which the sistring corresponding to that node has been seen in the text (Figure 2).</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Unassigned Probability Mass
</SectionTitle>
    <Paragraph position="0"> When faced with a highly skewed observed class distribution for which there is little confidence due to small sample size, a typical response is to back-off or smooth to the more general class distribution. Unfortunately, this representation makes problematic the distinction between a back-off conditional distribution and one based on a large sample (and hence estimated with confidence). We address this problem by explicitly representing the uncertainty as a class, called &amp;quot;questionable&amp;quot;. Probability mass continues to be distributed among the primary entity classes proportional to the observed distribution in the data, but with a total sum that reflects  Right Context Trie for the entity candidate Austria and some of its right contexts as observed in the corpus (&lt; , Holanda &gt;, &lt; , hizo &gt;, &lt; a Chirac &gt;) the confidence in the distribution and is equal to a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a16a15a18a17a20a19a14a21a23a22a25a24 . Incremental learning essentially becomes the process of gradually shifting probability mass from questionable to one of the primary classes.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6. Smoothing
</SectionTitle>
    <Paragraph position="0"> The probability of an entity candidate or context as being or indicating a certain type of entity is computed along the path from the root to the node in the trie structure described above. In this way, effective smoothing can be realized for rare entities or contexts. A smoothing formula taking advantage of the distributional representation of uncertainty is presented below.</Paragraph>
    <Paragraph position="1"> For a sistring a26a16a27a25a26a29a28a18a30a31a30a31a30a32a26 a5 (i.e. the path in the trie is</Paragraph>
    <Paragraph position="3"> a5 ) the general smoothing model for the conditional class probabilities is given by the recursive formula:</Paragraph>
    <Paragraph position="5"> where a61 is a normalization factor and a52 a62 a63 a64a49a65 a0a67a66 a65a6a68a70a69 a0 are model parameters. 7. One Sense per Discourse Clearly, in many cases, the context for only one instance of an entity and the word-internal information is not enough to make a classification decision. But, as noted by Katz (1996), a newly introduced entity will be repeated, &amp;quot;if not for breaking the monotonous effect of pronoun use, then for</Paragraph>
    <Paragraph position="7"> with the diameter representing the confidence of the classification of that instance using word-internal and local contextual information.</Paragraph>
    <Paragraph position="8"> emphasis and clarity&amp;quot;. We use this property in conjunction with the one sense per discourse tendency noted by Gale et al. (1992). The later paradigm is not directly usable when analyzing a large corpus in which there are no document boundaries, like the one provided for Spanish. Therefore, a segmentation process needs to be employed, so that all the instances of a name in a segment have a high probability of belonging to the same class. Our approach is to consider a 'soft' segmentation, which is word-dependent and does not compute topic/document boundaries but regions for which the contextual information for all instances of a word can be used jointly when making a decision. This is viewed as an alternative to the classical topic segmentation approach and can be used in conjunction with a language-independent segmentation system (Figure 3) like the one presented by Richmond et al. (1997).</Paragraph>
    <Paragraph position="9"> After estimating the class probability distributions for all instances of entity candidates in the corpus, a re-estimation step is employed. The probability of an entity class a79a39a80 given an entity candidate a81 at position a82a20a83a85a84a23a86 is re-computed using the formula:  are the positions of all instances of a81 in the corpus, a84a53a115a60a116 is the positional similarity, encoding the physical distance and topic (if topic or document boundary information exists), conf is the classification confidence of each instance (inverse proportional to the the  an automaton with two final states is to consider a chunking algorithm that identifies entity candidates and classify each of the chunks as Person, Location, Organization, Miscellaneous, or Non-entity. We use this second alternative, but in a 'soft' form; i.e. each word can be included in multiple competing chunks (entity candidates). This approach is suitable for all languages including Chinese, where no word separators are used (the entity candidates are determined by specifying starting and ending character positions). Another advantage of this method is that single and multiple-word entities can be handled in the same way.</Paragraph>
    <Paragraph position="10"> The boundaries of entity candidates are determined by a few simple rules incorporated into three discriminators: is_B_candidate tests if a word can represent the beginning of an entity, is_I_candidate tests if a word can be the end of an entity, and is_E_candidate tests if a word can be an internal part of an entity. These discriminators use simple heuristics based on capitalization, position in sentence, length of the word, usage of the word in the set of seed entities, and co-occurrence with uncapitalized instances of the same word. A string is considered an entity candidate if it has the structure shown in Figure 4.</Paragraph>
    <Paragraph position="11"> An extension of the system also makes use of Part-of-Speech (POS) tags. We used the provided POS annotation in Dutch (Daelemans et al., 1996) and a minimally supervised tagger (Yarowsky and Cucerzan, 2002) for Spanish to restrict the space of words accepted by the discriminators (e.g.</Paragraph>
    <Paragraph position="12"> is_B_candidate rejects prepositions, conjunctions, pronouns, adverbs, and those determiners that are the first word in the sentence).</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
9. Algorithm Structure
</SectionTitle>
    <Paragraph position="0"> The core algorithm can be divided into eight stages, which are summarized in Figure 5. The bootstrapping stage (5) uses the initial or current entity assignments to estimate the class conditional distributions for both entities and contexts along their trie paths, and then re-estimates the distributions of the contexts/entity-candidates to which they are linked, recursively, until all accessible nodes are reached, as presented in Cucerzan and Yarowsky (1999).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML