File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1658_intro.xml

Size: 15,448 bytes

Last Modified: 2025-10-06 14:03:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1658">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Entity Annotation based on Inverse Index Operations</Title>
  <Section position="4" start_page="492" end_page="495" type="intro">
    <SectionTitle>
2 Overview
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the process for entity annotation presented in the paper. A given document collection D is tokenized and segmented into sentences.</Paragraph>
    <Paragraph position="1"> The tokens are stored in an inverse index I. The inverse index I has an ordered list U of the unique tokens u1, u2, ..uW that occur in the collection, where W is the number of tokens in I. Additionally, for each unique token ui, I has a postings list L(ui) =&lt; l1,l2,...lcnt(ui) &gt; of locations in D at which ui occurs. cnt(ui) is the length of L(ui). Each entry lk, in the postings list L(ui), has three fields: (1) a sentence identifier, lk.sid, (2) the begin position of the particular occurrence of ui, lk.first and (3) the end position of the same occurrence of ui, lk.last.</Paragraph>
    <Paragraph position="2"> We require the input grammar to be the same as that used for named entity annotations in GATE (Cunningham et al., 2002). The GATE architecture for text engineering uses the Java Annotations Pattern Engine (JAPE) (Cunningham, 1999) for its information extraction task. JAPE is a pattern matching language. We support two classes of properties for tokens that are required by grammars such as JAPE: (1) orthographic properties such as an uppercase character followed by lower case characters, and (2) gazetteer (dictionary) containment properties of tokens and token sequences such as 'location' and 'person name'. The set of tokens along with entity types specified by either of these two properties are referred to as Basic Entities. The instances of basic entities specified by orthographic properties must be single tokens.</Paragraph>
    <Paragraph position="3"> However, instances of basic entities specified using gazetteer containment properties can be token sequences.</Paragraph>
    <Paragraph position="4"> The module (1) of our system shown in Figure 1, identifies postings lists for each basic entity type. These postings lists are entered as index entries in I for the corresponding types. For example, if the input rules require tokens/token sequences that satisfy Capsword or Location Dictionary properties, a postings list is created for each of these basic types. Constructing the postings list for a basic entity type with some orthographic property is a fairly straightforward task; the postings lists of tokens satisfying the orthographic properties are merged (while retaining the sorted order of each postings list). The mechanism for generating the postings list of basic entities with gazetteer properties will be developed in the following sections. A rule for NE annotation may require a token to satisfy multiple properties such as Location Dictionary as well as Capsword. The posting list for tokens that satisfy multiple properties are determined by performing an operation parallelint(L,Lprime) over the posting lists of the corresponding basic entities. The parallelint(L,Lprime) operation returns a posting list such that each entry in the returned list occurs in both L as well as Lprime. The module (2) of our system shown in Figure 1 identifies instances of each annotation type, by performing index-based operations on the postings lists of basic entity types and other tokens.</Paragraph>
    <Section position="1" start_page="492" end_page="493" type="sub_section">
      <SectionTitle>
3 Annotation using Cascading Regular
Expressions
</SectionTitle>
      <Paragraph position="0"> Regular expressions over basic entities have been extensively used for NE annotations. The Common Pattern Specification Language (CSPL)1 specifies a standard for describing Annotators that can be implemented by a series of cascading regular expression matches.</Paragraph>
      <Paragraph position="1"> Consider a regular expression R over an alphabet S of basic entities, and a token sequence</Paragraph>
      <Paragraph position="3"> at determining all matches of regular expression R in the token sequence T. Additionally, NE annotations do not span multiple sentences. We will therefore assume that the length of any annotated token sequence is bounded by [?], where [?] can be the maximum sentence length in the document collection of interest. In practice, [?] can be even smaller.</Paragraph>
    </Section>
    <Section position="2" start_page="493" end_page="493" type="sub_section">
      <SectionTitle>
3.1 Computing Annotations using a DFA
</SectionTitle>
      <Paragraph position="0"> Given a regular expression R, we can convert it into a deterministic finite automate (DFA) DR. A DFA is a finite state machine, where for each pair of state and input symbol, there is one and only one transition to a next state. DR starts processing of an input sequence from a start state sR, and for each input symbol, it makes a transition to a state given by a transition function PhR. Whenever DR lands in an accept state, the symbol sequence till that point is accepted by DR. For simplicity of the document and index algorithms, we will ignore document and sentence boundaries in the following analysis.</Paragraph>
      <Paragraph position="1"> Let @ti,i+[?],1 [?] i [?] W [?][?] be a subsequence of T of length [?]. On a given input @ti,i+[?], DR will determine all token sequences originating at ti that are accepted by the regular expression grammar specified through DR. Figure 2 outlines the algorithm findAnnotations that locates all token sequences in T that are accepted by DR.</Paragraph>
      <Paragraph position="2"> Let DR have {S1,...,SN} states. We assume that the states have been topologically ordered so that S1 is the start state. Let a be the time taken to consume a single token and advance the DFA to the next state (this is typically implemented as a table or hash look-up). The time taken by the alfindAnnotations(T,DR) null Let T = {t1,...,tW} for i = 1 to W [?][?] do let @ti,i+[?] be a subsequence of length [?] starting from ti in T use DR to annotate @ti,i+[?] end for  rences of R in a token sequence T.</Paragraph>
      <Paragraph position="3"> gorithm findAnnotations can be obtained by summing up the number of times each state is visited as the input tokens are consumed. Clearly, the state S1 is visited W times, W being the total number of symbols in the token sequence T. Let cnt(Si) give the total number of times the state Si has been visited. The complexity of this method is:</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="3" start_page="493" end_page="494" type="sub_section">
      <SectionTitle>
3.2 Computing Regular Expression Matches
using Index
</SectionTitle>
      <Paragraph position="0"> In this section, we present a new approach for finding all matches of a regular expression R in a token sequence T, based on the inverse index I of T.</Paragraph>
      <Paragraph position="1"> The structure of the inverse index was presented in Section 2. We define two operations on postings lists which find use in our annotation algorithm.</Paragraph>
      <Paragraph position="2">  1. merge(L,Lprime): Returns a postings list such that each entry in the returned list occurs either in L or Lprime or both. This operation takes O(|L|+|Lprime|) time.</Paragraph>
      <Paragraph position="3"> 2. consint(L,Lprime): Returns a postings list such that each entry in the returned list points to a token sequence which consists of two consecutive  subsequences @sa and @sb within the same sentence, such that, L has an entry for @sa and Lprime has an entry for @sb. There are several methods for computing this depending on the relative size of L and Lprime. If they are roughly equal in size, a simple linear pass through L and Lprime, analogous to a merge, can be performed. If there is a significant difference in sizes, a more efficient modified binary search algorithm can be implemented. The details are shown in Figure 3. The consint(L,Lprime) Let M elements of L be l1 ***lM Let N elements of L' be lprime1 ***lN if M &lt; N then set j = 1 for i = 1 to M do set k = 1, keep doubling k until lprimej.first [?] li.last &lt; lprimej+k.first binary search the Lprime in the interval j***k to determine the value of p such that  lprimep.first [?] li.last &lt; lprimep+1.first if lprimep.first = li.last a match exists, copy to output set j = p+ 1 end for else Same as above except l and lprime are reversed end if  for consint complexity of this algorithm is determined by the size qi of the interval required to satisfy lprimej.first [?] li.last &lt; lprimej+qi.first (assuming |L |&lt; |Lprime|). It will take an average of log2(qi) operations to determine the size of interval and log2(qi) operations to perform the binary search, giving a total of 2log2(qi). Let q1***qM be the sequence of intervals. Since the intervals will be at most two times larger than the actual interval between the nearest matches in Lprime to L, we can see that |Lprime |[?] summationtextMi=1 qi [?] 2 [?] |Lprime|. Hence the worst case will be reached when qi = 2|Lprime|/|L |with a time complexity given by 2|L|(log2(|Lprime|/|L|) + 1), assuming |L |&lt; |Lprime|.</Paragraph>
      <Paragraph position="4"> To support annotation of a token sequence that matches a regular expression only in the context of some regular expression match on its left and/or right, we implement simple extensions to the consint(L1,L2) operator. Details of the extensions are left out from this paper owing to space constraints.</Paragraph>
    </Section>
    <Section position="4" start_page="494" end_page="494" type="sub_section">
      <SectionTitle>
3.3 Implementing a DFA using the Inverse
Index
</SectionTitle>
      <Paragraph position="0"> In this section, we present a method that takes a DFA DR and an inverse index I of a token sequence T, to compute a postings list of subsequences of length at most [?], that match the regular expression R.</Paragraph>
      <Paragraph position="1"> Let the set S = {S1,...,SN} denote the set of states in DR, and let the states be topologically ordered with S1 as the start state. We associate an object lists,k with each state s [?] S and [?]1 [?] k [?] [?]. The object lists,k is a posting list of all token sequences of length exactly k that end in state s. The lists,k is initialized to be empty for all states and lengths. We iteratively compute lists,k for all the states using the algorithm given in Figure 4. The function dest(Si) returns a set of states, such that for each s [?] dest(Si), there is an arc from state Si to state s. The function label(Si,Sj) returns the token associated with the edge (Si,Sj).</Paragraph>
      <Paragraph position="2"> for k = 1 to [?] do for i = 1 to N do for s [?] dest(Si) do if i == 1 then</Paragraph>
      <Paragraph position="4"> all token sequences in T that match R.</Paragraph>
      <Paragraph position="5"> At the end of the algorithm, all token sequences corresponding to postings lists lists,i,s [?] S,1 [?] i [?] [?] are sequences that are matched by the regular expression R.</Paragraph>
    </Section>
    <Section position="5" start_page="494" end_page="495" type="sub_section">
      <SectionTitle>
3.4 Complexity Analysis for the Index-based
Approach
</SectionTitle>
      <Paragraph position="0"> The complexity analysis of the algorithm given in Figure 4 is based on the observation that,summationtext</Paragraph>
      <Paragraph position="2"> listSi,k contains an entry for all sequences that visit the state Si and are of length exactly k. Summing the length of these lists for a particular state Si across all the values of k will yield the total number of sequences of length at most [?] that visit the state Si.</Paragraph>
      <Paragraph position="3"> For the algorithm in Figure 3, the time taken by  one consint operation is given by 2b(|listSi,k |[?] (log(rijk) + 1)) where b is a constant that varies with the lower level implementation. rijk = |L(label(Si,Sj))| |listSi,k |is the ratio of the postings list size of the label associated with the arc from Si to Sj to the list size of Si at step k. Note that rijk [?] 1. Let prev(Si) be the list of predecessor states to Si. The time taken by all the merge operations for a state Si at step k is given by g(log(|prev(Si)|)|listSi,k|) Assuming all the merges are performed simultaneously, g(log(|prev(Si)|) is the time taken to create each entry in the final merged list, where g is a constant that varies with the lower level implementation. Note this scales as the log of the number of lists that are being merged.</Paragraph>
      <Paragraph position="4"> The total time taken by the algorithm given in</Paragraph>
      <Paragraph position="6"> (2) Note that in deriving Equation 2, we have ignored the cost of merging list(Sa,k) for k = 1***[?] for the accept states.</Paragraph>
    </Section>
    <Section position="6" start_page="495" end_page="495" type="sub_section">
      <SectionTitle>
3.5 Comparison of Complexities
</SectionTitle>
      <Paragraph position="0"> To simplify further analysis, we can replace cnt(Si) with fcnt(Si) where fcnt(Si) = cnt(Si)/W. If we assume that the token distribution statistics of the document collection remain constant as the number of documents increases, we can also assume that fcnt(Si) is invariant to W. Since rijk is given by a ratio of list sizes, we can also consider it to be invariant to W. We now assume a [?] b [?] g since these are implementation specific times for similar low level compute operations. With this assumptions from Equations 1 and 2, the ratio CD/CI can be approximated by:</Paragraph>
      <Paragraph position="2"> The overall ratio of CD to CI is invariant to W and depends on two key factors fcnt(Si) andsummationtext s[?]dest(Si) log( -ris). If fcnt(Si) lessmuch 1, the ratio will be large and the index-based approach will be much faster. However, if either fcnt(Si) starts approaching 1 or summationtexts[?]dest(Si) log( -ris) starts getting very large (caused by a large fan out from Si), the direct match using the DFA may be more efficient.</Paragraph>
      <Paragraph position="3"> Intuitively, this makes sense since the main benefit of the index is to eliminate unnecessary hash lookups for tokens do not match the arcs of the DFA. As fcnt(Si) approaches 1, this assumption breaks down and hence the inherent efficiency of the direct DFA approach, where only a single hash lookup is required per state regardless of the number of destination states, becomes the dominant factor.</Paragraph>
    </Section>
    <Section position="7" start_page="495" end_page="495" type="sub_section">
      <SectionTitle>
3.6 Comparison of Complexities for Simple
Dictionary DFA
</SectionTitle>
      <Paragraph position="0"> To illustrate the potential gains from the index-based annotation, consider a simple DFA DR with two states S1 and S2. Let the set of unique tokens A be {a,b,c***z}. Let E be the dictionary {a,e,i,o,u}. Let DR have five arcs from S1 to S2 one for each element in E. The DFA DR is a simple acceptor for the dictionary E, and if run over a token sequence T drawn from A, it will match any single token that is in E. For this simple case fcnt(S2) is just the fraction of tokens that occur in E and hence by definition fcnt(S2) [?] 1. Substituting into 3 we get</Paragraph>
      <Paragraph position="2"> As long as fcnt(S2) &lt; 0.27, this ratio will always be greater than 1.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML