File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1658_metho.xml
Size: 10,518 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1658"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Entity Annotation based on Inverse Index Operations</Title> <Section position="5" start_page="495" end_page="497" type="metho"> <SectionTitle> 4 Inverse Index-based Annotation using </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="495" end_page="496" type="sub_section"> <SectionTitle> Regular Expressions </SectionTitle> <Paragraph position="0"> A DFA corresponding to a given regular expression can be used for annotation, using the inverse index approach as described in Section 3.3. However, the NFA to DFA conversion step may result in a DFA with a very large number of states. We develop an alternative algorithm that translates the original regular expression directly into an ordered AND/OR graph. Associated with each node in the graph is a regular expression and a postings list that points to all the matches for the node's regular expression in the document collection. There are two node types: AND nodes where the output list is computed from the consint of the postings lists of two children nodes and OR nodes where the output list is formed by merging the posting lists of all the children nodes. Additionally, each node has two binary properties: isOpt and self-Loop. The first property is set if the regular expression being matched is of the form 'R?', where '?' denotes that the regular expression R is optional. The second property is set if the regular expression is of the form 'R+', where '+' is the Kleen operator denoting one or more occurrences.</Paragraph> <Paragraph position="1"> For the case of 'R*', both properties are set.</Paragraph> <Paragraph position="2"> The AND/OR graph is recursively built by scanning the regular expression from left to right and identifying every sub-regular expression for which a sub-graph can be built. We use capital letters R,X to denote regular expressions and small letters a, b, c, etc., to denote terminal symbols in the symbol set S. Figure 5 details the algorithm used to build the AND/OR graph. Effectively, the AND/OR graph decomposes the computation of the postings list for R into a ordered set of merge and consint operations, such that the output L(v) for node v become the input to its parents. The graph specifies the ordering, and by evaluating all the nodes in dependency order, the root node will end up with a postings list that corresponds to the desired regular expression.</Paragraph> <Paragraph position="3"> if R is empty then</Paragraph> <Paragraph position="5"/> </Section> <Section position="2" start_page="496" end_page="497" type="sub_section"> <SectionTitle> 4.1 Handling '?' and Kleen Operators </SectionTitle> <Paragraph position="0"> The isOpt and selfLoop properties of a node are set if the corresponding regular expression is of the form R?, R+ or R[?]. To handle the R? case we associate a new property isOpt with the output list L(v) from node v, such that L(v).isOpt = 1 if the v.isOpt = 1. We also define two operations consintepsilon1 in Figure 7 and mergeepsilon1 which account for the isOpt property of their argument lists. For consintepsilon1, the generated list has its isOpt set to 1 if and only if both the argument lists have their isOpt property set to 1. The mergeepsilon1 operation remains the same as merge, except that the resultant list has isOpt set to 1 if any of its argument lists has isOpt set to 1. The worst case time taken by consintepsilon1 is bounded by 1 consint and 2 merge operations.</Paragraph> <Paragraph position="1"> To handle the R+ case, we define a new operator consintepsilon1(L,+) which returns a postings list Lprime, such that each entry in the returned list points to a token sequence consisting of all k [?] [1,[?]] consecutive subsequences @s1,@s2 ...@sk, each @si,1 [?] i [?] k being an entry in L. A simple linear pass through L is sufficient to obtain consint(L,+). The time complexity of this operation is linear in the size of Lprime. The isOpt prop-erty of the result list Lprime is set to the same value as its argument list L.</Paragraph> <Paragraph position="2"> Figure 6 shows an example regular expression and its corresponding AND/OR graph; AND nodes are shown as circles whereas OR nodes are shown as square boxes. Nodes having isOpt and selfLoop properties are labeled with +, [?] or ?.</Paragraph> <Paragraph position="3"> Any AND/OR graph thus constructed is acyclic.</Paragraph> <Paragraph position="4"> The edges in the graph represent dependency between computing nodes. The main regular expression is at the root node of the graph. The leaf nodes correspond to symbols in S. Figure 8 outlines the algorithm for computing the postings list of a regular expression by operating bottom-up on the AND/OR graph.</Paragraph> <Paragraph position="6"> for Each node v in the reverse topological sorting of GR do if v.nodetype == AND then Let v1 and v2 be the children of v</Paragraph> <Paragraph position="8"/> </Section> </Section> <Section position="6" start_page="497" end_page="498" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> In this section, we present empirical comparison of performance of the index-based annotation technique (Section 4) against annotation based on the 'document paradigm' using GATE. The experiments were performed on two data sets, viz., (i) the enron email data set2 and (ii) a combination of Reuters-21578 data set3 and the 20 Newsgroups data set4. After cleaning, the former data set was 2.3 GB while the latter was 93 MB in size. Our code is entirely in Java. The experiments were performed on a dual 3.2GHz Xeon server with 4 GB RAM. The code for creation of the index was custom-built in Java. Prior to indexing, the sentence segmentation and tokenization of each data set was performed using in-house Java versions of state transduction over annotations based on regular expressions. The JAPE grammar requires information from two main resources: (i) a tokenizer and (ii) a gazetteer.</Paragraph> <Paragraph position="1"> (1) Tokenizer: The tokenizer splits the text into very simple tokens such as numbers, punctuation and words of different types. For example, one might distinguish between words in uppercase and lowercase, and between certain types of punctuation. Although the tokenizer is ca pable of much deeper analysis than this, the aim is to limit its work to maximise efficiency, and enable greater flexibility by placing the burden on the grammar rules, which are more adaptable. A rule has a left hand side (LHS) and a right hand side (RHS). The LHS is a regular expression which has to be matched on the input; the RHS describes the annotations to be added to the Annotation Set. The LHS is separated from the RHS by '>'. The following four operators can be used on the LHS: '|', '?', '[?]' and '+'. The RHS uses ';' as a separator between statements that set the values of the different attributes. The following tokenizer rule identifies each character sequence that begins with a letter in upper case and is followed by 0 or more letters in lower case: &quot;UPPERCASELETTER&quot; &quot;LOWERCASELETTER&quot;* >>> Token; orth=upperInitial; kind=word; Each such character sequence will be annotated as type &quot;Token&quot;. The attribute &quot;orth&quot; (orthography) has the value &quot;upperInitial&quot;; the attribute &quot;kind&quot; has the value &quot;word&quot;.</Paragraph> <Paragraph position="2"> (2) Gazetteer: The gazetteer lists used are plain text files, with one entry per line. Each list represents a set of names, such as names of cities, organizations, days of the week, etc. An index file is used to access these lists; for each list, a major type is specified and, optionally, a minor type. These lists are compiled into finite state machines. Any text tokens that are matched by these machines will be annotated with features specifying the major and minor types. JAPE grammar rules then specify the types to be identified in particular circumstances.</Paragraph> <Paragraph position="3"> The JAPE Rule: Each JAPE rule has two parts, separated by &quot;->&quot;. The LHS consists of an annotation pattern to be matched; the RHS describes the annotation to be assigned. A basic rule is given described in terms of the annotations already assigned by the tokenizer and gazetteer. The annotation pattern may contain regular expression operators (e.g. [?], ?, +). There are 3 main ways in which the pattern can be specified: 1. value: specify a string of text, e.g.</Paragraph> <Paragraph position="4"> {Token.string == &quot;of&quot;} 2. attribute: specify the attributes (and values) of a token (or any other annotation), e.g.</Paragraph> <Paragraph position="5"> {Token.kind == number} 3. annotation: specify an annotation type from the gazetteer, e.g. {Lookup.minorType == month} (2) Right hand side: The RHS consists of de- null tails of the annotations and optional features to be created. Annotations matched on the LHS of a rule may be referred to on the RHS by means of labels that are attached to pattern elements. Finally, attributes and their corresponding values are added to the annotation. An example of a complete rule This says 'match sequences of numbers followed by a unit; create a Name annotation across the span of the numbers, and attribute rule with value NumbersAndUnit'. null Use of context: Context can be dealt with in the grammar rules in the following way. The pattern to be annotated is always enclosed by a set of round brackets. If preceding context is to be included in the rule, this is placed before this set of brackets. This context is described in exactly the same way as the pattern to be matched. If context following the pattern needs to be included, it is placed periments after the label given to the annotation. Context is used where a pattern should only be recognised if it occurs in a certain situation, but the context itself does not form part of the pattern to be annotated. For example, the following rule for 'email-id's (assuming an appropriate regular expression for &quot;EMAIL-ADD&quot;) would mean that an email address would only be recognized if it occurred inside angled brackets (which would not themselves form part of the entity):</Paragraph> </Section> class="xml-element"></Paper>