File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2028_metho.xml
Size: 11,504 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2028"> <Title>Modelling of a Gazetteer Look-up Component</Title> <Section position="4" start_page="161" end_page="163" type="metho"> <SectionTitle> 3 Modeling of a gazetteer </SectionTitle> <Paragraph position="0"> Raw gazetteers are usually represented by a text file, where each line represents a single entry and is in the following format: keyword (attribute:value)+. For each reading of an ambiguous keyword, a separate line is introduced, e.g., for the word Washington the following entries are introduced:</Paragraph> <Paragraph position="2"> tion can be accompanied by the number of different routes to any final state outgoing from the same state as the current transition, whose label are lexicographically lower than the current one. Consequently, computing I(w) for w would consist solely of summing over the integers associated with traversed transitions, whereas memory requirements rise to 30% (5; 2) We differentiate between open-class and closed-class attributes depending on their range of values, e.g., full-name is an open-class attribute, whereas gender is a closed-class attribute. As can be seen in the last reading for Washington attribute may be assigned a list of values.</Paragraph> <Section position="1" start_page="161" end_page="162" type="sub_section"> <SectionTitle> 3.1 Standard Approach </SectionTitle> <Paragraph position="0"> The standard approach to implementing dictionaries presented in (5; 2) can be straightforwardly adapted to model the architecture of a gazetteer.</Paragraph> <Paragraph position="1"> The main idea is to encode the keywords and all attribute values in a single numbered MADFSA.</Paragraph> <Paragraph position="2"> In order to distinguish between keywords and different attribute values we extend the indexing automaton so that it has n+1 initial states, where n is the number of attributes. The right language of the first initial state corresponds to the set of the keywords, whereas the right language of the i-th initial state for i , 1 corresponds to the range of values appropriate for i-th attribute. The subautomaton starting in each initial state implements different perfect hashing function. Hence, the aforesaid automaton constitutes a word-to-index and index-to-word engine for keywords and attribute values. Once we know the index of a given keyword, we can access the indices of all associated attribute values in a row of an auxiliary table. Consequently, these indices can be used to extract the proper values from the indexing automaton.</Paragraph> <Paragraph position="3"> In the case of multiple readings an intermediate array for mapping the keyword indices to the absolute position of the block containing all readings is indispensable. The overall architecture is sketched in figure 2. Through an introduction of multiple initial states log2(card(i)) bits are sufficient for representing the indices for values of attribute i, where card(i) is the size of the corresponding value set.</Paragraph> <Paragraph position="4"> It is not necessarily convenient to store the proper values of all attributes in the numbered automaton, e.g. numerical or alphanumerical data could be stored directly in the attribute-value matrix or elsewhere (cf. figure 2) if the range of the values is bounded and integer representation is more compact than anything else. Fortunately, the vast majority (but definitely not all) of attribute values in gazetteers deployed in NLP happens to be natural language expressions. There- null fore, we can expect the major part of the entries and attribute values to share suffixes, which leads to a better compression rate. Prevalent bottleneck of the presented approach is a potentially high redundancy of the information stored in the attribute-value matrix. However, this problem can be partially alleviated via automatic detection of column dependency, which might expose sources of information redundancy. Reccurring patterns consisting of raw fragments could be indexed and represented only once.</Paragraph> </Section> <Section position="2" start_page="162" end_page="163" type="sub_section"> <SectionTitle> 3.2 Pure Finite-State Representation </SectionTitle> <Paragraph position="0"> One of the common techniques for squeezing automata in the context of implementing dictionaries is an appropriate coding of the input data. Converting a list of strings into a MADFSA usually results in a good compression rate since many words share prefixes and suffixes, which leads to transition sharing. If strings are associated with additional annotations representing certain categories, e.g., part-of-speech, inflection or stem information in a morphological lexicon, then an adequate encoding is necessary in order to keep the corresponding automaton small. A simple solution is to reorder categories from the most specific to the most general ones, so that stem information would precede inflection and part-of-speech tag. Alternatively, we could precompute all possible annotation sequences and replace them with some index. However, the major part of a string that encodes the keyword and its tags might be unique and could potentially blow up the corresponding automaton enormously. Consider again the entry for the morphological lexicon consisting of an inflected word form and its tags, e.g. striking:strike:v:a:p (v - verb, a present, p - participle). Obviously, the sequence striking:strike is unique. Through the exploitation of the word-specific information the inflected form and its base form share one can introduce patterns (6) describing how the lexeme can be reconstructed from the inflected word form, e.g., 3+e - delete three terminal characters and append an e (striking ! strik + e), which would result in better suffix sharing, i.e., the suffix 3+e:v:a:p is more frequently shared than strike:v:a:p.</Paragraph> <Paragraph position="1"> The main idea behind transforming a gazetteer into a single automaton is to split each gazetteer entry into a disjunction of subentries, each representing some partial information. For each open-class attribute-value pair present in the entry a single subentry is created, whereas closed-class attribute-value pairs are merged into a single subentry and rearranged in order to fulfill the first most specific, last most general criterion. In our example, the entry for the word Washington (city) yields the following subentries: where NAM maps attribute names to single univocal characters not appearing elsewhere in the original gazetteer and VAL denotes a mapping which converts the values of the closed-class attributes into single characters which represent these values. The string #1, where # is again a unique symbol, denotes the reading index of the entry (first reading). In case of list-valued open-class attributes we can simply add an appropriate subentry for each element in the list. Gazetteer resources converted in this manner are subsequently compiled into an MADFSA. In order to gain better compression rate we utilized formation patterns for a subset of attribute values appearing in the gazetteer entries. These patterns resemble the ones for encoding morphological information, but they partially rely on other information. For instance, frequently, attribute values are just the capitalized form of the corresponding keywords as can be seen in our example. Such a pattern can be represented by a single character. Further, keywords and attribute values often share prefixes or suffixes, e.g., Washington vs. Washington D.C.</Paragraph> <Paragraph position="2"> Next, there are clearly several patterns for forming acronyms from the full form, e.g., US can be derived from United States, by concatenating all capitals in the full name. Nevertheless, some part of the attribute values can not be replaced by patterns. Applying formation patterns to our sample entry would result in: where PAT maps pattern names to unique characters. Some space savings may be obtained by reversing the attribute values not covered by any pattern since prefix compression might be eventually superior to suffix compression.</Paragraph> <Paragraph position="3"> The outlined method of representing a gazetteer is an elegant solution and exhibits three major assets: (a) no external storage for attribute values is needed, (b) the automaton involved is not numbered which means less space requirement and reduced searching time in comparison to approach in 3.1, and (c) as a consequence of the encoding strategy, there is only one single final state in the automaton.3 From the other point of view, the information stored in the gazetteers and the fashion in which the automaton is built intuitively does not allow for obtaining the same compression rates as in the case of the automaton in 3.1. For instance, many entries are multiword expressions, which increase the size of the automaton by an introduction of numerous sequential paths. In order to alleviate this problem we applied transition jamming.</Paragraph> <Paragraph position="4"> 3The states having outgoing transitions labeled with the unique symbols in the range of NAM are implicit final states. The right languages of these states represent attribute-value pairs attached to the gazetteer entries.</Paragraph> </Section> <Section position="3" start_page="163" end_page="163" type="sub_section"> <SectionTitle> 3.3 Transition Jamming </SectionTitle> <Paragraph position="0"> Transition jamming is an equivalence operation on automata in which transitions on sequential paths are transformed into a single transition labeled with the label of the whole path (3). Intermediate states on the path are removed. The jammed automaton still accepts the same language. We have applied transition jamming in a somewhat different way. Let ... be a sequential path in the automaton and a = a0 :::ak be the label of .... We remove all transitions of ... and introduce a new transition from f(...) to l(...) labeled with a0 , i.e., -(f(...);a0) = l(...) and store the remaining character sequence a1 :::ak in a list of sequential path labels. Once all such labels are collected, we introduce a new initial state in the automaton and consecutively starting from this state we add all these labels to the minimized automaton while maintaining its property of being minimal (4). The subautomaton starting from the new initial state implements a perfect hashing function. Finally, the new 'jammed' transitions are associated with the corresponding indices in order to reconstruct the full label on demand. There are several ways of selecting sequential paths for jamming. Maximum-length sequential paths constitute the first choice. Jamming paths of bounded length might yield better or at least different results. For instance, a sequential path whose label is a long fragment of a multiword expression could be decomposed into subpaths that either do not include whitespaces or consist solely of whitespaces. In turn, we could jam only the subpaths of the first type.</Paragraph> <Paragraph position="1"> Storing sequential path labels in a new branch of the automaton obviously leads to the introduction of new sequential paths. Therefore, we have investigated the impact of repetitive transition jamming on the size of the automaton. In each phase of repetitive jamming, a new initial state is introduced from which the labels of the jammed paths identified in this phase are stored.</Paragraph> </Section> </Section> class="xml-element"></Paper>