File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-2028_evalu.xml

Size: 4,883 bytes

Last Modified: 2025-10-06 13:59:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2028">
  <Title>Modelling of a Gazetteer Look-up Component</Title>
  <Section position="5" start_page="163" end_page="165" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="163" end_page="164" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> We have selected following gazetteers for the evaluation purposes: (a) UK-Postal - city names in the UK associated with county and postal code  players and events in the language technology community, (c) PL-NE - a gazetteer of MUC-type Polish named entities, (d) Mixed - a combination of (b) and (c), (e) GeoNames - an excerpt of the huge gazetteer of geographic names information covering geopolitical areas, including name variants, administrative divisions, different codes, etc. Table 1 gives an overview of our test data.4</Paragraph>
    </Section>
    <Section position="2" start_page="164" end_page="165" type="sub_section">
      <SectionTitle>
4.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> Several experiments with different set-ups were conducted. Firstly, we compared the standard with the pure-FSA approach. Next, we repeated the experiments enhanced by integration of single transition jamming. The results are given in table 2. The numbers in the columns concerning transition jamming correspond to jamming of maximum-length sequential paths and jamming of whitespace-free paths (in brackets).</Paragraph>
      <Paragraph position="1"> The increase in physical storage in the case of numbered automata has been reported to be in range of 30-40% (state numbering) and 60-70% (transition numbering) (1). Note at this point that automata are usually stored as a sequence of transitions, where states are represented only implicitly (7). Considering additionally the space requirement for the auxiliary table in the standard approach for storing the indices for open-class attribute values, it turns out, that this number oscillates around m C/ n C/ log256n bytes, where m is the number of open-class attributes and n is  values for which formation patterns can be applied to the total number of open-class attribute values in a given gazetteer. the number of entries in the gazetteer. Summing up these observations and taking a look at the table 2, we conclude without naming absolute size of the physical storage required that the pure-FSA approach turns out to be the superior when applied to our test gazetteers. However, some results, in particular for the Geo-Names, where j-j is about three time as big as in the automaton in the standard approach, indicate some pitfalls.</Paragraph>
      <Paragraph position="2"> Mainly due to the fact that some open-class attributes in GeoNames are alphanumeric strings which do not compress well with the rest. Secondly, some investigation reveal the necessity of additional formation patterns, which could work better with this particular gazetteer. Finally, the GeoNames gazetteer exhibits highly multilingual character, i.e., the size of the alphabet is larger.</Paragraph>
      <Paragraph position="3"> As expected, transition jamming works better with the Pure-FSA approach, i.e., it reduces the size of j-j by a factor of 1.35 to 1.9, whereas in the other case the gain is less significant.</Paragraph>
      <Paragraph position="4"> Transition jamming constrained to witespace-free paths yielded better compression rates, in particular for gazetteers without numerical data (see table 2). Obviously, transition jamming is penalized through the introduction of state numbering in some part of the automaton and indexing certain edges, but the overall size of the automaton is still smaller than the original one. In the case of the LT-World gazetteer, there were circa 20000 sequential paths in the automaton. Consequently, we removed circa 134 000 transitions.</Paragraph>
      <Paragraph position="5"> Next, we studied the profitability of repetitive transition jamming. Figure 3 presents two  Pure-FSA-B stands for repetitive jamming on whitespace-free paths). diagrams which depict how this operation impacts the size of the automaton for the LT-World gazetteer. As can be observed, a more than 2stage repetitive jamming does not significantly improve the compression rate. Interestingly, we can observe in the left diagram that for both approaches the repetitive jamming of maximum-length sequential paths leads (after stage 3) to a greater reduction of jQj than jamming of whitespace-free paths. The corresponding numbers for other gazetteers with respect to repetitive jamming were of similar nature. Reversing labels of sequential paths and reversing open-class attribute values not covered by any formation pattern results in insignificant difference (1-2%) in the size of the automata.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML