XML Viewer - n03-3005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-3005_metho.xml
Size: 18,444 bytes
Last Modified: 2025-10-06 14:08:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-3005">
  <Title>Indexing Methods for Efficient Parsing</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Preliminaries
</SectionTitle>
    <Paragraph position="0"> The indexing method proposed here can be applied to any chart-based parser. We chose for illustration the EFD parser implemented in Prolog (an extensive presentation of EFD can be found in (Penn, 1999c)). EFD is a bottomup, right-to-left parser, that needs no active edges. It uses a chart to store the passive edges. Edges are added to the chart as the result of closing (completing) grammar rules.</Paragraph>
    <Paragraph position="1"> The chart contains n a0 1 entries (n is the number of words in the input sentence), each entry i holdingedges that have their right margin at position i.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 TFS Encoding
</SectionTitle>
      <Paragraph position="0"> To ensure that unification is carried through internal Prolog unification, we encoded descriptions as Prolog terms for parsing TFSGs. From the existing methods that efficiently encode TFS into Prolog terms ((Mellish, 1988), (Gerdemann, 1995), (Penn, 1999a)), we used embedded Prolog lists to represent feature structures. As shown in (Penn, 1999a), if the feature graph is N-colourable, the least number of argument positions in a flat encoding is N.</Paragraph>
      <Paragraph position="1"> Types were encoded using the attributed variables from SICSTus (SICS, 2001).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Chart Parsing with Indexing
</SectionTitle>
    <Paragraph position="0"> In order to close a rule, all the rules' daughters should be found in the chart as edges. Looking for a matching edge for a daughter is accomplished by attempting unifications with edges stored in the chart, resulting in many failed unifications.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 General Indexing Strategy
</SectionTitle>
      <Paragraph position="0"> The purpose of indexing is to reduce the amount of failed unifications when searching for an edge in the chart. This is accomplished by indexing the access to the chart. Each edge (edge's category or description) in the chart has an associated index key, that uniquely identifies sets of categories that can match with that edge's category. When closing a rule, the chart parsing algorithm looks up in the chart for edges matching a specific daughter. Instead of visiting all edges in the chart, the daughter's index key will select a restricted number of edges for traversal, thus reducing the number of unnecessary unification attempts.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Index Building
</SectionTitle>
      <Paragraph position="0"> The passive edges added to the chart represent rules' mothers. Each time a rule is closed, its mother is added to the chart according to the indexing scheme. The indexing scheme selects the hash entries where the mother 1 is inserted. For each mother M , the indexing scheme is a list containing the index keys of daughters that are possible candidates to a successful unification with M . The indexing scheme is re-built only when the grammar changes, thus sparing important compiling time.</Paragraph>
      <Paragraph position="1"> In our experiments, the index is represented as a hash2, where the hash function applied to a daughter is equivalent to the daughter's index key. Each entry in the chart has a hash associated with it. When passive edges are added to the chart, they are inserted into one or several hash entries. For an edge representing a mother M , the list of hash entries where it will be added is given by the indexing scheme for M .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Using the Index
</SectionTitle>
      <Paragraph position="0"> Each category (daughter) is associated with a unique index key. During parsing, a specific daughter is searched for in the chart by visiting only the list of edges that have the appropriate key, thus reducing the time needed for traversing the chart. The index keys can be computed off-line (when daughters are indexed by their position, see Section 7) or during parsing (as in Sections 4, 6).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Indexing for CFG Chart Parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Indexing Method
</SectionTitle>
      <Paragraph position="0"> The first indexing method presented in this paper is aimed at improving the parsing times for CFGs. The index key for each daughter is daughter's category itself. In order to find the edges that match a specific daughter, the search take place only in the hash entry associated with that daughter's category. This increases to 100% the ratio of successful unifications (Table 1 illustrates the significance of this gain by presenting the successful unification rate for non-indexing parser).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Experiments for CFG indexing
</SectionTitle>
      <Paragraph position="0"> Several experiments were carried to determine the actual run-times of the EFD and indexed EFD parsers for CFGs.</Paragraph>
      <Paragraph position="1">  parse trees, by constructing a rule from each sub-tree of every parse tree, and removing the duplicates.</Paragraph>
      <Paragraph position="2"> For all experiments we chose a test set of 5 sentences (with lengths of 15, 14, 15, 13, and 18 words) such that each grammar will parse successfully all sentences and each word has only one lexical use in all 5 parses. The number of rules varied from 124 to 3196.</Paragraph>
      <Paragraph position="3"> Figure 1 shows that even for a smaller number of rules, the indexed parser outperforms the non-indexed version. As the number of rules increases, the need for indexing becomes more stringent. Although unification costs are small for atomic CFGs, using an indexing method is well justified.</Paragraph>
      <Paragraph position="4">  The performance measurements for all CFG experiments (as well as for TFSG experiments presented later) were carried on a Sun Workstation with an UltraSparc v.9 processor at 440 MHz and with 1024 MB of memory. The parser was implemented in SICStus 3.8.6 for Solaris 8.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Typed-Feature Structure Indexing
</SectionTitle>
    <Paragraph position="0"> Compared to CFG parsing, for TFSGs the amount of attempted unifications is much smaller (usually UBGs have fewer rules than CFGs), but the unification itself is very costly. Again, indexing could be the key to efficient parsing by reducing the number of unifications while retrieving categories from the chart.</Paragraph>
    <Paragraph position="1"> The major difference between indexing for CFGs and for TFSGs lies in the nature of the categories used: CFGs are mostly associated with the use of atomic categories, while TFSGs employs complex-structure categories (typed-feature structures). This difference makes indexing more difficult for typed-feature structure parsers, since the extraction of an index key from each category is not a trivial process anymore. The following sections describe the solution chosen for indexing typed-feature structure parsers.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Statistical and Non-Statistical Indexing
</SectionTitle>
      <Paragraph position="0"> In our quest for improving the parsing times for TFSGs, we took two different approaches to indexing. The first approach uses statistical measurements carried on a corpus of training sentences to determine the most appropriate indexingscheme. The second approach relies on a priori analysis of the grammar rules, and no training is required. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Experimental Resources
</SectionTitle>
      <Paragraph position="0"> For both statistical and non-statistical indexing schemes, a simplified version of the MERGE grammar was used.</Paragraph>
      <Paragraph position="1"> MERGE is the adaptation for TRALE (Meurers and Penn, 2002) of the English Resource Grammar (CSLI, 2002).</Paragraph>
      <Paragraph position="2"> The simplified version has 13 rules with 2 daughters each and 4 unary rules, and 136 lexical entries. The type hierarchy contains 1157 types, with 144 features introduced.</Paragraph>
      <Paragraph position="3"> The features are encoded as Prolog terms (lists of length 13) according to their feature-graph colouring.</Paragraph>
      <Paragraph position="4"> For performance measurements, we used a test set containing 40 sentences of lengths from 2 to 9 words 3 (5 sentences for each length). For training the statistical indexing scheme we use an additional corpus of 60 sentences.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Statistical Indexing for TFS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Path Indexing
</SectionTitle>
      <Paragraph position="0"> Our statistical approach to indexing has its roots in the automaton-based indexing from (Penn and Popescu, 1997), used in generation, but adapted to indexed edge retrieval. The solution we chose is similar to the quick-check vector presented in (Malouf et al., 2000). When parsing sentences in the training corpus, the parser is modified in order to record, for each unification between two feature structures that failed, the feature path that caused the unification failure. The path causing most of the unification failures across all training corpus will be refered to as the indexing path. The type value at the end of the indexing path is used as an index key.</Paragraph>
      <Paragraph position="1">  The indexing scheme used for adding edges to the chart during parsing is a slightlymodified version of the general scheme presented in Section 3.2. Each edge is associated with an index key. For our statistical indexing, we used the type at the end of an edge's indexing path as the index key for that edge.</Paragraph>
      <Paragraph position="2"> 3The coverage of our version of the MERGE grammar is quite limited, therefore the test sentences are rather short (which is, however, a common characteristic of TFSGs compared to CFGs).</Paragraph>
      <Paragraph position="3"> An edge describing a rule's mother M is added to the indexed chart at all positions indicated by the keys in the list La0 M a1 . Since types are used as index keys, this list is defined as La0 M a1a3a2a5a4 ta6 t a7 kM</Paragraph>
      <Paragraph position="5"> , where kM is the index key for M , a10 is the unique most general type, and a7 is the type unification.</Paragraph>
      <Paragraph position="6">  The retrieval of edges from the indexed chart is accomplished as described in Section 3.3. The index key for each daughter is the type value at the end of the indexing path. In case the indexed path is not specified for a given daughter, the type a10 is used for the key. Hence, searching for a matching edge in the entry described by a10 is identical to using a non-indexed chart parsing.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Path Indexing with Quick Check
</SectionTitle>
      <Paragraph position="0"> The path indexing scheme presented above makes use of a single feature path that causes most of the failed unifications over a corpus of sentences. Since each of the paths causing unification failures represents relatively small percentages of the total failures (the first two paths account for only 18.6% and 17.2%, respectively), we decided to use the first two paths in a mixed approach: the type at the end of the first path was still used as an index key, while the traversal of edges in a hash entry was accompanied by a quick-check along the second path.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Performance
</SectionTitle>
      <Paragraph position="0"> Four parsers were tested: the non-indexed EFD parser, the path-indexed parser (using one path), the non-indexed EFD parser using quick-check, and the combination of path indexing and quick-checking. The results are presented in Table 2.</Paragraph>
      <Paragraph position="1">  using the converted MERGE grammar.</Paragraph>
      <Paragraph position="2"> Although the number of unifications dropped almost to 18% for the combination of path indexing and quickcheck, the difference in parsing times is not as significant. This is due to the costs of maintaining the index: simple path indexing is constantly slower than quick-check. Path indexing combined with quick-check outperforms quick-check for sentences longer than 7 words.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Non-Statistical Indexing for TFS
</SectionTitle>
    <Paragraph position="0"> Statistical indexing and quick-check have a major disadvantage if they are used during grammar development cycles. If the grammar suffers important changes, or the sentences to be parsed are not similar to those from training, the training phase has to be re-run. Hence, an indexing scheme that does not need training is needed.</Paragraph>
    <Paragraph position="1"> The indexing scheme presented in this section reduces the number of hash entries used, thus reducing the cost of manipulating the index. The index key for each daughter is represented by its position (rule number and daughter positionin the rule), therefore the time spent in computing the index key during parsing is practically eliminated.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Index Building
</SectionTitle>
      <Paragraph position="0"> The structure of the index is determined at compile-time (or can be constructed off-line and saved for further uses if parsing is done with the same grammar). The first step is to create the list containing the descriptions of all rules' mothers in the grammar.</Paragraph>
      <Paragraph position="1"> Then, for each mother description, a list La0 Mothera1a13a2</Paragraph>
      <Paragraph position="3"> ia14 D ja1a15a6 daughters that can match Mothera8 is created, where each element of the list L represents the rule number Ri and daughter position D j (inside rule Ri) of a category that can match with Mother.</Paragraph>
      <Paragraph position="4"> For CFGs, the list La0 Mothera1 would contain only the daughters that are guaranteed to match with a specific Mother (thus creating a &amp;quot;perfect&amp;quot; index). For UBGs, it is not possible to determine the exact list of matches, since the content of a daughter can change during parsing. However, it is possible to rule out before parsing the daughters that are incompatible (with respect to unification) with a certain Mother, hence the list La0 Mothera1 has a length between that of a &amp;quot;perfect&amp;quot; indexing scheme and that of using no index at all. Indeed, for the 17 mothers in the MERGE grammar, the number of matching daughters statically determined before parsing ranges from 30 (the total number of daughters in the grammar) to 2. This compromise pays off by its simplicity, reflected in the time spent managing the index.</Paragraph>
      <Paragraph position="5"> During run-time, each time an edge (representing a rule's mother) is added to the chart, its category Cat is inserted into the corresponding hash entries associated with the positions a0 Ri  D ja1 from the list La0 Cata1 . The entry associated to the key a0 Ri  D ja1 will contain only categories that can possibly unify with the daughter at position a0 Ri  D ja1 in the grammar. Compared to the path indexing scheme (Section 6.1) where the number of entries could reach 1157 (total number of types), in this case the number is limited to 30 (total number of daughters).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Using the Index
</SectionTitle>
      <Paragraph position="0"> Using a positional index key for each daughter presents the advantage of not needing an indexing (hash) function during parsing. When a rule is extended during parsing, each daughter is looked up in the chart for a matching edge. The position of the daughter a0 Ri  D ja1 acts as the index key, and matching edges are searched only in the list indicated by the key a0 Ri  Although the statistical and non-statistical indexing techniques can be merged in several ways into a single method, the cost of maintaining a complicated indexing scheme overshadows the benefits. An experiment that combined all indexing techniques presented in this paper produced parsing times almost four times longer than the slowest non-statistical indexing. However, as shown in the followingparagraphs, the statistical information about paths causing unification failures can be used to improve the efficiency of indexing.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.1 Encoding Re-ordering
</SectionTitle>
      <Paragraph position="0"> The unification of feature structures is accomplished by means of Prolog term unifications, as described in Section 2.1. This means that the unification of features encoded on the first position in their lists will take place before the unification of features that are encoded at the end of the lists.</Paragraph>
      <Paragraph position="1"> During the training phase presented in Section 6, we observed that features causing most of the unification failures are not placed at the beginning of the list in their encodings. Therefore, we re-arranged the placement of encoded features according to their probabilityto cause unification failures.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML