File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1069_metho.xml

Size: 22,422 bytes

Last Modified: 2025-10-06 14:07:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1069">
  <Title>E ective Structural Inference for Large XML Documents</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Overview of Solution
</SectionTitle>
    <Paragraph position="0"> The major grammatical inference methods fall into a few general categories. One of these categories includes a family of algorithms known as state merging methods. A state merging method typically begins by constructing what is known as a Pre x Tree Automaton (PTA) from the positive examples of the language to be inferred. If we let the set of positive examples be R+, then the pre x tree for R+ (PTA(R+)) may be constructed as follows. We begin with an automaton that simply accepts the rst string in R+. Then we iterate over the rest of the strings in R+ and for each one follow transitions in the automaton for as many symbols as possible.</Paragraph>
    <Paragraph position="1"> When a symbol is found that does not match a valid transition, the PTA is augmented with a new path to accept the rest of the string. In particular, a Probabilistic Finite State Automaton (PFSA) is merely an automaton with probabilities associated with each transition and nal state. This is important both for some of the inference methods and for evaluating the quality of solutions.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Evaluating a Solution
</SectionTitle>
      <Paragraph position="0"> To measure the quality of the inferred DTD, we use the concept of Minimum Message Length (MML) (George &amp; Wallace 1984). In particular, we adapt the formula developed for PFSA (Raman 1997) as below:</Paragraph>
      <Paragraph position="2"> where: N is the number of states in the PFSA V is the cardinality of the alphabet plus one tj is the number of times the jth state is visited mj is the number of arcs from the jth state (plus one for nal states) m0j is the number of arcs from the jth state (no change for nal states) nij is the frequency of the ith arc from the jth state M is the sum of all mj values and M0 is the sum of all m0j values</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Implementation
</SectionTitle>
      <Paragraph position="0"> One of the goals of this work was to both produce new inference methods and make comprehensive comparisons with existing techniques. This entailed a signi cant amount of implementation that consists of several modules to perform stages of the inference process. The most important stage involves PTA generalisation using the inference methods. These methods fall into two broad categories, which are labelled generalisation and optimisation in the implementation. The rst of these consists of the Merge methods with its pseudo-code shown in  algorithm 1.</Paragraph>
      <Paragraph position="1"> Algorithm 1 GeneralisePTA Input: A PTA A to be generalised, a merge crite null rion criterion Output: The generalised form of A Method:  1. repeat 2. for all pairs (s1;s2) of states in A do 3. if criterion(s1;s2) then 4. A:merge(s1;s2) 5. criterion:forcedMerges(A) 6. if criterion:determinise() then 7. determinise(A) 8. end if 9. end if 10. end for 11. until no more merges are possible 12. if not criterion:determinise() then 13. determinise(A) 14. end if 15. return A  Here the merge criterion determines the actual behaviour of the inference procedure. For each method belonging to the merge family, a merge criterion class is derived from a base interface. The merge criterion is allowed to make forced merges after an initial merge is decided (see line 5), which may be necessary to t the semantics of a method, or may be more e cient. Also, merge criteria may decide if the PFSA is determinised after every merge (lines 6{8) or only at the end of the inference process (lines 12{ 14). Determinisation itself is performed by merging of states, as opposed to the traditional algorithms. The alternative inference methods all apply optimisation techniques to try and minimise the MML of the PTA. These algorithms vary signi cantly in implementation, ruling out the possibility of building them around the same core algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Inference Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Reference Methods
</SectionTitle>
      <Paragraph position="0"> Several previously applied methods were implemented to evaluate their relative performance. Two such algorithms are those applied by Ahonen in (Ahonen 1996), the rst known paper to address DTD generation using tradition grammatical inference methods. These methods are theoretically appealing, as they guarantee to infer languages falling within certain language classes. These classes are termed k-contextual and (k, h)-contextual, so named as they assume the structure to be inferred to have limited context. It is not clear whether this assumption is valid in practice, however. Another method applied to DTD generation ((Young-Lai 1996)), is derived from more recent work in grammatical inference. The base algorithm is known as Alergia, introduced in (Carrasco &amp; Oncina 1994b). The criterion for state equivalence in Alergia is based upon observed frequencies of transitions and nalities of a pair of states. As with the methods of Ahonen, Alergia has strong theoretical appeal. Again, though, we are interested in practical performance. Further to the methods described above, we devised two basic optimisation strategies against which to benchmark the results of our main algorithm. The rst of these, termed the Greedy method, is a straight-forward steepest-descent algorithm which employs incremental MML calculation to optimise a PTA.</Paragraph>
      <Paragraph position="1"> Along with this a weighted stochastic hill-climbing method was implemented, which also used incremental MML calculation. These two methods illustrate the need for more sophisticated optimisation algorithms. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The sk-strings Method
</SectionTitle>
      <Paragraph position="0"> The sk-strings method actually consists of a family of algorithms, described in (Raman &amp; Patrick 1997) and in more detail in (Raman 1997), of which ve were implemented. The basis of these algorithms is an extension upon the k-tails heuristic of Biermann and Feldman (Biermann &amp; Feldman 1972), which in turn is a relaxed variant of the Nerode equivalence relation. Under the Nerode relation, a pair of states are equivalent if they are indistinguishable in the strings that may be accepted following them.</Paragraph>
      <Paragraph position="1"> The k-tails criterion relaxes this to only considering these strings (tails) up to a given length (k.) The sk-strings method is extended using stochastic automata, and considers only the top s percent of the most probable k-strings. The k-strings di er from k-tails in that they are not required to end at a nal state, unless they have length less than k. The probability of a k-string is calculated by taking the product of the probabilities of the transitions exercised by that k-string.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Ant Colony Optimisation
</SectionTitle>
      <Paragraph position="0"> The Ant Colony Optimisation (ACO) meta-heuristic is a relatively new optimisation technique. First described in (Dorigo et al. 1991), the method uses a positive feedback technique for searching in a similar manner to actual ants. In biological experiments it was revealed that insect ants cooperated in nding shortest paths by leaving pheromone trails as they walked. An ant traveling back and forth along a short path increases the pheromone level on that path more rapidly than an ant using a longer path, thus in uencing more ants to take the shorter route.</Paragraph>
      <Paragraph position="1"> The e ect is then self-reinforcing until eventually all ants will choose the shorter path. ACO algorithms mimic this technique using arti cial ants to search out good solutions to optimisation problems.</Paragraph>
      <Paragraph position="2"> The ACO heuristic operates over several iterations to allow the positive feedback of the pheromones to take e ect. In each iteration, the arti cial ants navigate the search space using only a simple heuristic, but as they move they leave pheromones on the trail they follow. In some variants, including the one used in this work, the pheromone placement is delayed until the end of an iteration, when all ants have completed a full walk. At this point the amount of pheromone assigned to each ant is weighted with respect to the quality of the solution it found. Thus moves involved in higher quality solutions are more likely to be chosen by ants in future iterations. Although ants acting by themselves are only capable of nding relatively poor solutions, working in cooperation they may approach higher quality ones. After a certain number of iterations without improvement to the best solution found the algorithm terminates.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 The Proposed Hybrid Method
</SectionTitle>
      <Paragraph position="0"> The sk-ANT Heuristic: The motivation for this new heuristic was to create a method that would be successful for a variety of input data sizes by combining the best features of both the sk-strings and ACO techniques. One consideration was to rst run the sk-strings algorithms, and then use the results to seed the ACO optimisation. However, this approach su ers from several problems. Firstly, it is not practical to attempt all possible combinations of both algorithms. Thus we would be required to choose a limited number of models resulting from the sk-strings technique to seed the second stage of the process. The simplest way to achieve this would be to choose the best models, up to a reasonable number. These models will not necessarily lead to the best results, though, as they may have already been over-generalised. More importantly, by letting the sk-strings methods run to completion we would lose many of the advantageous aspects of the ACO method. Most notably, its willingness to explore a greater breadth of the search space would be missed.</Paragraph>
      <Paragraph position="1"> The new method thus incorporates both the sk-strings and ACO heuristics at each step of the inference process. It is most easily described as a modi ed version of the ACO technique with the ants guided by an sk-strings criterion. The guiding is made progressively weaker as the model becomes smaller, to allow the advantages of the ACO method for smaller models to take e ect. The key of this new method is a new algorithm for the ant move selection as shown in algorithm 2. In particular in line 4, a merge must pass the sk-strings criterion to be considered. The outer while loop on line 2 and if statement on line 11 combine to progressively weaken the sk-strings criterion when it has become too strict.</Paragraph>
      <Paragraph position="2"> Eventually the criterion will be weak enough to let all merges pass, and the algorithm will behave identically to the original version.</Paragraph>
      <Paragraph position="3"> Algorithm 2 sk-antMoveSelector Input: A set of all state pairs merges, an ant heuristic heuristic, an ant weighting function weighting, a pheromone table pheremones and an sk-strings criterion skCriterion.</Paragraph>
      <Paragraph position="4"> Output: A state pair representing the chosen  merge.</Paragraph>
      <Paragraph position="5"> Method: 1. choices [ ] 2. while choices:size() = 0 do 3. for merge in merges do 4. if skCriterion(merge) then 5. h heuristic(merge) 6. p pheremones[merge] 7. value weighting(h;p) 8. choices:add((value;merge)) 9. end if 10. end for 11. if choices:size() = 0 then 12. skCriterion:weaken() 13. end if 14. end while 15. return stochasticChoice(choices)</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="14" type="metho">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> To ful ll our goal of comprehensive testing, we applied three sets of tests. The rst two of these used generated data, which allowed systematic experimentation on widely varied input. The other test set consisted of a few models chosen from real data, to illustrate the nature of the models inferred by the sk-ANT method. In each test the algorithms were run with a range of input parameters, and the results</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Inference of Large Models
</SectionTitle>
      <Paragraph position="0"> The larger data set also consisted of 100 sample les generated from random PFSA. In this case the PFSA had a maximum of 20 states with an alphabet cardinality of 8. For each of these a total of 25 strings were generated giving an average PTA size of 143.38 states. On average the inferred models were larger than for the small data set, though they ranged from an average of 1.23 to 142.74 states for di ering algorithms.</Paragraph>
      <Paragraph position="1"> Figure 1 shows the success rates of each algorithm in inferring models with the lowest MML values. For each algorithm, two results have been shown. The rst is the frequency of inferring the best model overall, by choosing the best of the algorithm's attempts. The second is the frequency of obtaining the best average performance, derived by averaging all of the algorithm's attempts before ranking. The best over-all performance is most important, whilst the best average indicates stability across di erent input parameters. In particular, sk-ALL denotes the implementation of sk-strings with ve variants from its family (refer to (Raman 1997) for details of di erent variants). The results show the poor performance of both the Stochastic and ACO methods, and the clear dominance of the sk-ANT heuristic. The poor performance of the ACO and Stochastic methods is likely due to the large search space. The heuristic guidance used in the sk-ANT method clearly overcomes this di culty, producing the best results.</Paragraph>
      <Paragraph position="2"> Other poor algorithms are desirable due to their simplicity and e ciency, but are lacking in quality of solutions found.</Paragraph>
      <Paragraph position="3"> The deviations from the best model were calculated for each algorithm, as presented in table 1. The numbers were derived from the di erence between the MML values of the best model inferred by a given algorithm as compared with the best model overall. The hybrid sk-ANT technique again  proved to be the best in terms of average deviation at 2.81%. Thus the newer method is preferable if only one choice is allowed. On the other hand, the sk-ALL heuristic was better in terms of worst-case performance, though the deviation is still rather high. Again, some applications may need to employ a combination of di erence methods to achieve better worst-case results.</Paragraph>
      <Paragraph position="4"> The experiments have revealed many interesting points. Firstly, the k-contextual, (k, h)-contextual and Alergia methods previously applied to this problem have been shown to perform poorly. Although the papers describing their use extend the algorithms and employ re ning operations, it appears that other methods are a more appropriate starting point. We have also seen that the simple Greedy and Stochastic methods cannot match the performance of more complicated techniques. This highlights the di culties inherent in the search space of grammatical inference. The sk-strings method developed in (Raman 1997) has proven to be much more e ective, provided combined results from all of the heuristics are used. A major contribution to its success may lie in its use of statistical data in the inference process. The ACO algorithm's failure on large testing data led to a new method which we have named sk-ANT.</Paragraph>
      <Paragraph position="5"> This new hybrid algorithm proved to be the most e ective by a considerable margin, and is thus the algorithm of choice. Where worst case guarantees are required, we recommend using combined results of both the sk-ANT and sk-ALL methods.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="14" type="sub_section">
      <SectionTitle>
5.2 Real Data Set: Webster's p Element
</SectionTitle>
      <Paragraph position="0"> The real data set we used was an extract of the digital form of the 1913 Webster Unabridged Dictionary. The dictionary format is almost compatible with SGML, and after some preprocessing we were able to extract the structural information in XML format. Only a small subset of the entire dictionary was used, due to its overwhelming size. We selected the paragraph element 'p', which is used to group together the information pertaining to each dictionary word, for the experiment. This particular element was chosen as it exhibits a rather complex and variable structure. This is in contrast to the structure</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"/>
      <Paragraph position="6"> of the other models chosen from real data sets, with the intent being to stretch the capabilities of the sk-ANT method.</Paragraph>
      <Paragraph position="7"> The inferred model is shown in gure 2. Observe the dominant path through states 0, 2, 9, 4 and 1. It is most important that this path is preserved, with the rest of the structure accounting for exceptions and noise in the data. Such irregularity is a fact of life when dealing with semi-structured data such as XML.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="14" end_page="14" type="metho">
    <SectionTitle>
6 Tractability Considerations
</SectionTitle>
    <Paragraph position="0"> The goals of grammatical inference such as the structural inference presented in this paper require not only that our methods be e ective, but also that they are useful in practice. For this reason, we investigate the complexity of our new algorithm of choice, namely sk-ANT. Coverage of the tractability of the other algorithms may be found in the original papers in which they are presented ((Ahonen 1996), (Carrasco &amp; Oncina 1994b) and (Raman 1997).) In our testing, we found that the sk-ANT method was the most expensive in terms of running time. This did not provide an obstacle for the experimental data, but may become important when working with very large PTA (several thousand states) or under tight time constraints. Fortunately, in practice, a model for typical large XML documents would be normally fewer than one hundred states.</Paragraph>
    <Paragraph position="1"> By analysing the sk-ANT algorithm, we found that the most important factor is the number of merges considered at each iteration of the inference process. At every step, each one of the possible state pairings is considered for merging. At a stage when  the PFSA being inferred has r states, there are (r2) such merges. This may be expanded to give: r(r 1)  merges considered. As the algorithm proceeds from the full number of states, say n, until there is just one state, the total number of merges considered is:</Paragraph>
    <Paragraph position="3"> These merges are considered by all ants, introducing an extra constant factor. This factor does not in uence the asymptotic complexity, however, which is clearly O(n3). Although not unmanageable, there are applications for which this complexity may be too great.</Paragraph>
    <Section position="1" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
6.1 Measured Scalability
</SectionTitle>
      <Paragraph position="0"> To complement the theoretical analysis, we also performed a simple empirical test to measure the performance of the algorithm in practice. The test was performed on PTA with sizes ranging from 10 to 150 states, with several of each size. Keeping in line with the experimental results shown earlier, we used the same range of parameters and averaged the results for each PTA size.</Paragraph>
      <Paragraph position="1"> Figure 3 shows a graph of the obtained timing values. The points give the actual values for each PTA size, with a bezier approximation shown using a solid line. For comparison, we have also included a cubic function, shown with the dotted line. Clearly the analytical result corresponds well to the empirical measurement, with the cubic function proving to be a reasonable approximation of the points. Note that the individual CPU times shown are not of particular interest, as the primary goal of the implementation was not e ciency. Pro ling has shown that there is scope for improvement, though of course the asymptotic growth will remain the dominant factor.</Paragraph>
    </Section>
    <Section position="2" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
6.2 Discussion
</SectionTitle>
      <Paragraph position="0"> The analytical and empirical analysis of the sk-ANT algorithm have both shown it to have asymptotic complexity of O(n3). In practice, this is quite acceptable for a wide range of applications. Where it is not acceptable, there are several alternatives. A simple one would be to employ one of the more efcient algorithms until the PFSA inferred reaches a small enough size to seed the sk-ANT method. A more complicated method may use a modi ed sk-ANT heuristic that employs approximations, and perhaps limits the size of the neighbourhood examined by each individual ant. Such a method may be able to reduce the complexity to O(n2), though it has not been thoroughly investigated. Cases where large amounts of data are involved can often be broken down into several sub-problems to make them manageable. For instance, in a database where documents are drawn from many sources, it may be appropriate to treat each data source separately. Indeed, this is a requirement for many applications.</Paragraph>
      <Paragraph position="1"> In other cases, it is feasible to take a sample of the data to use for inference, as the sk-ANT algorithm has been shown to perform well on sparse examples.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML