File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1069_intro.xml
Size: 2,618 bytes
Last Modified: 2025-10-06 14:01:25
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1069"> <Title>E ective Structural Inference for Large XML Documents</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> The inference of structure in XML information is a relatively new area of research. However, there are several closely related topics that have been studied for a longer period. Many of these topics fall into the general eld of Inductive Inference, more specifically the sub- eld of Grammatical Inference. This sub- eld is concerned with the theory and methods for learning grammars from example data. For further details concerning the eld of Grammatical Inference the reader is referred to the surveys of Pitt (Pitt 1989) and Sakakibara (Sakakibara 1997). In addition, there has also been prior research into automatic recognition of document structure. Earlier attempts by (Chen 1991), (Fankhauser & Yu 1994) and (Shafer 1995) in similar problem spaces all use solutions based on heuristic methods. In each case, the generalisation step involves searching for similar patterns in the data and combining the corresponding structural information. Although these techniques may work well in some cases, their applicability is restricted by a lack of generality. The approaches of (Ahonen 1996) and (Young-Lai 1996) are more powerful in concept. In both of these works, methods derived from theoretical grammatical inference are applied to the problem of inferring DTD content models. The rst known application of such theory to this problem, in (Ahonen 1996), makes use of a characterising method to infer a subset of the regular language class. The alternative solution in (Young-Lai 1996) makes use of an adapted stochastic method. In both cases, the results are post-processed to produce more desirable content models. Unfortunately, neither paper investigates or compares other methods. It is partly for this reason that these methods have been included in this study for comparisons. The more recent paper of Garofalakis et al (Garofalakis et al. 2000) is very similar to this work in terms of motivation and the application of information theory. However, their inference algorithms are based upon direct generalisation and factoring of regular expressions, with information theoretic principles used to choose a nal result from a pool of candidates. In contrast, we propose a hybrid method which employs various principles throughout the inference process, with the aim of producing a more general method.</Paragraph> </Section> class="xml-element"></Paper>