File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3031_metho.xml
Size: 8,420 bytes
Last Modified: 2025-10-06 14:09:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-3031"> <Title>Reformatting Web Documents via Header Trees</Title> <Section position="5" start_page="121" end_page="122" type="metho"> <SectionTitle> 3 Header Extraction Algorithm </SectionTitle> <Paragraph position="0"> In this section, we describe our algorithm that receives a list of blocks and returns a list of depths.</Paragraph> <Section position="1" start_page="121" end_page="121" type="sub_section"> <SectionTitle> 3.1 Basic Concepts </SectionTitle> <Paragraph position="0"> The algorithm proceeds in two steps: separator categorization and block clustering. The first step estimates local block relations (i.e., relations between neighboring blocks) via probabilistic models for characters and tags that appear around separators. The second step supplements the first by extracting the undetermined relations between blocks by focusing on global features, i.e., regularities in HTML tag sequences. We employed a clustering framework to implement a flexible regularity detection system that is robust to noise.</Paragraph> </Section> <Section position="2" start_page="121" end_page="122" type="sub_section"> <SectionTitle> 3.2 STEP 1: Separator Categorization </SectionTitle> <Paragraph position="0"> The algorithm classifies each block relation into one of three classes: NON-BOUNDARY, RELATING, ever, blocks that sandwich RELATING separators are regarded to consist of a header and its modified block. Figure 4 shows an example of separator categorization for the list of blocks in Figure 1. The left block of a RELATING separator must be in the smaller depth than the right block. Figure 2 shows an example. In this tree, NAME is in a smaller depth than John. On the other hand, both the left and right blocks in a NON-BOUNDARY separator must be in the same depth in a tree representation, for example, John and Smith in Figure 2.</Paragraph> <Paragraph position="1"> We use a probabilistic model that assumes the locality of relations among separators and blocks. In this model, each separator D7 and the strings around it, D0 and D6, are modeled by means of the hidden variable CR, which indicates the class in which D7 is categorized. We use the character zerogram, unigram, or bigram (changed according to the number of appearances null ) for D0 and D6 to avoid data sparseness problems. null For example, let us consider the following part of the example document: NAME: John Smith.</Paragraph> <Paragraph position="2"> In this case, : is a separator, ME is the left string and Jo is the right string.</Paragraph> <Paragraph position="3"> Assuming the locality of separator appearances, the model for all separators in a given document set is defined as C8B4D0BN D7BN D6B5 BP</Paragraph> <Paragraph position="5"> vector of left strings, D7 is a vector of separators, and D6 is a vector of right strings.</Paragraph> <Paragraph position="6"> The joint probability of obtaining D0, D7, and D6 is</Paragraph> <Paragraph position="8"> assuming that D0 and D6 depend only on CR: a class of relation between the blocks around D7.</Paragraph> <Paragraph position="9"> This generalization is performed by a heuristic algorithm. The main idea is to use a bigram if its number of appearances is over a threshold, and unigrams or zerograms otherwise. If the frequency for B4D0BND6B5 is over a threthold, C8B4D0BND6CYCRB5 is used instead of C8B4D0CYCRB5C8B4D6CYCRB5.</Paragraph> <Paragraph position="10"> If the frequency for D7 is under a threthold, D7 is replaced by its longest prefix whose frequency is over the threthold. Based on this model, each class of separators is determined as follows:</Paragraph> <Paragraph position="12"> The hidden parameters C8B4CRCYD7B5, C8B4D0CYCRB5, and C8B4D6CYCRB5, are estimated by the EM algorithm (Dempster et al., 1977). Starting with arbitrary initial parameters, the EM algorithm iterates E-STEPs and M-STEPs in order to increase the (log-)likelihood</Paragraph> <Paragraph position="14"> To characterize each class of separators, we use a set of typical symbols and HTML tags, called representatives from each class. This constraint contributes to give a structure to the parameter space.</Paragraph> </Section> <Section position="3" start_page="122" end_page="122" type="sub_section"> <SectionTitle> 3.3 STEP 2: Block Clustering </SectionTitle> <Paragraph position="0"> The purpose of block clustering is to take advantage of the regularity in visual representations. For example, we can observe regularity between NAME and AGE in Figure 1 because both are sandwiched by the character * and preceded by a null line. This visual representation is described in the HTML source as, for example, Our idea is to define the similarities between (context of) blocks based on the similarities between their surrounding separators. Each separator is represented by the vector that consist of symbols and HTML tags included in it, and the similarity between separators are calculated as cosine values. The algorithm proceeds in a bottom-up manner by examining a given block list from tail to head, finding the block that is the most similar to the current block, and collecting them into the same cluster. After that, all blocks in the same cluster is assigned the same depth.</Paragraph> </Section> </Section> <Section position="6" start_page="122" end_page="123" type="metho"> <SectionTitle> 4 Preliminary Experiments </SectionTitle> <Paragraph position="0"> We used a training data that consists of 1,418 web documents of moderate file size that did not have &quot;src&quot; or &quot;script&quot; tags . The former criteria is based on the observation that too small or too large documents are hard to use for measuring performance of algorithms, and the latter criteria is caused by the fact our system currently has no module to handle image files as blocks.</Paragraph> <Paragraph position="1"> We randomly selected 20 documents as test documents. Each test document was bracketed by hand They are collected by retrieving all user pages on one server of a Japanese ISP.</Paragraph> <Paragraph position="2"> from 1,000 to 10,000 bytes Src tags indicate inclusion of image files, java codes, etc to evaluate machine-made bracketings. The performance of web-page structuring algorithms can be evaluated via the nested-list form of tree by bracketed recall and bracketed precision (Goodman, 1996). Recall is the rate that bracketing given by hand are also given by machine, and precision is the rate that bracketing given by machine are also given by hand. F-measure is a harmonic mean of recall and precision that is used as a combined measure. Recall and precision were evaluated for each test document and they were averaged across all test documents. These averaged values are called macro-average recall, precision, and f-measure (Yang, 1999). We implemented our algorithm and the following three ones as baselines.</Paragraph> <Paragraph position="3"> NO-CL does not perform block clustering. NO-EM does not perform the EM-parameterestimation. Every boundary but representatives is defined to be categorized as &quot;UNRELATING&quot;. null PREV performs neither the EM-learning nor the block clustering. Every boundary but representatives is defined to be categorized as &quot;NONBOUNDARY&quot; null . It uses the heuristics that &quot;every block depends on its previous block.&quot; Table 1 shows the result. We observed that use of both the EM-learning and block clustering resulted in the best performance. NO-EM performs the best among the three baselines. It suggests that only relying on HTML tag information is not a so bad strategy when the EM-training is not available because of, for example, the lack of a sufficient number of training examples.</Paragraph> <Paragraph position="4"> Results on the documents that were rich in HTML tags with highly coherent layouts were better than those on the others like the documents with poor separators such as only one space character or one line feed. Some of the current results on the documents with such poor visual cues seemed difficult for use in practical systems, which indicates our system still leaves room for improvement.</Paragraph> <Paragraph position="5"> This strategy is based on the fact that it maximized the performance in a preliminary investigation.</Paragraph> </Section> class="xml-element"></Paper>