File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-3031_intro.xml

Size: 2,566 bytes

Last Modified: 2025-10-06 14:03:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3031">
  <Title>Reformatting Web Documents via Header Trees</Title>
  <Section position="4" start_page="121" end_page="121" type="intro">
    <SectionTitle>
2 Definitions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
2.1 Definition of Terms
</SectionTitle>
      <Paragraph position="0"> Our system decomposes an HTML document into a list of blocks. A block is defined as the part of a web document that is separated by a separator.A separator is a sequence of HTML tags and symbols.</Paragraph>
      <Paragraph position="1"> Symbols are defined as characters in texts that are neither numbers nor letters. Figure 1 shows an example of the conversion of an HTML document to a list of blocks.</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="2" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
Document
</SectionTitle>
      <Paragraph position="0"> A header is defined as a block that modifies subsequent blocks. In other words, a block that can be a tag annotated to subsequent blocks is defined as a header. Some examples of headers are Titles (e.g., &amp;quot;About Me&amp;quot;), Headlines (e.g., &amp;quot;Here is my profile:&amp;quot;), Attributes (e.g., &amp;quot;Name&amp;quot;, &amp;quot;Age&amp;quot;, etc.), and Dates.</Paragraph>
    </Section>
    <Section position="3" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
2.2 Definition of the Task
</SectionTitle>
      <Paragraph position="0"> The system produces header trees for given web documents. A header tree can be seen as an indented list of blocks where the level of each node's indent is equal to the depth of the node, as shown in Figure 2. Therefore, the main part of our task is to give a depth to each block in a given web document. After that, some heuristic rules are employed to construct header trees from a list of depths. In the next section, we discuss the task of assigning a depth to each block. Therefore, an input to the system is a list of blocks and the output is a list of depths.</Paragraph>
      <Paragraph position="1"> The system also produces nested-list representation of header trees for the purpose of evaluation. In nested-list representation, each node that has children is represented by the list whose first element represents the parent and remaining elements represent the children. Figure 3 shows list representation of the tree in Figure 2.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML