File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2001_metho.xml

Size: 3,544 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2001">
  <Title>Ronan.Reilly@may.ie</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 System overview
</SectionTitle>
    <Paragraph position="0"> The system has two phases. The first phase of the system deals with the formation of a map of valid XML (a valid XML document is one which is well-formed and which has been validated against a DTD) marked-up documents using the SOM algorithm. The second phase deals with the automatic markup of new (unmarked) document according to the markup of existing documents. Once a document is marked-up, the system's behaviour is modified to improve accuracy. These two phases of the system are currently implemented independently but will be combined to form an integrated hybrid system. This paper focuses on phase 2 of the system.</Paragraph>
    <Paragraph position="1"> Phase 2 of the system is implemented as an independent automatic XML markup system, which is Figure 1. It comprises two main modules - a Rule extraction module and a Markup module. The rule extraction module deals with the extraction of rules using an inductive learning approach (Mitchell, 1997). Firstly, during a preliminary phase, training examples are collected from the valid XML marked-up documents.</Paragraph>
    <Paragraph position="2"> These documents should be from a specific domain and their markup should be valid and conformant to the rules of a single Document Type Definition (DTD). An XML document consists of a strictly nested hierarchy of elements with a root element. Only elements having text nodes are considered as markup elements for our  The markup of elements nested within other elements can be accomplished by using the DTD. Each training instance corresponds to an element containing a text node from the collection of marked-up documents. The text enclosed between the start and end tags of all occurrences of each element is encoded using a fixed-width feature vector. We have used 31 features in our experiments. The set of feature vectors is used by the system to learn classifiers. An inductive learning algorithm processes these encoded instances to develop classifiers for elements having specific tag names. These classifiers segment the text of an unmarked document into different elements of the resulting XML marked-up document. In our system, the C5 program is used to learn classifiers. These classifiers are later used to markup the segments of text as XML elements.</Paragraph>
    <Paragraph position="3"> The second module deals with the creation of XML markup. The unmarked document to be used for this process should be from the same domain and should have a similar structure to the documents, which were used for learning the rules. To accomplish the markup, the unmarked document is segmented into pieces of text using a variety of heuristics. These heuristics are derived from the set of training examples. By using the DTD conformant to the document set used for learning the rules and by using the text segments stored for each element, a hierarchical structure of the document is encoded and the marked-up document is produced.</Paragraph>
    <Paragraph position="4"> The markup produced by the system can be validated according to a DTD. However, in order to check the accuracy of the markup, we have to examine it manually and compare it with the original source (if available) as XML processors can only validate the syntax of the markup, and not its semantics.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML