File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1707_intro.xml

Size: 6,592 bytes

Last Modified: 2025-10-06 14:01:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1707">
  <Title>Towards a web-based centre on Swedish Language Technology</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Data harvesting
</SectionTitle>
    <Paragraph position="0"> We will now describe what documents that will be harvested should look like, and how our harvester will work. Understanding this part of the paper requires some knowledge of HTML and XML 2.1Document structure In order for our harvesting engine to retrieve and store documents properly they need to conform to some standards. Contributions may be in HTML or XML. In the case of HTML the contributing site will need to annotate existing documents with tags that tell us what is what. We have chosen to use the META and SPAN tags since they may be used as containers for generic meta-information. Use of the META-tag is uncontroversial, since it is commonly used for providing data for search engines. The SPAN tag is more commonly used for style information, but may be used to tailor HTML to ones own needs and tastes, according to W3C (1999). Some may still see the use of this tag as controversial, since we use it for semantic, not style information. We argue that this is not an issue because the tags will not affect the website in question (presuming the information was already there), whether users implement a stylesheet or not. Alternatives might have been HTML-comments or scripting languages. The drawbacks with these methods are that they may force data-providers to use invalid HTML and that they make it harder to make use of existing information.</Paragraph>
    <Paragraph position="1"> As for XML, documents will have to conform to a DTD under development by us in cooperation with NorDokNet. We also have contact with LT-World, in order to resolve compatibility issues with their information centre before they appear. Since websites using XML is not yet commonsight, documents will usually have to be written from scratch. No web browsers currently support XML satisfactorily, so the documents are not yet of much use outside of Slate.</Paragraph>
    <Paragraph position="2"> Therefore, this method is aimed towards early adopters who see the future benefits of XML and wish to participate in that development.</Paragraph>
    <Paragraph position="3"> 2.2Examples An HTML file that we should retrieve and file under people should contain this META-tag: &lt;meta http-equiv=&amp;quot;slate&amp;quot; content=&amp;quot;people&amp;quot;&gt; somewhere in its head. The document body may contain constructs like this: I am a &lt;span class=&amp;quot;title&amp;quot;&gt;Ph.D.</Paragraph>
    <Paragraph position="4"> student&lt;/span&gt; in computational linguistics at &lt;span  class=&amp;quot;affiliation&amp;quot;&gt;the Department of Computer and Information Science at Svedala University&lt;/span&gt;.</Paragraph>
    <Paragraph position="5">  Note that this is a real-life example (somewhat anonymized and edited). The Ph. D. student's existing web-page was annotated with SPANtags attributed to classes in the Slate database/XML-application. If the Ph. D. student changes affilitiation, he will probably want to update his own webpage. The updated information will be automatically retrieved by SlateBot.</Paragraph>
    <Paragraph position="6"> XML files are more rigid since they follow a DTD. Their structure should not be surprising. Being XML-documents they contain a declaration and a few containers:  The XML-compliant reader may note that our structure does not adhere to any standards such as RDF or OLAC. This will be attended to in later versions (see Discussion, below).</Paragraph>
    <Paragraph position="7"> 2.3Harvesting engine The core parts of our data harvesting engine, SlateBot, have been implemented and alpha tests have commenced using a very small database and XML-application listing people.</Paragraph>
    <Paragraph position="8"> SlateBot is an application being developed in Perl, in order to acommodate for entry modes a) and b) above. The application consists of five main parts, making use of corresponding Perlmodules: null  For input, it takes a list of links of participating sites. The list is maintained by our webmaster. For each entry in the list, it makes a HTTP GET request, just like a regular web-browser. It then checks whether the document in question is written in HTML or XML, and runs either the HTML parsers and the link extractor or the XML parser. The information returned from the parsers is then translated into XML and updated by the database interface.</Paragraph>
    <Paragraph position="9"> The reason for having three different HTML parsers (including the link extractor) is, of course, that they do different things. The head parser only parses information in the HTML-HEAD part of the document, in order to check whether the document belongs in our database at all, and if so, in what main category. The HTML parser looks for SPAN tags, and returns those that correspond to our categories. Finally, the link extractor searches the document for links to other documents. This is provided for future development, in case we need our harvester to follow links in documents.</Paragraph>
    <Paragraph position="10"> The XML parser is simpler, because we can assume that the XML file conforms to our DTD. We simply parse the file to find out which of our categories are implemented in the document, and send those to the database interface.</Paragraph>
    <Paragraph position="11"> 2.4Tests We have made some preliminary tests of the engine. Ph. D. students from GSLT and a few others were asked to provide information on themselves in either of three of the four entrymodes above. Web-forms were left out of the first tests, because we needed to ascertain that we could delete the database without consequenses.</Paragraph>
    <Paragraph position="12"> The Ph. D. students at GSLT come from a wide range of backgrounds, so it was expected that some would be more interested in XML than others. Three of the entries we received were in the form of links to HTML-files and three were links to XML-files. One reply was in the form of an e-mail request for entry. Naturally, the information base is too small to draw any real conclusions concerning user preferences, but it does seem that we are headed in the right direction in providing the different modes.</Paragraph>
    <Paragraph position="13"> The tests have helped us iron out a few of the unforeseen bugs that come with all software development, and we are now ready to enlarge the database model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML