File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0608_metho.xml
Size: 21,601 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0608"> <Title>An Extensible Framework for Efficient Document Management Using RDF and OWL</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 2. Document Management Architecture </SectionTitle> <Paragraph position="0"> Without differentiating at the level of content, layout and formats, we treat documents as information resources. These information resources can potentially be distributed across various document repositories called e-Doc servers. Figure 2.1 demonstrates a simplified distributed document management system. The architecture shows how three different document repositories could co-exist functionally along with the Annotea enabled annotation framework ([5]). These servers implement procedural mechanisms for query access and retrieval of documents. Besides, these documents can be annotated and the annotations reside on an independent server known as the annotation server, which also serves as a document server. Principally, annotations can be viewed as information resources, which are described in RDF.</Paragraph> <Paragraph position="1"> Server 1Server 2Server 3 server architecture The e-Doc server consists of several functional layers that inter-communicate and holistically, serve the cumulative purpose of document management. These layers though distinct at the level of data flow and individual processing of information, afford functionalities that are exploited by the e-Doc server. Figure 2.2 shows various such layers of the e-Doc server. On the foundation level, it is assumed that every document on the e-Doc server adheres to a single syntax i.e. XML, which represents the top most layer in the architecture. The second layer depicts the access points that are broadly categorized along various dimensions such as metadata, conceptual/ontology system and terminology. A detailed description of the access points will be carried out in the Section 4. The e-Doc server is assumed to be flexible enough to handle all possible ontology formats/standards whether it is a native XML document or a text or a picture/video data coming from some streaming applications. This forms the third important layer of the e-Doc server. The bottom layer represents Annotations [6], which adheres to the RDF [4] syntax. This layer forms an integral part of the e-Doc server as it enables annotation capability and RDFdescribable semantics to the actively retrieved document or existing documents in the server [7]. Besides, RDF also provides the opportunity to utilize annotations as access points for the documents.</Paragraph> <Paragraph position="2"> As can be seen from the Figure 2.2, a user interacts with the server through a client interface by launching his queries. The architecture provides the user ample flexibility in utilizing different levels of descriptions for retrieving documents by providing variety of access points. In the following sections, we describe each of these access layers in more detail.</Paragraph> <Paragraph position="3"> 3. Annotations: Specified as RDF Model Annotations form the most abstract layer within the e-Doc architecture. They can be broadly defined as comments, notes, explanations, or other types of external remarks that can be attached to either a document or a sub portion of a document. As annotations are considered external, it is possible to annotate a document as a whole or in part without actually editing its content or structure. Conceptually, annotations can be considered as metadata, as they give additional information about an existing piece of data. Annotations can have many distinguishing properties, which can be broadly classified as: null * Physical location:- An annotation can be stored locally or on one or more annotation servers; * Scope:- An annotation can be associated with a document as a whole or to a sub-portion of a document.</Paragraph> <Paragraph position="4"> * Annotation type:- Annotations can have vari null ous functional types such as, &quot;Comment&quot;, &quot;Remark&quot;, &quot;Query&quot; e.t.c....</Paragraph> <Paragraph position="5"> Due to this abstract nature and multiplicity of functional types, a formal treatment of annotations is often unwieldy. Therefore, it is desired to have a semantically driven structural representation for annotations, which we describe below.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Annotation Semantics </SectionTitle> <Paragraph position="0"> Annotations are stored in one or multiple annotation servers. These servers endorse exchange protocols as specified by Annotea [5]. Essentially, the Annotation Server can be regarded as a general purpose RDF store, with additional mechanisms for optimized queries and access. This RDF store is built on top of a general SQL store. Annotations are stored in a generic RDF database accessible through an Apache HTTP server (see Figure 3.1). All communication between a client and an annotation server uses the standard HTTP methods such as Annotations have metadata associated with them, which is modeled according to an RDF schema and encode information such as date of creation of the annotation, name of the author, the annotation type (e.g. comment, query, correction) the URI [8] of the annotated document, and an Xpointer [9] that specifies what part of the document was annotated, and the URI to the body of the annotation which is assumed to be an XHTML [10] document (Figure 3.2).</Paragraph> <Paragraph position="1"> Xpointers are used to point to the Annotated portions within the document, while Xlinks [11] are used to setup a link between the Document and it's annotation.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Annotation Operations </SectionTitle> <Paragraph position="0"> The user makes a selection of the text to be annotated and provides the annotation along with other details such as author name, date of creation, type of annotation, URI of the annotated document etc. The annotations are published using standard HTTP POST method.</Paragraph> <Paragraph position="1"> To do this the client generates an RDF description of the annotation that includes the metadata and the body and sends it to the server. The annotation server receives the data and assigns a URI to the annotation i.e. the body, while metadata is identified by the URI of the Document. null For annotation retrieval, the client queries the annotation server via the HTTP GET method, requesting the annotation metadata by means of the document's URI.</Paragraph> <Paragraph position="2"> The annotation server replies with an RDF-specified list of the annotation metadata. For each list of annotations that the client receives, it parses the metadata of each annotation, resolves the Xpointer of the annotation, and if successful, highlights the annotated text. If the user clicks on the highlighted text, the browser uses an HTTP GET method to fetch the body of the annotation from the URI specified in the metadata.</Paragraph> <Paragraph position="3"> The following are the broad categories of the annotation functions implemented by the annotation server: * Annotate a document as a whole.</Paragraph> <Paragraph position="4"> * Annotate a portion of a document.</Paragraph> <Paragraph position="5"> * Query to access all the annotations for a particular document.</Paragraph> <Paragraph position="6"> * Query to access type specific or any of the metadata property specific annotations, which serve as query parameters for all the annotated documents.</Paragraph> <Paragraph position="7"> 4. Model Based Access: Using OWL As described in the previous section, the RDF layer provides an enhanced mechanism for querying and accessing a document. However, to enable full-fledged management of documents, it is imperative to incorporate some reasoning-based abstract semantics such as OWL (Web Ontology Language) over a cluster of documents. OWL provides formal mechanisms for describing ontology of documents. By doing so, the architecture can provide flexible access points as well as logical inference mechanisms, which are necessary while performing metadata queries.</Paragraph> <Paragraph position="8"> Access points play an important role by providing flexibility and intuitiveness in access mechanisms to the user. Figure 4.1 depicts a very basic characterization of the access points. As it is illustrated in the figure, a specific access point is needed to direct a query to attain certain desired result set. Within the Proteus framework, the e-doc architecture provides a model driven specification of access points such as metadata-based, ontological, or terminological model. The model driven approach has strong significance in the sense that every access point is associated by certain abstract information structure so that it provides transparency to the queries, which remain independent from actual implementation and data formats (e.g. XML DTD).</Paragraph> <Paragraph position="9"> Even though these models are independent, they are flexible enough to interact among themselves. For example, results of queries on one model can act as a reference for another model. The references may be transformed into document excerpts by requests made synchronously at the query stage or asynchronously when the user wants to visualize the information.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Terminological Access: </SectionTitle> <Paragraph position="0"> Terminology can be defined as the description of specialized vocabulary of an application domain. As it contains a nomenclature of technical terms, it is capable of providing a conceptual view of the domain. Terminology can be either monolingual or multilingual by nature. Monolinguality specifies a one to one relation between a term and a concept or a term to its equivalences or a term to the related documents, while multilinguality specifies relation between term to certain target terms or term to certain target documents.</Paragraph> <Paragraph position="1"> The Following is a simplified Proteus terminology ex- null Figure 4.2 describes a simplified terminology model the terminological section contains entries such as identifier, subject field, definition, and explanations etc., where as the other sections such as the language and the term sections contain details regarding the language used and the term status respectively. This can also be seen within the sample Proteus terminology described above.</Paragraph> <Paragraph position="2"> Terminological access is significant in cases where the user is aware of the specific term and needs to make a search within the related domain to access certain documents of his interest. For example, an operator of a firm might be willing to retrieve all the maintenance documents related to the term &quot;Pump&quot;. Thanks to the terminological access point, the operator needs nothing but just the term to launch his query and retrieve the desired document. The above-mentioned scenario is depicted in Figure 4.3 Terminological access plays a dual role. On one hand it acts as a data source providing support for finding mono or multilingual equivalences or linguistic descriptions. On the other hand, it provides access for on-line documents. When seen as a data source, it can also provide indexing support for manual indexing and can perform semi-automated indexing: * Graphic files (drawings, pictures, video, scanned texts etc): manual indexing * Text files: semi-automatic indexing; suggestion of descriptors to be confirmed by a human expert null * Data, e.g. from monitoring: automatic indexing with metadata.</Paragraph> <Paragraph position="3"> Terminological model serves as a gateway to the Ontology-based Conceptual model of the domain (Figure 4.4). Use of a technical term as a query parameter relates to set of relevant concepts, which can further be used to retrieve the desired set of documents.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> other models Meta-Data Access </SectionTitle> <Paragraph position="0"> Metadata can be loosely defined as &quot;data about data&quot;..</Paragraph> <Paragraph position="1"> Specifically, metadata encodes certain attributive information about the data, in our case documents, which can be used to access data. Within this platform the metamodel can be seen as a meta-tree of nodes in which every node refers to certain precise set of information descriptors. For example, Dublin Core descriptors such as title, author, date, publisher, etc can potentially be represented as nodes in the description trees.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Meta-model Description </SectionTitle> <Paragraph position="0"> This Meta-model is discussed keeping the specific Dublin Core [12] model in mind. Meta model consists of three basic components, a Resource, an Element, and its value.</Paragraph> <Paragraph position="1"> * Resource - the object being described. * Element - a characteristic or property of the Resource.</Paragraph> <Paragraph position="2"> * Value - the literal value corresponding to the</Paragraph> <Paragraph position="4"> Figure 4.5 shows a simplified view of the Dublin core reference model, within which the Element Qualifiers are nothing but additional attributes that further specify the relationship of the element to the resource. On the other hand, the value qualifiers can be described as additional attributes that further specify the relationship of the value to the element.</Paragraph> <Paragraph position="6"> Access of documents by means of metadata is a very important as well as a practical usage, as the user can directly retrieve a well defined piece of information, under the condition that he knows a small number of &quot;facts&quot; about the information: e.g. the authors name, the date, the reference number or the date of a previous maintenance. This corresponds to a typical situation within the Proteus framework (see Figure 4.6). Meta-data access, in other way, can be seen as an advanced index functionality, which can update itself and grow automatically in the same form as the amount of stored information grows.</Paragraph> <Paragraph position="7"> For example: While sorting documents by date or type, the date, time, source or author information can always be automatically collected. However, in case of a new maintenance document, advanced metadata can be collected by asking a human to enter it into the system.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="sub_section"> <SectionTitle> E-doc Operator Dublin Core </SectionTitle> <Paragraph position="0"> Title: name of the client Creator: name of the method agent Subject: Equipment ID Description: type of equipment Publisher: CEF CIGMA division Contributor: division out of CIGMA Date: date of the draw-up Type: procedure, FMECA, ...</Paragraph> <Paragraph position="1"> Format: .doc, .ppt, . xls (not useful) Identifier: name of the site (location) Source: former version Language: French, English, German Relation: related FMECA, procedure, video, pictures, ...</Paragraph> <Paragraph position="2"> Metadata model can be seen as an enhanced search mechanism. A sequence of access points i.e. terminology followed by metadata, when launched can help in refining the search along an attribute dimension.</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Ontology Access </SectionTitle> <Paragraph position="0"> Ontology is a hierarchy of concepts or in other way a platform for describing the concepts used within a specific domain, maintenance in our case. Its independence with regard to specific model or format makes it interoperable. For example, one can have an ontology represented in a UML [13] class diagram whereas the same ontology can be represented in an XML schema. As already discussed, Ontology is complementary to terminology in terms of attribution of concepts to terms.</Paragraph> <Paragraph position="1"> Conceptually, it serves as an abstract structure, which can be populated by the interested parties and thus, can serve as a very important access point. An abridged example of an abstract Proteus OWL [14] ontology version can be seen in the figure 4.7 Ontology model description As per the requirements of the Proteus project, an ontology model comprises of a three-tiered structure. The three layers consist of General concepts (General Maintenance Ontology), Application Profiles, and the industrial contexts respectively. These layers are built up keeping in mind the interoperability with other external applications. As can be seen from the Figure 4.8 below, the general concept layer has the highest interoperability as it contains basic level concepts such as Actors, Documents, Location, Equipments etc. The second layer (Application Profiles) consists of concepts, which are specific to a certain application, for instance pertinent to a train manufacturing company, or an aviation company. All the layers are bound to inherit concepts, but not necessarily all from the first layer (general concept), which in turn forms the parent layer of all other layers. The third layer (Industrial contexts) contains concepts very specific to an industry for instance, car manufacturing companies such as Ford, GM etc.</Paragraph> <Paragraph position="2"> Instances can be derived only from the last layer i.e. the Industrial contexts layer.</Paragraph> <Paragraph position="3"> The model is open for external sources i.e. ontology from external sources can be merged within each layer, for example, SUMO [15], which is a higher-level ontology. It contains very general concepts, which can be used directly within our ontology.</Paragraph> <Paragraph position="4"> OWL-DL is used for specifying the ontological model as it provides the following advantages: * Basic support for describing classification hierarchies and simple constraint features. e.g. migration path for thesauri and other taxonomies; E.g. Racer [16].</Paragraph> <Paragraph position="5"> In a way, ontology access is a complementary approach to the terminology access, as terminology structure describes the global concept behind a thematic domain, but does not deliver a functional description of the domain. The ontology access exactly provides this functional description (as is usually needed in the maintenance domain). The concept remains global when referring to a generic class of entities and gets specific when describing a particular entity type. Apart from the normal functionality of this access point, it can be very important when combined with retrieval by natural language and by visual elements (hierarchy structured sets of pictures). In a way we can see ontology as an empty structure with user-defined class relationships, which can be filled with visual elements (photos, drawing, scheme) and then the referring terms.</Paragraph> <Paragraph position="6"> For example Figure 4.9 depicts visual search of documents via ontology. In order to avoid complexity, only recommended terms are used to name the objects represented by the visual elements. Other terms can be left apart pointing to plain concepts (without visual concepts). The index of the metadata tool could be virtually integrated into the index administrated by the terminology tool. This enables a two-step-search, beginning with a word and then finding the actually searched item not by selecting a more specific term from the terminology tool, but by looking for a picture of the searched item in the functional concept. This index could also be virtually integrated into the index of the functional concept. Thus the user could situate the search results provided by the metadata tool within the functional structure of the maintained equipment (instead of getting designation, ID-Number, description and meta data only).</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 5. Data Category Specification </SectionTitle> <Paragraph position="0"> The various models (terminology, annotations, etc.) and functionalities (access primitives to an e-doc server) have to be defined in such a way that a similar piece of information (e.g. author, subject field, term, etc.) means the same thing from one place to another. Such a semantic definition of data categories (in the terminology of ISO committee TC 37) acts in complementary to an ontology such as the one we define in the Proteus system since it is intended to be a general purpose layer of descriptors that may be used in other environments than that of a specific project. Therefore, we adopted a similar methodology as that of the efforts within the ISO TC 37 committee to deploy a data category registry of all descriptors used in the project as reference semantic units described in accordance to ISO standard 11179 (metadata registries). Such a registry plays a double role: * It provides unique entry point (of formal public identifier) for any model that refers to it; * It gives a precise description of the data category by means of a definition and associated documentation (examples, application notes, etc.).</Paragraph> </Section> class="xml-element"></Paper>