File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1701_metho.xml

Size: 17,992 bytes

Last Modified: 2025-10-06 14:08:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1701">
  <Title>RDF(S)/XML LINGUISTIC ANNOTATION OF SEMANTIC WEB PAGES</Title>
  <Section position="5" start_page="12" end_page="12" type="metho">
    <SectionTitle>
1. TEXT ANNOTATION IN CORPUS
LINGUISTICS
</SectionTitle>
    <Paragraph position="0"> The idea of text annotation was originally developed in Corpus Linguistics. Traditionally, linguists have defined corpus as &amp;quot;a body of naturally occurring (authentic) language data which can be used as a basis for linguistic research&amp;quot; (Leech, 1997). Following McEnery &amp; Wilson (2001), Corpus Linguistics was first applied to research on language acquisition, to the teaching of a second language or to the elaboration of descriptive grammars, etc.. With the arrival of computers, the number of potential studies to which corpora could be applied increased exponentially. So, nowadays, the term corpus is being applied to &amp;quot;a body of language material which exists in electronic form, and which may be processed by computer for various purposes such as linguistic research and language engineering&amp;quot; (Leech, 1997). An annotated corpus &amp;quot;may be considered to be a repository of linguistic information [...] made explicit through concrete annotation&amp;quot; (McEnery &amp; Wilson, 2001).</Paragraph>
    <Paragraph position="1"> The benefit of such an annotation is clear: it makes retrieving and analysing information about what is contained in the corpus quicker and easier.</Paragraph>
    <Paragraph position="2"> In Leech (1997), a list of the different (possible) levels of linguistic annotation can be found. As Leech himself states, for the time being, no corpus includes all of them, but only two or, at most, three of them. Some of them were only in their first stage of conception at the time of writing his paper. A smaller but more realistic list of annotation levels is included in EAGLES (1996a) namely: lemma, morpho-syntactic, syntactic, semantic and discourse annotation. Standard recommendations on morpho-syntactic and syntactic annotation of corpora can be found in (EAGLES, 1996a) and (EAGLES, 1996b). A complementary list of general criteria that should be considered when elaborating an annotation scheme can be found in one of the results of the EAGLES project work, the Corpus Encoding Standard (CES, 2000) which are being taken into account in the elaboration of our model (Aguado de Cea, 2002). With respect to the previous and well-known standardization initiative, TEI  , all these works mentioned are TEI-compliant. Thus, for the sake of brevity, we will focus on semantic annotation henceforth.</Paragraph>
    <Paragraph position="3"> As asserted in McEnery &amp; Wilson (2001), two broad types of semantic annotation may be identified: A. The marking of semantic relationships between items in the text (for example, the agents or patients of particular actions). This type of annotation has scarcely begun to be applied.</Paragraph>
    <Paragraph position="4"> B. The marking of semantic features of words in a text, essentially the annotation of word senses in one form or another. This trend has quite a longer history but there is no universal agreement in semantics about which features of words should be annotated  .</Paragraph>
    <Paragraph position="5"> Although some preliminary recommendations on lexical semantic encoding have already been posited (EAGLES, 1999), no EAGLES semantic corpus annotation standard has yet been published; nevertheless, for the second type of semantic annotation enunciated, a set of reference criteria has been proposed by Schmidt and  See, for example, the controversies within the SENSEVAL initiative meetings - (Kilgarriff, 1998), (Kilgarriff &amp; Rosenzweig, 2000).</Paragraph>
    <Paragraph position="6"> mentioned in Wilson &amp; Thomas (1997) for choosing or devising a corpus semantic field  annotation system. These criteria can be summarized as follows  : 1. It should make sense in linguistic or psycholinguistic terms.</Paragraph>
    <Paragraph position="7"> 2. It should be able to account exhaustively for the vocabulary in the corpus, not just for a part of it.</Paragraph>
    <Paragraph position="8"> 3. It should be sufficiently flexible.</Paragraph>
    <Paragraph position="9"> 4. It should operate at an appropriate level of granularity (or delicacy of detail).</Paragraph>
    <Paragraph position="10"> 5. It should, where appropriate, possess a hierarchical structure.</Paragraph>
    <Paragraph position="11"> 6. It should conform to a standard, if one exists  .</Paragraph>
  </Section>
  <Section position="6" start_page="12" end_page="12" type="metho">
    <SectionTitle>
2. ONTOLOGIES AND SEMANTIC WEB
</SectionTitle>
    <Paragraph position="0"> ANNOTATIONS.</Paragraph>
    <Paragraph position="1"> AI researchers have found in ontologies (Gruber, 1993), (Guarino et al., 1995), (Studer et al., 1998) the ideal knowledge model to formally describe web resources and its vocabulary and, hence, to make explicit in some way the underlying meaning of the concepts included in web pages. With Ontological Semantics (Niremburg &amp; Raskin, 2001) as a support theory  , the annotation of these web resources with ontological information should allow intelligent access to them, should ease searching and browsing within them and should exploit new web inference approaches from them. The influential WordNet and EuroWordNet (Fellbaum, 2001) ontologies should be mentioned as valuable resources for this purpose. Many systems and projects have been developed towards this aim hitherto: SHOE (Luke et al., 2000) proposes HTML page semantic annotation with a Horn clause-based language also called SHOE; the (KA)2 initiative (Benjamins et al., 1999) seeks to annotate HTML documents with ontological information, taking Knowledge Acquisition Community ontologies as a basis; PlanetOnto (Motta et al., 1999) aims at automatically annotating the HTML news pages of an organisation by means of the information obtained from an event-ontology based knowledge base; finally, within the Semantic Community Web Portals project (Staab et al., 2000) an ontology-based architecture for editing and maintaining web portals in an easier way is being developed. Besides, a number of semantic annotation tools have also been developed so far: COHSE  (COHSE, 2002), MnM (Vargas-Vera et al., 2001), OntoMat-Annotizer (OntoMat, 2002), SHOE Knowledge Annotator (SHOE, 2002) and AeroDAML (AeroDAML, 2002).</Paragraph>
  </Section>
  <Section position="7" start_page="12" end_page="12" type="metho">
    <SectionTitle>
3. INTEGRATION OF PARADIGMS: AN
EXAMPLE
</SectionTitle>
    <Paragraph position="0"> The model here shown, OntoTag, is developed within ContentWeb, a Ministry funded project, which aims at creating an ontology-based platform to enable users to query e-commerce applications by using natural language, performing the automatic retrieval of information from web documents annotated with ontological and linguistic information. Besides, a prototype in the entertainment domain will be developed.</Paragraph>
    <Paragraph position="1"> ContentWeb objectives can be found in (Aguado de Cea, 2002).</Paragraph>
    <Paragraph position="2"> Within the elaboration of OntoTag, a first exploration phase has been performed. A short example of this first phase is presented next. It has been implemented in RDF(S), but an XML version was also developed and the possibility of using any other language has a priori not been discarded. In the annotation example given below, two different morpho-syntactic tools were applied: Conexor (Conexor, 2002) and MBT (MBT, 2002). Some other tools are being evaluated for further use and the XML and RDF(S) annotation tools and wrappers are being designed at the moment.</Paragraph>
    <Paragraph position="3">  A semantic field (sometimes also called a conceptual field, a semantic domain or a lexical domain) is a theoretical construct which groups together words that are related by virtue of their being connected - at some level of generality - with the same mental concept (Wilson &amp; Thomas, 1997).</Paragraph>
    <Paragraph position="4">  For a more detailed explanation, see (Aguado de Cea, 2002).  Once again the SENSEVAL initiatives must be mentioned: they reveal the demand for semantic standardization in the field of word sense disambiguation (Kilgarriff, 1998), (Kilgarriff &amp; Rosenzweig, 2000).</Paragraph>
    <Paragraph position="5">  meaning in natural language and an approach to natural language processing (NLP) which uses a constructed world model - the ontology - as the central resource for extracting and representing meaning of natural language texts, reasoning about knowledge derived from texts as well as generating natural language texts based on representations of their meaning.</Paragraph>
  </Section>
  <Section position="8" start_page="12" end_page="12" type="metho">
    <SectionTitle>
3.1. RDF(S) EXAMPLE
DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> In Figure 1, Figure 2and Figure 3, we can see the annotation of the following Spanish sentence in the first three levels &amp;quot;Tras cinco anos de espera y despues de muchas habladurias, llega a nuestras pantallas la pelicula mas esperada de los ultimos tiempos.&amp;quot;  In the morpho-syntactic level (Figure 1) every word or lexical token is given a different Uniform Resource Identifier (URI henceforth) and three possible categorisations are included, according to the three different tagsets and systems we want to evaluate. Each tagset has been assigned a different class in the</Paragraph>
    <Paragraph position="2"> For the sake of space saving, just the annotation of the article &amp;quot;la&amp;quot; has been included in the figure.</Paragraph>
    <Paragraph position="3"> In the syntactic level (Figure 2) every syntactic relationship between morpho-syntactic items is given a new URI, so that it can be referenced in higher-level relationships or by other levels of the annotation model (i.e.</Paragraph>
    <Paragraph position="4"> &lt;synAnnot:Chunk rdf:ID=&amp;quot;1_510&amp;quot;&gt;). Again for the sake of space saving, just the annotation of the phrase &amp;quot;la pelicula mas esperada de los ultimos tiempos&amp;quot; has been included in the figure.</Paragraph>
    <Paragraph position="5"> In the semantic level (see Figure 3) some components of lower level annotations are annotated with semantic references to the concepts, attributes and relationships determined by our (domain) ontology, implemented in DAML+OIL.</Paragraph>
    <Paragraph position="6"> &lt;contentWeb:FilmReview&gt; &lt;contentWeb:text&gt;Tras cinco anos de espera y despues de muchas habladurias, llega a nuestras pantallas la pelicula mas esperada de los ultimos tiempos.&lt;/contentWeb:text&gt;  in our project. The pragmatic counterpart of et been tackled at this phase of and, thus, this level is not included in ple.</Paragraph>
  </Section>
  <Section position="9" start_page="12" end_page="12" type="metho">
    <SectionTitle>
HE XML DATA MODEL
</SectionTitle>
    <Paragraph position="0"> XML data model, every token from the with a &lt;Word&gt; tag and a RDF the attribute rdf:ID. Immediately &lt;surface_form&gt; tag will be it appeared in text; then come the morpho-syntactic, antic annotations for this token. to our morpho-syntactic (namespace pos) includes the  ant, applied also in SonIsa, the tagger ), MBT (a web-based ant) and the weborpho- null information is annotated by means of an attribute lemma, associated to the tag of the namespace pos.</Paragraph>
    <Paragraph position="1"> The syntactic counterpart of our XML data model (namespace syn) contains, in a TEI conformant manner, only the syntactic information given by FDG at the moment (m tags may be added as the model is refined). syntactic information covers EAGLES syntactic layers (c) and (d): showing dependency relations and indicating functional labels. Thus, the attributes defined at this level are: dependent_on which shows the token on which the present depends (via its rdf:ID); dependency, which describes the kind of dependency between both and surface_syn_tag, which denotes the surface syntactic function of the token in a Constraint Grammar approach. We are now studying the best way to cover EAGLES syntactic layers (a) and (b) - bracketing and labelling of segments - from a Constraint Gram perspective, not developed in the EAGLES syntactic guidelines aforementioned.</Paragraph>
    <Paragraph position="2"> p-lg/9406023 The semantic counterpart of our XML data model (namespace sem) is ontology-based and defined by means of the tags given in the DAML+OIL implementation of our domain ontologies.</Paragraph>
  </Section>
  <Section position="10" start_page="12" end_page="12" type="metho">
    <SectionTitle>
4. ADVANTAGES OF THE INTEGRATED
MODEL
</SectionTitle>
    <Paragraph position="0"> As shown in the previous section example, it seems that AI and Corpus Linguistics, far from being irreconcilable, can join together to give birth to an integrated annotation model. This conjunct annotation scheme would be very useful and valuable in the development of the Semantic Web and would benefit from the results of both disciplines in many ways, not restricted to the semantic level, below analysed. A particular subsection is dedicated to multi-functionality</Paragraph>
  </Section>
  <Section position="11" start_page="12" end_page="12" type="metho">
    <SectionTitle>
4.1. AT THE SEMANTIC LEVEL
</SectionTitle>
    <Paragraph position="0"> Let us now see the benefits at the semantic level of a hybrid annotation model, first from a linguistic point of view and, then, from an ontological point of view.</Paragraph>
    <Paragraph position="1"> 4.1.1. Regarding ontologies from a linguistic point of view Taking a closer view to sections 1 and 2, and comparing the proposals from both Corpus Linguistics and AI, we find out that the use of ontologies as a basis for a semantic annotation scheme fits perfectly and accomplishes the criteria posited by Schmidt. Clearly, its mostly hierarchical structure fulfils by itself criterion (5) and, as a side effect, criteria (2) and (4), since the former is related to the capacity of an ontology to grow horizontally (in breadth) and the latter to the capacity of an ontology to grow vertically (in depth or in specification). Hence, the end user can decide the level of specificity needed. Criterion (3) is also satisfied by an ontology-based semantic annotation scheme, since we can always specialise the concepts in the ontology according to specific periods, languages, registers and textbases.</Paragraph>
    <Paragraph position="2"> Ontologies are, by definition, consensual and, thus, are closer to becoming a standard than many other models and formalisms or, as criteria (6) requires, at least they lay a framework of properties and axioms (principles) and major categories that can be modified to some extent to fit individual needs. Concerning criterion (1), quite a lot of groups developing ontologies are characterized by a strong interdisciplinary approach that combines Computer Science, Linguistics and (sometimes) Philosophy; thus, an ontology-based approach should also make sense in linguistic terms.</Paragraph>
    <Paragraph position="3"> 4.1.2. Regarding linguistic annotations from an ontological point of view The main drawback for AI researchers to adopt a linguistically motivated annotation model would lie on the statement in section 1 that says, &amp;quot;there is no universal agreement in semantics about which features of words should be annotated&amp;quot; or on that other statement in Schmidt's criterion 1, in the same section, that says, &amp;quot;still an exhaustive set of categories is to be determined&amp;quot;.</Paragraph>
    <Paragraph position="4"> But ontology researchers are trying to fill this gap with initiatives such as the UNSPSC (UNSPSC, 2002) or RosettaNet (RosettaNet, 2002) in specific domains (i.e. e-commerce). In any case, linguistic annotations at the semantic level are more ambitious and potentially wider than the strictly ontology-based ones. Establishing a link between semantic annotation and discourse annotation and text construction following the RST approach, which has already been applied in text generation (Mann &amp; Thomson, 88), seems a fairly promising linguistic enhancement.</Paragraph>
    <Paragraph position="5"> So far, we have seen how ontologies can fit in the semantic annotation of texts; let us see in the next subsections how linguistic annotations in all of its levels can improve the potential of Semantic</Paragraph>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
Web Pages.
4.2. MULTI-FUNCTIONALITY
</SectionTitle>
      <Paragraph position="0"> The need for (shallow) parsing in semantic processing is found in Vargas-Vera et al. (2001) and also in Kietz et al.(2000): most information extraction systems (as well as other NLP applications) use some form of shallow parsing  to recognise syntactic constructs or, in other words, to syntactically identify some fragments of the sentences. A chunker  called Marmot is included in the annotation process presented in the  Without generating a complete parse tree for each sentence. Such partial parsing has the advantages of greater speed and robustness.  A chunker is a natural language (pre)processing tool that separates and segments sentences into its subconstituents, i.e. noun, verb and prepositional phrases, etc.</Paragraph>
      <Paragraph position="1"> former. Even though this need for lower levels of linguistic analysis mentioned hitherto applies to information extraction systems, it is not restricted to this kind of NLP applications. Since the proposed annotation model adds overt linguistic information to any kind of document, then it can be used for a wide range of purposes that require a linguistic or semantic analysis or processing (i.e. machine-aided translation, information retrieval, etc.).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML