File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0604_metho.xml
Size: 18,723 bytes
Last Modified: 2025-10-06 14:09:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0604"> <Title>The semantics of markup: Mapping legacy markup schemas to a common semantics</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Goals of this paper </SectionTitle> <Paragraph position="0"> In this paper we extend Simons' proof of concept for the use of metaschemas in the following ways.</Paragraph> <Paragraph position="1"> SIL is extended to include the ability to map the content of designated elements and attributes in source documents to the semantic schema, not just the markup itself.</Paragraph> <Paragraph position="2"> We devise metaschemas for lexicons that use distinct XML markup schemas: one of the lexicons that Simons (2003) originally used, for Sikaiana (Solomon Islands) with about 3000 entries; a Hopi (Arizona) dictionary with about 30,000 entries, for which Kenneth Hill's original encoding using a proprietary and no longer supported data-base program was converted to XML by Lewis and Gonzalez; and a Potawatomi (Great Lakes region, US and Canada) lexicon being created by Laura Buszard-Welcher using the EMELD FIELD tool. The Prolog query engine is replaced by SeRQL, an SQL-like query language for Sesame, an RDF database program (Broekstra, Kampman and van Harmelen 2002; User Guide for Sesame 2004). It is our intention to couple Sesame with an inference engine that reads OWL documents, such as Racer (Haarslev and Moller 2001).</Paragraph> <Paragraph position="3"> In carrying out the migration of such language resources to the Semantic Web, we are guided by the principle of preserving the original analyses as much as possible. At the same time, since the migrated resources are to be rendered mutually interoperable and transparent to the tools that are designed to work over them, the migration process has the potential to greatly increase the precision of the original analyses, to reveal inconsistencies in them, and ultimately to result in enriched resources. For example, the comparison of two descriptions of the same language that has been made possible by migration could reveal errors in one or the other. Similarly, a single resource could be checked for consistency with accumulated linguistic knowledge represented in an ontology. The migration process thus provides two sources of new knowledge. First is the knowledge brought in from the document interpretation process itself, i.e. by the linguist, not necessarily the one who performed the original analysis. Second when the migrated documents are added to the knowledge base, new inferences can be automatically generated based on the general knowledge of linguistics captured in the ontology. The type of new knowledge generated is however constrained, for example, by the type of search to be done over the resulting knowledge base (see section 6).</Paragraph> <Paragraph position="4"> However the migration process can also skew or misinterpret the intentions underlying the original documentation. To minimize this risk, the migration tools should be as non-intrusive as possible. Even so, some steps are necessary to add structure where structure is lacking in the original XML documentation and to interpret the meaning of the original elements where their meanings are undefined or unclear. For the ontology the implication is that theory-laden concepts either should be avoided or less encumbered alternatives should be made available.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 GOLD </SectionTitle> <Paragraph position="0"> An important guiding principle used in the construction of GOLD is to distinguish between those concepts that represent the content of linguistic data and those that pertain to the structuring of those data (cf. Ide and Romary 2003 who also distinguish between data content and data structure).</Paragraph> <Paragraph position="1"> A particular entry in a lexicon, for example, is a data structure used to organize lexical data in a particular fashion. Entries usually contain actual data instances, e.g., the Hopi word nahalayvi'yma or its phonological properties. The process of data migration is made much easier if a separation between data and data structure is upheld in the semantic schema.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Data content </SectionTitle> <Paragraph position="0"> Linguistic data content includes linguistic expressions, the physical manifestations of language, also known as 'morphs', or simply 'forms', which may be written, spoken or signed. In GOLD, written linguistic expressions are represented as ORTHOGRAPHICEXPRESSION with the subclasses ORTHOGRAPHICPART, ORTHOGRAPHICWORD, and ORTHOGRAPHICSENTENCE. These are defined as special types of strings. In order to analyze linguistic data further, abstract counterparts of linguistic expressions are proposed called LINGUISTICUNIT.</Paragraph> <Paragraph position="1"> The abstract units are the main objects of interest in formal linguistics. In some theories, the various subclasses of LINGUISTICUNIT correspond to 'morphemes', 'constituents', or 'constructions'. No assumptions are made about whether these have any mental significance, e.g. whether they are underlying forms. The class hierarchy for LINGUISTICUNIT is presented in Farrar, Lewis and Langendoen (2002), and can be viewed in GOLD using Protege 2.0 [protege.stanford.edu].</Paragraph> <Paragraph position="2"> The LINGUISTICUNIT hierarchy is organized according to how its components are realized as forms, and not according to their formal linguistic features, which are theory specific. So, for example, LEXICALUNIT is simply a formal unit that can appear in isolation in its realized form, and not necessarily something that can be a constituent of larger syntactic constructions. The methodology leaves open the question of whether, for example, a SUBLEXICALUNIT can also be a phrasal constituent, as appears to be the case with CLITIC. Yet another alternative would be to organize LINGUISTICUNIT according to semantic features, e.g., a SUBLEXICALUNIT would be something which usually represents a grammaticized notion. But, since this varies from language to language, a different taxonomy would be needed for every type of language encountered. To sum up, adhering to strictly formal features necessitates theory-specific taxonomies, while adhering to semantic features leads to language-specific taxonomies. Instead a neutral approach is taken in which LINGUISTICUNIT is organized according to how instances are realized as linguistic expressions.</Paragraph> <Paragraph position="3"> ORTHOGRAPHICEXPRESSION is related to LINGUISTICUNIT by the predicate REALIZES. The particular sort of LINGUISTICUNIT is further defined according to what kinds of attributes it can take.</Paragraph> <Paragraph position="4"> So, a MORPHOSYNTACTICUNIT has attributes of the sort MORPHOSYNTACTICATTRIBUTE. Instances of particular attributes are PASTTENSE, SINGULAR-NUMBER, and PROGRESSIVEASPECT. The class of attributes pertaining to linguistic units parallels other kinds of non-linguistic attributes such as SHAPEATTRIBUTE and PHYSICALSTATE.</Paragraph> <Paragraph position="5"> There are several varieties of attributes which linguists find useful for language description, including phonological and semantic features. Semantic attributes contrast with morphosyntactic attributes in that the former correspond to the notional characteristics of linguistic form that have some manifestation in the grammar.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Data structures </SectionTitle> <Paragraph position="0"> A linguistic data structure is defined as an abstract information container which provides a way to package elements of linguistic data. The two main types of data structures contained in GOLD at the moment are LEXICALITEM and FEATURE-STRUCTURE. Our characterization of LEXICALITEM extends that of Bell and Bird (2000). At a minimum, a LEXICALITEM should contain an instance of LEXICALUNIT or of SUBLEXICALUNIT. Special relations are given in GOLD which pertain only to data structures, e.g., HASLEXICALUNIT relates a LEXICALITEM to a LEXICALUNIT. Instances of LEXICALITEM typically include glosses either in the same language in the case of a monolingual lexicon, or in some other language in the case of a bilingual lexicon. Glosses are simply instances of ORTHOGRAPHICEXPRESSION related to the entry via the relation GLOSS. Entries relate to one another via relations such as SYNONYMOF and ANTONYMOF.</Paragraph> <Paragraph position="1"> If a LEXICALITEM contains extensive morphological information, we may represent this in the form of a FEATURESTRUCTURE. The FEATURE-STRUCTURE class is part of a more extensive set of data structures known as a FEATURESYSTEM (Langendoen and Simons, 1995; Maxwell, Simons and Hayashi, 2002). A FEATURESPECIFICATION is a data structure that contains a subclass and an instance of MORPHOSYNTACTICATTRIBUTE (i.e. an ordered pair), for example, [TENSE: PASTTENSE].</Paragraph> <Paragraph position="2"> The implementation of the FEATURESYSTEM construct allows for recursive FEATURESPECIFICA-TIONs in which, for example, a subclass of MORPHOSYNTACTICATTRIBUTE is paired with an instance of FEATURESTRUCTURE.</Paragraph> <Paragraph position="3"> One criticism that could be raised against the inclusion of data structures in a semantic resource such as GOLD is that they are superfluous. Why not simply leave it up to the source markup to describe the elements of data structure, e.g., in the form of an XML Schema? This is certainly a reasonable criticism, since excluding data structures from GOLD would make the ontological modelling process much simpler. However, they are included because we envision that subsequent applications will need to be able to reason, not only about the data itself, but also about how it is structured. For example, it might be necessary to compare elements of a LEXICALITEM to that of FEATURESTRUCTURE. This is actually an essential step in achieving the vision of the Semantic Web, namely, constraining the source data in such a way as to preserve structure where structure is defined and to enrich structure where structure is left unspecified. null</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Semantic Interpretation Language </SectionTitle> <Paragraph position="0"> The Semantic Interpretation Language (SIL) was originally created to define the meaning of the elements and attributes declared in an XML markup schema, as well as the relationships between them. An SIL metaschema is an XML document that formally maps the elements and attributes of an XML encoded resource to concepts in an OWL ontology or an RDF Schema. Furthermore, the metaschema formally interprets the original markup structure by declaring what the dominance and linking relations in the XML document structure represent. For example, consider the extract from the Hopi lexicon shown in The dominance relation between the elements <MSI> (for 'morphosyntactic information') and <POS> (for 'part of speech') in the original XML is implicitly something like 'has'. This can be made more explicit by mapping it to HAS-MORPHOSYNTACTICPROPERTY, a formally defined relation in the ontology. This relation is formally defined in the ontology by specifying its signature, i.e. what kinds of arguments it can take. Thus, a better defined, more exact, relationship between elements of markup is achieved.</Paragraph> <Paragraph position="1"> SIL has been extended to formalize the resolution of content in addition to markup. For example, the semantics of the gram vt in the XML structure <POS>vt</POS> can be specified via a mapping to the ontology as an instance of VERB-TRANSITIVE, in addition to defining the semantics of the POS element itself.</Paragraph> <Paragraph position="2"> An SIL metaschema, as described in detail in Simons (2004), is an XML document built from metaschema directives, which are essentially processing instructions expressed as XML elements. Directives like resource, property, literal and translate generate elements of the resulting semantic interpretation. Part of the SIL DTD is shown in Figure 2.</Paragraph> <Paragraph position="3"> The interpret directive performs the primary mapping function from markup elements of the input resource to the enriched output, as demonstrated in Figure 3. The tag <form> is interpreted as a LINGUISTICFORM, specifically as an name=&quot;type&quot;>, embedded within <POS>, is interpreted as referencing a morphosyntactic property, the value of which is content interpretable by the terminology set identified by the reference Hopi/Hopi_pos_mapping.xml. A terminology set contains a simple mapping between terms used in the source document and the names of the equivalent concepts in the ontology. SIL can handle both one-to-one terminology mappings (e.g., mapping from the tag vt to the concept VERB-TRANSITIVE) as well as one-to-many mappings (e.g. mapping from 1sg to a property bundle of SIL is designed to allow interoperability between resources by mapping the different structures and content of markup in the source documents onto the same set of ontological concepts. This is demonstrated by comparing the transformed output for Hopi shown in Figure 4 with the transformed output for Sikaiana in Figure 5. Note that the inputs are different but the outputs are the same.</Paragraph> <Paragraph position="4"> The SIL only guarantees interoperability when comparable semantic resources are employed in the mapping. If an entire group relies on a common semantic schema, e.g. GOLD, a 'community of practice' is formed. This in turn facilitates intelligent search across converted resources.</Paragraph> <Paragraph position="5"> Currently, writing an SIL metaschema is done entirely by hand. We are in the process, however, of developing two tools to automate the process. The first tool will allow the user to define the relationship between the terminology used within a resource with relevant GOLD concepts. The second tool will define the structural mapping relationship between the resource and a given metastructure. The first tool, named Alchemy, presents the user with a drag-and-drop interface in which the user defines the terms used within her resource by associating them with one or more GOLD concepts. The relationship between any given term and relevant GOLD concepts can be complex, with one-to-one or one-to-many relationships being allowed, and the relationships themselves can be of any of a number of types: SameAs, KindOf, etc.</Paragraph> <Paragraph position="6"> We are in the process of building this tool, embedded within an systems developer toolkit accompanying GOLD.</Paragraph> <Paragraph position="7"> The second as of yet unnamed tool is still in the early design stages. This tool will allow the user to first define the type of resource she is converting (lexicon, interlinear text, grammar, etc.), and will then lead her through a series of questions that define the structure by associating it with a meta-type definition for the particular resource type. The tool will require a precise and well-defined 'semantics of linguistic structure', a conceptual space of linguistic structural types that will be included in GOLD, but is still in the process of being defined. The final output of this tool, in association with an Alchemy-defined terminology set, will be an SIL metaschema.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Querying Resources </SectionTitle> <Paragraph position="0"> In this section, we discuss the general issue of searching over linguistic descriptions on the Web, and the current state of our effort to do so using SeRQL (see section 3 item 4) over the RDF repositories for Sikaiana, Hopi and Potawatomi generated by the metaschemas from their XML-encoded lexicons.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Dimensions of search over linguistic de- scriptions </SectionTitle> <Paragraph position="0"> As mentioned in section 1 above, one of the most compelling reasons to migrate XML documentation to a semantically interoperable format is to enable intelligent search. For the linguistics community, we envision several parameters of search over semantically interoperable linguistic documentation. Search may be performed according to: * level of analysis (phonetic, morphosyntactic, discourse) * typological dimension (including language type) * intent of search (for exploring some particular language, or for language comparison) * kind of results desired (which data structure to return) Search also varies according to degree of difficulty, that is, whether search requires the assistance of an inferencing engine or not. Direct search is defined as search over explicitly represented data, i.e. instance data in the knowledge space. This includes the simple string matching of conventional search engines. But since the search will be carried out using the enriched RDF framework, direct search is not limited to string matching in the original XML. An example of direct search is to find all data that includes a reference to instances of some grammatical category (e.g., PASTTENSE). Boolean searching with direct search is also possible, e.g., searching for cases of portmanteau morphemes, expressed in our framework as two or more MORPHOSYNTACTICATTRIBUTES associated with some LINGUISTICUNIT.</Paragraph> <Paragraph position="1"> Indirect search goes beyond direct search by making use of inferences based on the structuring of the concepts in an ontology. For example the concept of PLURALNUMBER means 'two or more', the concept of DUALNUMBER means 'exactly two', and the concept of MULTALNUMBER means 'three or more'. A direct search for PLURALNUMBER will miss those instances represented as DUALNUMBER and MULTALNUMBER, whereas an indirect search will find them.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Some SeRQL queries </SectionTitle> <Paragraph position="0"> In Figure 6, we give the SeRQL query (omitting using namespace) for the orthographic forms for all the lexical items specified as having the GOLD concept PROGRESSIVEASPECT in the three lexicons. This query returned 1135 results, all from</Paragraph> </Section> </Section> class="xml-element"></Paper>