File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1036_intro.xml
Size: 9,881 bytes
Last Modified: 2025-10-06 14:00:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1036"> <Title>XML and Multilingual Document Authoring: Convergent Trends</Title> <Section position="3" start_page="0" end_page="244" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The typical al3pl'oacll to XML authoring views an XML doctmlcnt as a mixture of wee-like strttctttre, expressed througll balanced labelled parentheses (tim lags), and of sul:face, expressed llu'ough free lexi interspersed between lhe tags (PCI)ATA). A l)octunent Type l)elinilion (DTD) is roughly similar to a coiitext-free grammar j with exactly one predelined terminal. It delines a set o1' well-formed structures, (hat is, a la,guage over trees, where each nonterminal node can dominate either the empty string, or a sequence of occurrences of nonterminal nodes and of 111o terminal node pcdata. The terminal pcdata has a specM status: it can in turn dominate any characler string (subjecl to certain reslrictions on the characters allowed). Authoring is typically seen as a top-down interactive process of step-wise refinement of the root nonterminal (corresponding to the whole document) where the aulhor ileratively chooses a rule for expanding IBu( see (l'rescod, 1998) lbr an inleresfing discussion oflhe differenccs. null a nonlerminal aheady present in the tree, 2 and where in addition the author can choose an arbitrary sequence of characters (roughly) for expanding lhe pcdata node.</Paragraph> <Paragraph position="1"> One can observe the following trends in the XML world: A move towards more typing of the surface: Schemas (W3C, 1999a), which are an inlluemial proposal for the ieplacenlent of I)TD's, provide for types such as float, boolean, uri, etc., instead o\[&quot; the single type pcdata; A move, aheady constitulive of the main lmlpose of XMl, its opposed l(1 HTML for instance, towards clearer separation between content and form, where the original XML document is responsible for conlent, and powerful styling lnechanisms (e.g. XSI.T (W3C, 1999b)) are available for rendering 111o doctlll/en\[ \[o lhe end-user.</Paragraph> <Paragraph position="2"> We advocate an approach in which these two moves are radicalixcd in tile folk)wing ways: Strongly typed, surface-free XML documents. The whole content of the document is a trcc whore each node is labelled and typed. For inlernal nodes, lhe lype is just the usual nonierminal name (or category), and Ille label is a name for the expansion chosen for this nonlernfinal, lhat is, an identifier of which rule was chosen to expand ibis nonterminal. For leaves, lhe type is a semanlically specilic category such as Integer, Animal, etc., and lhe label is a specilic concept of this type, such as three or dog) Styling responsible for producing tim text itself.</Paragraph> <Paragraph position="3"> The styling mechanisnl is not only responsible for rendering the layout of the lext (typography, order and presentation of lhe elements), but also for producing the text itse!ffrom 111o document content.</Paragraph> <Paragraph position="4"> What are (he motiw~tions behind this proposal? Autlmring choices carry language-independent meaning. First, let us note that lhe expansion choices ical lype (c, t): lhe,'c is no reslriction on lhe denotalional slalus of leaves.</Paragraph> <Paragraph position="5"> purposes, we have assumed that there are in turn three semantic varieties of cautious. The rule identitier on the left can be seen as a semantic label for each expansion choice (in practice, the rule identifiers are given mnemonic names directly related to their meauing).</Paragraph> <Paragraph position="6"> made during the authoring of an XML document generally carry language-independent meaning. For instance, the DTD for an aircraft maintenance manual might be legally required to distinguish between risk instructions of two kinds: caut ion (risk related to material damages) and warning (risk to the operator). Or a D~'I) describing a personal list of contacts might provide a choice of gender (male, female), title (dr, prof, default), country (ger, fra,...), etc. Each such authoring choice, which formally consists in selecting among different rules for expanding the same nonterminal (see Figure 1), corresponds to a semantic decision which is independent of the language chosen for expressing the document. A given DTD has an associated expressive space of tree structures which fall under its explicit control, and the author is situating herself in this space through top-down expansion choices. There is then a tension between on the one hand these cxplicitely controlled choices, which should be rendered differently in different languages (thus ger as Germany, Allemagne, Deutschland .... and Warning by a paragraph starting with Warnillg! ...; Attention, Danger! ...; Achtung, Lebensgefahr! ...), and on the other hand the uncontrolled inclusion in the XML document of free PCDATA strings, which are written in a specific language.</Paragraph> <Paragraph position="7"> Surface-fi'ce XML documents. We propose to completely remove these surface strings from the XML document, and replace them with explicit meaning labels. 4 The tree structure of the document then becomes the sole repository of content, and can be viewed as a kind of interlingua for describing a point in the expressive space of tile DTD (a strongly domain-dependent space); it is then the responsability of the language-specific rendering mechanisms to &quot;display&quot; such content in each individual language where the document is needed.</Paragraph> <Paragraph position="8"> XML and Multilingual Document Authoring. In this conception, XML authoring has a strong connection to the enterprise of Multilingual Document Authoring in which the author is guided in the specilication of the document content, and where the system is responsible 4There are autlmring situations in which it may be necessary for the user to introduce new selllalllic labels eorleSl)onding lo expressive needs not foreseen by lhe creator of the original I)TD. To handle such situations, it is useflfl to view the l)TI)'s as open-ended objecls 1o which new semantic labels and types can be added at authoring time. for generating from this content textual output in several languages simultaneously (see (Power and Scott, 1998; Hartley and Paris, 1997; Coch, 1996)).</Paragraph> <Paragraph position="9"> Now there are some obvious problems with this view, due to the current limitations of XML tools.</Paragraph> <Paragraph position="10"> Limitations of XML for multilingual document authoring. The first, possibly most serious, limitation originates in the fact that a standard DTD is severely restricted in the semantic dependencies it can express between two subtrces in the document structure. Thus, if in the description of a contact, a city of residence is included, one may want to constrain such an information depending on the country of residence; or, in the aircraft maintenance manual example, one might want to automatically include some warning in case a dangerous chemical is mentioned somewhere else in the document.</Paragraph> <Paragraph position="11"> Because DTD's are essentially ofcontcxt-fi'ce expressive power, the only communication between a subtree and its environment has to be mediated through the name of the nonterminal rooting this subtree (for instance the nonterminal Country), which presents a bottleneck to information ilow.</Paragraph> <Paragraph position="12"> The second limitation comes fi'om the fact that the current styling tools for rendering an XML document, such as CSS (Cascading Style Sheets), which arc a strictly layout-oriented language, or XSLT (XSL transformation language), which is a more generic tool for transforming an XML document into another one (such as a displayoriented HTML file) are poorly adapted to linguistic processing. In particulm, it seems difficult in such formalisms to express such basic grammatical facts as ntunber or gender agreement. But such problems become central as soon as semantic elements corresponding to textual units below the sentence level have to be combined and rendered linguistically.</Paragraph> <Paragraph position="13"> We will present two related proposals for overcoming these limitations. The first, the Grammatical Framework (GF)(Ranta, 2000), originates in constructive type-theory (Martin-L6f, 1984; Ranta, 1994) and in mathematical proof editors (Magnusson and Nordstr6m, 1994). The second, h~teraction Grammars (IG), is a specialization of Definite Clause Grammars strongly inspired by GF. The two approaches present certain lk)rmal differences that will not be examined in detail in this papeh but they share a number of important assumptions: * The semantic representations are strrmgly O'ped trees, and rich dependencies between subtrees can be specilied; * The abstract tree is independe,lt of tile different textual realization hmguages; * Tim surface realization in each language is obtained by a semalltics-driven compositional process; that is, the surface realizations are constructed by a bottom-up recursive process which associates surface realizations to abstract tree nodes by recursively combining the realizations of daugthcr nodes to obtain the realization of the mother node.</Paragraph> <Paragraph position="14"> * The grammars are revelwible, that is, can be used both for generation and for parsing; * The authoring process is an interactive process of repeatedly asking the author to further specify nodes in the absmlct tree of which only the type is known at the 1)oint of interacti(m (tyFe re/itlemeHt).</Paragraph> <Paragraph position="15"> This process is mediated througll text in the language of the author, showing the types t(5 be relined as specially highlighted textual units.</Paragraph> </Section> class="xml-element"></Paper>