File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1308_metho.xml
Size: 21,418 bytes
Last Modified: 2025-10-06 14:15:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1308"> <Title>A Multilingual Natural-Language Interface to Regular Expressions</Title> <Section position="4" start_page="81" end_page="82" type="metho"> <SectionTitle> 4 Abstract syntax and operations on it </SectionTitle> <Paragraph position="0"> The central role is played by an abstract syntax. It is a system of syntax trees, whose relation to the concrete syntaz--that is, the notation visible to the user of XFST or of natural language---is linearization, that is, the flattening of tree structure into the linear structure of a string. The inverse operation of linearization is parsing.</Paragraph> <Paragraph position="1"> In the theory of programming languages, it is customary to think of compiling as an operation that is not applied to the visible code as such, but only after parsing. Generally speaking, it is the syntax tree and not the string that is interpreted in the denotational semantics of the language and compiled (in our case, into a transducer or an automaton).</Paragraph> <Paragraph position="2"> The structure of the grammar of XFST notation is shown in Figure 1. The system of syntax trees is designed in such a way that they contain all information that is needed for linearization, interpretation, and compilation. Mathematically, each of these operations is a function from one system into another. As they usually suppress some of the information present in syntax trees, their inverses are not functions but search procedures. In the diagram of Figure 1, as well as in all diagrams to be displayed later, functions are represented by straight arrows and search procedures by bent arrows.</Paragraph> <Paragraph position="3"> To give an example, the XFST syntax tree RLEkleenestar (RLEsymbol (RSEonesymb &quot;a&quot;)) gets the following values under the main operations: linearization: the concrete XFST expression a*, interpretatio/l: the regular set \[&quot;&quot;, &quot;a&quot;, &quot;aa&quot;,..\] (a lazy list of Haskell), compilation: the automaton (\[sb &quot;a&quot;\], \[0\], \[0\],0, \[(0, sb &quot;a degdeg, 0)\]) (written in the notation of Haskell).</Paragraph> <Paragraph position="4"> Search procedures can be formalized as functions yielding lists as values, following a technique explained in \[15\]. For instance, the result of parsing is a list of syntax trees, which can be empty or have several distinct elements. Parsing the concrete regular expression a* gives, in addition to the syntax tree just mentioned, the tree RREkleenestar (RREsymbol (RSEonesymb &quot;a&quot;)) that represents a relation expression and is compiled to a transducer. Parsing the ill-formed expression A . x. \[B .x. C\] gives an empty list.</Paragraph> <Paragraph position="5"> The core of the natural-language interface consists of two abstract syntaxes, one for the XFST notation and another for natural language. The communication between the formalism and natural language takes place between the abstract syntaxes, so that all relevant information is preserved. However, natural language is richer than XFST in the sense that it may contain many different expressions for one and the same automaton or transducer (and we intend our fragment of natural language to be much richer in the future than it is now). Thus the operation of phrasing that takes an XFST expression into a natural-language expression is a search procudere, while the interpretation of XFST in natural language is a function.</Paragraph> <Paragraph position="6"> Figure 2 shows the communication between XFST and natural languages. The user of the interface, who does not care about why it works, will only see the concrete notations on the top and on the bottom, and the translations between them, which are both search procedures, since at least one component in both of them is a search procedure~</Paragraph> </Section> <Section position="5" start_page="82" end_page="88" type="metho"> <SectionTitle> 5 Syntactic categories </SectionTitle> <Paragraph position="0"> The systematic ambiguity of XFST notation is resolved by introducing two distinct categories of regular expressions, which are the categories ELE of regular language expressions, ERE of regular relation expressions.</Paragraph> <Paragraph position="1"> In addition, there are some categories not directly visible to the user: l~E of regular match constraint expressions, RCR of regular context expressions, ROE of regular operation expressions, and RSE of regular symbol expressions. Expressions of these categories occur as parts of language and relation expressions. For instance, match constraint expressions include the arrows -> and -> denoting the &quot;all matches&quot; constraint and the &quot;left to right longest match&quot; constraint, respectively. What is important in recognizing these categories is that they can be given denc* tational semantics that works compositionally as a part of the semantics of larger expressions. Thus, for instance, match constraints can be interpreted as functions of pairs of lists of integers encoding segment lengths.</Paragraph> <Paragraph position="2"> One level higher up than language and relation expressions, we have the category RDE of regular definition expressions.</Paragraph> <Paragraph position="3"> This category includes expressions of the form define A B ; used in XFST scripts. There are two syntactic structures of this category: definitions of regular languages and definitions of regular relations. Thus, as will he shown later, the categories RLE and RRE are' open to the introduction of new expressions by the user of the interface. In order to define a compositional interpretation of natural language in the XFST formalism, there must be at least one syntactic category of natural language for each category of XFST. We will have, in particular, the categories</Paragraph> <Paragraph position="5"> It could be possible to have more categories than these so that, for instance regular languages could also be expressed by adjectives and not only by common nouns. But we shall here confine ourselves to this minimal system.</Paragraph> <Paragraph position="6"> 6 Translations between XFST and natural languages Rather than showing the syntax trees, their interpretations, and their linearizations in formal detail, we shall just list the XFST operators and some ways of expressing them in English and in French. The list is shown in Table 1. The table presents the expressions grouped into the categories RLg, Rl~, RI~, RCE, and RDE. The category RCE is included in RLE, and the category R0~. in RRE.</Paragraph> <Paragraph position="7"> The table shows just one natural-language structure for each XFST form of expression. A few more are already implemented in the interface, and anyone who plays with it will almost immediately suggest some new ones. But there are some requirements that any new syntactic structure must fulfill--if some of the expressions included in the table looks more complicated than one would expect, this is usually explained by some of the following three principles: Expressions belonging to the same category must have the same syntactic behaviour.</Paragraph> <Paragraph position="8"> (This rules out having, say, adjectives in the same category as common nouns.) The constructions must be arbitrarily iterable. (Just as the operators of XFST are. The result is often hard to read but it should always be grammatical.) No expression should be ambiguous. (This is not necessary for an interface, but it makes it simpler. The language gets more complicated, though, because special words are needed to function as parentheses.) In Section 9, we will explain an extension of the XFST script language that makes it possible to introduce new ways of expressing regular expression operators.</Paragraph> <Paragraph position="9"> The presentation of natural language expressions in the table is schematic and does not make explicit the way in which various morphological features, such as those imposed by agreement, are controlled. There is a detailed discussion of this topic in \[13\]. All morphological features are introduced in linearization, and so is the order of words: they do not belong to the abstract syntax.</Paragraph> <Paragraph position="10"> All of the structure captured by the abstract syntax is common to the different languages. Notice that there is more in common than just the semantical content, since the same content can be expressed in different ways. The tiny fragment presented here does not yet give a very good illustration of this phenomenon. But a little example can already be given. The regular language alblc is expressed, according to the table, by the English and French common nouns string equal either to 'a' or to 'b' or to 'c', chaine ~gale soit d 'a' soit d 'b' soit ~ 'c'.</Paragraph> <Paragraph position="11"> But the grammar also includes the more concise structure usable for a union all of whose members are single symbols: symbole autre qu 'un It chaise commenfant par un A et continuant par un B :.. suivi d'un C string equal either to an It or to a B ...or to a C optional sequence of A 's nonempty sequence of It's string containing an A other A than a B chaise 6gale soitd un A soit dun B ... soit dun C s~quence optionnelle de It's s~quence non vide de A 's !chalne contenant un A autre A qu'un B string equal both to an A and to a B string other than an A sequence of n A's</Paragraph> <Section position="1" start_page="85" end_page="88" type="sub_section"> <SectionTitle> optional A </SectionTitle> <Paragraph position="0"> string resulting from an A by inserting B's string containing an A only G accept A as such change c into d in the beginning R then Q ... then P not only It but also q repeatedly R as long as applicable )repeatedly R as long as applicable but at least once repeatedly It n times optionally R replace an A by a B first R and, in what results, Q, ... and, in what results, P replace every t by a B, x-> mark the beginning of every A by an L and the end by an It, x-> choosing all possible matches choosing the longest matches from left to right if it is preceded by an L if it is followed by an R AnA is aB.</Paragraph> <Paragraph position="1"> To R is to Q.</Paragraph> <Paragraph position="2"> chaPSne ~gale et d un A et dun B chaise autre qu'un A sdquence de n A's A optionnel chafne r~sultant d'un A par l'insertion de B's chafne ne contenant de A que G accepter A tel quel changer c pour d au ddbut R, ensuite q .... ensuite P non seulement R mais aussi. Q faisant rdpdtition, R aussi longtemps qu 'applicable faisant r~pdtition, It aussi longtempe qu'applicable mais au rosins use ~sis faisant r~p~tition, R n ~sis optionnellement, g remplacer un A par un B d'abord It et, dans ce qui en r~sulte, Q ... et, dans ce qui en rdsulte, P remplacer tout A par un B, x-> marquez le commencement de tout A par un Let la fin par un R, x-> choisissant routes les apparitions possibles choisissant les apparitions ies plus longues de gauche d droite s'il est prdc~d~ par un L s'il est suivi d'un It UsA est un B.</Paragraph> <Paragraph position="3"> R, c'est q.</Paragraph> <Paragraph position="4"> symbol from the list 'a', 'b ', 'c' symbole de la liste 'a&quot; 'b', 'c'. The distinction between these two grammatical structures is similar in English and French, but it is not reflected by any distinction on the semantic level, that is, in the XFST formalism. 7 Translating regular expressions Given the theoretical framework of Figure 2 (left), it is possible to build several functionalities of translating between XFST, English, and French and of editing XFST scripts and English and French text files. For instance, if we have an interface implementing translation from XFST to natural language, we can type in the string \[a J b\]+ and get the following output (actually as ~ code, here typeset): English expressions for language : nonempty sequence of symbols from the list 'a', 'b' French expressions for language : sgquence non vide de symboles de la liste 'a', 'b' English expressions for relation : accept a nonempty sequence of symbols from the list 'a&quot; 'b' as such ~vpeatedly accept a symbol from the list 'a ', 'b' as such, as long as applicable but at least once l~epeatedly not only accept 'a' as such but also accept 'b' as such, as long as applicable but at least once French expressions for relation : accepter une sgquence non vide de symboles de la liste 'a&quot; 'b' telle quelle \[aisant rgpgtition, accepter un symbole de la liste 'a', 'b ' tel quel aussi longtemps qu'applicable mais au moins une lois faisant rdp~tition, non seulement accepter 'a' tel quel mais aussi accepter 'b' tel quel aussi longtemps qu 'applicable mais au moins une lois Because the input expression is ambiguous between a language and a relation, it can be expressed both as a common noun and as an instruction. The instruction, in turn, is either based on the identity relation of \[a I b\]+ (the first sentence), on the Kleene closure of the identity relation of \[a I b\] (the second sentence), or on the disjunction of the identity relations of a and b (the third sentence). For the practical purpose of documentation, it seems that it is the user who knows best which alternative to choose. Translation from English or French to XFST is unambiguous because our English and French fragments distinguish between language and relation expressions. The parser allows quite a lot of errors in natural language input: since morphology and orthography have been precisely defined in linearization, it is only easier, as well as more useful, to make the parser tolerant. Thus the French input accepter un sequence optionnel de chaines vide tel quels both yields the result O* and the corrected translation back to French, accepter une sgquence optionnelle de chaines rides telle quelle 8 Editing scripts and text files While many grammatical constructions of formal and natural languages can be arbitrarily iterated, even without ambiguities, the results can get hard to read. Thus the definition of a simplified Finnish hyphenation program reads in the XFST notation which is already hard to read, but probably easier than the English version produced by our interface (the French version is no better): To hyphenate is to mark the end of every string that begins with an optional sequence of symbols from the list 'd', 'g', 'h', 'j', 'k', T, 'm', 'n', 'p', 'r', 's', 't', 'v' and continues by a nonempty sequence of symbols from the list 'a', 'e', 'i', 'o', 'u', 'y' followed by an optional sequence of symbols from the list 'd', 'g', 'h', 'j', 'k', '1', 'm', 'n', 'p', 'r', 's', 't', 'v' by '-' if it is followed by a string that begins with an optional sequence of symbols from the list 'd', 'g', 'h', 'j', 'k', T, 'm', 'n', 'p', 'r', 's', 't', 'v' and continues by a nonempty sequence of symbols from the list 'a', 'e', 'i', 'o', 'u', 'y', choosing the longest matches from left to right.</Paragraph> <Paragraph position="5"> Both the formal code and the corresponding English and French texts are easier to understand if organized in sequences of shorter definitions: define vowel a f e J i J o J u I y ; define consonant d J g I h I j I k I 1 J m I n I p I r I s I 1; I v ; define syllable consonant* voeel+ consonant* ; define hyphenate syllable @-> ... ~,- I I _ consonant vowel ; A vowel is a symbol from the list %', 'e', 'i', 'o', 'u', 'y'. A consonant is a symbol from the list 'd', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'r', 's', 't', 'v'. A syllable is a string that begins with an optional sequence of consonants and continues by a nonempty sequence of vowels followed by an optional sequence of consonants. To hyphenate is to mark the end of every syllable by '-' if it is followed by a string that begins with a consonant and continues by a vowel, choosing the longest matches from left to right.</Paragraph> <Paragraph position="6"> Une voyelle est un symbole de la liste 'a', 'e', 'i', 'o', 'u', 'y'. Une consonne est un symbole de la liste 'd', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'r', 's', 't', 'v'.</Paragraph> <Paragraph position="7"> Un syllabe est une chaine qui commence par une sdquence optionnelle de consonnes et continue par une sdquence non vide de voyelles suivie d'une sdquence optionnelle de consonnes.</Paragraph> <Paragraph position="8"> Marquer les syllabes, c'est marquer la fin de tout s~llabe par '-' s'il est suivi d'une cha/ne qui commence par une consonne et continue par une voyelle, choisissant les apparitions les plus longues de gauche/L droite.</Paragraph> <Paragraph position="9"> (The example is from \[6\], structured in a slightly different way.) When editing a natural language text in parallel with an XFST script, some lexical information has to be given: the word or words used for each new defined concept, the plural form (if irregular), and the gender (in'French). The meaning of each lexical entry is given by the definition itself. Thus the words that are introduced are used in very precise technical meanings.</Paragraph> </Section> </Section> <Section position="6" start_page="88" end_page="89" type="metho"> <SectionTitle> 9 Function definitions </SectionTitle> <Paragraph position="0"> Standard XFST scripts have a format for defining macros for constant regular expressions, but we can get much more structure by defining functions. For function definitions, we use the format define F(.X ..... Y.) C, where C is an already defined regular expression possibly containing the variable symbols X, ..., Y (these symbols of course cannot be used as names of these letters in C--but we need not reserve a special class of variable symbols). A file containing function definitions can be translated into a file without them by replacing all applications of functions by their definienda. Those functions that are not used in definitions of constants are then simply ignored. Function definitions can be used for introducing new operators. For example, the task of shallow parsing uses the same kind of mark up operation over and over again: a segment of a string is put between parentheses and the closing parenthesis is marked by a category label. Using a function definition, we can write define labelWith(.C,c.) C @-> ~( ... ~) c ; define markNP labelWith(.NP,~+np.) ; define markVP labelWith(.VP,~+vp.) ; define marks labelWi~h(.S, ~+s .) ; where Np, Vp, and S are some previously defined sets of noun phrases, verb phrases, and sentences, respectively. Now, the natural-language structure corresponding to functions is an expression with complements, and it is easy to include user-defined information on the complements in the grammar. This information includes the prepositions (possibly none) required by each argument place, as well as the question whether the complement takes the plural or the singular form. The English text corresponding to the above piece of script looks as follows: To label C's with a c is to mark the beginning of every C by '(' and the end by a string that begins with ')' and continues by a c, choosing the longest matches from left to right.</Paragraph> <Paragraph position="1"> To mark noun phrases is to label noun pl~rases with '+np'.</Paragraph> <Paragraph position="2"> To mark verb phrases is to label verb phrases with '+vp'.</Paragraph> <Paragraph position="3"> To mark sentences is to label sentences with '+s'.</Paragraph> <Paragraph position="4"> Without the initial function definition, we would need three sentences of the same length and complexity as the function definition.</Paragraph> <Paragraph position="5"> m 10 The importance of structured writing I Organizing a program into a sequence of definitions is an example of structured programming, whose benefits are well known among programmers, but which is even more important if programs are systematically translated into texts. The impact of structuration on readability is, so to say, magnified when formal code is translated into the less perspicuous syntax of natural language. In natural language, syntactically complex expressions must be avoided by careful planning of the text.</Paragraph> <Paragraph position="6"> An obvious question arises whether it is possible to take a messy text and make it more readable by some automatic structuration software. The answer suggested by the analogy between text-production and programming is negative: as there is no algorithm that turns messy programs into structured ones, there is no algorithm that turns messy texts into readable ones. But programmers and writers can be encouraged to structured thinking.</Paragraph> <Paragraph position="7"> Readability is not so much a function of the language that is chosen but of the way in which the chosen language is used. There may be programming languages in which it is impossible to produce readable code, but unreadable code can be produced in any language, be it formal or natural. Thus any natural-language interface should be judged, not by the worst expressions it includes (because every language includes bad expressions), nor by its coverage of natural language (which is surely limited), but by its ability to provide clear and natural ways of expression whenever properly used.</Paragraph> </Section> class="xml-element"></Paper>