File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-3157_metho.xml

Size: 17,170 bytes

Last Modified: 2025-10-06 14:13:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-3157">
  <Title>From Detection/Correction to Computer Aided Writing</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Computer Aided Writing (CAW)
</SectionTitle>
    <Paragraph position="0"> A computer system for a writer is basically a personal computer which runs a text processor, the power increase of personal computers has been followed by the growth of services provided to the user. Some of these services aim to increase the writers productivity but most of them aim to obtaining a better quality of produced documents. We will distinguish here between two categories of services: presentation services and production services. The fwst ones concern the way the paper form of the text looks: justification, formating, multi-column...</Paragraph>
    <Paragraph position="1"> They are very powerful in modem systems, especially if you add to your text processor a graphic processor and a page maker, but they have little to do with linguistics and so we will not discuss them here.</Paragraph>
    <Paragraph position="2"> The second ones concern the text itselt, in its content and in its form. The best known and most achieved service in this category is the spelling checker, which can be found in every modern text processor. Recently, other services  have emerged: * on-line lexicons with synonym and antonym links; * idea managers which help the user to build the plan of his document; * syntactic checkers in the spirit of the IBM system CRITIQUE \[6\].</Paragraph>
    <Paragraph position="3">  In most cases, these new services are add-ons to an existing text processor and CAW systems are stacks of tools, lacking the coherence of an integrated approach.</Paragraph>
    <Paragraph position="4"> Our idea is that CAW must be thought of as a goal in itself and our aim is to build an environment for the production, maintenance, edition and communication of texts. Such a system will be based on a coherent set of software tools reflecting the state of the art in string manipulation and linguistic treatment. At a first glance, the system should include classic and well-known tools such as those cited above and more sophisticated tools like: * morphological analysis and generation, which can for example be used for lemmatization of words or groups of words. The idea here is to use these lemmatized groups as keys to access external knowledge bases or document bases \[91.</Paragraph>
    <Paragraph position="5"> * syntactico-semantic analysis and generation to allow operations like: changing the tense of a paragraph, changing the modality of a sentence, help in detecting ambiguous phrases and in disambiguation by proposing paraphrases. There is also the possibility of  generating a definition of a word on the basis of its formal description in the lexicon.</Paragraph>
    <Paragraph position="6"> ACRES DE COLING-92, NANTES. 23-28 AOtYr 1992 l 0 i 4 PROC. OF COLING-92, NANTES. AUG. 23-28, 1992 * lexical and syntactic checkers, which must  also be able to propose corrections, by the use of all the linguistic knowledge included in the system.</Paragraph>
    <Paragraph position="7"> * structural manipulations of the text in the spirit of idea managers but also some verifications on the structure by the use of a grammar of the text, which depends on the type of document created. For example, a software documentation will include a user manual and a reference manual, the user manual will include an installation chapter, a tutorial introduction chapter .....</Paragraph>
    <Paragraph position="8"> * interface with the outside world: that includes of course the production of a paper form of the text but also, at least as important as the former, the production of the text in some standardized form (for example the form caracteristics are the use of a minimal number of data structures and a distributed architecture. We will here quickly describe the role of each module, leaving for the next two sections the discussion about data structures and architectural choices.</Paragraph>
    <Paragraph position="9"> The proposed system is primarily built for French but every module has been designed to be as general as possible, and is completely configurable, so that it can be used for other languages.</Paragraph>
    <Paragraph position="10"> Each module is viewed as a server which is able to provide some service. Following our work on detection and correction of errors, many modules are dedicated to this sort of task.</Paragraph>
    <Paragraph position="11"> Given an incorrect word, the similarity key module is able to produce a list of correct words Fi\[ure 1: Architecture of a CAW environment recommended by the TEl \[8\]) which can travel on networks and be legible by most software. This lorm can also be used to store the text in databases or to pass it on to other software. A very interesting type of software could be an automatic translator, so that a text could be created in one language and published in one or more other languages.</Paragraph>
    <Paragraph position="12"> Such a system is a long term objective and we will see in the next section an architecture which makes possible a short term full implementation, while being open for future extensions.</Paragraph>
    <Paragraph position="13"> 3. Architet~ure of a CAW environment Figure 1 describes the architecture of tile CAW system under development in our team. Its which are possible corrections of the incorrect one. It is well-suited for typographic errors. The phonetic graphic transducer plays the same role by using the phonetic invariant of words. It is well-suited for spelling errors.</Paragraph>
    <Paragraph position="14"> The morphological module can also be used for lexical correction \[3\] but its main purpose is to produce an input for the syntactico-semantic parser, which is in charge of building a decorated structure of the sentences of the text. The parser we use is a dependency-tree transducer designed as a robust parser \[4, 5\]. The syntactic checker is in charge of verifying agreement rules in sentences \[7\].</Paragraph>
    <Paragraph position="15"> The multi-purpose lexicon contains all lexical information and furnishes access tools (see next section).</Paragraph>
    <Paragraph position="16"> ACRES DE COLING-92, NAMES, 23-28 hOt~q 1992 1 0 1 5 PROC. ov COLING-92, NAr~'rl.;s, AUG. 23-28, 1992 The text processor provides string Every module can read or write in this lattice; manipulations while the edition communication for example, the corrections prOduced by lexical module gives a paper or communicable form of correctors can be added as multiple the text. The structure manager is in charge of global interpretations of a word. manipulations on the surface structure of the text (chapter, sections,...) and of the much more difficult task of verifying the internal coherence (there is an introduction, a development, a conclusion,...).</Paragraph>
    <Paragraph position="17"> Finally, the control and user interface module assumes the synchronisation and communication between modules and the transmission of user orders.</Paragraph>
    <Paragraph position="18"> The correctors, the syntactic checker, the morphological parser and generator, the syntactico-semantic parser are all operational on micro-computers. At the moment, the lexicon is a roots and endings dictionary (35,000 entries, generating 250,000 forms) with only morphological information on words, but its extension is under development. Figure 2: Example of a lattice  4. Data Structures 4.2. Lexicon 4.1. Blackboard  A main caracteristic of our system is the use of an internal representation of the text in the form of a multi-dimensional lattice (inspired by \[2\]) which play the role of a blackboard for all the modules.</Paragraph>
    <Paragraph position="19"> Each node of the lattice bears information on a piece of text, and we propose that they all have the same structure: each node bears a tree (sometimes limited to the root) and each node of the tree bears a typed feature structure (a ~t'term, see SS4.2). We can imagine that the lattice is initiated by the flow of characters which come from the text processor, thus the word &amp;quot;Time&amp;quot; will become: For performance problems, it seems more reasouable to initiate the lattice with the lexical units resulting from the morphological parsing of the text. With the sequence of characters &amp;quot;Time flies...&amp;quot;, we will obtain the bottom four nodes of the figure 2 lattice.</Paragraph>
    <Paragraph position="20"> We can see two dimensions of the lattice on this example: a sequential dimension (&amp;quot;time&amp;quot; is the first word and is followed by the second word &amp;quot;l\]ies&amp;quot;), and an ambiguity dimension (both words have two possible interpretations).</Paragraph>
    <Paragraph position="21"> A third dimension appears when the syntactic parser starts its work. It produces new lattice nodes which bear dependency trees. With the lattice above, the syntactic parser will add the two top nodes (figure 2).</Paragraph>
    <Paragraph position="22"> We think it is very important, for the coherence of the knowledge embedded in the system, that all lexical information be contained in a unique dictionary. Multiple access and adapted software tools will extract and present the information to the user in different forms, for example the natural language form of a formal entry may be computed by the syntactic generator.</Paragraph>
    <Paragraph position="23"> To represent knowledge associated with words, we have chosen typed-feature structures called w-terms \[1\]. With these structures, basic concepts are ordered in a hierarchy which can be extended to whole structures. Thus we can determine if a 'e-term is less than another and the unification of two hU-temls is the biggest ~t'term which is less than both unified ones. In other words, the unification of two terms is the most general term which synthesizes the properties of both unified ones. This caracteristic is very interesting for the implementation of paradigms: a paradigm is the representative of a class of words and contains the information which describes the behaviottr of a word. We distinguish three types of paradigms: morphological, syntactic and semantic.</Paragraph>
    <Paragraph position="24"> Morphological paradigms bear the category of the word and a few linguistic variables such as gender and number. Syntactic paradigms contain information about the function of the word within its context. The aim is to code sub-categorization of words, and it is very important for verbs but also for nouns and some Ac'r~ DE COLING-92, NANrES. 23-28 Ao(;r 1992 1 0 1 6 P}toc. OF COLING-92. NANTES. AUG. 23-28. 1992 adjectives. A semantic paradigm is the semantic concept associated with the word or the logical structure in the case of predicate words.</Paragraph>
    <Paragraph position="25">  Each entry in the lexicon contains a key, which is used to access the entry, and a reference to a paradigm of each type. In order to allow information sharing between &amp;quot;v-terms, we add to the entry an optional list of equational constraints. For example, for ehoose, we have : syn. subject, sere - sem. agent: and syn.object.sem = sere.choice sayingthat usually the subject of the verb is its agent and the object is the choice. The result of morphological parsing of a form is the unification of the three paradigms of each lexicon entry used. For example, for the form chooses, we use the root choose and the ending s (which add the features person and number to the paradigms of the verb) thus we obtain:  where the notation @X is used to write reference links (equational constraints).</Paragraph>
    <Paragraph position="26"> The idea behind paradigms is to allow a great factorization of knowledge: it is obvious for morphological paradigms (in the actual dictionary, we have only 400 paradigms for 250,000 forms) and for syntactic paradigms (the number of possible sub-categorizations for verbs is far less than the number of verbs). It is less obvious for semantic paradigms, especially if you want a very f'me description of a word: in this case, there is almost a paradigm for each word.</Paragraph>
    <Paragraph position="27"> So the lexicon is essentially built around three ,v-term bases, one for each set of paradigms.</Paragraph>
    <Paragraph position="28"> The bases are accessed by the roots and endings dictionary used by morphological tools (parser and generator), and we can easily add synonym and antonym links to this dictionary. The key-form correspondence table, required by the similarity key correction technique cannot easily be embedded in this lexicon structure, but we propose to append it to the lexicon so that any module requiring iexical information must use the multi-purpose lexicon module. This constraint is imposed in view of coherence: each time a root is added to the main dictionary, all key-form pairs obtainable from this root must be added to the table.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Distribution
</SectionTitle>
    <Paragraph position="0"> Each module in our system must be viewed as a server which responds to requests sent by any other module. Such an architecture has the classical advantages of modular StlUctures: you can add or remove a module very easily, you can modify a module in a transparent manner as long as you do not change its interface ....</Paragraph>
    <Paragraph position="1"> But this structure has another advantage which is very important in the context of linguistic treatments: the linguistic competence of each module can be exploited by the others. We will use two examples to illustrate our purpose.</Paragraph>
    <Paragraph position="2"> First, in detection and correction of lexical errors, we have implemented classical tools (similarity key and phonetic). Then we decided to implement syntactic checking, so we needed the services of a morphological parser. We added to the system (a prototype called DECOR) our morphological tools, and the availability of these tools gave the idea of using them for detection and correction, so we inrplemented a third technique of correction : morphological generation.</Paragraph>
    <Paragraph position="3"> Example of correction using morphological g,~mmm : loots, although incorrect, may be parsed as foot + s, and the root foot, plus the variables (plura/) associated with the s, when passed on to the morphological generator, give the correct form feel.</Paragraph>
    <Paragraph position="4"> ACRES DE COLIN'G-92, NANTES, 23-28 AObq' 1992 l 0 I 7 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 As a second example, consider the problem of proposing correction for agreement errors: when an error occurs, it means that at least two words do not agree so there are at least two possible corrections depending on which of the two words you choose to correct. The solution for the system is to propose both corrections to the user and let him choose one. Even this simple method requires linguistic service: a morphological generator is necessary to produce each correction.</Paragraph>
    <Paragraph position="5"> But we think that in most cases the good correction can be choosen automatically, according to criterions I such as those considered by \[ 10\]: * number of errors in a group: little cat are funny pets must be corrected little cats are funny pets rather than little cat is funny pet; * it is better to correct in a way which does not modify the phonetic of the phrase, We give here a French example2: Les chiens dress~es.., will be corrected Les chiens dresses.., rather than Les chiennes dress~es. .. * one can give priority to the head of the phrase: cat which are.., becomes cat which is...; * writer laziness: a writer sometimes omit an s where one is necessary, but rarely add one where it is not.</Paragraph>
    <Paragraph position="6"> Such criterions are sometimes contradictory and we propose to use an evaluation method which gives a relative weight to each criterion so that each possible correction has a probability of being correct. The user is asked for a choice only in cases where both corrections have equivalent probability.</Paragraph>
    <Paragraph position="7"> But, whatever strategy is implemented, it needs the cooperation of various linguistic modules in order to perform the evaluation: phonetic transducer, morphological parser and generator, and our architecture permits the use of the available ones.</Paragraph>
    <Paragraph position="8"> Finally, beyond linguistic justifications, one can find computational justifications: each module of the system can work in parallel with the others and they can even work on different computers, putting the distribution at a physical level.</Paragraph>
    <Paragraph position="9"> INote that these criterions are pertinent for French, where there are a lot of agreement rules (between noun, adjectives and detenniner, between subject and verb,...) 2An similar english example might be The skis slides wich is corrected The ski slides rather than The skis slide.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML