File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2107_intro.xml
Size: 9,990 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2107"> <Title>A computerized dictionary : Le Tresor de la langue francaise informatise</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Presentation </SectionTitle> <Paragraph position="0"> Le Tresor de la langue francaise, The Treasury of the French language, existed first as a paper version. It is a dictionary of the 19 th and 20 th century vocabulary, in 16 volumes. The first volume was published in 1971 and the last one in 1993. It contains about 100 000 head words with their etymology and history, that means 270 000 definitions, 430 000 examples with their source, the majority of them are extracted from the Frantext database.</Paragraph> <Paragraph position="1"> The computerized version of the dictionary, the TLFi (Tresor de la langue francaise informatise), contains the same data as the paper version, with its 350 million characters. With the help of very sophisticated automata, we have been able to insert in the text a very complex set of XML tags in such a way that every textual object is clearly identified and that the hierarchy containing these objects is clearly designed. With this tag set and thanks to its software Stella, it can be seen as a lexical finely structured database.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Stella: the Toolbox for the TLFi </SectionTitle> <Paragraph position="0"> exploitation As well as all the textual resources of the laboratory, the textual database FRANTEXT</Paragraph> <Paragraph position="2"> the dictionary of Academie francaise and several others lexical database (Bernard et col, 2001 et 2002), the TLFi runs on its own specially software program STELLA, written in the laboratory, Stella allows a compact data storage (with a mathematically demonstrable optimality) of structured texts (Dendien, 1996). Above all, it offers, to developpers, very powerful tools for the access and handling of textual data including several XML hierarchical taggings.</Paragraph> <Paragraph position="3"> Stella offers the users: - An environment to make the requests. The interface is very friendly, with a lot of help on line. It offers fine-grained request possibilities, allowing precision in the results.</Paragraph> <Paragraph position="4"> - An optimal response time to all requests. - A good quality of service: Stella contains a linguistic &quot;knowledge&quot; (flexions, categorized databases) which allows a user to make complex requests.</Paragraph> <Paragraph position="5"> - A powerful capacity of interrogation: a user can write parametrable grammars to be used and re-used in different contexts.</Paragraph> <Paragraph position="6"> - A possibility of hypernavigation throughout all the databases interconnected under Stella.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Specificities of the TLFi </SectionTitle> <Paragraph position="0"> Its originality is based, firstly, on its wordlist, which is rich of about 100 000 entries, present either in our funds or in other dictionaries. The TLF was a pioneer in the treatment of morphemes or in the treatment of structures of specific vocabularies.</Paragraph> <Paragraph position="1"> Secondly, it is original too, by the richness of the number of examples, about 430000, and syntagms, about 165 000, quoted throughout its 16 volumes.</Paragraph> <Paragraph position="2"> Besides, its list of metatextual objects such as headwords, definitions, indications of domains, semantic and stylistic indicators, examples with their sources, fixed phrases is exceptional. About 40 different metatextual objects.</Paragraph> <Paragraph position="3"> The data are proposed in different sections: synchrony, etymology, history, pronunciation, bibliography.</Paragraph> <Paragraph position="4"> One of the main advantages of a computerized dictionary is to consider it as a knowledge database in which one can extract any items contained in any textual object. In order to get the most relevant answers when querying the database, the XML tag set is divided into two subsets : the tags of the first subset are used as delimiters and identifiers of the different kinds of textual objects used in the TLF ; the tags of the second subset represent the hierarchical organisation of the articles, in a way similar to the block structure of modern programming languages. This introduces very useful notions: The scope of an object (from a pragmatic point of view it is the paragraph in which the object appears plus all the subparagraphs of this paragraph) is the smallest block containing this object.</Paragraph> <Paragraph position="5"> The comparison of the scopes defines a binary relation between objects. Let us take the example of an article devoted to the headword W, containing the stylistic indicator &quot;popular&quot; and an example of Zola. Does this example illustrate a popular sense of W? The answer is yes if the scope of the indicator contains the scope of the example, otherwise the answer is no.</Paragraph> <Paragraph position="6"> 2.2.3 The different level of queries : Three levels of queries are possible depending on the users's need.</Paragraph> <Paragraph position="7"> 2.2.3.1. First level : Simple visualization of an article.</Paragraph> <Paragraph position="8"> You can read an article dedicated to a specific headword by three means.</Paragraph> <Paragraph position="9"> Firstly, you have the possibility to write the word with mistakes if you do not know the right spelling of the word. That is very useful for the users who do not remember the right accents (acute, grave or circumflex) for instance. All kinds of mistakes (like bad or omitted accents, simple/double consonants, missing hyphens) are allowed. What is more, any mistake is possible as long as the right pronunciation is correct. For instance, if you write &quot;ornitorink&quot;, the article &quot;ornithorynque&quot; will be found. It is also possible to enter an inflected form of a verb (ex. danseront) or of an adjective or noun (ex. generaux), or even a phonetic equivalent of such a form (ex. jenero), even with bad accents (ex. jenero).</Paragraph> <Paragraph position="10"> Secondly, you can use the possibility of seeing the list of the main articles contained in the TLFi, this allows the user to discover unknown words, just as if he was turning the pages of the paper version of the dictionary.</Paragraph> <Paragraph position="11"> Thirdly, the user can find an article thanks to selecting sounds and not alphabetical characters. At that level of consulting, you read the dictionary article by article, yet with easy ways of searching a word.</Paragraph> <Paragraph position="12"> 2.2.3.2. Second level : aided requests.</Paragraph> <Paragraph position="13"> At that level you have the possibility of using the dictionary as a textual knowledge database and to make queries throughout the 16 volumes in one click of mouse. One can make requests on graphic forms, on inflected forms as well, on sequences of words ...</Paragraph> <Paragraph position="14"> The requests can be mono-criterion or multicriteria. Examples of mono-criterion requests : all the words borrowed from Spanish language, all the words of a specific domain, all the onomatopoeias, all the metaphors and so on. By specifying several criteria, one can extract from the dictionary all the nouns of a specific domain, all the verbs which are used with a stylistic indicator (for instance &quot;popular&quot;) and which have been used by an author (for instance Victor Hugo). Other example : one can extract also all the definitions which contain a word (for instance instrument) and which at the same time do not contain the word measure, and which are found in the domain of optics, and so on. 2.2.3.3. Third level : complex requests.</Paragraph> <Paragraph position="15"> The user, at that level, can seek for a set of textual</Paragraph> <Paragraph position="17"> } imposing them to be conformant to a set of constraints combining the type, the contents and the relations between objects.</Paragraph> <Paragraph position="18"> Type and contents specifications are sometimes possible in other computerized dictionaries query systems. They allow to find articles talking about architecture or containing an example contained from Zola. Suppose now that we are looking for articles where an example of Zola is related to the domain architecture. The simple fact to combine the two criteria (must talk about architecture and contain examples of Zola), is not enough : may be some part of the article is not devoted to architecture. If the example of Zola is in such a part, the article is not relevant.</Paragraph> <Paragraph position="19"> In the TLFi, the problem is solved with a new kind of constraint, by using the scope of objects. We will state that the example must be hierarchically inferior to the domain. Thus, all the articles where the example is not in the scope of the domain will be shifted.</Paragraph> <Paragraph position="20"> This feature is nothing but the simple reflect of the XML tags representing the hierarchy. It gives the TLFi query system an incredible accuracy.</Paragraph> <Paragraph position="21"> Strangely enough, it seems to be ignored in other computarized dictionaries, with the consequence of very poor quality results.</Paragraph> <Paragraph position="22"> This powerful feature, plus many other ones, such as the possibility to make list of words in many ways (manually, by automatical generation of the inflected forms of a given lemma, or by high level regular expressions selection) and reuse them in requests, allied to the rich content of the TLFi and a very friendly user's interface, with help on line, allow very complex querying, with pertinent results.</Paragraph> </Section> </Section> class="xml-element"></Paper>