File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1095_intro.xml
Size: 4,492 bytes
Last Modified: 2025-10-06 14:05:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1095"> <Title>INTEX: A CORPUS PROCI!\]SSIN(-?, SYSTEM</Title> <Section position="3" start_page="0" end_page="579" type="intro"> <SectionTitle> 1. LINGUISTIC TOOI,S </SectionTitle> <Paragraph position="0"> The user th'st loads a text and selects the woi'kiug langl.iage I. INT\[~.X counls lhc nulnbor of lokens in the lexl, lhe number of different ones, and sorts lhoni by frequency. Theil Ihe user selects linguis-tic tools to parse the text. Tools aye either diclio.. nnries or tinilO stale transducers (FSTs).</Paragraph> <Section position="1" start_page="0" end_page="579" type="sub_section"> <SectionTitle> 1.1. Dictionaries </SectionTitle> <Paragraph position="0"> INTEX is based on lwo large coverage builtqn dictionaries: -the I)IT:.LAI ~ diclio,mry contains over 700,000 simple words, basically all the simple words of the language 2. Each entry in the I)ELAI: is asso. cialed wilh explicit morphological infornlathm for each word: its canonical form (e.g. the intinilive for verbs), its part of speech (e.g. Noun), aud some inllectional information (e.g. th'st person singular present). I lere are three entries of the t:i'onch I)EI,AI::: a, avoil: V'.P3s abacas, abaca. N:mp abais.va, abaisses: g.',lXs The token 'a' is the Verb 'avoir' con, jugaled in tilt Third Person Singular l'resent (P3s); 'abacas' is the masculine plural of the Noun 'abaca'; 'abaissa' is a verbal form of 'abaisser' COlljugated in lhe third person sirigular &quot;Passe colnposC' (J3s). Since the lnorphological analysis of each 1, At this moznefit. English. French and Ilalian tlicthmaries have boon already included in INTI:,X. (lermail. ,Spanish alld Poflu.~tlOS(', compatible diclionaries lift: tlll(ICl' COllsIrucliOll. '~Vt: will lJiVe Froilch oxainl)les.</Paragraph> <Paragraph position="1"> 2. For ii discussiOll on the COillpleloness/)I&quot; lilt', DEI ,AF dictionary. see in I(?ourloi,'-;: ,Rilborztein 10~91. IClemeiweau 19931. token is performed by a simple lookup routine, INTEX guarantees an error free result (there is no guessing algorithm nor 'probabilistic' result).</Paragraph> <Paragraph position="2"> INTEX includes a few other dictionaries for proper names, toponyms, acronyms, etc.; --the DELACF dictionary contains over 150,000 compounds, mostly nouns 3. Each entry in the DELACF is associated with its canonical form, its part of speech, and some inflectional information. Here are three entries of the French DELACF: h tout de suite, h tout de suite. ADV cartes bleues, carte bleue. N:fp pomme de terre, pomme de terre. N:fs INTEX includes a few other dictionaries for compound proper names. The use,&quot; may add his/her own dictionaries for simple words and compounds. null 1.2. Finite State Transducers FSTs are represented in INTEX by recursive graphs. Basically, the &quot;input&quot; part of an FST is used to identify patterns in texts; the &quot;output&quot; part of an FST is used to associate each identified occurrence with information. In many cases, FSTs represent words more naturally than dictionaries. For example, numerical determiners, such as trente-cinq mille neuf cents trente-qttatre, forreally are compounds which are naturally represented by graphs (see the graph Dmlm in Appendix 1). FSTs may also be used to bring together graphical variants of a woM in order to check the spelling coherency, to associate all the variants of a term with a unique canonical ent,'y in an index, to represent families of derived words (see the graph France in Appendix 1), to associate synonyms of a term in an information retrieval system, etc. In the graph editor, gray nodes are graph names; tags written in white nodes are the inputs of the FSTs, outputs are written below nodes 4. The user draws graphs directly 3. For a discussion on the completeness of the DELACF, see in \[Courtois; Silberztein 1989\].</Paragraph> <Paragraph position="3"> 4. For a description of the graph editor of INTEX, see \[Silberztein 1993\].</Paragraph> <Paragraph position="4"> on the screen; the resulting graphs a're interpreted as FSTs by INTEX.</Paragraph> <Paragraph position="5"> By selecting and applying dictionaries and FSTs to a text, the user builds the dictionary of the words of the text. Appendix 1 shows the resulting dictionary, as well as the list of all unknown tokens. Generally, these tokens are either spelling errors or proper names.</Paragraph> </Section> </Section> class="xml-element"></Paper>