File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1095_metho.xml
Size: 7,658 bytes
Last Modified: 2025-10-06 14:13:44
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1095"> <Title>INTEX: A CORPUS PROCI!\]SSIN(-?, SYSTEM</Title> <Section position="4" start_page="579" end_page="580" type="metho"> <SectionTitle> 2. LOCATING PATTERNS </SectionTitle> <Paragraph position="0"> After having built the dictionary of the words of the text, the user can locate morpho-syntaetic patterns in the corpus, index o1' build a concordance for all occurrences of the pattern. Patterns may be: --a word, or a list of words. For example, one can locate in a text all occurrences of the verb faire (even when inllected), all the compound nouns (since most of them are non-ambiguous terms, their list constitutes a good index); --a given category, such as verb conjugated in the third person sitzgttlar (V:3s), or noun in the feminine plural (N:fp), etc. Here arc several examples of categories5: A:p (adjective in plural), ADV (adverb), DE7&quot;.'f (femirzine determirzer), DKms (past participle, mascttline singular), etc.</Paragraph> <Paragraph position="1"> --a syntactic pattern represented by a regular expression or a graph; the following is a regular expression:</Paragraph> <Paragraph position="3"> This pattern re'Itches any sequence beginning with a conjugated form of the verb e?tp'e, optionally followed by an adverb (<E> stands for the null word), followed by a determiner and then a noun. Note that categories match simple and compound words. In particular; <ADV> also matches compound adverbs. More generally, the use,&quot; may apply to the text grammars expressed by recursive graphs; graphs typically represent: -- sees of synonymous expressions, such as : per- null ent languages can be linked, so that each matching sequence in the source language could be automatically associated with the corresponding graph in the target hmguage (e.g. lose one's head, mind, bearings, etc.). A graph may represent all the expressions which designate an entity, or a process; indexing such graphs allows one to retrieve information in large corpora; -- pieces of a large-coverage grammar of the language. Recnrsive graphs are easily edited; standard operations on graphs (union, intersection, differences, etc.) help to build an easily mainrained system of hundreds of elementary graphs.</Paragraph> <Paragraph position="4"> This construction has begun in LADL; we already have graphs describing adverbial complements which express a measure (temperature, speed, length, etc.), a time or a date (e.g. le 17 fdvrier 1993, le premier hmdi du mois de jnin) (Maurel 1989), some locative structures (Garrigues 1993), etc.</Paragraph> </Section> <Section position="5" start_page="580" end_page="580" type="metho"> <SectionTitle> 3. REMOVING AMBIGUITIES </SectionTitle> <Paragraph position="0"> In order to disambiguate words in texts, INTEX uses cache dictionaries and local grammars.</Paragraph> <Section position="1" start_page="580" end_page="580" type="sub_section"> <SectionTitle> 3.1. Cache dictionaries </SectionTitle> <Paragraph position="0"> Since the DELAF and DELACF dictionaries included in INTEX have a very large coverage, they contain a number of words which only occur in some specific domains; in addition, some fiequent words may be associated with generally inappropriate information. For instance, par is usually a preposition in French, but in some cases it may be a noun (a technical term in gol\[). By default, each occurrence of this token will be considered ambiguous (preposition or noun). Cache dictionaries are used as filters: if INTEX finds a word in a cache dictionary, it will not look tip the selected dictionaries and FSTs. If the user knows that in a given corpus, the token par is always a proposition, he/she enters the following entry in a cache dictionary: pat; par. PREP Hence, the user can avoid unnecessary ambiguities by putting frequent words (or conversely, specific terms) in cache dictionarids adapted to each processed text.</Paragraph> <Paragraph position="1"> Most compounds are ambiguous, since they forreally are sequences of simple words; for instance, the sequence pomme de terre is not necessarily a compound noun in the following sentence: null Luc recottvre une pomme de terre tulle (Luc covetw a cooked potato) (Luc covers an apple with scorched earth) However, a number of compounds are not ambiguous, either because they contain a non-autonomous conslituent (e.g. aujourd'hui), or because they are technical terms (e.g. tm lube cathodiqtte, un sous-marin nucldaire). By entering these non-ambiguous compounds in a cache dictionary, the user prevents INTEX fi'om looking up dictionaries and FSTs for simple words; hence INTEX does not process these conlpounds as ambiguous.</Paragraph> </Section> <Section position="2" start_page="580" end_page="580" type="sub_section"> <SectionTitle> 3.2. Local grammars </SectionTitle> <Paragraph position="0"> A local granmaar is a two-part rule: if a given sequence of words is matched, then each word in the sequence is tagged in the proper way. For instance, in the sequence s'en donne, s' is a 1)1&quot;onoun (not a conjunction), en is a pronoun (not a preposition), and donne is a verb (not a noun).</Paragraph> <Paragraph position="1"> The corresponding local grammar would be: s '/<PRO> en/<PRO> <MOT>/< V> <MOT> stands for any word. Local grammars arc represented by FSTs, heuce their length and their COml)lcxity have no limit. Any number of local giammars may be used at lho sanie (line to disanl~ bigualo Ioxls (FSTs Inorgo easily); hence it is best Io el'tale small ()lieS. Local ~l'aillnlal's use lhc dictionary of the words of the texts, so they correctly haildle sequellCOS with coinpounds. Appendix 2 shows a few local grannllars. IN'rEX inchidos a dozen &quot;pcrfccl&quot; local granllliars, tllat is, gramilqars that will never give hlcorreot lagging sohitioils; the user may add his/her own perfect (or probabilistic) disan~bigualing gralnnlars.</Paragraph> <Paragraph position="2"> 3.3. The resiill of lhe parshig After having selected linguistic tools (either dictionaries or FSTs), the riser cau parse tile text, that is, insert in the text all the linguistic information reqt, ired by a syntactic parser. For instance, the text: iI la donne would at this step be represented by the following expression: iI, PRO (la, PRO:fr + la, DUl'.'fs) (donne, N:fs + donner, V.'PIs + donner, V:P3s donnel; V: S l s + donne r, V: S3s +donner, V.' Y2s ) la can be a pronoun or a determiner; donne is a noun, or 5 conjugated forms of the verb donner.</Paragraph> <Paragraph position="3"> INTEX then builds the corresponding minimal automaton: the number of transitions of this automaton corresponds to the number of lexical ambiguities of the text (in the above example: 9 transitions). By selecting and applying local grammars to the text, the user effectively removes transitions in the resulting automaton. For instance, thanks to a simple local grammar (which describes the preverbal particles), the above text can be parsed to give the following expression: it, PRO la, PRO:fs (donner, V.'P3s + donner, V:S3s) The remaining ambiguity corresponds to the tense of the verb: indicative or subjunctive present. The corresponding automaton has only 4 transitions. Hence, the number of transitions can be used as a quantitative tool to measure the efficiency of the removal of ambiguities. By selecting one local grammar at a time, or by merging several, the user is able to apprehend exactly how each grammar covers the text, and pcrlbnns in terms of deleting transitions.</Paragraph> </Section> </Section> class="xml-element"></Paper>