File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0511_metho.xml
Size: 20,002 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0511"> <Title>A Computational Lexicon of Portuguese for Automatic Text Parsing Ehsabete RANCHHOD FLUL/CAUTL-IST</Title> <Section position="4" start_page="0" end_page="76" type="metho"> <SectionTitle> 2 Portuguese Electronic Dictionaries </SectionTitle> <Paragraph position="0"> By electromc dictionary, we mean a computerized lexicon specifically elaborated to be used tn automatic text parsing operations (indexing, recognition of complex words, technical and common, etc ) Thus, large coverage electronic dlctlonanes were built for Portuguese for that purpose The set of lexIcal data is organized according to the formal complexity of the lexlcal units The Portuguese DELAS IS the central element of the d~ctionary system Itcontains more than 110,000 simple words, whose grammatical attributes are systemaucally described and encoded The set of compound words Is structured m the Portuguese DELAC At the moment, it Is constituted by a lexicon of 22,000 compound nouns and 3,000 frozen adverbs, so it is stdl far from adequate letters The lexlcal entnes of DELAS have the followmg general structure <word>, <formal description> where word represents the canomcal form (the lemma) of a simple lexlcal umt (m general the masculine smgular for the nouns and adjecuves, the mfinmve for the verbs), and formal description corresponds to an alphanumenc code contzanmg mformat~on on the grammatical attributes of the entries their grammatical class (eventually, sub-class), and their morphological behavior The mflected forms are automaucally generated from the association of a lemma to an mflecuonal code the hst of all reflected words constitutes the Portuguese DELAF (1,250,000 word forms) In Portuguese, the major grammatical classes nouns, adjectives and verbs have mflected forms -nouns and adjectives can appear m the femmme and/or m the plural, they can recewe dtmmuUve and augmentaave suffixes, the superlative degree of the adjectives can be expressed by morphological means (suffixes), -verbs are conjugated (mood, tense, person, number), furthermore, some verbal forms can undergo formal mo&ficattons reduced by the presence of a clmc pronoun Thus, the DELAS entries gato, NOID1 gordo, AOIDIS1 (where N and A mdtcate that gato (cat) ts a noun and gordo (fat) Is an adjective, 01 corresponds to the mflectlon rule for mascuhne, feminine, smgular and plural, DI and $1 exphclt the type of dtmmuttve and superlative suffixes that can be accepted by these entries) produce the following infected forms (DELAF entnes) gato, gato N ms (cat) As for the verbs, for mstance, dar (to gwe) dar, VO2t gives rise to a hst of 73 reflected forms that correspond to the normal conjugation of a nondefective verb, in addmon, dar can be constructed with clmc pronouns (t), m the posmon of accusative and dative complements So, m (1) Nds demos o hvro ~ Maria (Lit We gave the book to Maria) the verb form demos expresses mdlcatlve mood, past tense, and first person plural From a syntactic point of view, dar Is constructed with three arguments, subject Nds (we) and two complements o hvro (the book), fi Maria (to Maria) The complement syntacuc posmons can be fulfilled by clmc pronouns, respecavely, o (it), accusative, and lhe (her), dauve, as m (2) N6s demo-lo ~ Maria (Lit We gave tt to Maria) (3) N6s demos-lhe o hvro (Lit We gave her the book) (4) Nds demos-lho (Lit We gave her_t0 In (2), the direct object has been chttclzed, and, due to historical phonetic reasons, both the accusative pronoun and the verb have undergone formal modifications o>lo, demos>demo In (4), both pronouns (dative and accusative) are obhgatonly agglutinated, forming the contraction lho ( <lhe + o) So, even though the analysts of the combinations verb-chttc Is a syntactic matter, given the morphological changes reduced by such combinations m Portuguese, a first descnptlon had to be made at the morphcqegtcal level On the other hand, the example m (4) dlustrates a case where the formal notion of simple word does not correspond to an adequate hngutstlc analys~s Indeed, the form lho results from the contraction of two Independent oronouns lhe + o In Portuguese, contracted forms Issued from the agglutmatton of two different words (and two different grammaucal categones) are commonly observed We give some simple examples of contracuons resulting from the merging of preposmons with determiners, pronouns and adverbs pel(o,a, os, as) < por + (o,a, os, as) (by the) del(e,a, es, as) < de + (ele, ela, eles, elas) (of (him, her, them)) daqut < de + aqut (from here) The relationship between contractions and their base constituent categories are estabhshed by finite-state transducers (see below)</Paragraph> <Section position="1" start_page="75" end_page="76" type="sub_section"> <SectionTitle> 2.2 Dictionaries for Compounds </SectionTitle> <Paragraph position="0"> Compound words, l e, lexical units that are constituted by a fixed combination of s~mple words, represent a large amount of the lexicon of any language One has only to underline m a text the sequences of words that are frozen together to some extent to realize that compounds constitute an important percentage of the text 3 It is therefore illusory to envisage any sort of automatic processing before a slgmficant lexlcal coverage is achieved. The Issue is even more acute if one considers the description of sclenttfic or technical texts or any speciahzed lexicon, where the number of compounds can rise up to appalhng figures As said in 2 compounds are structured m the Portuguese DELAC Priority was given tO the hstlng and formahzatlon of compound nouns, that can mflect lua de mel - luas de mel (honeymoon), and to compound adverbs, that are invariable de repente (suddenly) From the point of view of the lexicon, the mare focus, especially as far as compound nouns are concerned, has been the every-day, not too techmcal, lexicon In order to ~dentlfy compound words, and dlstmgulsh them from formally Identical word free combinations, a set of morpho-syntactlc criteria was adopted (Ranchhod (1991), Bapttsta (1995)) In short, compounds are the sequences of words that present restncUons to the See 5 ~Parsmg Texts Using INTEX Toolscombinatorial properttes that they were supposed to have The formallzauon of compound dtcuonary entries is slmdar to that of simple words Since compound adverbs, preposluons and conjunctions do not reflect, their formats are rather s~mple</Paragraph> <Paragraph position="2"> Compound nouns, however, have generally reflected forms The rules for the mflect~on of compound nouns presented by grammarians do apply to some cases, but most compound nouns exhibit mflecuonal restrictions on gender or number that cannot be accounted by the morphological properties of their constituents In the DELA format, the inflectional properties of compound nouns are specified according to the same criteria as m the dictionary of simple words Thus, g~ven the following nominal entries of the The first two compound nouns, ser humano and guerra frta have an internal structure Noun Adjective (NA), the most productwe class m Portuguese, vtstta de estudo ~s a compound of structure Noun de Noun, also a very productwe one Each entry is characterized by the posslbdlty (+) or lmposstbtllty (-) of gender and number reflection, respecttvely, the elements of the compound that can be inflected receive the mflecuonal code that they have m the DELAS both constltuents of ser humano inflect (in number) according to, respectively, the rules 21 and 01 ser humano - seres humanos, guerra frta is invariable, and the noun vlstta de estudo only allows the inflection of vtstta vtstta de estudo vtsttas de estudo As well as for other languages (e g French), addluonal mformat~on Is being added, namely semantic</Paragraph> </Section> <Section position="2" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 2.3 Local Grammars </SectionTitle> <Paragraph position="0"> Most of the local hngutstgc phenomena, as well as many complex sentences, are represented m a natural way by the formalism of finite-state automata (FSA) For instance, frozen or semtfrozen structures are very naturally described by graphs, that represent FSAs (Sllberztem (1997)) We illustrate the use of graphs with an elementary example, selected from the hbrarles of Portuguese local grammars This grammar descnbes a family of adverbial expressions (dates), which refer to a period of ttme around the middle of the months (or, by extensmn, of some years) as m the underlined expression lsso aconteceu nos tdos de Marfo (That happened on the ides of March)</Paragraph> <Paragraph position="2"> The following examples show how transducers are used to analyze contracuons, ambtgumes and compound numerical determiners</Paragraph> </Section> <Section position="3" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 3.1 Analysis of Contracted Words </SectionTitle> <Paragraph position="0"> As stated above (2 1 ), contracted forms resulting from the agglutination of two independent words are commonly observed nn Portuguese To properly analyze these entrees we built flmte state transducers (FST) that, given a contracted form, produce an output corresponding to the decomposition of the contractxon into ~ts base constituents For xnstance, the FST (de, ~ t~q~,,Al~ F~g 2 - Analysis of the contracted form daqut decomposes daqut (a contraction of the preposmon de (from) and the adverb aqut (here)) m ~ts base constituents and, s~multaneously, associate to them the grammatical reformation of the dtcuonary</Paragraph> </Section> <Section position="4" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 3.2 Disambiguation I ,' ', ..... </SectionTitle> <Paragraph position="0"> ~__ / ~ ........... \[:~j&quot; \[ Dtsambtguanon can be done at different moments * ~{i-~.'rld ......... . ~, ),v,m, ! , of parsing '~ I..~.,.I ~ m ~'--'~. , I'~',&quot; ! / &quot;-- ~, I&quot;&quot;&quot; I t a) Dtsambtguatmn durmg normahzatwn -,~lallll *J i: ~' (I The normahzation of texts for hnguistlc analysis I~ \[ uses FST to identify sentences and unambiguous * It: |compounds, to solve contractions and ehsxons As i~,~ an example of dxsamblguatlon at this level, we * I p.I ,.i..&quot; ~, still use the case of contracuons .. l,','j ' The form dele results from the contrachon of de \[\] Fig 1-Advldos grf (of) with the ambiguous personal pronoun ele (he, Th~s set of adverbial phrases corresponds to a h~m), which can be either a subjective (coded N) \[\] linguistic object of clearly flmte-state nature, but or a genitive form (coded O) hngmst~c phenomena of a more complex nature ele, eu PRO+Pes N3ms 03ms However, only * can be efficmntly described by such formahsms genitive forms can occur m the contraction dele (Gross (1997, 1995)) (de + ele) So, the FST \[\]</Paragraph> <Paragraph position="2"> Fig 3 - Analysis of the contracted forms dele, dela, deles, delas From the graphs of the local grammars, parsers (FSTs) can be automatically constructed, that applied to texts m combination with the dictionaries, allow the detection of a large variety of hngmstlc patterns (see below)</Paragraph> </Section> </Section> <Section position="5" start_page="76" end_page="78" type="metho"> <SectionTitle> 3 Transducers </SectionTitle> <Paragraph position="0"> Finite-state automata and transducers can be efficiently apphed at various levels of hngutstm</Paragraph> <Section position="1" start_page="76" end_page="77" type="sub_section"> <SectionTitle> analysis </SectionTitle> <Paragraph position="0"> ts used not only to decompose the contracted form dele m its base consmuents but to dlsamb~guate the pronouns ele, ela, eles, elas Identical FSTs can be used to analyze more complex situations where both constituents of a contraction can mflect mdependently 4 b) Dtsambiguatwn for tagging In Portuguese, a word such as compra can be either a noun or a verb, the form o can be a determiner, a demonstranve pronoun and a personal pronoun So, the linear combination of these elements allows six different analyses However, m sentences hke Ela compra-o hoje (She buys it today) compra Is only a verb, and o is only a personal pronoun, bound to the verb by an hyphen The following FST Fig 4 - FST for the dlsamblguanon of verbs and clmcs was built to solve these amb~gmttes the five erroneous analyses are not taken mto account, compra and o receive the correct tags</Paragraph> </Section> <Section position="2" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 3.3 Numerical Determiners </SectionTitle> <Paragraph position="0"> The Portuguese numerical determiners from dots (2) to novecentos e noventa e nove md novecentos e noventa e nove (999,999) are plural forms However, some of them can reflect m gender</Paragraph> <Paragraph position="2"> trezentos e vmte e dots <hvro__~s> (three hundred and twenty-two <books>) trezenta___~s e vmte e duas <cadetra_._s_s> (three hundred and twenty-two <chmrs>) ~ That ~s the case of aqueloutra which Is the contraction of the demonstrative pronouns aquela + outra (that(fs) + otherOes)) In Portuguese, even though contracted words are numerous, the hst of contractions ~s stdl a closed set So its descnpuon with FSTs is possible However, this solution would not be adequate to describe productive phenomena revolving agglutination, as it is probably the case of most compound nouns In German, for instance Others are mvanant an respect to gender Numerical deternuners such as dots, duas and vmte are simple words and therefore they are formahzed tn the DELAF dicnonary, numerical deterrnmers such as trezentos e vmte e dots, trezentas e vinte e duas and mtle sete can be seen as specml compound words that are more adequately described by FST The first FST m figure</Paragraph> <Paragraph position="4"> Fig 5 - FST for the ldenuficatlon of Numerical</Paragraph> </Section> <Section position="3" start_page="77" end_page="78" type="sub_section"> <SectionTitle> Determiners </SectionTitle> <Paragraph position="0"> describes all the compound numencal determiners from vmte e um (21) to novecentos e noventa e nove mtl novecentos e noventa e nove (999,999), including feminine and mvanant forms, assoctatmg to each of them the grammaucal category and the corresponding numerical value, as m the examples trezentos e vmte e dots, trezentos e vmte e dots DET+Num+VaI=322 mp trezentas e vmte e duas, trezentas e vmte e duas DET+Num+Val=322 fp mtle sete, mtl e sete DET+Num+VaI=1007 mfp The FST shaded nodes refer to embedded FST, for instance, CentenasMF refers to the sub-graph that represents all mvanant compound determiners from cento e tr~s (103) to cento e noventa e nove (199) and UmdadesMF represents all mvanant umts from tr~s (3) to nove (9)</Paragraph> </Section> </Section> <Section position="6" start_page="78" end_page="78" type="metho"> <SectionTitle> 4 Parsing Texts Using INTEX Tools </SectionTitle> <Paragraph position="0"> The hngmstlc resources that we briefly described have been imported into INTEX, that apply them to large texts We gave here some examples of text processing, using a small text a) Recognttton of all compound words of the text A semelhan~a de um c6dlgo de banas que pemute ldentlficar uma mfimdade de produtos, dependendo da sequ6ncm de ntimeros, o genoma humano tamb6m encerra quase todos os nossos segredos e, ~osso modo, basta uma hge~ra muta~o num gene para que se mamfeste urea doenqa ou, pelo contr, irto, uma resmt~.ncm ~t rnesma A toda a hora novos genes s~o ~denuficados um cha 6 um gene assocmdo ~t repulsa do tabaco, noutro um que traduz uma minor susceptabdldade de se ficar mfectado por deterrmnado vfrus Hfi um c6dlgo para tudo Mas todos estes dados consmuem apenas 10 por cento do patrtm6mo gen6t~co humano conhec~do Um facto que deverfi &quot; ser alterado em Feveretro do pr6x~mo ano, se se puderem cumpnr as prev~s6es dos respons~ive~s pelo amb~c~oso Projecto do Genoma Humano In the example, the compound words have been underlined b) Indexmg all utterances of a gtven word All the forms assocmted to the mfinmve of the verb ser (to be) ao ldentfflcados um dm 6 um gene assocmdo ~ rep toda a hora novos genes silo ldent~ficados um dm o Um facto que deverfi se___r alterado em Fevere~ro were ~dent~fied and extracted into a concordance c) Indexmg a morphologtcal pattern The rataonal expression <DET+Art+Ind fs> (<E>+<A fs>) <N fs> (<E>+<PREP><N>) or the eqmvalent FST Identify femmme smgular (fs) noun phrases, that are specified by a determiner (DET) belonging to the class of indefinite articles (Art + lnd), the head of the noun phrase ts a feminine singular noun (Nfs), optionally (E) modified by an adjectave m pre-nomlnal posmon or a preposmonal phrase (PREP N) In the first paragraph of the sample text, the NPs corresponding to those structures are (underhned) A semelhanqa de um c6dtgo de barras que perrrute ldenuficar urea mfimdade de produtos, dependendo da sequ8ncm de ntimeros, o genoma humano tamb6m encerra quase todos os nossos segredos e, grosso modo, basta urea hge~ra mutaq~o hum gene para que se mamfeste uma doenqa ou, pelo contr~no, uma teslst~nc~a h mesma A toda a hora novos genes s.~o ~dentlficados um dm 6 um gene assocmdo repulsa do tabaco, noutro um que traduz .u.ma minor suscept~bd~dade de se ficar mfectado por detemunado vfrus d) Locating lextco-syntacttc patterns A regular expressaon (or a local grammar) of the form (<dever>+<poder>) (.<ADV>+<E>) <V W> corresponds to syntacuc constructions wath modal verbs dever, poder (must, can) The mare verbs are m the mfinmve form <V W>, an insert or an adverb (simple or compound) can occur between the two Verbs In the text, there are two construcuons of such type conhecldo Um facto que devcr,i scr alterado em Fe ro do pr6xlmo ano, se se puderern curnpnt as prev~s</Paragraph> </Section> <Section position="7" start_page="78" end_page="78" type="metho"> <SectionTitle> 5 Maintaining and Increasing Dictionaries, using INTEX features 5.1 Simple words </SectionTitle> <Paragraph position="0"> To evaluate the coverage of the extstmg dlcuonary we apply ~t to vaned corpora the nonrecognttton of a word form Indicates m general that (0 it is not m the dlcttonary, (n) tt was incorrectly formahzed (m) ~t is a proper name, remam m the lexicon of the language) are formahzed and added to the dictionary, (n) the erroneous entries must be corrected, (m) proper names must be hsted m spectal d~ct~onanes, built from the explorauon of existing catalogs However a lot of proper nouns are homographs wtth common ones, that m some contexts are written m capitals (Bush and Rose can be e~ther a proper noun or a common one), (tv) acronyms (if they have good prospects to survtve) must be hsted and assocmted with the words that they represent In general, acronyms are formally s~mple words, but they represent compounds Our expenment of braiding such dlcuonanes indicates that the assocmtlon of both types of lexlcal umts tt ~s not a tnvml task</Paragraph> <Section position="1" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 5.2 Compound nouns </SectionTitle> <Paragraph position="0"> The d~cuonanes of compound nouns are being enlarged m a seml-automatlc way We write regular expressions that correspond to typical patterns of compound nouns (e g <N ms> <A ms>), and then we ask INTEX to extract from texts (to which dicnonanes have been apphed prewously) all patterns that match that structure The resulting hsts, integrated into a concordance, contain not only the combinations of a noun and an adjecuve but also compound nouns of that form that are followed by an adjective Lmgmsts mteracttvely validate the hsts of candidates to binary or ternary compounds</Paragraph> </Section> </Section> class="xml-element"></Paper>