File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1090_metho.xml
Size: 16,168 bytes
Last Modified: 2025-10-06 14:11:18
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1090"> <Title>A METHOD TO REDUCE LARGE NUMBER OF CONCORDANCES.</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> I Introduction </SectionTitle> <Paragraph position="0"> In the composition of a dictionary,those involved in the definition of each word have to study very consciously its set of concordances, so that no meaning or use is missed.</Paragraph> <Paragraph position="1"> there are, of course, some difficulties since on one hand, the sample is never large enough as to insure the occurrence of all the different meanings and uses of every word to be defined.</Paragraph> <Paragraph position="2"> This problem is solved by consulting other dictionaries and expertees on the particular subject. null On the other hand, there are words having a very large number of occurrences, making their analysis a very difficult task, since it is not possible to have present in mind everything that is being analysed. At first thought this could be solved by taking at random a smaller number of concordances; however, when reducing in this way, one is about to loose the grammatical and semantic information contained in all those concordances to be taken away; hence a method had to be implemented as to attain the maximum possible information.</Paragraph> <Paragraph position="3"> In order to solve this problem, the DEM presents a method whose aim is to obtain optimal information with the minimum number of concordances to be handled.</Paragraph> <Paragraph position="4"> This method consists of, for each concordance to analyse and compare four words to the left and to the right of word W together with their grammatical category associated; and establishing which one of them is identical to which other in a particular context: A tree structure is generated.</Paragraph> <Paragraph position="5"> Having known this, it is proceeded to reduce the number, by selecting some of them considered to be representatives.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> II Preliminary Requirements </SectionTitle> <Paragraph position="0"> Our sample (Corpus del Espa~ol Mexicano Contempor~neo: CEMC), consists of 1,973,151 occurrences, resulting in 65,200 different types, I whose frequency vary from I to 68,252. 2 Some preliminary work has been done consisting in the automatic labeling of each and every word of the corpus with its grammatical category, 2 in which from the total number of occurrences, 1,083,945 were automatically solved, and --590-the rest had to be solved by hand, then the computer was fed with the results, obtaining in this form, the complete sample labelled. We took advantage of this work, since otherwise it would have been impossible to try to reduce the number of concordances in terms of the same grammatical category.</Paragraph> <Paragraph position="1"> Next, was to implement a programme that prodPS ces, for any given word, its set of concordances; each word stating its own grammatical category.</Paragraph> <Paragraph position="2"> This is stored in a file called CONCUERDA, and it is organized in the following way: Every concordance has three lines, each one of them consisting of: - 6 characters (nnnnnn) reserved for the number of occurrence.</Paragraph> <Paragraph position="3"> - 12 characters (tttppplll) reserved for the register of that line, according to the original text, and stating text code, page and line.</Paragraph> <Paragraph position="4"> - 72 characters reserved for the actual text - 18 characters for the label of each word of the line, stating the grammatical category code. The first two characters indicate the number of words in the line.</Paragraph> <Paragraph position="5"> Figure number 1 shows part of file CONCUERDA and its organization.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> III The Algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Association of the i-Concordance to table </SectionTitle> <Paragraph position="0"> ORDENA.</Paragraph> <Paragraph position="1"> For each concordance, a table ORDENA is associated in the following way: - The word in question is located in the middle line and associated to ORDENA(5) - Four words are selected to the right and to the left of W, since they are supposed to be carrying the most significant grammatical and semantic information about the word W. 3 We took this idea from the Centre du Tr#sor de la Langue frangaise&quot;s work concerning to the treatment of binary groupes Each of the next four words to the right of W will take its place in Oi+ 1 if and only if</Paragraph> <Paragraph position="3"> as they are considered to break up the continuity of a context.</Paragraph> <Paragraph position="4"> In similar way, the words to the left of W are associated to their place in ORDENA. Figure No. 2 shows how to construct table ORDENA from a given concordance.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Generation of a Tree Structure startin 9 </SectionTitle> <Paragraph position="0"> from ORDENA.</Paragraph> <Paragraph position="1"> Once obtained this set of up to nine words, it is proceeded to construct a tree structure for the words to the right of W and one for the words to the left of W.</Paragraph> <Paragraph position="2"> It will only be described here the construction, of the right branch of the tree. The left is generated immediately after, though in symme- null tric form: - The tree has a root node which is the word W itself, and has five levels, being the root in level 5.</Paragraph> <Paragraph position="3"> - A direct descendant of a node w i is given by the word wj such that wiw j are adjascent, i.e. if wi-~ORDENA i and wj.~ORDENAi+ 1 then wj is a direct descendant of w i.</Paragraph> <Paragraph position="4"> - The label of each node consists of: - Word w associated.</Paragraph> <Paragraph position="5"> - Its grammatical category.</Paragraph> <Paragraph position="6"> - Its frequency.</Paragraph> <Paragraph position="7"> And pointers to: - Direct ascendant.</Paragraph> <Paragraph position="8"> - First direct descendant.</Paragraph> <Paragraph position="9"> - Next node whose direct ascendant is the same as the one of itself.</Paragraph> <Paragraph position="10"> - Another file called CONCORD, where it is stored the number of the concordance or concordances where that word in that</Paragraph> <Paragraph position="12"> particular context came from, making in this way possible the retrieval operation.</Paragraph> <Paragraph position="13"> '-A node has as many branches as different words are found to be direct descendants to that word, with the same grammatical category through the whole set of concordances.</Paragraph> <Paragraph position="14"> The process repeats itself until the last concordance has been processed.</Paragraph> <Paragraph position="15"> Figure No. 3 shows, for a set of 14 concordances, the left and right trees generated. A+OS DE EDAD Y MUCHOS RI+ONES AU/Ndeg TUVO UN IMITAOOR NOTABLE~ QUE FUE UN BANDERILLERO LLAMADO ANTONIOP GONZA/LEZ~ EL~,ORIZA6E+O~ QUIEN DIO A AHORA, LA EMPRESA GUE LA TIEN!~ R~NTAOA~ SE ~sTA/ GASTANDO UN DINERAL EN ESTE SERIAL~ BUSCANDO NUEVOS VALORES~ MISMO5 QUE - HASTA QLiE SU EDAD SE LOS PERMITA - NO HABRA/N DE SALIR DE ENTRC LOS NI+OS TOREi~OS, CONSEGuIR DINERO PARA SACAR ADELANTE LA FUNDACIO/No PRIMERO HABLO/ EL SE+OR CURA eUE FNTONCES NO TEHI/A NI TRZZi4TA A+OS DE EDAD. LUEGO DON TOMA/S8 SA/NCHEZO (ESTE SI/ VIEJO Y COLUDO) PROPUSO COLECTAS Y RIFAS. CABALLOS. I 0 EN SAN ~ JOSE/g HABI/A MEDIO MILLAR DE HOMBRES EN EDAD DE TOMAR LAS ARMAS E IRSE A LA GUERRA~ PERO NO TODOS SE SINTIERON CON A/NIMOS DE CASADOS Y TENI/AN HIJOS. LOS MA/S ERAN JO/VEN\[:S Z:N EL VERDOR DE LA EDAD, DE 16 A 30 A+OS, CON ALGUNA DESTRE2A EN L:L MANEJODE ARMAS Y CABALI..OS Y SIN DISCIPLINA MiL!TAR. 5 03#OJ ENCUBIERTOS DEL DIABLO~ O AL MENOS DO/CILES INSTRUMENTOS DE SUS AVIESOU DESIGNIOS, LA BEATA IMAGEN DE LA EDAD \[)E ORO REDIVIVA SE TRANSMUTO/~ AL CONJURO DEL DES~NGA+O~ EN EDAD DE HIER~<O EN QUE DOMINABA LA CRECIENTZ DESIGNIOS, LA BEATA IMAGEN DE LA EDAD DE ORO REDIVZVA SE TRANSMUTO/~ AL CONJURO DEL DFSENGA+O~ EN EDAD bE HIERRO EN QUE DOMINABA LA CRECIENT~: CONVICC:Io/N DE QUE ESOS DESNUDOS HIJOS DEL OCE/ANO ~ FORMABAN PARTE DEL INDI/GEHAS~ COMO ES AU/N~ EN PARTE EsTE/RIL, SINO ~UE REALIZARI/A SU PROGRESIVA EDUCACIO/N EN LA ADOLESCENCIA Y HASTA EN LA EDA9 ADULTA'. EN EL PLAN DEFINITZVAMENTE REG::Ni:RADOR DICTADO EN EL LLANO DEL RODEO ~ JURA/IS Y YO PIERDO UN ALUMNO. 6 935958</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 The algorithm to select significant </SectionTitle> <Paragraph position="0"> concordances.</Paragraph> <Paragraph position="1"> Once the tree is fully constructed it is proceeded to make the actual reduction. null There are some facts to be considered beforehand: The more words repeated exactly in the same context, the greater is the probability that the meaning of the word W in that context is the same.</Paragraph> <Paragraph position="2"> A set of words repeated a small number of times may be more significaqt than another one repeated a larger number of times since there are not so many different meanings or grammatical functions of ~ word W followed by the s~ me set of words.</Paragraph> <Paragraph position="3"> Next, it will be described the proced ure: In order to analyse the tree, a left-most path is followed.</Paragraph> <Paragraph position="4"> - A 6th level branch of the tree is first analysed (Remember that the root is in level 5, and that the tree to the right of W is being analysed). If the frequency is greater than I, then its leftmost direct descendant is analysed in the same way.</Paragraph> <Paragraph position="5"> If a 9th level rode is reached in this form, and the frequency n > I, it means that the words W followed by these four words ocurred a times in n different concordances. As it was said before there is a good probability that the meaning of the word W in this particular context is the same in all of the n concordances&quot; hence, by talking only one or two of them, by means of a random function, we obtain a significant concordance, and the ( n - 1) or (n-2) left can be safely omited from the final output.</Paragraph> <Paragraph position="6"> - If at same intermediate level it is found that the frequency of the word associated to that ~ode is I, then the analysis of such branch would have to be stopped; however, it was thought that a possible way to reduce was not by identical words but by the same grammatical category. It is proceeded then to find all direct descendants of its own direct ascendant with the same frequency and grammatical category, and then the number of these concordances is reduced.</Paragraph> <Paragraph position="7"> It is clear that the process takes into account that as the level of reduction is closer to 5, then the context is less significant; hence a larger number of concordances have to be chosen to mantain the required quality information.</Paragraph> <Paragraph position="8"> After some study and many trials it was empirically decided by our team of linguists* that a reasonable pattern of reduction was the following: If the level of reduction is 4 or 6 and the frequency F~ 30 then the number of concordances selected Q would be</Paragraph> <Paragraph position="10"> thank in particular to Paulette Levy for her valuable discussions and interesting suggestions.</Paragraph> <Paragraph position="11"> --595--It has to be mentioned here, that this pattern of reduction may be changed according to the wprd analysed., as to obtain the best results each time. When it is already Known the number of concordances that will be chosen ( Q out of F) it is proceeded to select them again, by means of a random function, and each one of them is marked as such, to avoid any one of them be selected twice or more times.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Output. </SectionTitle> <Paragraph position="0"> The final output is presented indicating the group of words repeated the grammatical category of the last word when applicable and the frequency.</Paragraph> <Paragraph position="1"> Next, the Q concordances chosen are lis-</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> IV The Computational System~ </SectionTitle> <Paragraph position="0"> The system was implemented in the University of Norway version of ALGOL 60 NUALGOL for a UNIVAC 1106 computer of the &quot;Centro de Procesamiento Arturo Rosenblueth&quot; of the Secretar~a de Educa-</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> ciSn P~blica (Ministry of Education), </SectionTitle> <Paragraph position="0"> with 262K words of 36 bites of central memory and 8,000,000 of characters in disc.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Data Storage. </SectionTitle> <Paragraph position="0"> We made use of 3 files: a) File CONCUERDA, where the whole set of concordances of the word W was stored, and it was described above.</Paragraph> <Paragraph position="1"> b) Files ARBOL and CONCORD~ these two files are supposed to contain the information obtained while generating the right and left trees.</Paragraph> <Paragraph position="2"> ARBOL: Each node of the tree is stored in a line composed of 72 characters, distributed in the fo- null 7 for the address of the next direct descendant of its own direct ascendant (i.e. like next brother) 7 for the address of the first direct descendant 4 for the number of direct descendants (i.e. No of branches emerging from it) and 6 for the address in file CONCORD where it is stored the number of the concordance where it comes from, From the com#utational point of view, each one of the trees is generated in the following way: - The root, whose node associated is the word W is in a prefixed address, and it will be present in every concordance. This word is taken from ORDENA (5) - The next word in ORDENA will be stored by means of a hash function, and it is decided to be the same node as one previously stored, if and only if the word, its grammatical category, level and direct ascendant are exactly the same, in such case the frequency is aumented by one and in file CONCORD is stored the number of this concordance in addition to the previous one.</Paragraph> <Paragraph position="3"> Otherwise it will be a new rode.</Paragraph> <Paragraph position="4"> Figure No 5 shows part of file ARBOL, EDAD (AGE) is being processed.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> V Results And Applications~ </SectionTitle> <Paragraph position="0"> The first results were very encouraging, since for those words with medium number of concordances say up to 600 we were able to reduce the number between 30% and 40%, according to the word in question.</Paragraph> <Paragraph position="1"> No lost information was reported (by comparing the original set of concordances with the reduced version) It is expected that for words with higher frequency, the method here des.</Paragraph> <Paragraph position="2"> cribed will be more efficient.</Paragraph> <Paragraph position="3"> However, from the computational point of view, there are still some difficulties, since the generation of each tree is very time consuming as the frequency of the word in question increases. ~e are still working to optimize it.</Paragraph> <Paragraph position="4"> The most important application besides the original main objectives, is that by this method it is possible to find expressions and patterns of language repeated and used consistently.</Paragraph> </Section> class="xml-element"></Paper>