File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1103_metho.xml
Size: 20,000 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1103"> <Title>Semiautomatic creation of taxonomies</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Extracting candidate tuples </SectionTitle> <Paragraph position="0"> The core of the original method was the extraction of a huge amount of Spanish/English word pairs, Hbil, from both parts of a bilingual dictionary4.</Paragraph> <Paragraph position="1"> Then 17 automatic methods were constructed that, from Hbil, generated 17 sets of connections between Spanish words and WN synsets. These methods followed different criteria of pairing in such a way that the resulting sets presented diverse degrees of overlapping and different quality degrees measured in terms of precision and coverage. For a complete explanation of the methods, see (Atserias et al., 1997). The 17 automatic methods can be grouped as shown below: gles/Espanol. Biblograf S.A. Barcelona, 1992.</Paragraph> <Paragraph position="2"> a0 methods 1, 2, 3, 4, 5, 6, 7, 8: Connect Spanish words to WN synsets taking into account the multiplicity of the connection (1:1, 1:N, M:1, M:N) and the polysemy of English words in WN (mono, poly).</Paragraph> <Paragraph position="3"> a0 methods 9, 10, 11, 12: Correspond to different cases in which having Spanish word more than one translation, the respective translations are linked by taxonomic relations in WN.</Paragraph> <Paragraph position="4"> a0 methods 13, 14, 15: They take profit of the semantic relations in WN using it as a semantic space for measuring conceptual distance between different elements (English translations of Spanish words cooccurring in definitions of a head-word entry, English translations of a head-word and its genus). See (Agirre and Rigau, 1995).</Paragraph> <Paragraph position="5"> a0 method 16: It generates connections taking profit of the coincidence of English words from the same translation in the same synset.</Paragraph> <Paragraph position="6"> a0 method 17: It generates connections taking profit of thematic tags in the bilingual dictionary. null Each of these methods generates a list of pairs of the kind synset-word. Each connection can be generated by multiple methods.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Selecting the most appropriate tuples: </SectionTitle> <Paragraph position="0"> first approach In order to evaluate the quality of the different methods, a random stratified sample of around 2,500 links covering all the sets was extracted and verified manually. The results of this evaluation are presented in 3rd column of table 1.</Paragraph> <Paragraph position="1"> All of the methods that gave a percentage of correctness of 85% or better were selected to build what was called Subset1. This gave as a result an initial set of 10,982 connections between Spanish words and WN synsets from a potential volume of 66,609.</Paragraph> <Paragraph position="2"> Based on the supposition that, if a connection is a joint solution of two methods, its probability to be correct would be higher, and then the joint evaluation of both methods will be higher than that of each method separately, and having checked the high degree of intersection between solutions of the different methods, we decided to add to the previous set of connections (Subset1) those connections occurring as simultaneous solution of two of the methods not considered in the previous phase, increasing coverage without loosing precision.</Paragraph> <Paragraph position="3"> The links of those methods not selected for Subset1 were crossed, and the volume of each intersection set was calculated. The percentage of correct solutions in each set can be understood as an estimation of the probability that one connection proposed by some pair of methods is correct. The solutions of certain pairs of methods presented accuracies equal or above 85%. But some didn't have a sample significant enough to ensure a reliable estimation of the probability of producing correct solutions.</Paragraph> <Paragraph position="4"> In this second step of the sampling we proceeded to complete the manual evaluation of those groups that seem promising. The results are summarized in table 2. The table identified 14 intersections with an accuracy equal or above 85% (in bold). All connections belonging to those cells were selected to form</Paragraph> <Paragraph position="6"> tions.</Paragraph> <Paragraph position="7"> Table 3 shows the comparison of volumes and accuracies between the first sample stage (Subset1), the connections extracted in the second stage of sampling (Intersections) and the set resulting from the fusion of both (Subset2). The cardinality of the set is less than the addition of cardinalities because there are connections belonging to both sets. This gives an idea that the degree of intersection is far greater than 2, what is worth studying. Subset2 was the Spanish WordNet finally included within EuroWordNet (1999). This is the origin of the present work.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Extending the coverage </SectionTitle> <Paragraph position="0"> Spanish WordNet has been further developed, adding new synsets and variants and correcting manually the links in Subset2, reaching a total of 54,753 links, almost all manually verified. As a result we now dispose of a wider and more accurate database that allows us to perform more robust estimations of confidence score factors for the different methods.</Paragraph> <Paragraph position="1"> We will call this manually verified database Subset3, which were extracted on Dec. 2001. See table 6.</Paragraph> <Paragraph position="2"> The result is that now, from the manually verified links, 20,013 correspond to connections extracted from automatic methods. The difference between these two figures (15,535 and 20,013) is due to the insertions performed for getting Subset3. Those connections, as has already been pointed out, don't occur in Subset2 but can belong to the set of results of some method not selected so far.</Paragraph> <Paragraph position="3"> We decided to construct with those 20,013 connections a third set of connections, all of them manually evaluated as correct(OK) or incorrect(KO), which would be used as test to evaluate again the whole population, and try to obtain, by means of a more detailed study of intersections, a Subset4 which would enhance the existing results in Subset3.</Paragraph> <Paragraph position="4"> From the 20,013 connections there are 17,140 correct and 2,873 incorrect, giving an accuracy ratio over 85%. This result some way validates the previous work, as it was our intention to obtain a large set of connections with a value of above 85%.</Paragraph> <Paragraph position="5"> In obtaining Subset2 it was evident the high degree of intersection between the different methods, a degree much larger than 2, but only the intersections of two methods were studied. It is our goal to study if the intersections could be exploited to extract from them an individual evaluation for each connection, and a formula that allows to calculate this value for new connections depending on the set of methods that propose them.</Paragraph> <Paragraph position="6"> Concretely, the aim of the present work is to study the statistical behavior of the links regarding the set of methods supporting them, and not only intersections of two methods. To do this, all the data has been condensed in a matrix of 66,609 vectors, one vector for each link, of the kind link ma1 ma3 ... ma1a4a3 ma1a5a4 eval where mi are booleans indicating membership of the link to the set of solutions of method i, eval is the manual evaluation accepting one of two values (OK being correct, KO being incorrect), and link is the pair (WN1.5 synset,Spanish word). From this matrix the 20,013 rows with manual verification have been extracted to be studied separately, with the aim of obtaining some statistical measure that permits us to select the good connections of the set, in order to apply later the statistic to the whole population.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Descriptive analysis </SectionTitle> <Paragraph position="0"> Using the set of 20,013 validated links, the accuracy of each method can be reevaluated in a more precise way than in the first stage of the sampling (table 1). The results of the reevaluation are shown in 4th column of table 1.</Paragraph> <Paragraph position="1"> Comparing both tables the methods have different accuracies. While some methods (1, 15, 16) keep a similar accuracy, some suffer a light decrement (2, 3, 4), and most of them have a significant increment (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17). But these variations must be studied with caution.</Paragraph> <Paragraph position="2"> On one side, methods 1, 2, 3, 4 and 16 were accepted completely in the first stage of the sampling. Thus for these methods it is fair to study the changes observed. Methods 1 and 16 maintain the same levels, while methods 2, 3 and 4 suffer a light decrease, more important in the cases of methods 3 and 4.</Paragraph> <Paragraph position="3"> The rest of the methods were not selected completely, but based on intersections with other methods, and must be analyzed from another point of view. The general increment in these methods comes from the manually added connections during revision process.</Paragraph> <Paragraph position="4"> Another assumption taken in the generation of the set was that the accuracy increased with the number of methods which produced them. In figure 1 the percentage of correct links by the number of coincident methods is shown, and it doesn't ensure the assumption, as it doesn't seem to exist a correlation between the number of methods that produce a connection and the percentage of correct solutions.</Paragraph> <Paragraph position="5"> There are indeed methods highly correlated and their added evidences don't cause an increase of global evidence. Figure 1 depicts just sets of methods intersecting, but they don't distinguish which of the methods are intersecting in each case. May be there are some methods of higher predictive quality than others, and this should be taken into account. So, it seems better to analyze the relationship between those methods according to the coincident links they If a study is performed using the vectors previously shown, each connection is being evaluated against the set of methods supporting it. In this analysis, the fact that in the second stage of the sampling the less promising sets of solutions of certain pairs of methods were under-represented doesn't suppose any risk, as now it is going to be considered the fact that each connection is correct or not in relation to the total set of methods that produce it.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Logistic regression model </SectionTitle> <Paragraph position="0"> We applied without success the Principal Components Analysis method, trying to find a spatial dispersion that separated correct and incorrect connections. We chose instead to apply another statistical method more appropriate to the problem: logistic regression, which is used to obtain an adjustment for the accuracy of a connection from the set of methods that propose it.</Paragraph> <Paragraph position="1"> The set of boolean variables ma1 ...ma1a5a4 defines 2a1a5a4 possible combinations of methods that propose a certain connection. Associating to each connection a vector of this kind, the same description will be used for all connections proposed by the same group of methods and they will collapse in the same point. Thus a new matrix can be constructed that for the 2a1a5a4 possibilities keeps the number of correct evaluations and the total of evaluations, being the number of incorrect ones the difference between both values: ma1 ma3 ... ma1a4a3 ma1a5a4 nok ntot where nok is the number of correct evaluations for the set of solutions of every group of methods, and ntot accumulates the total number of evaluations.</Paragraph> <Paragraph position="2"> It is clear that the probability that a link would be correct can be estimated by the following expression a0a2a1a4a3a6a5a8a7a10a9a12a11a13a3a6a5a15a14a16a11a18a17a6a3a19a17 (considering that the probability is the limit of the relative frequency). The logistic regression is a technique which allows finding a model (in the mathematical sense) for approximating a0a20a1a4a3a6a5a8a7 on the basis of a set of explicative variables (in our case ma1 , ma3 , ... ma1a5a4 ). In order to fit the formula a21a23a22a25a24</Paragraph> <Paragraph position="4"> rion is used for every value of a40a53a52 . It was observed that the analysis is disturbed by a series of combinations of methods with very low frequency in the sample (with very low ntot), and it was decided to restrict (in this first phase) the study to those combinations of methods more represented in the sample. Then the analysis was redone for those rows with a27a36a35a54a29a38a35a56a55a58a57 . This elimination supposes the loss of 1,210 evaluations, which mean a 10% of the total number, leaving a set of 18,803.</Paragraph> <Paragraph position="5"> The results can be enhanced. The P value of each of the factors shows which methods are significant for explaining the probability of a link of being OK.</Paragraph> <Paragraph position="6"> For a P value lower than 0.05, the method is significant in the model; this means that being supported by this method is influencing the probability of OK for a given link. The objective should be to find the minimal set of significant methods. This is an exponential problem, as all combinations should be tested. The backward method is useful to find a local optimum iteratively deleting the less informative variable between the non significant ones. In the case of the present study, methods ma59 and ma2 have a P value very close to 1, meaning they are not significant. Performing the backward method both of them would be eliminated. All P-values for the remaining methods are close to 0 and the Pearson goodness-of-fit test gives a P-value of 0.000, meaning the model and all its methods are significant.</Paragraph> <Paragraph position="7"> To evaluate the diagonal tendency of the model, a plot has been done clustering the points in seven ranked groups, and showing each point with a circle of a diameter equivalent to the size of the group. It is shown in figure 2. In the graph the diagonal ten- null dency can be clearly seen showing there is a correlation between both measures, the calculated and the real ones. This is a good indicator, as it means that the method approximates quite correctly the problem. But, what is more important, is that the groups with the larger sizes are positioned more to the diagonal center than the ones poorly evaluated.</Paragraph> <Paragraph position="8"> 5 Selecting the most appropriate tuples: final approach After selecting only the methods that are significant to the model, there is a general formula to calculate the % from the results of the regression model: p(OK)= a74 where the values mi are presented in table 4.</Paragraph> <Paragraph position="9"> The formula obtained was applied to the total set of 66,604 connections proposed by the automatic methods, in order to obtain an individual evaluation for each of the points. By means of this estimation of the probability of correctness of each point, a global percentage can be estimated for the whole population. In this case it has given 84.86%. We dispose now of an estimation of goodness for any of the proposed connections. We can, thus, study some thresholding mechanism for deciding whether to accept or not a given connection. The results are presented in table 5. For each threshold the number of total connections selected has been calculated, the number of synsets and Spanish words related, the number of OK and KO evaluations, the estimation of the real global percentage(a3a6a5a15a14 a1a4a3a6a5 a43 a5a8a3 a7 ), the global estimation from the formula, and the degree of polysemy of the Spanish words (a1a3a2a5a4a7a6 a29a9a8a11a10 a2 a14 a1a3a2a13a12 a27 a2a15a14 a35 a2 ). From the results of table 5 several issues can be observed. On the one side, in the initial set of connections there is a degree of polysemy of 4.99, when the global polysemy of Spanish nouns is 1.8, see (Rigau, 1998). This value would correspond to a threshold between 0.90 and 0.91. If we accept senses from Spanish to be translated to more than one sense in WN, a higher polysemy degree could be accepted, but never reaching 4.99.</Paragraph> <Paragraph position="10"> It is interesting to note the decreasing evolution of the illustrative variable polysemy as the threshold increases, reaching at the limit a value of 1.2, very close to monosemic values. In order to choose a threshold dividing the population between good and bad results, the polysemy degree of the total population is a good indicator, and surprisingly well correlated with the formula.</Paragraph> <Paragraph position="11"> Observing the row at 0.86% a drastic decrement of the number of connections occurs, but not followed by a decrement in the number of words nor synsets.: only the polysemy degree is being decremented to 2.77 without a great loss of representativity of the whole set, which maintains a global estimation of 90.76%. This decrement is caused by the disappearance of nearly 16.000 connections uniquely generated by method 8. The second jump takes place in the row of 91%, where losing a significant number of words and synsets, the polysemy lowers to 1.57 with a global estimation of 93.03%. Beyond this point the loss in coverage is much greater than the gain in precision.</Paragraph> <Paragraph position="12"> What is really interesting is that, independently of the global estimation of the set, each connection has its own estimation. Then, if a threshold of 86% is chosen, all of the 29,644 connections chosen have each its own estimation, and this is an important information; a verification can be designed depending on the estimation of the points, for example, starting with the lower ones, to eliminate a larger number of errors, or starting with the higher ones, to ensure a higher degree of correct results.</Paragraph> <Paragraph position="13"> In a similar way, if the threshold of 86% was selected, from the 2,653 Spanish words lost, the connection with the highest estimation could be included, which will always be lower than 86%, trying to rescue the largest number of words possible, placing into the set at least one sense for those words. Or senses for those words could be discarded from upper to lower until one correct sense is found.</Paragraph> <Paragraph position="14"> This means that, once this result has been obtained, there is a wide number of possible ways and solutions to optimize the effort made, and to maximize the number of correct solutions per manual evaluation.</Paragraph> <Paragraph position="15"> Anyway a last comment must be said to indicate that this process doesn't help to avoid the need for a final work of hand evaluating the different connections obtained in order to delete the wrong results that have been included in the final set.</Paragraph> <Paragraph position="16"> In order to give a final result, and to compare it with the results previously obtained, a threshold of 0.858% was selected as cut point, and all the results above this limit were accepted. Table 6 shows the results. The correctness of the increment is unknown but falls into the limits of Subset4. The correctness the regression comes from table 5. The correctness of Subset4 has been estimated from the union of Subset3 and Regress.</Paragraph> </Section> </Section> class="xml-element"></Paper>