File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/j03-3006_intro.xml
Size: 12,567 bytes
Last Modified: 2025-10-06 14:01:40
<?xml version="1.0" standalone="yes"?> <Paper uid="J03-3006"> <Title>Automatic Association of Web Directories with Word Senses</Title> <Section position="4" start_page="488" end_page="492" type="intro"> <SectionTitle> 2. Algorithm </SectionTitle> <Paragraph position="0"> Overall, the system takes a WordNet 1.7 noun as input, generates and submits a set of queries into the ODP, filters the information obtained from the search engine, and returns a set of ODP directories classified as (1) pseudo-domain labels for some word sense, (2) noise, and (3) salient noise (i.e., directories that are not suitable for any sense in WordNet but could reveal and characterize a new relevant sense of the noun). In case (1), the WordNet sense - ODP directory association also receives a probability score. A detailed description of the algorithm steps follows.</Paragraph> <Section position="1" start_page="488" end_page="489" type="sub_section"> <SectionTitle> 2.1 Querying ODP Structure </SectionTitle> <Paragraph position="0"> For every sense w i of the noun w, a query q i is generated, including w as compulsory term, the synonyms and direct hypernyms of w i as optional terms, and the synonyms of other senses of w as negated (forbidden) terms. These queries are submitted to ODP, and a set of directories is retrieved. For instance, for circuit, the following queries are generated and sent to the ODP search engine: Santamar'ia, Gonzalo, and Verdejo Association of Web Directories with Word Senses q3= [+circuit path route itinerary -&quot;electrical circuit&quot; -&quot;electric circuit&quot; -&quot;electrical device&quot; -tour -&quot;racing circuit&quot; -lap -circle ] q4= [+circuit group grouping -&quot;electrical circuit&quot; -&quot;electric circuit&quot; -&quot;electrical device&quot; -tour -&quot;racing circuit&quot; -lap -circle] q5= [+circuit &quot;racing circuit&quot; racetrack racecourse raceway track -&quot;electrical circuit&quot; -&quot;electric circuit&quot; -&quot;electrical device&quot; -tour -lap -circle] q6= [+circuit lap circle locomotion travel -&quot;electrical circuit&quot; -&quot;electric circuit&quot; -&quot;electrical device&quot; -tour -&quot;racing circuit&quot; -lap -circle]</Paragraph> </Section> <Section position="2" start_page="489" end_page="489" type="sub_section"> <SectionTitle> 2.2 Representing Retrieved Directory Descriptions </SectionTitle> <Paragraph position="0"> For every directory d, a list of words l(d) is obtained removing stopwords and preserving all content words in the directory path. For instance, one of the directories produced by the circuit queries is d = business/industries/electronics and electrical/contract manufacturers which is characterized by the following word list: l(d)= [business, industries, electronics, electrical, contract, manufacturers]</Paragraph> </Section> <Section position="3" start_page="489" end_page="489" type="sub_section"> <SectionTitle> 2.3 Representing WordNet Senses </SectionTitle> <Paragraph position="0"> For every sense w</Paragraph> <Paragraph position="2"> ) of words is made with )=[electrical circuit, electric circuit, electrical device, bridge, bridge circuit, Wheatstone bridge, bridged-T, closed circuit, loop, parallel circuit, shunt circuit, computer circuit, gate, logic gate, AND circuit, AND gate, NAND circuit, NAND gate, OR circuit, OR gate, X-OR circuit, XOR circuit, XOR gate, integrated circuit, (..) instrumentality, instrumentation, artifact, artefact, object, physical object, entity]</Paragraph> </Section> <Section position="4" start_page="489" end_page="490" type="sub_section"> <SectionTitle> 2.4 Sense/Directory Comparisons </SectionTitle> <Paragraph position="0"> For every sense description l(w j ), a comparison is made with the terms in the directory description l(d). This comparison is based on the hypothesis that the terms in an appropriate directory for a word sense will have some correlation with the sense description via WordNet semantic relations. In other words, our assumption is that the path to the directory in the ODP topical structure will have some degree of overlapping with the hyponymy path to the word sense in the WordNet hierarchical structure. For this comparison, we simply count the number of co-occurrences between words in l(w j ) and words in l(d). Repeated terms are not discarded, as repetition is correlated with stronger associations. Other, better-grounded comparisons, such as the cosine between l(w j ) and l(d), were empirically discarded because of the small size and small amount of overlapping of the average vectors.</Paragraph> </Section> <Section position="5" start_page="490" end_page="490" type="sub_section"> <SectionTitle> Computational Linguistics Volume 29, Number 3 2.5 Candidate Sense/Directory Associations </SectionTitle> <Paragraph position="0"> The association vector v(d, w) has as many components as senses for w in WordNet 1.7; the ith component, v(d, w) i represents the number of matches between the directory l(d) and the sense descriptor l(w j ). For instance, the association vector of business/industries/electronics and electrical/contract manufacturers with circuit is v(d, circuit)=(6, 0, 0, 0, 0, 0) that is, six coincidences for sense 1 (the electric circuit sense), which has the associated vector shown in the previous section (which includes five occurrences of electrical and one occurrence of electronic). The rest of the sense descriptions have no coincidences with the directory description.</Paragraph> <Paragraph position="1"> v(d, w) is the basis for making candidate assignments of suitable senses for direc- null tory d: If one of the components v(d, w) j is not null, we assign the sense w j to the directory d. If all components are null, the directory is provisionally classified as noise or new sense. If more than one component is not null, the senses i with maximal</Paragraph> <Paragraph position="3"> are all considered candidates. These candidate assignments are confirmed or discarded after passing a number of filters and receiving a confidence score C(d, w j ), both of which are described below.</Paragraph> </Section> <Section position="6" start_page="490" end_page="491" type="sub_section"> <SectionTitle> 2.6 Filters </SectionTitle> <Paragraph position="0"> Filters are simple heuristics that contribute to a more accurate classification of the relations predicted by the co-ocurrence vector v(d, w). We are currently using two filters: One differentiates nouns and noun modifiers to prevent wrong associations, and another detects sense specializations.</Paragraph> <Paragraph position="1"> 2.6.1 Modifiers. Frequently, the ODP search engine retrieves directories in which the noun to be searched, w, has as a noun modifier role. Such cases usually produce erroneous associations. For instance, the directory library/sciences/animals & wildlife/mammals/tamarins/golden lion tamarin is erroneously associated with the mammal sense of lion, which is here a modifier for tamarin.</Paragraph> <Paragraph position="2"> Modifiers are detected with a set of simple patterns, as the syntactic properties of descriptions in directories are quite simple. In particular, we discard most cases using the structure of the ODP hierarchy, as in this case. The filter analyzes the structure of the directory, detects that the parent category of golden lion tamarin is tamarin, therefore assumes that golden lion tamarin is a specialization of tamarin, and assigns the directory to a suitable sense of tamarin (tamarin 1 in WordNet).</Paragraph> <Paragraph position="3"> An additional filter (weaker than the previous one) discards compounds according to the position (the searched noun precedes another noun), as in personal/kids/arts & entertainment/movies/animals/lion king This directory could be associated with lion 1 because it contains the word animal, but the assignment is rejected because of the modifier filter. In general, on such occasions the searched noun plays a modifier role (as adjective or noun); discarding all such cases favors precision over recall. In this case, the label is classified as noise. Santamar'ia, Gonzalo, and Verdejo Association of Web Directories with Word Senses as a characterization of a sense specialization for some of the word senses being considered; our algorithm tries to detect such cases, creating a hyponym of the sense and characterizing the directory with the hyponym.</Paragraph> <Paragraph position="4"> The filter identifies a directory as a candidate hyponym if it contains explicitly a modifier w pattern (where w is the noun being searched). This filter detects explicit specializations, such as office chair as a hyponym of chair 1,orfox family channel as a hyponym of channel 7, but fails to identify, for instance, memorial day as a hyponym of holiday.</Paragraph> <Paragraph position="5"> If the candidate hyponym, as a compound, is not present in WordNet, then it is incorporated and described with the directory. If it is already present in WordNet, an additional checking of the hyponymy relation is made. For instance, the directory business/industries/electronics and electrical/components/integrated circuits is assigned to the WordNet entry integrated circuit, because integrated circuit is already a hyponym of circuit in WordNet.</Paragraph> </Section> <Section position="7" start_page="491" end_page="492" type="sub_section"> <SectionTitle> 2.7 Confidence Score </SectionTitle> <Paragraph position="0"> Finally, a confidence score C(d, w j ) for every potential association (d, w j ) is calculated using four empirical criteria: 1. Checking whether d was directly retrieved for the query associated to w j .</Paragraph> <Paragraph position="1"> 2. Checking whether the system associates d with one or more senses of the word w.</Paragraph> <Paragraph position="2"> 3. Checking the number of coincidences between l(d) and l(w j ).</Paragraph> <Paragraph position="3"> 4. Comparing the previous number with the number of coincidences between l(d) and the other sense descriptions {l(w)</Paragraph> <Paragraph position="5"> The confidence score is a linear combination of these factors, weighted according to an empirical estimation of their relevance:</Paragraph> <Paragraph position="7"> where v is the association vector v(d, w), n the number of senses, k the number of senses for which v</Paragraph> <Paragraph position="9"> are coefficients empirically adjusted to (a cannot take negative values, because, as (d, w j ) is a candidate association, v</Paragraph> <Paragraph position="11"> Let us see an example of how this confidence measure works, calculating C(d, w</Paragraph> <Paragraph position="13"> . This directory has been retrieved from the query q1= [+circuit &quot;electrical circuit&quot; &quot;electric circuit&quot; &quot;electrical device&quot; -tour -&quot;racing circuit&quot; -lap -circle] corresponding to circuit 1, which agrees with the association made by the system. Hence C . As all other components of v are null, the highest value of the components different from sense 1 is also null (max</Paragraph> <Paragraph position="15"> compared with the other possibilities. It decreases when v(d, w) includes more than one non-null coordinate, and their values are similar.</Paragraph> <Paragraph position="17"> coefficients, we obtain C(d, circuit 1)=0.975.</Paragraph> <Paragraph position="18"> The confidence score can be used to set a threshold for accepting/discarding associations. A higher threshold should produce a lower number of highly precise associations; a lower threshold would produce more associations with less accuracy. For the evaluation below, we have retained all directories, regardless of their confidence score, in order to assess how well this empirical measure correlates with correct and useful assignments.</Paragraph> <Paragraph position="19"> An example of the results produced by the algorithm can be seen in Table 1. The system assigns directories to senses 1, 2, and 5 of circuit (six, two, and three directories, respectively). Some of them are shown in the table, together with a sense specialization, integrated circuit, for sense 1 (electrical circuit). Senses 3, 4, and 6, which did not receive any directory association, do not appear to have domain specificity, but are instead general terms.</Paragraph> </Section> </Section> class="xml-element"></Paper>