File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1135_metho.xml

Size: 13,174 bytes

Last Modified: 2025-10-06 14:08:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1135">
  <Title>Extracting Hyponyms of Prespecified Hypernyms from Itemizations and Headings in Web Documents</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Previous Work: AHRAI
</SectionTitle>
    <Paragraph position="0"> The Algorithm for Hyponymy Relation Acquisition from Itemization (AHRAI) acquires hyponymy relations from HTML documents according to three assumptions. null Assumption A Expressions included in the same itemization or listing in an HTML document are likely to have a common hypernym.</Paragraph>
    <Paragraph position="1"> Assumption B Given a set of hyponyms that have a common hypernym, the hypernym appears in many documents that include the hyponyms.</Paragraph>
    <Paragraph position="2"> Assumption C Hyponyms and their hypernyms are semantically similar.</Paragraph>
    <Paragraph position="3"> We call expressions in an itemization hyponym candidates. A set of the hyponym candidates extracted from a single itemization or list is called a hyponym candidate set (HCS). For the itemization in Figure 1 (A), we would treat Toyota, Honda, and Nissan as hyponym candidates, and regard them as members of the same HCS.</Paragraph>
    <Paragraph position="4"> The procedure consists of the following four steps. Note that Steps 1-3 correspond to Assumptions A-C. Step 1 Extraction of hyponym candidates from itemized expressions in HTML documents.</Paragraph>
    <Paragraph position="5"> Step 2 Selection of a hypernym candidate by using document frequencies and inverse document frequencies.</Paragraph>
    <Paragraph position="6"> Step 3 Ranking of hypernym candidates and HCSs based on semantic similarities between hypernym and hyponym candidates.</Paragraph>
    <Paragraph position="7"> Step 4 Application of a few additional heuristics to elaborate computed hypernym candidates and hyponym candidates.</Paragraph>
    <Paragraph position="8"> Step 1 is performed by using a rather simple algorithm operating on HTML tags. See Shinzato and Torisawa, 2004, for more details. The other steps are described in the following.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Step 2
</SectionTitle>
      <Paragraph position="0"> In Step 2, the procedure selects a common hypernym candidate for an HCS. First, two sets of documents are prepared. The first set of documents is a large number of HTML documents that are randomly selected and downloaded. This set of documents is called a global document set, and is assumed to indicate the general tendencies of word frequencies. Then the procedure downloads the documents including each hyponym candidate in a given HCS by using an existing search engine 1. This document set is called a local document set, and is used to determine the strength of the association of nouns with the hyponym candidates.</Paragraph>
      <Paragraph position="1"> Let us denote a given HCS as C, a local document set obtained from all the items in C as LD(C), and a global document set as G. N is a set of the nouns that can be hypernym candidates2 A hypernym candidate, denoted as h(C), for C is obtained through the following formula.</Paragraph>
      <Paragraph position="3"> df(n;D) is a document frequency, which is actually the number of documents including a noun n in a document set D. idf(n;G) is an inverse document frequency, which is defined as log(jGj=df(n;G)).</Paragraph>
      <Paragraph position="4"> 1As in Shinzato and Torisawa, 2004, we used the search engine &amp;quot;goo.&amp;quot; (http://www.goo.ne.jp). Note that we enclosed the strings to be searched by &amp;quot;&amp;quot; so that the engine does not split them to words automatically.</Paragraph>
      <Paragraph position="5"> 2We simply used the most frequent nouns observed in a large corpora as N.</Paragraph>
      <Paragraph position="6"> The score hS has a large value for a noun that appears in a large number of documents in the local document set and is found in a relatively small number of documents in the global document set. This reflects Assumption B given above.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Step 3
</SectionTitle>
      <Paragraph position="0"> By Step 2, the procedure can produce pairs of a hypernym candidate and an HCS, which are denoted by fhh(Ci);Ciigmi=1. Here, Ci is an HCS, and h(Ci) is a common hypernym candidate for hyponym candidates in an HCS Ci.</Paragraph>
      <Paragraph position="1"> In Step 3, the similarity between hypernym candidates and hyponym candidates is considered to exclude non-hypernyms that are strongly associated with hyponym candidates from the hypernym candidates obtained by h(C), according to Assumption C. For instance, non-hypernym &amp;quot;price&amp;quot; may be a value of h(fToyota;Hondag) because it is strongly associated with the words Toyota and Honda in HTML documents. Such non-hypernyms are excluded based on the assumption that non-hypernyms have relatively low semantic similarities to the hyponym candidates, while the behavior of true hypernyms should be semantically similar to the hyponyms. In the &amp;quot;price&amp;quot; example, the similarity between &amp;quot;price&amp;quot; and &amp;quot;Toyota&amp;quot; is relatively low, and we can expect that &amp;quot;price&amp;quot; is excluded from the output. The semantic similarities between hyponym candidates in an HCS C and a hypernym candidate n are computed using a cosine measure between co-occurrence vectors:</Paragraph>
      <Paragraph position="3"> Here, ho(C) denotes a co-occurrence vector of hyponym candidates, while hy(n) is the co-occurrence vector of a hypernym candidate n. Assume that all possible argument positions are denoted as fp1;C/C/C/;plg and fv1;C/C/C/;vog denotes a set of verbs. Then, the above vectors are defined as follows.</Paragraph>
      <Paragraph position="5"> Here, fh(C;p;v) denotes the frequency of the hyponym candidates in an HCS C occupying an argument position p of a verb v in a local document set and f(n;p;v) is the frequency of a noun n occupying an argument position p of a verb v in a large document set.</Paragraph>
      <Paragraph position="6"> The procedure sorts the hypernym-HCS pairs fhh(Ci);Ciigmi=1 using the value sim(h(Ci);Ci) C/ hS(h(Ci);Ci): Then, the top elements of the sorted pairs are likely to contain a hypernym candidate and an HCS that are semantically similar to each other. The final output of AHRAI is the top k pairs in this ranking after some heuristic rules are applied to it in Step 4. Rule 1 If the number of documents that include a hypernym candidate is less than the sum of the numbers of the documents that include an item in the HCS, then discard both the hypernym candidate and the HCS from the output.</Paragraph>
      <Paragraph position="7"> Rule 2 If a hypernym candidate appears as substrings of an item in its HCS and it is not a suffix of the item, then discard both the hypernym candidate and the HCS from the output. If a hypernym candidate is a suffix of its hyponym candidate, then half of the members of an HCS must have the hypernym candidate as their suffixes. Otherwise, discard both the hypernym candidate and its HCS from the output.</Paragraph>
      <Paragraph position="8"> Rule 3 If a hypernym candidate is an expression belonging to the category of place names, then replace it by &amp;quot;place name.&amp;quot; Recognition of place names was done by an existing morpho- null In other words, the procedure discards the remaining m ! k pairs in the ranking because they tend to include erroneous hypernyms.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Step 4
</SectionTitle>
      <Paragraph position="0"> The steps described up to now can produce a hypernym for hyponym candidates with a certain precision. However, Shinzato et al. reported that the rules shown in Figure 2 can contribute to higher accuracy. In general, we can expect that a hypernym is used in a wider range of contexts than those of its hyponyms, and that the number of documents including the hypernym candidate should be larger than the number of web documents including hyponym candidates. This justifies Rule 1. Rule 2 is effective since Japanese is a head final language, and semantic head of a complex noun phrase is the last noun.</Paragraph>
      <Paragraph position="1"> Rule 3 was justified by the observation that when a set of place names is given as an HCS, the procedure tends to produce the name of the region that includes all the places designated by the hyponym candidates.</Paragraph>
      <Paragraph position="2"> (See Shinzato and Torisawa, 2004 for more details.) Recall that in Step 3, the ranked pairs of an HCS and its common hypernym are obtained. By applying the above rules to these, some pairs are removed from the ranked pairs, or are modified. For some given integer k, the top k pairs of the obtained ranked pairs become the final output of our procedure, as mentioned before.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Proposed Method: HEAIH
</SectionTitle>
    <Paragraph position="0"> Our proposed method, Hyponym Extraction Algorithm from Itemizations and Headings (HEAIH), is obtained by using some steps of AHRAI. The HEAIH procedure is given a set of l hypernyms, denoted by X = fxigli=1, where xi is a hypernym, and finds hyponyms for the hypernyms. The basic behavior of the HEAIH is summarized as follows.</Paragraph>
    <Paragraph position="1"> First, it downloads the documents which are likely to contain itemizations consisting of hyponyms of the given hypernyms. This is done by generating possible headings or explanations of the itemizations by using prespecified linguistic patterns and by search- null ing the documents including the expressions with an existing search engine. Second, the procedure applies Steps 1 and 2 of AHRAI and computes a ranked list of hypernym candidates for each HCS extracted from the itemizations in the downloaded documents.</Paragraph>
    <Paragraph position="2"> The list is ranked in descending order of the hS score values. Note that the ranked list is generated independently from a given hypernym.</Paragraph>
    <Paragraph position="3"> We assume that a given hypernym is likely to be a true hypernym if the top elements of the ranked list of hypernym candidates contain many substrings of the hypernym. The procedure computes a score value, which is designed so that it has a large value when many substrings of the given hypernym are included in the list. Then, the pairs of a given hypernym and a corresponding HCS are sorted by the score value, and only the top k pairs are provided as the output of the whole procedure.</Paragraph>
    <Paragraph position="4"> More precisely, HEAIH consists of Steps A-E, each of which are described below.</Paragraph>
    <Paragraph position="5"> Step A For each of the given hypernyms, denoted by xi, generate a set of strings which are typically used in headings, such as &amp;quot;List of xi,&amp;quot; by using the prespecified patterns listed in Figure 3. The set of generated strings for a hypernym xi is denoted by Hd(xi). Give each string in Hd(xi) to an existing search engine and pick up a string that has the maximum hit count in Hd(xi). Then, download the documents in the ranking produced by the engine for the picked up string. In our experiments, we downloaded the top 25 documents for each hypernym if the ranking contained more than 25 documents. Otherwise, all the documents were downloaded.</Paragraph>
    <Paragraph position="6"> Step B Identify the itemizations in the downloaded documents and extract the expressions in them by using Step 1 of AHRAI. The results obtained in this step are denoted by B(X) = fhx0h;Chigmh=1, where x0h is one of the given hypernyms and Ch is an HCS extracted from a document downloaded for x0h.</Paragraph>
    <Paragraph position="7"> Step C Apply Step 2 of AHRAI to each HCS Ch such that hx0h;Chi 2 B(X), and then obtain a ranked list that contains the top p words according to the hS values and is ranked by the values. We denote the list as HCList(Ch).</Paragraph>
    <Paragraph position="8"> Step D Sort the set B(X) = fhx0h;Chigmh=1 in descending order of the hSC value, which is given be-</Paragraph>
    <Paragraph position="10"> In short, the score hSC is the sum of the score values hS for the substrings of a given hypernym that was contained in the top p elements of the ranked list produced by Step 2 of AHRAI. In our experiments, we assumed p = 10. In addition, the score is weighted by the similarity measure sim(x;C)3.</Paragraph>
    <Paragraph position="11"> Step E Apply Rules 1 and 2 of AHRAI to each element of the sorted list obtained in Step D, and produce the top k pairs that survived the check by the rules as the final output. In our experiments, we assumed k = 200, while we obtained B(X) consisting of 2,034 pairs.</Paragraph>
    <Paragraph position="12"> Note that the weighting factor sim(x;C) in hSC contributed to high accuracy in our experiments using a development set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML