File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1135_intro.xml
Size: 6,159 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1135"> <Title>Extracting Hyponyms of Prespecified Hypernyms from Itemizations and Headings in Web Documents</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Hyponymy relations can play a crucial role in various NLP systems, and there have been many attempts to develop automatic methods to acquire hyponymy relations from text corpora (Hearst, 1992; Caraballo, 1999; Imasumi, 2001; Fleischman et al., 2003; Morin and Jacquemin, 2003; Ando et al., 2003). Most of these techniques have relied on particular linguistic patterns, such as &quot;NP such as NP.&quot; The frequencies of use for such linguistic patterns are relatively low, though, and there can be many expressions that do not appear in such patterns even if we look at large corpora. The effort of searching for other clues indicating hyponymy relations is thus significant.</Paragraph> <Paragraph position="1"> Our aim is to extract hyponyms of prespecified hypernyms from the WWW. We use itemizations (or lists) in HTML documents, such as the one in Figure 1(A), and their headings ('Car Company List' in the figure). In a similar attempt, Shinzato and Torisawa proposed an automatic method to obtain a common hypernym of expressions in the same itemizations in HTML documents (Shinzato and Torisawa, 2004) by using statistical measures such as document frequencies and inverse document frequencies. In the following, we call this method the Algorithm for Hyponymy Relation Acquisition from Itemizations (AHRAI). On the other hand, the method we propose in this paper is called Hyponym Extraction Algorithm from Itemizations and Headings (HEAIH).</Paragraph> <Paragraph position="2"> The difference between AHRAI and HEAIH is that HEAIH uses the headings attached to itemizations, while AHRAI obtains hypernyms without looking at the headings. This difference has a significant consequence in the acquisition of hyponymy relations. A hyponym tends to have more than one hypernym. For instance, &quot;Toyota&quot; can have at least two hypernyms, &quot;company&quot; and &quot;car.&quot; AHRAI may be able to obtain &quot;company,&quot; for instance, from the itemizations presented in Figure 1(A), but it cannot simultaneously obtain &quot;car.&quot; Consider the itemization in Figure 1(B). Although the heading of the itemization says that the items in the itemizations are cars, AHRAI will provide &quot;company&quot; as a hypernym of the itemizations. This is because AHRAI does not use the headings as clues for finding hypernyms and the itemizations in (A) and (B) are actually identical. Of course, the method could produce the hypernym &quot;car&quot; from different itemizations; this is unlikely, though, because the itemizations suggesting that &quot;Toyota&quot; is a &quot;car&quot; are likely to again include the names of other car manufactures such as &quot;Nissan&quot; and &quot;Honda,&quot; so the itemization must be more or less similar to the ones in the figure. In such situations, the procedure is likely to consistently produce &quot;company&quot; instead of &quot;car.&quot; On the other hand, HEAIH can simultaneously recognize &quot;Toyota&quot; as a hyponym of the two hypernyms by using the headings. Given a set of hypernyms, for which we'd like to acquire their hyponyms, HEAIH finds the headings (or, more precisely, candidates of headings) that include the given hypernyms, and extracts the itemizations which are located near the headings. The procedure then produces hyponymy relations under the assumption that the expressions in the itemizations are hyponyms of the given hypernym. For example, given &quot;car&quot; and &quot;car company&quot; as hypernyms, the procedure finds documents including headings such as &quot;Car Company List&quot; and &quot;Car List.&quot; If it is lucky enough, it finds documents such as those in Figure 1, and extracts the expressions &quot;Toyota&quot; &quot;Honda,&quot; and &quot;Nissan&quot; from the itemizations near the headings. It will then obtain that &quot;Toyota&quot; is a hyponym of &quot;car company&quot; from document (A) in the figure, while it finds that &quot;Toyota&quot; is a hyponym of &quot;car&quot; from (B). However, the task is not that simple. A problem is that we do not know how to identify the correspondence between itemizations and their headings precisely. One may think that, for instance, she can use the distance between an itemization and (candidates of) its heading in the HTML file as a clue for finding the correspondence. However, we empirically show that this is not the case. To solve this problem, we used a modified version of AHRAI. This method can produce a ranked list of hypernym candidates from the itemizations only. We assume that if the top n elements of a ranked list produced by AHRAI include many substrings of a given hypernym, the heading including the hypernym is attached to the itemization. null Note that the original AHRAI produced the top element of the ranked list as a hypernym, while HEAIH recognizes a string as a hypernym if its sub-strings are included in the top n elements in the ranked list. This helps the HEAIH to acquire hyponymy relations that the AHRAI cannot. Consider the itemizations in Figure 1. AHRAI may produce &quot;company&quot; as the top element of a ranked list for both (A) and (B). But if &quot;car&quot; is in the top n elements in the list as well, HEAIH can recognize &quot;car&quot; as a hypernym for (B).</Paragraph> <Paragraph position="3"> This paper is organized as follows. Section 2 describes AHRAI. Our proposed method, HEAIH, is presented in Section 3. The experimental results obtained by using Japanese HTML documents are presented in Section 4, where we compared our method and alternative methods.</Paragraph> </Section> class="xml-element"></Paper>