File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1193_metho.xml
Size: 13,410 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1193"> <Title>Acquiring an Ontology for a Fundamental Vocabulary</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Ontology Extraction </SectionTitle> <Paragraph position="0"> In this section we present our work on creating an ontology. Past research on knowledge acquisition from de nition sentences in Japanese has primarily dealt with the task of automatically generating hierarchical structures. Tsurumaru et al. (1991) developed a system for automatic thesaurus construction based on information derived from analysis of the terminal clauses of de nition sentences. It was successful in classifying hyponym, meronym, and synonym relationships between words. However, it lacked any concrete evaluation of the accuracy of the hierarchies created, and only linked words not senses. More recently Tokunaga et al. (2001) created a thesaurus from a machine-readable dictionary and combined it with an existing thesaurus (Ikehara et al., 1997).</Paragraph> <Paragraph position="1"> For other languages, early work for English linked senses exploiting dictionary domain codes and other heuristics (Copestake, 1990), and more recent work links senses for Spanish and French using more general WSD techniques (Rigau et al., 1997). Our goal is similar. We wish to link each word sense in the fundamental vocabulary into an ontology. The ontology is primarily a hierarchy of hyponym (is-a) relations, but also contains several other relationships, such as abbreviation, synonym and domain.</Paragraph> <Paragraph position="2"> We extract the relations from the semantic output of the parsed de nition sentences. The output is written in Minimal Recursion Semantics (Copestake et al., 2001). Previous work has successfully used regular expressions, both for English (Barnbrook, 2002) and Japanese (Tsurumaru et al., 1991; Tokunaga et al., 2001).</Paragraph> <Paragraph position="3"> Regular expressions are extremely robust, and relatively easy to construct. However, we use a parser for four reasons. The rst is that it makes our knowledge acquisition more language independent. If we have a parser that can produce MRS, and a machine readable dictionary for that language, the knowledge acquisition system can easily be ported. The second reason is that we can go on to use the parser and acquisition system to acquire knowledge from non-dictionary sources. Fujii and Ishikawa (2004) have shown how it is possible to identify de nitions semi automatically, however these sources are not as standard as dictionaries and thus harder to parse using only regular expressions. The third reason is that we can more easily acquire knowledge beyond simple hypernyms, for example, identifying synonyms through common de nition patterns as proposed by Tsuchiya et al. (2001). The nal reason is that we are ultimately interested in language understanding, and thus wish to develop a parser. Any e ort spent in building and re ning regular expressions is not reusable, while creating and improving a grammar has intrinsic value.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 The Extraction Process </SectionTitle> <Paragraph position="0"> To extract hypernyms, we parse the rst de nition sentence for each sense. The parser uses the stochastic parse ranking model learned from the Hinoki treebank, and returns the MRS of the rst ranked parse. Currently, just over 80% of the sentences can be parsed.</Paragraph> <Paragraph position="1"> An MRS consists of a bag of labeled elementary predicates and their arguments, a list of scoping constraints, and a pair of relations that provide a hook into the representation |a label, which must outscope all the handles, and an index (Copestake et al., 2001). The MRSs for the de nition sentence for doraib a2 and its English equivalent are given in Figure 2. The hook's label and index are shown rst, followed by the list of elementary predicates. The gure omits some details (message type and scope have been suppressed).</Paragraph> <Paragraph position="3"> In most cases, the rst sentence of a dictionary de nition consists of a fragment headed by the same part of speech as the headword.</Paragraph> <Paragraph position="4"> Thus the noun driver is de ned as an noun phrase. The fragment consists of a genus term (somebody) and di erentia (who drives a car).1 The genus term is generally the most semantically salient word in the de nition sentence: the word with the same index as the index of the hook. For example, for sense 2 of the word a0 a1 a3 a5 a6 doraib a, the hypernym is a17 hito \person&quot; (Figure 2). Although the actual hypernym is in very di erent positions in the Japanese and English de nition sentences, it is the hook in both the semantic representations.</Paragraph> <Paragraph position="5"> For some de nition sentences (around 20%), further parsing of the semantic representation is necessary. The most common case is where the index is linked to a coordinate construction.</Paragraph> <Paragraph position="6"> In that case, the coordinated elements have to be extracted, and we build two relationships.</Paragraph> <Paragraph position="7"> Other common cases are those where the relationship between the headword and the genus is given explicitly in the de nition sentence: for example in (1), where the relationship is given as abbreviation. We initially process the relation, a20 ryaku \abbreviation&quot;, yielding the coordinate structure. This in turn gives two words: a: an abbreviation for the Alps or the</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Japanese Alps </SectionTitle> <Paragraph position="0"> The extent to which non-hypernym relations are included as text in the de nition sentences, as opposed to stored as separate elds, varies from dictionary to dictionary. For knowledge acquisition from open text, we can not expect any labeled features, so the ability to extract information from plain text is important.</Paragraph> <Paragraph position="1"> We also extract information not explicitly la- null senting the domain has wide scope |in e ect the de nition means \In golf, [a driver3 is] a club for playing long strokes&quot;. The phrase that speci es the domain should modify a non-expressed predicate. To parse this, we added a construction to the grammar that allows an NP fragment heading an utterance to have an adpositional modi er. We then extract these modi ers and take the head of the noun phrase to be the domain. Again, this is hard to do reliably with regular expressions, as an initial NP followed by a38 could be a copula phrase, or a PP that attaches anywhere within the de nition |not all such initial phrases restrict the domain. Most of the domains extracted fall under a few superordinate terms, mainly sport, games and religion. Other, more general domains, are marked explicitly in Lexeed as features. Japanese equivalents of the following words have a sense marked as being in the domain golf: approach, edge, down, tee, driver, handicap, pin, long shot.</Paragraph> <Paragraph position="2"> We summarize the links acquired in Table 2, grouped by coarse part of speech. The rst three lines show hypernym relations: implicit hypernyms (the default); explicitly indicated hypernyms, and implicitly indicated hyponyms.</Paragraph> <Paragraph position="4"> The second three names show other relations: abbreviations, names and domains. Implicit hypernyms are by far the most common relations: fewer than 10% of entries are marked with an explicit relationship.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Veri cation with Goi-Taikei </SectionTitle> <Paragraph position="0"> We veri ed our results by comparing the hypernym links to the manually constructed Japanese ontology Goi-Taikei. It is a hierarchy of 2,710 semantic classes, de ned for over 264,312 nouns (Ikehara et al., 1997). Because the semantic classes are only de ned for nouns (including verbal nouns), we can only compare nouns. Senses are linked to Goi-Taikei semantic classes by the following heuristic: look up the semantic classes C for both the headword (wi) and the genus term(s) (wg). If at least one of the index word's semantic classes is subsumed by at least one of the genus' semantic classes, then we consider their relationship con rmed (1).</Paragraph> <Paragraph position="2"> In the event of an explicit hyponym relationship indicated between the headword and the genus, the test is reversed: we look for an instance of the genus' class being subsumed by the headword's class (cg ch). Our results are summarized in Table 3. The total is 58.5% (15,888 con rmed out of 27,146). Adding in the named and abbreviation relations, the coverage is 60.7%. This is comparable to the coverage of Tokunaga et al. (2001), who get a coverage of 61.4%, extracting relations using regular expressions from a di erent dictionary.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Extending the Goi-Taikei </SectionTitle> <Paragraph position="0"> In general we are extracting pairs with more information than the Goi-Taikei hierarchy of 2,710 classes. For 45.4% of the con rmed relations both the headword and its genus term were in the same Goi-Taikei semantic class. In particular, many classes contain a mixture of class names and instance names: a11a13a12 buta niku \pork&quot; and a12 niku \meat&quot; are in the same class, as are a0 a1a15a14 doramu \drum&quot; and a16 a17a19a18 dagakki \percussion instrument&quot;, which we can now distinguish. This con ation has caused problems in applications such as question answering as well as in fundamental research on linking syntax and semantics (Bond and Vatikiotis-Bateson, 2002).</Paragraph> <Paragraph position="1"> An example of a more detailed hierarchy deduced from Lexeed is given in 4. All of the words come from the same Goi-Taikei semantic class: h842:condimenti, but are given more structure by the thesaurus we have induced.</Paragraph> <Paragraph position="2"> There are still some inconsistencies: ketchup is directly under condiment, while tomato sauce and tomato ketchup are under sauce. This reects the structure of the original machine readable dictionary.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Discussion and Further Work </SectionTitle> <Paragraph position="0"> From a language engineering point of view, we found the ontology extraction an extremely useful check on the output of the grammar/parser.</Paragraph> <Paragraph position="1"> Treebanking tends to focus on the syntactic structure, and it is all too easy to miss a malformed semantic structure. Parsing the semantic output revealed numerous oversights, especially in binding arguments in complex rules and lexical entries.</Paragraph> <Paragraph position="2"> It also reveals some gaps in the Goi-Taikei coverage. For the word a0 a1 a3 a5 a6 doraib a \driver&quot; (shown in Figure 1), the rst two hypernyms are con rmed. However, a0 a1 a3 a5 a6 in GT only has two semantic classes: h942:tooli and h292:driveri. It does not have the semantic class h921:leisure equipmenti. Therefore we cannot con rm the third link, even though it is correct, and the domain is correctly extracted.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Further Work </SectionTitle> <Paragraph position="0"> There are four main areas in which we wish to extend this research: improving the grammar, extending the extraction process itself, further exploiting the extracted relations and creating a thesaurus from an English dictionary.</Paragraph> <Paragraph position="1"> As well as extending the coverage of the grammar, we are investigating making the semantics more tractable. In particular, we are investigating the best way to represent the semantics of explicit relations such as a33a35a34 isshu \a kind of&quot;.2 2These are often transparent nouns: those nouns which are transparent with regard to collocational or selection relations between their dependent and the exter-We are extending the extraction process by adding new explicit relations, such as a36a38a37a38a39 teineigo \polite form&quot;. For word senses such as driver3, where there is no appropriate Goi-Taikei class, we intend to estimate the semantic class by using the de nition sentence as a vector, and looking for words with similar de nitions (Kasahara et al., 1997).</Paragraph> <Paragraph position="2"> We are extending the extracted relations in several ways. One way is to link the hypernyms to the relevant word sense, not just the word. If we know that a8 a1 a9 club \kurabu&quot; is a hypernym of h921:leisure equipmenti, then it rules out the card suit \clubs&quot; and the \association of people with similar interests&quot; senses. Other heuristics have been proposed by Rigau et al.</Paragraph> <Paragraph position="3"> (1997). Another way is to use the thesaurus to predict which words name explicit relationships which need to be extracted separately (like abbreviation). null</Paragraph> </Section> </Section> class="xml-element"></Paper>