File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1901_intro.xml

Size: 2,925 bytes

Last Modified: 2025-10-06 14:02:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1901">
  <Title>The Hinoki Treebank: Working Toward Text Understanding</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In this paper we describe the current state of a new lexical resource: the Hinoki treebank. The motivation and initial construction was described in detail in Bond et al. (2004a). The ultimate goal of our research is natural language understanding  |we aim to create a system that can parse text into some useful semantic representation. Ideally this would be such that the output can be used to actually update our semantic models. This is an ambitious goal, and this paper does not present a completed solution, but rather a road-map to the solution, with some progress along the way.</Paragraph>
    <Paragraph position="1"> The mid-term goal is to build a thesaurus from dictionary de nition sentences and use it to enhance a stochastic parse ranking model that combines syntactic and semantic information. In order to do this the Hinoki project is combining syntactic annotation with word sense tagging. This will make it possible to test the use of similarity and/or class based approaches together with symbolic grammars and statistical models. Our aim in this is to alleviate data sparseness. In the Penn Wall Street Journal tree-bank (Taylor et al., 2003), for example, the words stocks and skyrocket never appear together. However, the superordinate concepts capital ( stocks) and move upward ( skyrocket) often do.</Paragraph>
    <Paragraph position="2"> We are constructing the ontology from the machine readable dictionary Lexeed (Kasahara et al., 2004). This is a hand built self-contained lexicon: it consists of headwords and their de nitions for the most familiar 28,000 words of Japanese. This set is large enough to include most basic level words and covers over 75% of the common word tokens in a sample of Japanese newspaper text. In order to make the system self sustaining we base the rst growth of our treebank on the dictionary de nition sentences themselves. We then train a statistical model on the treebank and parse the entire lexicon.</Paragraph>
    <Paragraph position="3"> From this we induce a thesaurus. We are currently tagging the de nition sentences with senses. We will then use this information and the thesaurus to build a model that combines syntactic and semantic information. We will also produce a richer ontology  |for example extracting selectional preferences. In the last phase, we will look at ways of extending our lexicon and ontology to less familiar words.</Paragraph>
    <Paragraph position="4"> In this paper we present the results from treebanking 38,900 dictionary sentences. We also highlight two uses of the treebank: building the statistical models and inducing the thesaurus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML