File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1008_intro.xml

Size: 3,027 bytes

Last Modified: 2025-10-06 14:06:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1008">
  <Title>Finding Parts in Very Large Corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> We present a method of extracting parts of objects from wholes (e.g. &amp;quot;speedometer&amp;quot; from &amp;quot;car&amp;quot;). To be more precise, given a single word denoting some entity that has recognizable parts, the system finds and rank-orders other words that may denote parts of the entity in question. Thus the relation found is strictly speaking between words, a relation Miller \[1\] calls &amp;quot;meronymy.&amp;quot; In this paper we use the more colloquial &amp;quot;part-of&amp;quot; terminology.</Paragraph>
    <Paragraph position="1"> We produce words with 55degPS accuracy for the top 50 words ranked by the system, given a very large corpus. Lacking an objective definition of the part-of relation, we use the majority judgment of five human subjects to decide which proposed parts are correct.</Paragraph>
    <Paragraph position="2"> The program's output could be scanned by an end-user and added to an existing ontology (e.g., Word-Net), or used as a part of a rough semantic lexicon.</Paragraph>
    <Paragraph position="3"> To the best of our knowledge, there is no published work on automatically finding parts from unlabeled corpora. Casting our nets wider, the work most similar to what we present here is that by Hearst \[2\] on acquisition of hyponyms (&amp;quot;isa&amp;quot; relations). In that paper Hearst (a) finds lexical correlates to the hyponym relations by looking in text for cases where known hyponyms appear in proximity (e.g., in the construction (NP, NP and (NP other NN)) as in &amp;quot;boats, cars, and other vehicles&amp;quot;), (b) tests the proposed patterns for validity, and (c) uses them to extract relations from a corpus. In this paper we apply much the same methodology to the part-of relation. Indeed, in \[2\] Hearst states that she tried to apply this strategy to the part-of relation, but failed. We comment later on the differences in our approach that we believe were most important to our comparative success.</Paragraph>
    <Paragraph position="4"> Looking more widely still, there is an evergrowing literature on the use of statistical/corpus-based techniques in the automatic acquisition of lexical-semantic knowledge (\[3-8\]). We take it as axiomatic that such knowledge is tremendously useful in a wide variety of tasks, from lower-level tasks like noun-phrase reference, and parsing to user-level tasks such as web searches, question answering, and digesting. Certainly the large number of projects that use WordNet \[1\] would support this contention. And although WordNet is hand-built, there is general agreement that corpus-based methods have an advantage in the relative completeness of their coverage, particularly when used as supplements to the more labor-intensive methods.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML