XML Viewer - p99-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1008_metho.xml
Size: 13,541 bytes
Last Modified: 2025-10-06 14:15:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1008">
  <Title>Finding Parts in Very Large Corpora</Title>
  <Section position="4" start_page="0" end_page="57" type="metho">
    <SectionTitle>
2 Finding Parts
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="57" type="sub_section">
      <SectionTitle>
2.1 Parts
</SectionTitle>
      <Paragraph position="0"> Webster's Dictionary defines &amp;quot;part&amp;quot; as &amp;quot;one of the often indefinite or unequal subdivisions into which something is or is regarded as divided and which together constitute the whole.&amp;quot; The vagueness of this definition translates into a lack of guidance on exactly what constitutes a part, which in turn translates into some doubts about evaluating the results of any procedure that claims to find them. More specifically, note that the definition does not claim that parts must be physical objects. Thus, say, &amp;quot;novel&amp;quot; might have &amp;quot;plot&amp;quot; as a part.</Paragraph>
      <Paragraph position="1"> In this study we handle this problem by asking informants which words in a list are parts of some target word, and then declaring majority opinion to be correct. We give more details on this aspect of the study later. Here we simply note that while our subjects often disagreed, there was fair consensus that what might count as a part depends on the nature of the  word: a physical object yields physical parts, an institution yields its members, and a concept yields its characteristics and processes. In other words, &amp;quot;floor&amp;quot; is part of &amp;quot;building&amp;quot; and &amp;quot;plot&amp;quot; is part of &amp;quot;book.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
2.2 Patterns
</SectionTitle>
      <Paragraph position="0"> Our first goal is to find lexical patterns that tend to indicate part-whole relations. Following Hearst \[2\], we find possible patterns by taking two words that are in a part-whole relation (e.g, basement and building) and finding sentences in our corpus (we used the North American News Corpus (NANC) from LDC) that have these words within close proximity. The first few such sentences are: ... the basement of the building.</Paragraph>
      <Paragraph position="1"> ... the basement in question is in a four-story apartment building ...</Paragraph>
      <Paragraph position="2"> ... the basement of the apartment building.</Paragraph>
      <Paragraph position="3"> From the building's basement ...</Paragraph>
      <Paragraph position="4"> ... the basement of a building ...</Paragraph>
      <Paragraph position="5"> ... the basements of buildings ...</Paragraph>
      <Paragraph position="6"> From these examples we construct the five patterns shown in Table 1. We assume here that parts and wholes are represented by individual lexical items (more specifically, as head nouns of noun-phrases) as opposed to complete noun phrases, or as a sequence of &amp;quot;important&amp;quot; noun modifiers together with the head. This occasionally causes problems, e.g., &amp;quot;conditioner&amp;quot; was marked by our informants as not part of &amp;quot;car&amp;quot;, whereas &amp;quot;air conditioner&amp;quot; probably would have made it into a part list. Nevertheless, in most cases head nouns have worked quite well on their own.</Paragraph>
      <Paragraph position="7"> We evaluated these patterns by observing how they performed in an experiment on a single example.</Paragraph>
      <Paragraph position="8"> Table 2 shows the 20 highest ranked part words (with the seed word &amp;quot;car&amp;quot;) for each of the patterns A-E. (We discuss later how the rankings were obtained.) Table 2 shows patterns A and B clearly outperform patterns C, D, and E. Although parts occur in all five patterns~ the lists for A and B are predominately parts-oriented. The relatively poor performance of patterns C and E was ant!cipated, as many things occur &amp;quot;in&amp;quot; cars (or buildings, etc.) other than their parts. Pattern D is not so obviously bad as it differs from the plural case of pattern B only in the lack of the determiner &amp;quot;the&amp;quot; or &amp;quot;a&amp;quot;. However, this difference proves critical in that pattern D tends to pick up &amp;quot;counting&amp;quot; nouns such as &amp;quot;truckload.&amp;quot; On the basis of this experiment we decided to proceed using only patterns A and B from Table 1.</Paragraph>
      <Paragraph position="9"> A. whole NN\[-PL\] 's POS part NN\[-PL\]  ... building's basement ...</Paragraph>
      <Paragraph position="10"> B. part NN\[-PL\] of PREP {theIa } DET roods \[JJINN\]* whole NN ... basement of a building...</Paragraph>
      <Paragraph position="11"> C. part NN in PREP {thela } DET roods \[JJINN\]* whole NN ... basement in a building ...</Paragraph>
      <Paragraph position="12"> D. parts NN-PL of PREP wholes NN-PL ... basements of buildings ...</Paragraph>
      <Paragraph position="13"> E. parts NN-PL in PREP wholes NN-PL ... basements in buildings ...</Paragraph>
      <Paragraph position="14"> Format: type_of_word TAG type_of_word TAG ...</Paragraph>
      <Paragraph position="16"> Table h Patterns for partOf(basement,building)</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="57" end_page="59" type="metho">
    <SectionTitle>
3 Algorithm
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
3.1 Input
</SectionTitle>
      <Paragraph position="0"> We use the LDC North American News Corpus (NANC). which is a compilation of the wire output of several US newspapers. The total corpus is about 100,000,000 words. We ran our program on the whole data set, which takes roughly four hours on our network. The bulk of that time (around 90%) is spent tagging the corpus.</Paragraph>
      <Paragraph position="1"> As is typical in this sort of work, we assume that our evidence (occurrences of patterns A and B) is independently and identically distributed (lid). We have found this assumption reasonable, but its breakdown has led to a few errors. In particular, a drawback of the NANC is the occurrence of repeated articles; since the corpus consists of all of the articles that come over the wire, some days include multiple, updated versions of the same story, containing identical paragraphs or sentences. We wrote programs to weed out such cases, but ultimately found them of little use. First, &amp;quot;update&amp;quot; articles still have substantial variation, so there is a continuum between these and articles that are simply on the same topic.</Paragraph>
      <Paragraph position="2"> Second, our data is so sparse that any such repeats are very unlikely to manifest themselves as repeated examples of part-type patterns. Nevertheless since two or three occurrences of a word can make it rank highly, our results have a few anomalies that stem from failure of the iid assumption (e.g., quite appropriately, &amp;quot;clunker&amp;quot;).</Paragraph>
    </Section>
    <Section position="2" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
Pattern A
</SectionTitle>
      <Paragraph position="0"> headlight windshield ignition shifter dashboard radiator brake tailpipe pipe airbag speedometer converter hood trunk visor vent wheel occupant engine tyre Pattern B trunk wheel driver hood occupant seat bumper backseat dashboard jalopy fender rear roof windshield back clunker window shipment reenactment</Paragraph>
      <Paragraph position="2"> passenger gunmen leaflet hop houseplant airbag gun koran cocaine getaway motorist phone men indecency person ride woman detonator kid key Pattern D import caravan make dozen carcass shipment hundred thousand sale export model truckload queue million boatload inventory hood registration trunk ten Pattern E airbag packet switch gem amateur device handgun passenger fire smuggler phone tag driver weapon meal compartment croatian defect refugee delay  Our seeds are one word (such as &amp;quot;car&amp;quot;) and its plural. We do not claim that all single words would fare as well as our seeds, as we picked highly probable words for our corpus (such as &amp;quot;building&amp;quot; and &amp;quot;hospital&amp;quot;) that we thought would have parts that might also be mentioned therein. With enough text, one could probably get reasonable results with any noun that met these criteria.</Paragraph>
    </Section>
    <Section position="3" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
3.2 Statistical Methods
</SectionTitle>
      <Paragraph position="0"> The program has three phases. The first identifies and records all occurrences of patterns A and B in our corpus. The second filters out all words ending with &amp;quot;ing', &amp;quot;ness', or &amp;quot;ity', since these suffixes typically occur in words that denote a quality rather than a physical object. Finally we order the possible parts by the likelihood that they are true parts according to some appropriate metric.</Paragraph>
      <Paragraph position="1"> We took some care in the selection of this metric. At an intuitive level the metric should be something like p(w \[ p). (Here and in what follows w denotes the outcome of the random variable generating wholes, and p the outcome for parts. W(w) states that w appears in the patterns AB as a whole, while P(p) states that p appears as a part.) Metrics of the form p(w I P) have the desirable property that they are invariant over p with radically different base frequencies, and for this reason have been widely used in corpus-based lexical semantic research \[3,6,9\]. However, in making this intuitive idea someone more precise we found two closely related versions: p(w, W(w) I P) p(w, w(~,) I p, e(p)) We call metrics based on the first of these &amp;quot;loosely conditioned&amp;quot; and those based on the second &amp;quot;strongly conditioned&amp;quot;.</Paragraph>
      <Paragraph position="2"> While invariance with respect to frequency is generally a good property, such invariant metrics can lead to bad results when used with sparse data. In particular, if a part word p has occurred only once in the data in the AB patterns, then perforce p(w \[ P) = 1 for the entity w with which it is paired. Thus this metric must be tempered to take into account the quantity of data that supports its conclusion. To put this another way, we want to pick (w,p) pairs that have two properties, p(w I P) is high and \[ w, pl is large. We need a metric that combines these two desiderata in a natural way.</Paragraph>
      <Paragraph position="3"> We tried two such metrics. The first is Dunning's \[10\] log-likelihood metric which measures how &amp;quot;surprised&amp;quot; one would be to observe the data counts I w,p\[,\[ -,w, pl, \[ w,-,pland I-'w,-'Plifone assumes that p(w I P) = p(w). Intuitively this will be high when the observed p(w I P) &gt;&gt; p(w) and when the counts supporting this calculation are large.</Paragraph>
      <Paragraph position="4"> The second metric is proposed by Johnson (personal communication). He suggests asking the question: how far apart can we be sure the distributions p(w \[ p)and p(w) are if we require a particular significance level, say .05 or .01. We call this new test the &amp;quot;significant-difference&amp;quot; test, or sigdiff. Johnson observes that compared to sigdiff, log-likelihood tends to overestimate the importance of data frequency at the expense of the distance between p(w I P) and</Paragraph>
    </Section>
    <Section position="4" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
3.3 Comparison
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the 20 highest ranked words for each statistical method, using the seed word &amp;quot;car.&amp;quot; The first group contains the words found for the method we perceive as the most accurate, sigdiff and strong conditioning. The other groups show the differences between them and the first group. The + category means that this method adds the word to its list, means the opposite. For example, &amp;quot;back&amp;quot; is on the sigdiff-loose list but not the sigdiff-strong list.</Paragraph>
      <Paragraph position="1"> In general, sigdiff worked better than surprise and strong conditioning worked better than loose conditioning. In both cases the less favored methods tend to promote words that are less specific (&amp;quot;back&amp;quot; over &amp;quot;airbag&amp;quot;, &amp;quot;use&amp;quot; over &amp;quot;radiator&amp;quot;). Furthermore, the</Paragraph>
    </Section>
    <Section position="5" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
Sigdiff, Strong
</SectionTitle>
      <Paragraph position="0"> airbag brake bumper dashboard driver fender headlight hood ignition occupant pipe radiator seat shifter speedometer tailpipe trunk vent wheel windshield  Sigdiff, Loose + back backseat oversteer rear roof vehicle visor - airbag brake bumper pipe speedometer tailpipe vent Surprise, Strong + back cost engine owner price rear roof use value window - airbag bumper fender ignition pipe radiator shifter speedometer tailpipe vent Surprise, Loose + back cost engine front owner price rear roof side value version window - airbag brake bumper dashboard fender ignition pipe radiator shifter speedometer tailpipe vent  combination of sigdiff and strong conditioning worked better than either by itself. Thus all results in this paper, unless explicitly noted otherwise, were gathered using sigdiff and strong conditioning combined.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML