File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1008_evalu.xml
Size: 3,547 bytes
Last Modified: 2025-10-06 14:00:35
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1008"> <Title>Finding Parts in Very Large Corpora</Title> <Section position="6" start_page="59" end_page="60" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 4.1 Testing Humans </SectionTitle> <Paragraph position="0"> We tested five subjects (all of whom were unaware of our goals) for their concept of a &quot;part.&quot; We asked them to rate sets of 100 words, of which 50 were in our final results set. Tables 6 - 11 show the top 50 words for each of our six seed words along with the number of subjects who marked the wordas a part of the seed concept. The score of individual words vary greatly but there was relative consensus on most words. We put an asterisk next to words that the majority subjects marked as correct. Lacking a formal definition of part, we can only define those words as correct and the rest as wrong. While the scoring is admittedly not perfect 1, it provides an adequate reference result.</Paragraph> <Paragraph position="1"> Table 4 summarizes these results. There we show the number of correct part words in the top 10, 20, 30, 40, and 50 parts for each seed (e.g., for &quot;book&quot;, 8 of the top 10 are parts, and 14 of the top 20). Overall, about 55% of the top 50 words for each seed are parts, and about 70% of the top 20 for each seed. The reader should also note that we tried one ambiguous word, &quot;plant&quot; to see what would happen. Our program finds parts corresponding to both senses, though given the nature of our text, the industrial use is more common. Our subjects marked both kinds of parts as correct, but even so, this produced the weakest part list of the six words we tried.</Paragraph> <Paragraph position="2"> As a baseline we also tried using as our &quot;pattern&quot; the head nouns that immediately surround our target word. We then applied the same &quot;strong conditioning, sigdiff&quot; statistical test to rank the candidates. This performed quite poorly. Of the top 50 candidates for each target, only 8% were parts, as opposed to the 55% for our program.</Paragraph> </Section> <Section position="2" start_page="59" end_page="60" type="sub_section"> <SectionTitle> 4.2 WordNet WordNet </SectionTitle> <Paragraph position="0"> + door engine floorboard gear grille horn mirror roof tailfin window - brake bumper dashboard driver headlight ignition occupant pipe radiator seat shifter speedometer tailpipe vent wheel windshield We also compared out parts list to those of Word-Net. Table 5 shows the parts of &quot;car&quot; in WordNet that are not in our top 20 (+) and the words in our top 20 that are not in WordNet (-). There are definite tradeoffs, although we would argue that our top20 set is both more specific and more comprehensive. Two notable words our top 20 lack are &quot;engine&quot; and &quot;door&quot;, both of which occur before 100. More generally, all WordNet parts occur somewhere before 500, with the exception of &quot;tailfin', which never occurs with car. It would seem that our program would be l For instance, &quot;shifter&quot; is undeniably part of a car, while &quot;production&quot; is only arguably part of a plant. a good tool for expanding Wordnet, as a person can scan and mark the list of part words in a few minutes.</Paragraph> </Section> </Section> class="xml-element"></Paper>