File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/p99-1008_concl.xml

Size: 2,900 bytes

Last Modified: 2025-10-06 13:58:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1008">
  <Title>Finding Parts in Very Large Corpora</Title>
  <Section position="7" start_page="60" end_page="60" type="concl">
    <SectionTitle>
5 Discussion and Conclusions
</SectionTitle>
    <Paragraph position="0"> The program presented here can find parts of objects given a word denoting the whole object and a large corpus of unmarked text. The program is about 55% accurate for the top 50 proposed parts for each of six examples upon which we tested it. There does not seem to be a single cause for the 45% of the cases that are mistakes. We present here a few problems that have caught our attention.</Paragraph>
    <Paragraph position="1"> Idiomatic phrases like &amp;quot;a jalopy of a car&amp;quot; or &amp;quot;the son of a gun&amp;quot; provide problems that are not easily weeded out. Depending on the data, these phrases can be as prevalent as the legitimate parts.</Paragraph>
    <Paragraph position="2"> In some cases problems arose because of tagger mistakes. For example, &amp;quot;re-enactment&amp;quot; would be found as part of a &amp;quot;car&amp;quot; using pattern B in the phrase &amp;quot;the re-enactment of the car crash&amp;quot; if &amp;quot;crash&amp;quot; is tagged as a verb.</Paragraph>
    <Paragraph position="3"> The program had some tendency to find qualities of objects. For example, &amp;quot;driveability&amp;quot; is strongly correlated with car. We try to weed out most of the qualities by removing words with the suffixes &amp;quot;hess&amp;quot;, &amp;quot;ing', and &amp;quot;ity.&amp;quot; The most persistent problem is sparse data, which is the source of most of the noise. More data would almost certainly allow us to produce better lists, both because the statistics we are currently collecting would be more accurate, but also because larger numbers would allow us to find other reliable indicators. For example, idiomatic phrases might be recognized as such. So we see &amp;quot;jalopy of a car&amp;quot; (two times) but not, of course, &amp;quot;the car's jalopy&amp;quot;. Words that appear in only one of the two patterns are suspect, but to use this rule we need sufficient counts on the good words to be sure we have a representative sample. At 100 million words, the NANC is not exactly small, but we were able to process it in about four hours with the machines at our disposal, so still larger corpora would not be out of the question.</Paragraph>
    <Paragraph position="4"> Finally, as noted above, Hearst \[2\] tried to find parts in corpora but did not achieve good results.</Paragraph>
    <Paragraph position="5"> She does not say what procedures were used, but assuming that the work closely paralleled her work on hyponyms, we suspect that our relative success was due to our very large corpus and the use of more refined statistical measures for ranking the output.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML