XML Viewer - w04-2612

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-2612_concl.xml
Size: 4,540 bytes
Last Modified: 2025-10-06 13:54:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2612">
  <Title>FBI Uniform Crime Reporting: Data Collection Guidelines</Title>
  <Section position="6" start_page="0" end_page="0" type="concl">
    <SectionTitle>
6 Discussion, Ongoing and Future Efforts
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Short versus LongTexts
</SectionTitle>
      <Paragraph position="0"> Our integration mechanism appears to work well when textual definitions are short texts with not very long sentences and phrases. This is the case with standard dictionaries and our system acquired knowledge which, by design, acquires knowledge in small portions, short, few sentence-long texts. We can accomplish this because for short texts, parsing and computing meaning-level representation is possible and can be done with high levels of precision.</Paragraph>
      <Paragraph position="1"> Full integration of larger texts such as many page encyclopedic entries or complete newspaper articles is currently not really possible because parsing long sentences and computing meaning-level representation of large texts with high levels of precision remains an open research problem.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Integrate or Not
</SectionTitle>
      <Paragraph position="0"> A really hard part is the integrate-or-not decision. In general, it is hard both for humans and systems to decide who is right and which piece of knowledge is correct. So despite having a system capable of fully automatic integration, we involve a human in the loop.</Paragraph>
      <Paragraph position="1"> We look at the system's recommendation, the alignment of different definitions, similarity metrics etc. and then make this decision by hand.</Paragraph>
      <Paragraph position="2"> The only alternative appears to make an arbitrary assumption that particular sources are (always) right or more right than some others.</Paragraph>
      <Paragraph position="3"> We have a mechanism that, in principle, allows us to integrate definitions from all existing sources. In practise, we consider a safer road of choosing two existing sources and updating them only with knowledeg acquired automatically by our system from corpora of &amp;quot;respectable&amp;quot; texts.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Dictionary Entries as Summaries,
Generalizations
</SectionTitle>
      <Paragraph position="0"> Our investigation leads us to believe that dictionary entries may be summaries and generalizations of words' uses over certain contexts. As such, they would constitute derived, and not primary, resource in people and machines. We plan to continue developing knowledge acquisition and learning methods to automatically create dictionaries/knowledge bases from corpora of texts. Our approach is to let the system acquire as much as possible and as specific as possible pieces of information and knowledge. We then generate dictionary-like, short, context-relevant definitions via our summarization/generalization mechanism.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.4 Text Generation
</SectionTitle>
      <Paragraph position="0"> Even with shorter text, we encounter many problems with generating naturally-sounding English text from our representation. One problem is that integration results in increasingly heavier phrases. Breaking long phrases into separate sentences with shorter phrases sometimes produces akward texts.</Paragraph>
      <Paragraph position="1"> Another problem is naturalness, which may mean different things in different contexts. In case of synonymous relations, we use two criteria. The first is frequency, commonality-based with preference given to the more commonly used, relative to a corpus, subject matter, or overall. For example, the word &amp;quot;infectious&amp;quot; will be preferred to &amp;quot;pathogenic&amp;quot;, and the phrase &amp;quot;extremely small&amp;quot; to &amp;quot;submicroscopic&amp;quot;. The second criterion is based on simplicity and size of utterance. For example, the word &amp;quot;submicroscopic&amp;quot; will be preferred to &amp;quot;extremely small&amp;quot;.</Paragraph>
      <Paragraph position="2"> It is clear that with progress on processing larger texts, the text generation problems will intensify.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML