File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1204_intro.xml

Size: 3,150 bytes

Last Modified: 2025-10-06 14:06:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1204">
  <Title>A Lexically-Intensive Algorithm for Domain-Specific Knowlegde Acquisition Rend Schneider * Text Understanding Systems</Title>
  <Section position="4" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Challenges in Information
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Extraction
2.1 The Acquisition Bottleneck
</SectionTitle>
      <Paragraph position="0"> Generally, I_E-Systems are built for a rather restricted task and work on a more or less limited domain. This keeps their knowledge bases and the rules that are needed to process the texts, e.g. the syntactic rules, quite compact. But nevertheless, the changes that have to be done whenever a working system is applied to another domain are remarkably high, in some cases leading to the construction of a almost completely new knowledge base.</Paragraph>
      <Paragraph position="1"> Both, the construction of a new knowledge base and their maintenance need a certain time and lots of efforts have to be done by highly-skilled staff knowing the system and the domain it is built for. On the other hand, texts or messages that are written for a very specific purpose show the phenomena of Sub-languages (Harris, 1982), with less ambiguities and varieties than unrestricted language but still more freedom in expression than Controlled Languages.</Paragraph>
      <Paragraph position="2"> This fact strengthens the need for the automatic acquisition of linguistic knowledge, esp. the construction of a lemmatisation and a shallow parsing component. null Statistical learning algorithms are usually applied to processing large corpora, but in real life, huge samples are hard to find for commercial and industrial applications. In our case, the corpora usually consist of small samples of fewer than 150 very short texts and the whole sample must be split into a training and a test corpora. This disadvantages are compensated by the use of a domain-specific sublanguage. Any sublanguage shows some use of typical vocabulary, styles, and grammatical constructions, and it can be said that the more specific the domain is, the stronger are the restrictions of the sublanguage. But even in categories where these restrictions are weak, the essential and relevant information is carried by some typical words and located in a few kernel phrases, so that even simple statistics like frequency lists, distance measures and weighted collocation patterns may overcome parts of the acquisition problem (see section 4).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Noisy-Channel-Problem
</SectionTitle>
      <Paragraph position="0"> The second major problem is concerned with the fact that still a remarkably high number of paperbound texts have to be pre-processed by an OCR-System in order to convert them into machine-readable code. This problem can be compared to the well known problem of a noisy channel, as indicated in Figure 2.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML