File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/w98-1204_concl.xml
Size: 2,429 bytes
Last Modified: 2025-10-06 13:58:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1204"> <Title>A Lexically-Intensive Algorithm for Domain-Specific Knowlegde Acquisition Rend Schneider * Text Understanding Systems</Title> <Section position="9" start_page="163" end_page="163" type="concl"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> In this paper we discussed the construction of a statistical learning algorithm based on restricted domains and their underlying sublanguages in order to build automatically a linguistic knowledge base for information extraction tasks with the aid of very simple arithmetic procedures. The method is based on weighted frequency lists of word forms and syntactical patterns. Although very small information about the texts and the domain is known a priori and only two functional dependencies (see Hypotheses 1 and 2) have been postulated, the algorithm learns automatically to build a very compact knowledge base from small and noisy text corpora. The method was tested empirically on several english, german and spanish corpora and shows the same results for noisy as well as for correct domain-specific corpora.</Paragraph> <Paragraph position="1"> A comparison of the core lexicon with common frequency analyses (Francis and Ku~era, 1982) for correct texts shows that even with a very small text sample the resulting information for linguistically allowed alterations of a lexical base form is acquired automatically. Additional information is achieved with the subsumption of linguistically incorrect variants. The acquired knowledge is stored in a compact and dynamic knowledge base whose structure is modified with every significant change of the lexeme's probabilistic properties and relations. The knowledge base is quite compact and allows a very quick analysis of unknown texts.</Paragraph> <Paragraph position="2"> First tests with different corpora and different languages (German and Spanish) show that this algorithm can be applied to different domains and other languages and thus is a useful tool for the expansion of IE-systems that work with OCR--data. Although the results of the algorithm depend very much on the data, i.e. the limits or sharpness of the domain which is used, the underlying ideas may be used for any information extraction purpose and other applications such as lexicography, information retrieval or terminology extraction.</Paragraph> </Section> class="xml-element"></Paper>