File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/c00-2136_concl.xml
Size: 3,665 bytes
Last Modified: 2025-10-06 13:52:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2136"> <Title>Automatic Acquisition of Domain Knowledge for Information Extraction</Title> <Section position="7" start_page="944" end_page="945" type="concl"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The development of a w~riety of information extra(:tion systems over the last decade has demonstrated their feasibility but also the limitations on their portability and t)erformance.</Paragraph> <Paragraph position="1"> Prcl)aring good t)atterns tbr these syste, ms requires (:onsiderable skill, and achieving good (:overage requires |;lie analysis of a large amount of text. These t)rol)lems h~ve t)een impedinmnts to the -wide\].' use of extraction systenls.</Paragraph> <Paragraph position="2"> These dit\[iculties have stimulate.d resear('h on 1)attel.'n a(:(luisition. Solne of this work has enli)hasized il\]teractive tools to (:onvert examples to extractioi~ t)atterlls (Yangarber and Grishman, 1997); nmch ot:' the re, search has focused on methods for automatically converting a cortms annotated with extraction examples into patterns (Lehnert et al., 1992; Fisher et al., 1995; Miller el; al., 1998). These techniques may reduce the level of systeln expertise required to develop a new extraction N)plieation, but they do not lessen the lmrden of studying a large corlms in order to .find relevant candidates.</Paragraph> <Paragraph position="3"> The prior work most closely related to our own is that of (R.ilotf, 1996), who also seeks to lmild pattenls automatically without the need to annotate a corpus with the information to be extracted. Itowever, her work ditfers t'rom 01217 own in several ilnportant respects. First, her patterns identit~y phrases that fill individual slots in the template, without specifying how these slots may be combined at a later stage into complete templates. In contrast, our procedure discovers complete, multi-slot event pat- null terns. Second, her procedure relies on a cort)us in which |;tie documents have been classified for relevance by hand (it was applied to the MUC-3 task, tbr which over 1500 classified documents are available), whereas ExDIsco requires no manual relevance judgements. While classifying documents tbr relevance is much easier than annotating docunlents with the information to be extracted, it; is still a significant task, and places a limit on |:tie size of the training corpus that can be effectively used.</Paragraph> <Paragraph position="4"> Our research has demonstrated that for the studied scenarios automatic pattern discovery Call yield extraction perfi)rmance colnt)arabh~ to that obtained through extensive corpus analysis. There are many directions in which the work reported here needs to be extended: * nsing larger training corpora, in order to find less frequent exanlplcs, and in that way hopefully exceeding the i)erfornlancc of our best hand-trained system * cat)luring the word classes which are generated as a by-product of our pattern discovery 1)rocedure (in a manner similar to (Riloff and ,Jones, 1999)) and using them to discover less frequent t)atterns in subsequent iterations - evaluating the effectiveness of the discovcry procedure on other scenarios. In partitular, we need to be able to identi\[y topits which cast be most effbctively characterized by clause-level patterns (as was the case tbr the business domain), and topics which can be better characterized by other means. We. wouM also like to understand how the topic clusters (of documents and patterns) which are developed by our procedure line up with pre-specified scenarios.</Paragraph> </Section> class="xml-element"></Paper>