File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1036_metho.xml
Size: 5,765 bytes
Last Modified: 2025-10-06 14:14:27
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1036"> <Title>Integration of Document Detection and Information Extraction</Title> <Section position="2" start_page="0" end_page="195" type="metho"> <SectionTitle> LEXlCO-SEMANTIC PATTERN MATCHING FOR SHALLOW NLP </SectionTitle> <Paragraph position="0"> The lexico-semantic pattern matching method allows for capturing of word sequences in text using a simple pattern language that can be compiled into a set of non-deterministic finite automata. Each automoton represents a single rule within the language, with several related rules forming a package. As a result of matching a rule against the input, a series of variables within the rule are bound to lexical elements in text. These bindings are subsequently used to generate single-word and/or multiple-word terms for indexing.</Paragraph> <Paragraph position="1"> Long phrasal terms are decomposed into pairs in two phases as follows. In the first phase, only unambiguous pairs are collected, while all longer and potentially structurally ambiguous noun phrases are passed to the second phase. In the second phase, the distributional statistics gathered in the first phase are used to predict the strength of alternative two-word sub-components within long phrases. For example, we may have multiple unambiguous occurrences of &quot;insider trading&quot;, while very few of &quot;trading case&quot;. At the same time, there are numerous phrases such as =insider trading case&quot;, =insider trading legislation&quot;, etc., where the pair =insider trading&quot; remains stable while the other elements get changed, and significantly fewer cases where, say, &quot;trading case&quot; is constant and the other words change.</Paragraph> <Paragraph position="2"> The experiments performed on a subset of U.S.</Paragraph> <Paragraph position="3"> PTO's patent database show healthy 10%+ increase in average precision over baseline SMART system.</Paragraph> <Paragraph position="4"> The average precision (11-point) has increased from 49% SMART baseline on the test sample to 56%.</Paragraph> <Paragraph position="5"> Precision at 5 top retrieved documents jumped from 48% to 52%. We also noticed that phrase disambiguation step was critical for improved precision.</Paragraph> </Section> <Section position="3" start_page="195" end_page="195" type="metho"> <SectionTitle> INDEXING WITH MUC-6 CONCEPTS </SectionTitle> <Paragraph position="0"> In these experiments we used actual MUC organization and people name spotter (from Lockheed Martin) to annotate and index a subset of TREC-4 collection. We selected 17 queries out of 250 TREC topics which explicitely mentioned some organizations by names. The following observations were made: 1.Different queries require different concepts to be spotted: concepts that are universal enough to be important in most domains are hard to find, or not discriminating enough.</Paragraph> <Paragraph position="1"> 2.These differences are frequently queryspecific, not just domain-specific, which makes MUC-style extraction impractical 3.The role that a concept plays in a query can affect its usefullness in retrieval: concepts found in focus appear to be radically more discriminating than those found in background roles.</Paragraph> <Paragraph position="2"> Initial results show that targeted concept indexing can be extremely effective, however, random annotation may in fact cause loss of performance. Overall, the average precision improved by only 3%; however, some queries, namely those where the indexed concepts were in focus roles, benefited dramatically. For example, the query about Mitsubishi has gained about 25% in precision over SMART baseline (from 42% to 52%).</Paragraph> <Paragraph position="3"> Typical results are summarized in the table below: words annotations both merge</Paragraph> </Section> <Section position="4" start_page="195" end_page="195" type="metho"> <SectionTitle> MIXED BOOLEAN/SOFT RETRIEVAL MODEL </SectionTitle> <Paragraph position="0"> We allow strict-match terms to be included in the search queries in a specially designated field. The hard/soft query mechanism allows a user to specify either in interactive or batch mode a boolean type query which will restrict documents returned by a vector space model match. Documents not satisfying the query will be deemed to be non-relevant for the query.</Paragraph> <Paragraph position="1"> A two-pass retrieval has been implemented in SMART to allow proper interpretations of such queries. In interactive mode a normal vector query can be entered using 'run' command. When the first results are returned using 'boolean' will place you in editor mode (similar to run). Construct the query and terminate the query with a period on a line by itself. The documents returned by the latest 'run' command are filtered and only those satisfying the query are redisplayed. Using 'more' will always retrieve 'num_wanted' unless there are insufficient documents remaining that are relevant to the initial vector query.</Paragraph> </Section> <Section position="5" start_page="195" end_page="195" type="metho"> <SectionTitle> RECOMMENDATIONS FOR AN INTEGRATED SYSTEM </SectionTitle> <Paragraph position="0"> The following were determined to be crucial in building an integrated extraction/detection system: 1. A large variety of extraction capabilities, best if could be generated rapidly on an ad-hoc basis. 2. Rapid discource analysis for role determination of semantically significant terms 3. The need for well-defined equivalence relation on annotations produced by an extraction system.</Paragraph> <Paragraph position="1"> 4. Use of mixed Boolean/soft retrieval model</Paragraph> </Section> class="xml-element"></Paper>