File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-2001_metho.xml

Size: 8,086 bytes

Last Modified: 2025-10-06 14:08:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-2001">
  <Title>A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
BD
BNCQ
BE
BNBMBMBMBNCQ
</SectionTitle>
    <Paragraph position="0"> B5. Since all the patent documents are provided with a formal abstract, we suppose the abstracts be equivalent to their documents in content so that the abstract and the document should both be retrieved as part of the similar documents to the query supplied. We will show below how we can set up the DLSI technique leading to an improved robust scheme below. We have shown how the shortcoming of a global projection-based LSI scheme can be improved by making a best use of differences of two vectors in adapting to the unique characteristics of each document (Chen et al., 2001).</Paragraph>
    <Paragraph position="1"> A Differential Document Vector is defined as</Paragraph>
    <Paragraph position="3"> are normalized document vectors satisfying particular types of documents. An Exterior Differential Document Vector in particular is defined as the Differential Document Vec-</Paragraph>
    <Paragraph position="5"> constitute two normalized document vectors of any two different documents. An Interior Differential Document Vector is defined by the Differential Document Vector</Paragraph>
    <Paragraph position="7"> constitute two different normalized document vectors of the same document. The different document vectors of the same documents may be taken from parts of documents including abstracts, or may be produced by different schemes of summaries, or from the querries. The Exterior Differential Term-Document Matrix is defined as a matrix, each column of which is set to an Exterior Differential Document Vector. The Interior Differential Term-Document Matrix is defined as a matrix, each column of which comprises an interior Differential Document Vector.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Details of a DLSI Model
</SectionTitle>
      <Paragraph position="0"> Any differential term-document matrix, say, of mby-n matrix D of rank D6 AK D5 BP D1CXD2B4D1BND2B5, can be decomposed into a product of three matrices, namely BW BP CDCBCE</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CC
</SectionTitle>
    <Paragraph position="0"> , such that CD and CE are an D1-by-D5 and D5-by-D2 unitary matrices respectively, where the first D6 columns of CD and CE are the eigenvectors of BWBW  equal to BW. Each of differential document vector D5 could find a projection on the CZ dimensional differential latent semantic fact space spanned by the k columns of CD  the covariance of the distribution computed from the training set. Assuming that the differential document vectors formed follow a high-dimensional Gaussian distribution, the likelihood of any differential document vector DC will be given by</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Algorithm
</SectionTitle>
      <Paragraph position="0"> 1. Text preprocessing: Identify words and noun phrases as well as stop words.</Paragraph>
      <Paragraph position="1"> 2. System term construction: Set up the term list as well as the global weights.</Paragraph>
      <Paragraph position="2"> 3. Set up the document vectors of all the collected documents in normalized form .</Paragraph>
      <Paragraph position="3"> 4. Construct interior differential term-document  1. A query is treated as a document; a document  vector is set up by generating the terms as well as their frequency of occurrence, and thus a normalized document vector is obtained for the query . Each document in the data base are processed by the procedures in items 2-5 below.</Paragraph>
      <Paragraph position="4">  Each patent document is usually provided with an abstract. The abstract can be used for content-based information retrieval by using DLSI method as described above. As we have mentioned before, the content-based information retrieval system by LSI analysis is not robust enough to be directly applicable to a real system. We will use the DLSI method only to narrow down the search space at a first stage of filtering in information retrieval. We will resort to a form based searching strategy to pin down the patent document.</Paragraph>
      <Paragraph position="5"> Now that the content-based DLSI search scheme has narrowed down the search space in content, the form based search strategy we now employ need not to pay attention to the synonymous expressions of the searching terms or sentences.</Paragraph>
      <Paragraph position="6"> This first stage of filtering is now implemented without going through the tedious process of dealing with the synonymous expressions by synonym dictionaries which are hard to develop and to use. Even if we succeeded in treating the synomyms, we also have to realize that the polynonym of a natural language will reduce the advantage of using synonym dictionary further, because two words are synonymous in one situation but might not be so in other situations, depending on context.</Paragraph>
      <Paragraph position="7"> In view of lengthy sentences used in patent documents including their abstracts, we want to emphasize that automaton-based template structure is an extremely efficient way of expressing lengthy sentences with their synonymous expressions.</Paragraph>
      <Paragraph position="8"> We will demonstrate this point by way of examples below. For a sentence, &amp;quot;There are beautiful parks in Japan across the nation&amp;quot;, we can use a template as of figure 1 where a variety of synonymous expressions are explicitly represented.</Paragraph>
      <Paragraph position="9"> The problem here is, how we could get the template for an abstract of patent document? Firstly, we regard the original abstract of patent itself as a simplest template. Then, we register queries into the matched template structures by combining each pair of matched terms into one node. This is illustrated by an example procedure in figures 1-3. The original template of an abstract is indicated by figure 1, but when a query of figure 2, namely, &amp;quot;There are lovely parks across Japan&amp;quot;, is matched to the template of figure 1, the template could be modified to a new structure of figure3.</Paragraph>
      <Paragraph position="10"> Suppose that the query sentence is, &amp;quot;There are ugly streets in Japan&amp;quot;. Now although we could locate a matching pattern similar to that of figure 2, we will have to rule it out so that we will not come up with a template which include the above sentence as a path, or part of a path . This mechanism should be established from users' response. We will explain it in Section 4.1.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Flow of the Search Process
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Entire Flow of the Complete Search
Process
</SectionTitle>
      <Paragraph position="0"> Before starting the search process, we should set up the DLSI for all the patent documents.</Paragraph>
      <Paragraph position="1">  1. Locate the query in the DLSI space. 2. Find and select those patent documents whose  abstracts' vector space lie in a neighborhood of the query vector space having semantic similarity to sentences of figure 1 by the DLSI matching algorithm. null 3. For each of the abstracts obtained by step 4.1, use the template matching algorithm of (Chen and Tokuda, 2003a) to calculate the similarity of the summary and the query, select the documents of which the abstracts have a highest similarity to the query.</Paragraph>
      <Paragraph position="2">  4. Show the result to the user.</Paragraph>
      <Paragraph position="3"> 5. Modify the abstracts in the database by users' responses.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML