A Novel Method for Content Consistency and Efficient Full-text 
Search for P2P Content Sharing Systems 
 
Hideki Mima 
University of Tokyo 
7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
mima@biz-model.t.u-tokyo.ac.jp 
Hideto Tomabechi 
Cognitive Research Lab 
7-8-25 Roppongi, Minato-ku, Tokyo, Japan 
Hideto_Tomabechi@crl.co.jp 
Abstract 
A problem associated with current P2P (peer-to-peer) 
systems is that the consistency between copied contents 
is not guaranteed. Additionally, the limitation of full-
text search capability in most of the popular P2P 
systems hinders the scalability of P2P-based content 
sharing systems. We proposed a new P2P content 
sharing system in which the consistency of contents in 
the network is maintained after updates or modifications 
have been made to the contents. Links to the 
downloaded contents are maintained on a server. As a 
result, the updates and modifications to the contents can 
be instantly detected and hence get reflected in future 
P2P downloads. Natural language processing including 
morphological analysis is performed distributedly by the 
P2P clients and the update of the inverted index on the 
server is conducted concurrently to provide an efficient 
full-text search. The scheme and a preliminary 
experimental result have been mentioned 
1   Introduction 
P2P content sharing systems can distribute large 
amounts of contents with limited resources. By utiliz-
ing this exceptional feature, the P2P content sharing 
model is expected to be one of the major means for 
exchanging contents. 
However, the presently available P2P content shar-
ing systems are mainly used to illegally copy movies 
and music contents. In some cases, the service provid-
ers are accused of such illegal data exchange. 
We have recognized that the following technical 
problems may result in the above mentioned misuse 
of P2P.  
First, the presently available commercial P2P con-
tent sharing systems do not provide sufficient func-
tions to track the exchange of contents among users. 
Due to this, service providers cannot monitor the 
illegal exchange or tampering of shared contents 
among users. 
Second, the presently available commercial P2P 
content sharing systems only provide simple search 
functions, such as keyword search; therefore, they are 
unsuitable for contents that are either frequently up-
dated or have text. In practice, the current P2P content 
sharing systems are mainly used to only share movies 
and music contents because these are not frequently 
updated. The development of an appropriate search 
method for the P2P content sharing system is required 
in order to apply them to search text contents and the 
latest version of contents. 
In order to solve these technical problems, we are 
developing a content consistency maintenance method 
and an information search technique for P2P content 
sharing systems. Our content consistency maintenance 
method consists of a technique that prevents the tam-
pering of contents and a method that maintains consis-
tency between the following: 
1. how users exchange contents on a P2P contents 
sharing system and 
2. how the service provider recognizes the 
exchange of contents. 
Finally, we aim to standardize the result of previ-
ous research  [10].  
In order to handle the updates of contents, the P2P 
content sharing system that we are developing main-
tains digital signs for each version of the content. Our 
system uses a download protocol based on asymmet-
ric key encryption to maintain content consistency. In 
order to obtain the latest version of contents, even for 
updated contents, this method employs links to the 
original and the downloaded contents. These links are 
managed on a central server. 
In order to efficiently implement a full-text search, 
clients connected to our system perform morphologi-
cal analysis and summarization of the text to generate 
text information that is necessary for building a re-
verse index on a central server. The text information 
is stored on a central server when the content is up-
dated. To reduce the load of full-text search, the 
search results are cached on clients. By these tech-
niques, we can distribute the load of natural language 
processing among clients and rapidly search text con-
tents with content updates. 
In this paper, we briefly describe the P2P content 
sharing system that we are developing and the tech-
niques used in it, namely, a content consistency main-
tenance method and a full-text search method. We 
also report the result of a preliminary experiment on 
load balancing of full-text search by our technique. 
This paper is structured as follows: Section 2 de-
scribes related work. Section 3 briefly describes the 
25
P2P content sharing system that we are developing. 
Sections 4 and 5 describe techniques for content con-
sistency maintenance and full-text search, respectively. 
Finally, Section 6 presents the conclusion and future 
work. 
2   Related Work 
The two kinds of researches related to our work are 
researches on content consistency maintenance and 
those on information search in a P2P environment. 
In this paper, we refer to a hybrid P2P system, such 
as Napster that uses a central server, as a P2P system, 
although it is not entirely decentralized. This is be-
cause, even a hybrid P2P system has an important 
advantage in terms of content sharing; it can distribute 
large amounts of contents with less bandwidth con-
sumption on the service providers side. 
2.1   Contents Consistency Maintenance  
Since the contents are stored on clients in a P2P 
content sharing system, malicious clients can tamper 
with the contents if no protection method against 
tampering is provided. 
The MD5 hash function in the protocol of Napster 
 [4] enables a content publisher to send the hash value 
of a content to a central server when it publishes the 
content. Freenet  [2] prevents tampering with the con-
tent by using the hash value of a content as its key. 
This technique is effective in preventing the tam-
pering of static content such as a movie or music 
content. However, when this technique is applied to 
frequently updated contents, each version is treated as 
a separate content because different versions have 
different keys. To handle such frequently updated 
contents, Freenet introduced indirect files in which the 
hash values of the contents are stored. By retrieving 
an indirect file, a user can retrieve the last updated 
content in two steps. In order to share frequently up-
dated contents, we need to provide a mechanism that 
associates the content ID with the hash value of a 
particular version of the content, as in the case of 
Freenet. 
Another problem of P2P content sharing systems is 
that the provider of a content sharing service cannot 
trace the exchange of contents among users. 
Napster, which is a centralized P2P content sharing 
system similar to our system, uses a download proto-
col by which the clients send a download request to 
the central server before they download the content 
from another client. After this, the central server does 
not participate in the download process of the content. 
Using this protocol, the central server cannot identify 
whether a download has been carried out successfully 
or not. A malicious client can send the same informa-
tion to the central server and pretend that a download 
request has been made by another client. It is also 
possible to send tampered content to another client 
without being detected by the central server. 
2.2   Information Search in P2P Environment 
The two types of search techniques that are widely 
used in P2P content sharing systems include using a 
central search server  [4] and flooding of search re-
quests  [6]. 
The problems of using a central server, such as 
poor scalability of a central search server and vulner-
ability that arises from a single point of failure, are 
widely known. The flooding of search requests also 
has scalability problems. As the number of nodes in a 
network increases, more search requests are flooded 
that consume a major part of the bandwidth. In order 
to reduce search requests, many systems use flooding 
techniques that often limit the search range with heu-
ristic methods. As a result, it cannot be assured that all 
existing contents in a network can be found in these 
systems. 
In order to solve the problems associated with the 
above mentioned techniques, several search methods 
based on distributed hash tables (DHT) have been 
proposed  [5] [7]. These methods are scalable to a con-
siderable extent. A characteristic of these methods is 
that exact match key search can be done with O (log 
n) or O (n
a
) hops. 
Reynolds and Vahdat proposed a method for im-
plementing full-text search by distributing the reverse 
index on a DHT. In this method, a key in a hash table 
corresponds to a particular keyword in a document, 
and a value in a hash table corresponds to a document 
that contains a keyword. A client that publishes a 
document notifies the nodes that correspond to the 
keywords contained in the document and updates the 
reverse indexes on these nodes. In this method, the 
load of the full-text search can be distributed among 
the nodes. We can also expect that the reverse indexes 
on the nodes can be updated rapidly by pushing the 
latest keywords in the contents from a client. 
On the other hand, this method has several limita-
tions. For example, when an AND search is per-
formed by this method, the search results must be 
transferred between the nodes. Li estimated the 
amount of resources that is necessary to implement a 
full-text search engine based on this method and 
pointed out that it is difficult to implement a large-
scale search engine, such as Google, by this method 
 [8]. 
Furthermore, if this method were applied to a P2P 
content sharing system, the problem of low availabil-
ity of nodes would arise because the users’ PCs would 
be used as nodes in such a system. In order to store 
reverse indexes on the nodes, we have to replicate 
them to ensure the availability of indexes. This would 
require more resources than that estimated by Li. 
26
Based on the above mentioned reasons, we believe 
that a full-text search technique using a central search 
server that manages reverse indexes is more feasible 
than a distributed reverse index technique for imple-
menting a full-text search engine in a P2P environ-
ment. 
3   System Architecture 
Figure 1 shows the architecture of our system. As 
described earlier, we chose a central server architec-
ture to provide a full-text search of the contents. 
The public keys of the clients are stored on a cen-
tral server. By sending a request to the server, a client 
can obtain a public key of another client that is con-
nected with the central server. The central server also 
has private and public keys. Its public key is available 
to all the clients. 
Each client has a unique ID. When a client connects 
to the central server, it sends its own IP address. An-
other client can obtain the IP address of a client by 
querying to the server using its client ID. The central 
server provides a content consistency maintenance 
mechanism and a full-text search engine. These 
mechanisms are described in the following sections.   
4 Content Consistency Maintenance 
4.1   Data Structure for Content Management 
In this system, a publisher of a document digitally 
signs a document with its private key and registers its 
sign to the central search server with its unique ID. 
When a document is a text document, a client per-
forms morphological analysis to generate search key-
words from a document. 
The ID of contents and digital signs corresponding 
to different versions are managed on the central 
search server. Using the ID and version, a client can 
obtain a digital sign for a document by querying to the 
central server using its ID and version. Using a digital 
sign ensures that a malicious client does not tamper 
with a document. 
A search result obtained from the central server is 
also digitally signed to ensure that a client does not 
tamper with it. As described in detail in section 5, a 
search result is cached on a client and can be modified. 
To prevent this, a search result comprises the ID of 
contents and a digital sign. 
In this system, a client can obtain the latest version 
of a document when a document is updated, by query-
ing its ID to the central server. However, a limitation 
associated with this method is that only the latest 
version of documents can be obtained. For example, 
by using indirect files and hash values of contents as 
in Freenet, we can obtain previous versions of a 
document by directly specifying a hash value of an 
earlier version. However, neither does Freenet assure 
that the latest version is always obtained nor does it 
assure that a particular earlier version is obtained 
because a previous version may be deleted if there is 
no request for it in a certain period. In our system, we 
consider only the latest version of a document which 
can be obtained at any time. Thus, we define our 
document query protocol in order to obtain the latest 
version. 
In order to prevent the concentration of download 
requests on a certain client, our system manages a list 
of clients that have downloaded the latest version of a 
document and distributes download sources to these 
clients using this list. 
In this method, the ID of a client that downloads 
the latest version of a document is added to a list; this 
ID corresponds with the ID of the document. When a 
client sends a request to the central server to 
download a document, the central server selects an 
appropriate client from a downloader’s list and returns 
its ID to the client. When the publisher updates a 
document, the list corresponding to that document is 
emptied. 
We describe this procedure by the following 
pseudo codes, where download is a function that 
requests the download of a document, nodeId is the 
ID of a client that requests the download, update is 
a function that requests the update of a document, and 
getNodeId is a function that gets the ID of a client 
that downloads a document whose ID is docId. 
 
  nodeIdList: document ID x node ID list  
 
  download(docId, nodeId) { 
    nodeIdList[docId].add(nodeId); } 
 
  update(docId, nodeId) { 
    nodeIdList[docId] = {nodeId}; } 
 
  getNodeId(docId) { 
    index = rand() * nodeIdList[docId].length; 
    return nodeIdList[docId][index]; } 
4.2 Tracing How Contents are Exchanged 
In a P2P content sharing system that uses a simple 
download protocol, such as Napster, when a service 
- Client public keys 
- Contents certificate 
- Links to contents 
- Full-text search index 
- Contents 
Figure 1. System Architecture
Central
server
Client
Client Client 
27
