A SECURE AND DYNAMIC
MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA
Abstract—Due
to the increasing popularity of cloud computing, more and more data owners are
motivated to outsource their data to cloud servers for great convenience and reduced
cost in data management. However, sensitive data should be encrypted before outsourcing
for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this
paper, we present a secure multi-keyword ranked search scheme over encrypted
cloud data, which simultaneously supports dynamic update operations like
deletion and insertion of documents. Specifically, the vector space model and
the widely-used TF_IDF model are combined in the index construction and
query generation. We construct a special tree-based index structure and propose
a “Greedy Depth-first Search” algorithm to provide efficient multi-keyword
ranked search. The secure kNN algorithm is utilized to encrypt the index and
query vectors, and meanwhile ensure accurate relevance score calculation
between encrypted index and query vectors. In order to resist statistical
attacks, phantom terms are added to the index vector for blinding search
results . Due to the use of our special tree-based index structure, the
proposed scheme can achieve sub-linear search time and deal with the deletion
and insertion of documents flexibly. Extensive experiments are conducted to
demonstrate the efficiency of the proposed scheme.
EXISTING
SYSTEM:
Searchable encryption schemes enable the
clients to store the encrypted data to the cloud and execute keyword search
over ciphertext domain. Due to different cryptography primitives, searchable
encryption schemes can be constructed using public key based cryptography or symmetric key based cryptography. Song et
al. proposed the first symmetric
searchable encryption (SSE) scheme, and the search time of their scheme is
linear to the size of the data collection. Goh proposed formal security definitions for SSE
and designed a scheme based on Bloom filter. The search time of Goh’s scheme is
O (n), where n is the cardinality of the document
collection. Curtmola et al. proposed
two schemes (SSE-1 and SSE-2) which achieve the optimal search time. Their
SSE-1 scheme is secure against chosen-keyword attacks (CKA1) and SSE-2 is secure
against adaptive chosen-keyword attacks (CKA2) schemes, which are very simple
in terms of functionality. Afterward, abundant works have been proposed under different
threat models to achieve various search functionality, such as single keyword
search, similarity search ,
multi-keyword boolean search , ranked search and multi-keyword ranked search,
etc. Multi-keyword boolean search allows the users to input multiple query
keywords to request suitable documents. Among these works, conjunctive keyword
search schemes only return the documents that contain all of the query
keywords. Disjunctive keyword search schemes return all of the documents that
contain a subset of the query keywords. Predicate search schemes are
proposed to support both conjunctive and disjunctive search. All these
multikeyword search schemes retrieve search results based on the existence of
keywords, which cannot provide acceptable result ranking functionality. Ranked
search an enable quick search of the
most relevant data. Sending back only the top-k most relevant documents
can effectively decrease network traffic. Some early works have realized the ranked search using
order-preserving techniques, but they are designed only for single keyword
search. Cao et al. realized the
first privacy-preserving multi-keyword ranked search scheme, in which documents
and queries are represented as vectors of dictionary size. With the “coordinate
matching”, the documents are ranked according to the number of matched query
keywords. However, Cao et al.’s scheme does not consider the importance
of the different keywords, and thus is not accurate enough. In addition, the
search efficiency of the scheme is linear with the cardinality of document
collection. Sun et al. presented
a secure multi-keyword search scheme that supports similarity-based ranking. The
authors constructed a searchable index tree based on vector space model and
adopted cosine measure together with TF×IDF to provide ranking results
PROPOSED
SYSTEM:
Due to the special structure of our tree-based
index, the proposed search scheme can flexibly achieve sub-linear search time
and deal with the deletion and insertion of documents. The secure kNN algorithm
is utilized to encrypt the index and query vectors, and meanwhile ensure
accurate relevance score calculation between encrypted index and query vectors.
To resist different attacks in different threat models, we construct two secure
search schemes: the basic dynamic multi-keyword ranked search (BDMRS) scheme in
the known ciphertext model, and the enhanced dynamic multi-keyword ranked
search (EDMRS) scheme in the known background model. Our contributions are
summarized as follows:
1) We design a searchable encryption scheme that supports
both the accurate multi-keyword ranked search and flexible dynamic operation on
document collection.
2) Due to the special structure of our tree-based
index, the search complexity of the proposed scheme is fundamentally kept to
logarithmic. And in practice, the proposed scheme can achieve higher search efficiency
by executing our “Greedy Depth-first Search” algorithm. Moreover, parallel
search can be flexibly performed to further reduce the time cost of search
process.
Module
1
The System and Threat Models
The system model in this paper involves
three different entities: data owner, data user and cloud server.
Data owner has
a collection of documents F ={f1;
f2; :::; fn} that
he wants to outsource to the cloud server in encrypted form while still keeping
the capability to search on them for effective utilization. In our scheme, the
data owner firstly builds a secure searchable tree index I from document
collection F,
and then generates an encrypted document collection C for F. Afterwards, the data
owner outsources the encrypted collection C and the secure index I to the cloud server, and
securely distributes the key information of trapdoor generation (including
keyword IDF values) and document decryption to the authorized data users. Besides,
the data owner is responsible for the update
operation of his documents stored in the
cloud server. While updating, the data owner generates the update information
locally and sends it to the server.
Data users are
authorized ones to access the documents of data owner. With t query
keywords, the authorized user can generate a trapdoor TD according to
search control mechanisms to fetch k encrypted documents from cloud
server. Then, the data user can decrypt the documents with the shared secret
key.
Cloud server stores
the encrypted document collection C and the encrypted searchable tree index I for data owner. Upon
receiving the trapdoor TD from the data user, the cloud server executes
search over the index tree I,
and finally returns the corresponding collection of top-k ranked
encrypted documents. Besides, upon receiving the update information from the
data owner, the server needs to update the index I and document collection
C according to the
received information. The cloud server in the proposed scheme is considered as
“honest-but-curious”, which is employed by lots of works on secure cloud data
search. Specifically, the cloud server honestly and correctly executes
instructions in the designated protocol. Meanwhile, it is curious to infer and
analyze received data, which helps it acquire additional information. Depending
on what information the cloud server knows, we adopt the two threat models
proposed by Cao et al..
Known Ciphertext Model. In
this model, the cloud server only knows the encrypted document collection C, the searchable index
tree I, and the search
trapdoor TD submitted by the authorized user. That is to say, the cloud
server can conduct ciphertext-only attack (COA) in this model.
Known Background Model. Compared
with known ciphertext model, the cloud server in this stronger model is
equipped with more knowledge, such as the term frequency (TF) statistics of the
document collection. This statistical information records how many documents
are there for each term frequency of a specific keyword in the whole document
collection, as shown in Fig. 2, which could be used as the keyword identity.
Equipped with such statistical information, the cloud server can conduct TF
statistical attack to deduce or even identify certain keywords through
analyzing histogram and value range of the corresponding frequency
distributions.
Module
2
Design Goals
To enable secure, efficient, accurate
and dynamic multikeyword ranked search over outsourced encrypted cloud data
under the above models, our system has the following design goals.
Dynamic: The
proposed scheme is designed to provide not only multi-keyword query and
accurate result ranking, but also dynamic update on document collections.
Search Efficiency: The
scheme aims to achieve sublinear search efficiency by exploring a special
tree-based index and an efficient search algorithm.
Privacy-preserving: The
scheme is designed to prevent the cloud server from learning additional
information about the document collection, the index tree, and the query. The
specific privacy requirements are summarized as follows,
1) Index Confidentiality and Query Confidentiality: The underlying
plaintext information, including keywords in the index and query, TF values of
keywords stored in the index, and IDF values of query keywords, should be
protected from cloud server;
2) Trapdoor Unlinkability: The cloud server should not be able to
determine whether two encrypted queries (trapdoors) are generated from the same
search request;
3) Keyword Privacy: The cloud server could not identify the
specific keyword in query, index or document collection by analyzing the
statistical information like term frequency. Note that our proposed scheme is
not designed to protect access pattern, i.e., the sequence of returned
documents.
Module
3
Search Process of UDMRS Scheme
The search process of the UDMRS scheme
is a recursive procedure upon the tree, named as “Greedy Depthfirst Search
(GDFS)” algorithm. We construct a result list denoted as RList, whose
element is defined as ⟨RScore;
FID⟩.
Here, the RScore is the relevance score of the document fFID to
the query, which is calculated according to Formula (1). The RList stores
the k accessed documents with the largest relevance scores to the query.
The elements of the list are ranked in descending order according to the RScore,
and will be updated timely during the search process.
(2). RScore(Du;Q) – The function
to calculate the relevance score for query vector Q and index vector Du
stored in node u.
kthscore –
The smallest relevance score in current RList, which is initialized as
0.
hchild –
The child node of a tree node with higher relevance score.
lchild – The child node of
a tree node with lower relevance score.
Since the possible largest relevance
score of documents rooted by the node u can be predicted, only a part of
the nodes in the tree are accessed during the search process.
Module
4
BDMRS Scheme
Based on the UDMRS scheme, we construct
the basic dynamic multi-keyword ranked search (BDMRS) scheme by using the
secure kNN algorithm. The BDMRS scheme is designed to achieve the goal of
privacypreserving in the known ciphertext model, and the four algorithms
included are described as follows:
• SK ← Setup() Initially, the
data owner generates the secret key set SK, including 1) a randomly generated m-bit
vector S where m is equal to the cardinality of dictionary, and
2) two (m×m)
invertible matrices M1 and M2. Namely, SK = {S;M1;M2}.
• I ← GenIndex(F; SK) First, the unencrypted
index tree T is
built on F by
using T ← BuildIndexTree(F). Secondly, the data
owner generates two random vectors {Du′;Du′′} for
index vectorDu in each node u, according to the secret vector S.
Specifically, if S[i] = 0, Du′[i] and Du′′[i] will be set
equal to Du[i]; if S[i] = 1, Du′[i] and Du′′[i] will be set
as two random values whose sum equals to Du[i]. Finally, the
encrypted index tree I is
built where the node u stores two encrypted index vectors Iu = {MT1
Du′;MT2
Du′′}.
• TD
← GenTrapdoor(Wq; SK) With keyword set Wq,
the unencrypted query vector Q with length of m is generated. If wi
∈ Wq,
Q[i] stores the normalized IDF value of wi; else Q[i]
is set to 0. Similarly, the query vector Q is split into two random
vectors Q′ and
Q′′.
The difference is that if S[i] = 0, Q′[i] and Q′′[i] are set to
two random values whose sum equals to Q[i]; else Q′[i] and Q′′[i] are set as
the same as Q[i]. Finally, the algorithm returns the trapdoor
TD = {M−11 Q′;M−12 Q′′}.
• RelevanceScore
← SRScore(Iu;TD)
With the trapdoor TD, the cloud server computes the relevance score of
node u in the index tree I to the query. Note that the relevance score
calculated from encrypted vectors is equal to that from unencrypted vectors.
Module
5
EDMRS Scheme
The BDMRS scheme can protect the Index
Confidentiality and Query Confidentiality in the known ciphertext model. However, the cloud server is able to
link the same search requests by tracking path of visited nodes. In addition,
in the known background model, it is possible for the cloud server to identify
a keyword as the normalized TF distribution of the keyword can be exactly
obtained from the final calculated relevance scores. The primary cause is that
the relevance score calculated from Iu and TD is exactly equal to
that from Du and Q. A heuristic method to further improve the
security is to break such exact equality. Thus, we can introduce some tunable randomness
to disturb the relevance score calculation. In addition, to suit different
users’ preferences for higher accurate ranked results or better protected
keyword privacy, the randomness are set adjustable.
CONCLUSION AND FUTURE WORK
In this paper, a secure, efficient and
dynamic search scheme is proposed, which supports not only the accurate multi-keyword
ranked search but also the dynamic deletion and insertion of documents. We
construct a special keyword balanced binary tree as the index, and propose a
“Greedy Depth-first Search” algorithm to obtain better efficiency than linear
search. In addition, the parallel search process can be carried out to further reduce
the time cost. The security of the scheme is protected against two threat
models by using the secure kNN algorithm. Experimental results demonstrate the efficiency
of our proposed scheme. There are still many challenge problems in symmetric SE
schemes. In the proposed scheme, the data owner is responsible for generating
updating information and sending them to the cloud server. Thus, the data owner
needs to store the unencrypted index tree and the information that are
necessary to recalculate the IDF values. Such an active data owner may not be
very suitable for the cloud computing model. It could be a meaningful but
difficult future work to design a dynamic searchable encryption scheme whose
updating operation can be completed by cloud server only, meanwhile reserving the
ability to support multi-keyword ranked search. In addition, as the most of
works about searchable encryption, our scheme mainly considers the challenge
from the cloud server. Actually, there are many secure challenges in a
multi-user scheme. Firstly, all the users usually keep the same secure key for
trapdoor generation in a symmetric SE scheme. In this case, the revocation of
the user is big challenge. If it is needed to revoke a user in this scheme, we
need to rebuild the index and distribute the new secure keys to all the
authorized users. Secondly, symmetric SE schemes usually assume that all the
data users are trustworthy. It is not practical and a dishonest data user will
lead to many secure problems. For example, a dishonest data user may search the
documents and distribute the decrypted documents to the unauthorized ones. Even
more, a dishonest data user may distribute his/her secure keys to the
unauthorized ones. In the future works, we will try to improve the SE scheme to
handle these challenge problems.
REFERENCES
[1] K. Ren, C.Wang, Q.Wang et al.,
“Security challenges for the public cloud,” IEEE Internet Computing,
vol. 16, no. 1, pp. 69–73, 2012.
[2] S. Kamara and K. Lauter,
“Cryptographic cloud storage,” in Financial Cryptography and Data Security.
Springer, 2010, pp. 136–149.
[3] C. Gentry, “A fully homomorphic
encryption scheme,” Ph.D.dissertation, Stanford University, 2009.[4] O.
Goldreich and R. Ostrovsky, “Software protection and simulation on oblivious
rams,” Journal of the ACM (JACM), vol. 43, no. 3, pp. 431–473, 1996.
[5] D. Boneh, G. Di Crescenzo, R.
Ostrovsky, and G. Persiano, “Public key encryption with keyword search,” in Advances
in Cryptology- Eurocrypt 2004. Springer, 2004, pp. 506–522.
[6] D. Boneh, E. Kushilevitz, R.
Ostrovsky, and W. E. Skeith III, “Public key encryption that allows pir
queries,” in Advances in Cryptology-CRYPTO 2007. Springer, 2007, pp.
50–67.
[7] D. X. Song, D. Wagner, and A.
Perrig, “Practical techniques for searches on encrypted data,” in Security
and Privacy, 2000. S&P 2000. Proceedings. 2000 IEEE Symposium on. IEEE,
2000, pp. 44– 55.
[8] E.-J. Goh et al., “Secure
indexes.” IACR Cryptology ePrint Archive, vol. 2003, p. 216, 2003.
[9] Y.-C. Chang and M. Mitzenmacher,
“Privacy preserving keyword searches on remote encrypted data,” in Proceedings
of the Third international conference on Applied Cryptography and Network
Security. Springer-Verlag, 2005, pp. 442–455.
[10] R. Curtmola, J. Garay, S. Kamara,
and R. Ostrovsky, “Searchable symmetric encryption: improved definitions and
efficient constructions,” in Proceedings of the 13th ACM conference on
Computer and communications security. ACM, 2006, pp. 79–88.
[11] J. Li, Q. Wang, C. Wang, N. Cao, K.
Ren, and W. Lou, “Fuzzy keyword search over encrypted data in cloud computing,”
in INFOCOM, 2010 Proceedings IEEE. IEEE, 2010, pp. 1–5.
[11] K. Schnaitter, S. Abiteboul, T.
Milo, and N. Polyzotis, “On-line index selection for shifting workloads,” in Proc.
IEEE 23rd Int. Conf. Data Eng. Workshop, Istanbul, Turkey, 2007.
[12] P. Unterbrunner, G. Giannikis, G.
Alonso, D. Fauser, and D. Kossmann, “Predictable performance for unpredictable
workloads,” in Proc. VLDB, Lyon, France, 2009.
[13] R. Cheng, J. Chen, and X. Xie,
“Cleaning uncertain data with quality guarantees,” in Proc. VLDB,
Auckland, New Zealand, 2008.
No comments:
Post a Comment