Searchable Encryption: My understanding on: "Search, Find and Resolve: Towards a Taxonomy for Searchable Encryption Schemes"
Hello there. Isaac here.
If you don't know what Searchable Encryption (SE) is, do look up what it is first. The previous post is an introduction on SE, though it's just what I understand and might not be accurate. If you want more accurate answers I suggest you read the Wikipedia article on Searchable Symmetric Encryption (SSE).
This post will pretty much talk about what I understand from this paper: "Search, Find and Resolve: Towards a Taxonomy for Searchable Encryption Schemes" by Ines Kramer, Silvia Schmidt, Mathias Tausig and Manuel Koschuch. Here is the DOI link to it: https://doi.org/10.5220/0007753804140421
What I love about this paper is how much easier it is to understand for a beginner compared to most papers. It provides a general overview on the categories SE can be divided into, which is really useful for someone trying to understand SE. I highly suggest you read the paper for yourself. Do note that quite a bit of this post is copied from the paper.
Anyway, the authors reviewed SE academia published up to 2016. They identified and categorized 6 different domains for the 393 articles they've read. Here are the domains:
- number of users (writer/reader)
- cryptographic primitives
- types of search
- search criteria
- types of data
- security
Number of users (writer/reader)
There are two different users in an SE scheme: writers and readers. They're also called data owners and data users respectively. Writers can generate and transmit encrypted data to the server and readers submit their search query and interpret the results.
Number of users can be divided into 4:
- S/S (single writer/single reader)
- S/M (single writer/multiple readers)
- M/S (multiple writers/single reader)
- M/M (multiple writers/multiple readers)
Which of these to use depends on the use case you're trying to solve.
From what I understand S/S schemes are pretty much just a single user. Usually these schemes use a symmetric key to encrypt and hash data. I mean, you can share this key, but that isn't exactly the most secure usage. I believe that if you have a trusted server though (such as a server you can trust on-premise), you can use that server as a user for the S/S scheme that connects with an untrusted server like a cloud storage server. This of course means that the server needs to be able to do authorization, which is extra work.
For maximum usability though, I believe M/M is the way to go. Schemes with multiple users do tend to be more complex though, like requiring key distribution and access controls. Again depends on the use case. If your already have a system for file authorization, then you can just plug a filesystem adapter for an S/S SE scheme and you should be pretty much done. (*Not advice)
Cryptographic primitives
I'm still quite confused on what cryptographic primitives actually mean. From what I understand in an SE context it refers to the type of algorithm or hash functions you're using.
What I do know is that how the data is encrypted isn't that important in an SE context. It's how the primitives are used in SE related operations (e.g. search queries, index) that's relevant. There are two big types of schemes: Searchable Symmetric Encryption (SSE) and Public-key Encryption with Keyword Search (PEKS). I know that one paper (Chamili, 2017) categorizes SE into more types than just 2, but I haven't dived into that paper yet.
Types of search
This is another one of the important domains. There are basically two types of search in SE: full-domain search and index-based search.
In full-domain search, the search is performed over all data. I have no idea how these work since I haven't looked at any papers utilizing this, but it should be much slower than index-based search, especially when you're not storing a single document but thousands or even tens of thousands.
I am more familiar with index-based searches though. In index-based searches, you basically have a dictionary with pre-defined keywords. This type of search is much easier in terms of data encryption since you practically decouple the data and the search function. There are generally two types of index-based searches: forward index and inverted index.
In a forward index, each data item has a list of keywords. Below is an example by Kramer et al. (2019):
Let's say you want to search the keyword "white". The search function would have to go through each document, then its keywords until it hits the keyword. So document 1, hit on third word, record. Document 2, hit on second word, record. Document 3, go through all keywords but no hits. You then return the recorded answers.
Let's look at an inverted index first before we compare them. An inverted index is... well, an inverted forward index basically. The "key" in the table is now the keyword while the "value" are the data items. Below is also by Kramer et al. (2019):
Same example: "white". You go through each keyword until you hit "white". You then return all the documents. Simple as that.Now obviously inverted index isn't the miracle fruit that has nothing but pros. The problem with it is updating the index. Say you add another document with its own keywords. You have to go through each keyword and update its value with the new document. If you delete a document, you have to go through every keyword and its value to determine if the document to delete exists for that keyword. It's not the fastest, as you can probably see.
With a forward index though, you just remove the associated key-pair, or add a new key-pair into the index, and you're done.
From what I understand, which index you choose depends on your use case. If you have a write-heavy storage, then you'll want a scheme with forward index. If your use case is read-heavy, then inverted index might just be for you.
The paper does mention a third subtype though: hybrid index. I have no idea how these look like; most likely different hybrid indexes have different structures.
Search criteria
This domain is referring to what kind of queries you can make to an SE scheme, and how the query is resolved. Some schemes only accept single keyword queries, but the paper claims that most recent systems support at the very least multiple keywords.
There are equality queries, which I assume are exact match queries. There's also boolean queries, which support AND, OR, and NOT. Ranked search returns results ranked by a pre-defined order. Fuzzy keyword search can work with typos and incomplete keywords. In my opinion, fuzzy keyword searches are the most user-friendly.
Types of data
Basically what kind of data it can work with. I feel that any data should be able to work with most SEs except for data in a database. Like usually you would think of text files, since those are the easiest to get keywords from. But I feel that any file, including audio or even images should work, as long as you can generate an index for them. I don't think generating the keywords for files is direct SE work, so as long as the index can be generated most SE schemes should work.
I still haven't wrapped my head around data inside databases and if they can work with most SE schemes. My initial assumption is that relational databases might need its own schemes, but document databases might work?
Security
There are many different attacks on SE schemes. I found 3 papers on it, though I haven't read them yet. What I do understand is that the encrypted data isn't too relevant to an SE scheme. Most security issues with SE is how an attacker can glean information off of the queries and its results. Stuff like size pattern, search pattern, and access pattern.
The index/size pattern refers to the information that can be deduced from the stored ciphertext/index. This includes documents or index size, number of documents or keywords, document lengths and similarity. The search pattern reveals the same keyword by comparing the matched records of two queries. By examining the history of result-sets in repeated queries the access pattern can be inferred.
The above paragraph is ripped off straight from the paper. I can't think of a simpler way to phrase it.
The paper also classified security with forward privacy (forward secure) and backward privacy (backward secure). Forward privacy is the hiding of the fact that new data has been added to the server, at least until a query reveals its existence. Backward privacy is about queries that won't leak any information on deleted data.
And that is what I understand and feel are the most important parts of the paper. The rest are also important, like how popular each of the types are in each of the domains, so do check the paper out if you would like to know more. Thanks for reading!
Comments
Post a Comment