Searchable Encryption: An Introduction (from someone who's still trying to read and understand the academia behind it)

Hello there. Isaac here.

So I'm currently doing research for my Master's program and am thinking of doing something on searchable encryption. Not sure what exactly I'm planning to do yet, but I'm sure it'll come as I read more.

Since this is my first post on this topic, let me explain what I understand about searchable encryption. Searchable encryption (SE) is a scheme to solve a certain problem. Say we have sensitive data that we want to keep hidden from unauthorized parties. To do that, we can encrypt the data. The problem is that with the data encrypted, we can't really search through the data. This is where SE comes in. SE let's us search through data that has been encrypted.

You might be wondering, if we can use SE to search through the data, doesn't this mean that SE itself isn't encrypted and can leak details on the data? Indeed it can. That is why whatever data that is used to be searched through has to be itself unreadable to unauthorized parties or even everyone including the owner.

Now that you kind of have a picture of what SE is and why it's used, let me give you an example implementation of a simple (but insecure) SE scheme. Note that this scheme is heavily influenced by the first article on SE that I understood (80% of).

Let's set the scenario. There are three actors: Us (the data owner (DO)), a server hosted by a cloud service provider (CSP, e.g. AWS, Google Cloud), and an adversary. The term I usually see in articles when describing CSPs is "honest but curious". This means that a cloud server won't deviate from the protocol that we set, but will try to learn as much as they can from what they can observe. The adversary can listen in on the communication between us and the CSP.

First we encrypt the sensitive data that we want to store in the server on our side. Then we send this encrypted data to the server. Since it's encrypted, neither the CSP nor the adversary can see what the data we sent is. From what I understand, we don't need to think too much about how the data is encrypted when talking about SE. Any secure encryption algorithm should work.

A typical SE scheme involves an index. Think of it like a table telling us where to search for stuff. In this scheme, to build our index, we first analyze the text in the data that we have and extract the keywords. We then build an index with these keywords. Say on the first column is a keyword and the second column a list of identifiers to documents containing that keyword. There's actually two different types of index from what I know, but I'll talk about it in the future.

Now that we have an index, we hash each of the keywords to permanently disfigure them. Now no one can know what the keywords are anymore, not even us. We then send this index to the server so they can store it.

Days pass, and now we want to search for a specific keyword and get data related to it. Note that if we sent a keyword in plaintext, an adversary can read it and know what our document contains. Like if we just sent "puzzle", "price", and "first-person-shooter", our adversary might be able to guess what kind of documents we are storing. So to prevent them from knowing, we can hash our keyword, which the server can just use to check in the table with the already hashed keywords and return us the documents we want. Once we get the documents we can just decrypt them to get the documents. Voila, we get to store them encrypted while simultaneously having the ability to search through them.

If you are experienced on SE schemes and attacks though, you might realize that when I mentioned this scheme is insecure, it really is insecure. Yes, they won't be able to see what the data or keyword is, but they can gain quite a bit of knowledge. Like when we just send a hash of a keyword, while we don't know what the keyword is, we will know if it's the same keyword sent since hashes of the same word are identical. There are other problems like size pattern leakage and access pattern leakage, which I will not get into here because I don't fully understand it yet.

Anyway, I hope you understand what searchable encryption is. That's all from me for this post. Thanks for reading :)

Next post on SE