Explaining Search Engines Like You're 5: A Beginner-Friendly Guide
I want to start a series of posts about how to build a search engine from scratch. The goal of these posts will be foster discussion and learning and by no means the "only" way to implement it. I would love to learn from others who has experience building real world search engine.
The Basic Setup
A typical search engine aims to help people do quick text search among huge amount of data. Given LinkedIn as an example, we maybe interested in posts that talks about recent Meta layoff. In order to achieve this goal, we leverage a data structure called "inverted index".
Example: LinkedIn Posts
Assume we have the following LinkedIn posts:
post 1: i love burger
post 2: meta layoff sucks
post 3: my cat is awesome
post 4: stay warm
post 5: I get affected by meta layoff
post 6: pay your cat tax
.....
post 11: meta layoff happens now
Building the Inverted Index
In order to find list of posts related to meta layoff efficiently, we want to build a term/token to post id mapping like the following. With this setup, we can quickly do a GET("meta layoff")
query that returns the post ids: 2, 5, 11 as the result.
{ "meta layoff": [post 2, post 5, post 11], "cat": [post 3, post 6], "burger": [post 1], "tax": [post 6] }
Building an Online Search Service
With this basic knowledge, let's see how we can build an online service that serves the live search traffic. Assume we only have limited amount of data, here's what we can do:
- Write some sorts of offline job (Spark/Map Reduce) that does the counting of all the posts we have and write the mapping result to a data lake storage (e.g., HDFS/S3)
- Leverage some web framework like FastAPI to write a search resource that given a query, try to execute
dict.get()
and returns the list of post Ids and deploy it to one of the cloud providers - During boot up, we load the data from data lake to in memory dict
To make this scalable, we can start a fleet of instances/pods that has the same in memory data as read replica.
Voilà, you have your toy search engine!
Note: Obviously, there's many limitations of this approach. For example:
- What if the blog post data is too big to fit into one machine's RAM?
- Search results are not intuitive since you only get the post ids instead of actual post content.
I will talk more about these challenges in the next post.
Cheers!
Yijie :)