Explaining Search Engines Like You're 5: A Beginner-Friendly Guide
Technology

Explaining Search Engines Like You're 5: A Beginner-Friendly Guide

2025-02-05
3 min read

I want to start a series of posts about how to build a search engine from scratch. The goal of these posts will be foster discussion and learning and by no means the "only" way to implement it. I would love to learn from others who has experience building real world search engine.

The Basic Setup

A typical search engine aims to help people do quick text search among huge amount of data. Given LinkedIn as an example, we maybe interested in posts that talks about recent Meta layoff. In order to achieve this goal, we leverage a data structure called "inverted index".

Example: LinkedIn Posts

Assume we have the following LinkedIn posts:

post 1: i love burger
post 2: meta layoff sucks
post 3: my cat is awesome
post 4: stay warm
post 5: I get affected by meta layoff
post 6: pay your cat tax
.....
post 11: meta layoff happens now

Building the Inverted Index

In order to find list of posts related to meta layoff efficiently, we want to build a term/token to post id mapping like the following. With this setup, we can quickly do a GET("meta layoff") query that returns the post ids: 2, 5, 11 as the result.

{ "meta layoff": [post 2, post 5, post 11], "cat": [post 3, post 6], "burger": [post 1], "tax": [post 6] }

Building an Online Search Service

With this basic knowledge, let's see how we can build an online service that serves the live search traffic. Assume we only have limited amount of data, here's what we can do:

  1. Write some sorts of offline job (Spark/Map Reduce) that does the counting of all the posts we have and write the mapping result to a data lake storage (e.g., HDFS/S3)
  2. Leverage some web framework like FastAPI to write a search resource that given a query, try to execute dict.get() and returns the list of post Ids and deploy it to one of the cloud providers
  3. During boot up, we load the data from data lake to in memory dict

To make this scalable, we can start a fleet of instances/pods that has the same in memory data as read replica.

Voilà, you have your toy search engine!

Note: Obviously, there's many limitations of this approach. For example:

  • What if the blog post data is too big to fit into one machine's RAM?
  • Search results are not intuitive since you only get the post ids instead of actual post content.

I will talk more about these challenges in the next post.

Cheers!

Yijie :)