On-the-fly Document Search

Michael Kramer
Course Hero Engineering
4 min readJun 17, 2022

--

Creating a microservice to index large amounts of content on-the-fly and return accurate results

To say Course Hero has a large library of content would be an understatement. Course Hero has hundreds of millions of documents, questions, and other study materials. Users find this content in a variety of ways, such as search engines like Google.

The page you land on shows a preview of the document you are interested in. If the phrase or subject you were searching for was not part of the available preview, you might bounce, not knowing that what you were searching for could still be part of the document. Or you might subscribe to get the document, assuming it has exactly what you want, just behind the preview, only to be annoyed to find the full document was not exactly what you were looking for.

full document search feature

We are testing an in-document search bar that allows the user to search the entire document, not just the text in the preview, building confidence that the document contains relevant information before a user subscribes for full access.

Our use case is simple, but somewhat unique: allow searching the entire contents of a document quickly in a typeahead fashion.

We need to find all fragments that match the user’s query for the given document, and since we are doing typeahead, we don’t want to remove stop words or punctuation. We do, however, want to help with spelling mistakes, and support synonyms.

typeahead functionality for document search

We evaluated solutions such as ElasticSearch, AWS Open Search, and Algolia. These are all geared towards searching across millions of documents, but not really searching within an individual document. They also require us to pre-index our entire content library, which is a cost we don’t want to pay up front — especially not knowing if our users would find enough value to justify it.

We need a service that can take in a user’s query for a given document, load the index for that document (or index the document on-the-fly) and return results, all within a second or two.

We ended up using an open source library called Bleve for handling the full-text search. Bleve has a lot of features out-of-the-box that make our job a lot easier, such as handling partial matches, synonyms, and stop words. It’s also customizable enough to allow us to write tokenizers, character filters, and fragmenters to make sure we get all the results we care about. Most importantly, it allows us to create indexes against a single document and immediately search that index.

During our initial proof of concept we determined that we would be able to download, index, and return relevant results within 1–2 seconds (p95) for ~90% of our document library. This means we only need to pre-index the largest 10% of our documents, saving a lot of time and money.

The service architecture is as follows:

on-the-fly document search service architecture

We have AWS Simple Queue Service (SQS) backend workers that are notified any time a document is uploaded. These workers determine if the document is large enough that it should be pre-indexed, and if so, it begins the process.

When a request comes in for a given document, our documentsearch REST/gRPC service checks if that index is loaded into memory already. If so, it serves the results immediately. If not, it then checks if the index exists in S3, and if so, it downloads and loads the index, again returning the results.

If this is a new document that has never been indexed, then the service will download the original document, load it into the index, and serve the results, while at the same time, uploading that index to S3 so that it can be used in the future.

If the index is not loaded into memory, then the service will obtain a lock, which will block additional requests for that same index. Once the index is loaded, the lock is released, and all blocked requests are allowed through. Indexes are stored in an LRU stack, where we only keep a certain number of indexes in memory, expiring ones that are no longer being searched actively.

Finally, we run our service inside of Kubernetes using Istio as our service mesh. Istio provides built-in support for stickiness, where we can direct requests for the same document to the same instance of our service. With typeahead search, we expect rapid queries for the same document to come in every few seconds as the user types. The initial result might be 1–2 seconds, but all subsequent requests should be served in a few ms as those requests hit a service which has the index already in memory.

With this architecture, we are able to serve search results with on-the-fly indexing for 90% of our library, and pre-index anything that would be too large, providing a quick searching experience for our users.

If solving complex technical challenges like this interests you, we’re hiring!

--

--