Search Engine Algorithms and Open Source Technologies

2020 · Posted by Sebastien Lemieux-Codere

Contents

  1. Search Signals
  2. Indexing
  3. Matching
  4. Ranking
  5. Evaluation and Monitoring

Search engines aim to generate search results based on a search context that satisfy a user’s information needs while also balancing how the search results impact the organization’s needs. The search context often but not necessarily includes a search query and can include other things such as the user’s location, language, search history and device type.

For the purpose of this article, the things being searched are called documents. These documents could be web pages, products, social media post, videos, news articles, etc

Search engines often generate search results in 3 steps: an indexing step, a matching step and a ranking step. Although not every search engine is designed using this 3 step search process, this article is organized in alignment with those 3 steps. It describes multiple different types of algorithms and corresponding open source technologies that can be used to implement them.

Across many of these steps, search signals are often used as indicators of how different documents align with the needs and objectives of users and of the organization. For example, relevance signals can be used as an indicator of how relevant a document is for a given search query based on a measure of how similar the search query’s text is to a document’s text. Moreover, other signals like a document’s popularity or recency can sometimes be used as indicators of a document’s quality

In the indexing phase, the documents are loaded from the original data source into the search engine. This is an ETL process that involves extracting the document information from the data source, applying a variety of transformations and enrichments to it to make it searchable and finally storing it into a database that allows for efficient searching.

The matching phase narrows down the set of documents to a more manageable number which are then passed on to the second ranking step. For example, if the search context is the search query “apple”, then the matching phase might retrieve every document that contains the word “apple”. It could also involve a nearest neighbor or approximate nearest neighbor search using document and query embeddings. The matching phase can also filter documents based on various criteria such as price and category. When the total number of documents to be searched is manageable, the matching step can be skipped and all documents are then passed on to the ranking step.

In the ranking phase, the results returned in the matching phase are ordered based on various scoring signals as well as other factors. The scoring signals can include various things like how well a document matches a search query and the popularity, recency, average rating and click through rate of the document. It can also involve hard coded rules and algorithms that are designed to generate other desired search results characteristics beyond just high scores for the search results such as novelty, diversity and exploration. Furthermore, it can incorporate some personalization to prioritize documents that are better for the user based on available information on the user (such as search history, current location, etc).

Search engines also often have evaluation and monitoring systems that collect and analyze data on how users interact with search results. This data can be used to measure the effectiveness of the search engine, to detect operational issues with the systems (such as abnormally high latency or abnormally low click through rate) and to train machine learning models to generate better results.

This article focuses on algorithms and open source technologies for search signals, search engine indexing, search engine matching, search engine ranking and search engine evaluation/monitoring. It is important to note that search engines also often have many other important features not covered in this article such as faceted search, match highlighting, featured snippets and query autocomplete and query suggestions.



Search Signals

Search signals that can be used in both the matching phase and the ranking phase of search engines. For example, in the matching phase, the documents most relevant to a search query based on a relevance signal can be retrieved as matches and passed on to the ranking phase for further processing. In the ranking phase, the relevance signal can be combined with a signal corresponding to the popularity of the results in order to prioritize popular results. Incorporating a popularity signal can be useful in some scenarios because popular results may be more likely to be what the user is looking for and may also be of higher quality. Search engines often combine many different signals for each document.

The rest of this section will focus on 3 different types of signals: relevance signals, quality related signals and personalization signals.


Relevance Scoring Signals

Relevance scoring signals are measures of how documents or parts of documents are related to a search query and context. A single search query can incorporate multiple different relevance signals, which might correspond to different ways to measure document relevance. Some relevance signals may only look at specific parts of the document such as the title or description. There can also be multiple signals for a single part of the document, representing different ways of measuring relevance. This section will cover 3 types of relevance signal: bag of words models, embedding based models and scoring model based approaches.

Bag of Words Approaches

In bag of words approaches, search queries and document fields are represented as the set of tokens they contain. Given the tokens for a search query and documents, various mathematical models can be used to generate relevance signals about how relevant the documents are to the query.

To generate these token representations from textual content, queries and documents are tokenized into tokens which represent the content features of the document. For example, the sentence “The apple is red!” could be tokenized into a list of 4 tokens: [“the”, “apple”, “is”, “red”]. The transformation of text and other types of data into tokens is called text analysis. Text analysis is a broad topic involving many methods to achieve desirable tokenization results for a given application. For example, in some applications, it could make sense to map all the words “apple”, “apples”, “Apple” and “APPLE” to the token “apple” since all of these words represent the same concept (although sometimes in different quantities). Text analysis can be adjusted in many ways including the transformations done in relation to quantities, locations, synonyms, acronyms, item taxonomies, different languages and special characters. Techniques like stemming, lemmatizing and synonym expansion can also be used to generate tokens from text. Tokens can also be generated from non-textual content such as images or sounds using deep learning methods. To learn about text analysis and some related open source technologies in more depth, I would recommend reading the book Relevant Search.

Popular bag of words models used to generate vector that can be compared to produce relevance signals include:

TF-IDF (Term frequency–Inverse Document Frequency) Vector Space Model
TF-IDF vectors can be generated to represent the contents of documents and queries based which words they contain and in what quantity. Then, the TD-IDF vector for a query can be compared to the TF-IDF vectors for documents to generate a relevance signals. The dimensions of a TF-IDF vector correspond to a measure of how important a specific token is to the meaning of the document or query. That measure is the product of 2 statistics related to term frequency \$ tf(t, d)\$ and inverse document frequency \$ idf(t, D) \$. $$ tfidf(t, d, D) = tf(t, d) \times idf(t, D) $$ There are many different ways of defining \$ tf(t, d)\$ and \$ idf(t, D) \$. A common way of doing it is to set \$ tf(t, d)\$ to the number of times term \$t\$ appreared in document \$d\$ and to set \$ idf(t, D) = log \frac{N}{ 1 + f(d) } \$ where \$ N \$ is the total number of documents and \$ f(d) \$ is the number of documents where the term t appears at least once. TF-IDf vectors are often compared using cosine similarity to generate relevance signals: $$Cosine\ Similarity(\vec{v_1}, \vec{v_2}) = \frac{ \vec{v_1} \cdot \vec{v_2} } { || \vec{v_1} || \times || \vec{v_2} || }$$
Okapi BM25
Okapi BM25 Score $$ Score = \sum_i^{n} IDF(q_i) \frac{ f(q_i, d) \times ( k1 + 1) }{ f(q_i, d) + k1 \times \big( 1 - b + b \times \frac{FieldLength}{AverageFieldLength} \big) } $$ where the search query has \$n\$ terms \$q_1, q_2, ... q_n \ \ \$, the \$ f(q_i, d) \$ term is the number of times the token \$ q_i \$ appeared in document \$ d \$, the configurable constant \$k1\$ controls the non-linear token frequency saturation effect, the configurable constant \$b\$ controls the field length normalization effect. For BM25, the \$ IDF(q_i) \$ is commonly defined as: $$ IDF(q_i) = ln \bigg( 1+ \frac{N - f(q_i) + 0.5 }{ f(q_i) + 0.5 } \bigg) $$ where \$ N \$ is the number of document and \$ f(q_i) \$ is the number of documents where the term \$ q_i \$ appears at least once.

Embedding Models

Another way to generate relevance scores is to compare an embedding generated from the query with embeddings generated from the documents to be searched. In this approach, important aspects of documents and queries are represented as dense vectors called embeddings. For a given query, documents that have an embedding similar to the query’s embedding based on a similarity measure are considered to be relevant. Recent advances in deep learning have led to dramatic improvements in the quality and versatility of embedding models. The dimensions of the embedding vectors can vary but just to give some context, it is often between 10 and 2000. Embeddings can be generated for all sorts of things including words, sentences, text documents, images, sounds and user preferences.


Embedding models for individual words include:


Embedding models for sentences include:


Popular methods to compare embedding vectors to generate relevance signals include: $$Cosine\ Similarity(\vec{v_1}, \vec{v_2}) = \frac{ \vec{v_1} \cdot \vec{v_2} } { || \vec{v_1} || \times || \vec{v_2} || } $$ and $$Euclidian\ Distance(\vec{v_1}, \vec{v_2}) = || \vec{v_1} - \vec{v_2} || $$


Embeddings for Everything: Search in the Neural Network Era:




The main advantages of word and sentence embedding models over the traditional bag of words models include:

  • They can handle synonyms and similar words in a natural way by mapping synonyms (and similar words) to similar embeddings.
  • Sentence embeddings can take into account the ordering and relationships between words.
  • They can power multilingual applications by using embeddings that are aligned across multiple languages. For example, it is possible to train embeddings such that words that have the same meaning in different languages have very similar embeddings.
  • For certain applications, it can be more natural to represent content as embeddings rather than as tokens.

Disadvantages of embedding models over bag of words models include:

  • The open source infrastructure technologies that can support the use of embedding models in a high availability and high scalability setting are not as mature as the technologies for traditional bag of words models.
  • When using embedding models in the matching phase, approximate nearest neighbor search search methods sometimes have to be used for scalability reasons. In contrast, (exact) inverted indices can often be used in the matching phase with bag of words models.

Deep Learning for Search (book) provides more information about using embeddings for search (including images, multilingual, etc.).


Model Based Scoring

In model based approaches, a machine learning model is trained to output a relevance signal by taking as input features derived for the search query, search context and a document information (the model could also incorporate other data as well). The input features can include all sorts of things including raw text data from the document/query, the outputs of vector space and embedding based models or other types of signals such as quality and personalization signals. The types of models that could be used include deep learning models, linear regression, logistic regression and regression trees. A major disadvantage with model based approaches is that they are often less scalable than bag of words and embedding based approaches. This is why they are often only used in the ranking phase on the narrowed down list of documents returned in the matching phase. In contrast, vector space models and embedding based models are more scalable and can often be used in the matching phase as well as the ranking phase. Furthermore, model based approaches require labelled training data whereas vector space models and embedding based models can typically be fitted on unlabelled training data.


Incorporating Other Natural Language Processing Algorithms

Natural processing methods like Text Classification, Named Entity Recognition and Intent Classification can be used to improve relevance signals further. For example, intent classification could be used to detect in the search query that the user is looking for a business in a specific location. Then, Named Entity Recognition could be used to extract that location (for example, new york city) and a relevance signal could then be incorporated to highly prioritize businesses in or near that location. The way natural language processing algorithms can be incorporated to create better relevance signals can be highly situation dependent.


Quality Related Scoring Signals

Document quality signals can be used to boost the rankings of high quality documents to give the search engine a tendency to return higher quality top results. Examples of quality related signals include:

  • Popularity: in some applications, it might make sense to boost popular items using popularity signals. Reasons why it might make sense to do this include that high quality causes items to become popular and that popular items are more likely to be what the user is looking for.
  • Recency: in some applications, newer content tends to be better. An example of an application where this might be the case is for news data, where users might prefer recent news compared to old news. Similarly, recently published or udpated technology content is more likely to be up to date and/or cutting edge.
  • Ratings: It might make sense to boost higher rated items.
  • User interaction metrics: for example click through rate, purchase rate, time on page and bounce rate could be used as indicators of quality.
  • Number of links or references to an item: this can be used as an indicator of popularity and/or quality. Google’s search engine famously used the PageRank Algorithm to generate a web page quality signal based on links between web pages.
  • Business Objectives: it can sometimes make sense to incorporate signals related business objectives. For example, in an e-commerce website, it could make sense to boost results that have higher profit margins or that reflect well on the store's brand.


Personalization Signals

For some applications, personalizing search engine results using information about the user can vastly improve relevance. This is called personalized search. Some examples of personalization signals include:

  • User information such as the user’s location, age or gender. For example, in a restaurant search engine, a user’s location can be used to prioritize restaurants that are near the user’s location.
  • User history such as a user’s past purchases or browsing history. For example, an e-commerce website could use collaborative filtering to recommend products a user is more likely to buy based on that user’s browsing and purchase history using a model trained from historical browsing and purchase history from all users.

To learn more about search personalization algorithms, it’s useful to look into the topic of recommendation systems. This article is about search engines so it won’t go into much depth on this other than to briefly mention 4 ways to generate personalization signals, which correspond to the 4 main types of recommendation systems:



Indexing

The indexing step is an ETL process (Extract, Transform, Load). First, documents are retrieved from their original source. This original data source could be a database, files, an API, etc. Then, they are transformed and possibly enriched into a format that is useful for search. This could involve text analysis, generating embeddings from text, images or other data types, applying classification or content tagging machine learning models, language detection, joining with other data sources, etc. Finally, the transformed documents are loaded into data structures that allow for efficient searching.

Architectural and Data Engineering Notes

Search infrstructure is often created as a service and kept separate from the original data source for a variety of reasons, including:

  • Allowing for more flexible experimentation: you can reindex data from the original data source into a new search engine index without worrying about corrupting the original data source or affecting the search index currently being used in production.
  • Many search technologies and databases are not appropriate for transactional workloads since they are not ACID compliant (they don't have all the Atomicity, Consistency, Isolation and Durability properties). Some search technologies are not able to perform document updates efficiently or at all.

Having the search engine separated from the original data source does have some disadvantages, including:

  • More complex architecture.
  • Limitations stemming from the CAP and PACELC theorems. This is why some search engines services return the document IDs of the desired search results, which are then used to get up to date information from the original data source. Thus, the original data source is considered to be the “source of truth” from which the search index and often also information displayed in the search results is fetched. With this design, the search engine determines which documents are in the search results but it is the original data source that is used to get the document information that will be displayed in the search results.
  • The information in the search engine may not be up to date.

If the original source of data is evolving and the search engine is kept separate from the original source of data, there is a need to keep the data synchronized between the 2 systems. The ways in which this can be done is highly dependent on the nature of the source data and the requirements of the search engine, including:

  • Updating document information in the search index. This can be done synchronously as changes are made to the source data or asynchronously using update queues. It’s important to note that some search technologies are not able to perform document updates efficiently or at all which may prevent this approach from being used.
  • Periodically creating a new version of the search index and reindexing all of the up to date documents from the original data source into it, and then redirecting the search engine to the new version of the index.


Matching

There are multiple different methods that can be used for the matching step. This article discusses 3 of them: inverted index, embedding nearest neighbor search and filtering. Some systems can have multiple matching engines to generate candidate search results based on different aspects. With this approach, the union of all of the matches from the different matching systems is then forwarded to the ranking engine to generate the final search results.

Inverted Index

Inverted Index: document fields are token sized into sets of 1 or more tokens which represent the content features of the document. For example, the sentence “The apple is red.” could be tokenized into a list of 4 tokens: [“the”, “apple”, “is”, “red”]. The tokens for every document can be placed into an inverted index data structure which maps tokens to every document that contains them.

There are popular open source databases that use inverted indexes include ElasticSearch and Apache Solr. Both ElasticSearch and Solr are built on top of the Lucene java library. Furthermore, they also incorporate search relevance signals from vector space models including BM25. ElasticSearch and Solr are very mature projects and have many powerful features beyond inverted index based search, including many ways to incorporate many other types of search signals into the ordering of the search results. Both ElasticSearch and Solr (using SolrCloud) can be configured to operate in a distributed fashion on clusters of computers to provide greater availability and scalability.

In lucene based search engines, the transformation of text and other types of data into tokens is called text analysis. Text analysis is a broad topic involving many methods to achieve desirable tokenization results for a given application. In Lucene based Search Databases, text analysis is done in 3 consecutive stages:

  1. A chain of zero or more character filters add, remove, or change characters,
  2. A single tokenizer receives the resulting stream of characters and breaks it up into individual tokens
  3. Finally, a chain of zero or more token filters receive a token stream and may add, remove, or change tokens.

Text analysis can be adjusted for a given application to better take into account quantities, locations, synonyms, acronyms, taxonomies, different languages and special characters. To learn about this topic in more depth, I would recommend reading the book Relevant Search.

Other types of data can also be analyzed to generate tokens for the inverted index. For example, machine learning models can be used to look at an image and to produce tokens that correspond to the content of the image. Speech to text models can be used to generate text data from speech data which can then be analyzed using standard text analysis methods.

ElasticSearch is often also used as an analytics, log analysis and system monitoring platform. It provides many features for those purposes including anomaly detection algorithms, log ingestion utilities and a program called Kibana for quickly creating dashboards and performing other cluster management tasks.

Embedding Nearest Neighbor Search

Embedding Nearest neighbor search is another algorithm that can be used for matching. Key aspects of documents and queries are represented as vectors called embeddings. For a given query, documents that have an embedding similar to the query’s embedding are considered to be matches. Recent advances in deep learning has led to dramatic improvements in the quality. For text data, the embedding nearest neighbor search approach has the benefit of being able to take into consideration things like the ordering of words in a sentence, synonyms and related concepts. The embedding nearest neighbor search approach also has the benefit of being able to be used on any type of data whose desired aspect can be represented by a suitable embedding, including images, videos, sounds, etc

Performing matching with embedding nearest neighbor search has 2 main challenges: First, representing document and search queries with embedding that have the desired effect when they are compared using a similarity measure. Second, indexing the embeddings in a way that allows for efficiently finding the documents embeddings that are a good match for a given query embedding in terms of the chosen similarity measure. Popular similarity measures for embedding vector similarity include: \$Cosine\ Similarity(\vec{v_1}, \vec{v_2}) = \frac{ \vec{v_1} \cdot \vec{v_2} } { || \vec{v_1} || \times || \vec{v_2} || } \$ and \$Euclidian\ Distance(\vec{v_1}, \vec{v_2}) = || \vec{v_1} - \vec{v_2} || \$ .

The first challenge was discussed briefly in the Embedding Based Relevance Signal portion of this article.

For the second challenge, there are both exact and approximate data structures that can be used. Exact nearest neighbor search data structures are guaranteed to find the nearest embedding to a query. In contrast, approximate nearest neighbor search data structures trade recall for speed. Instead of finding the exact K closest document embeddings, they find document embeddings that are typically close but not necessarily the closest. These data structures can often have a configurable speed-recall tradeoff that allows speed to be sacrificed for greater recall. Whether or not approximate nearest neighbor search is appropriate depends on the specific requirements of the search engine organization. Popular nearest neighbor and approximate nearest neighbor open source libraries include:

  • FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It allows for L2 and/or dot product vector search.
  • ANNOY (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data. It allows for Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance.
  • NMSLIB (Non-Metric Space Library) is a similarity search library for searching in generic and non-metric spaces. It provides generic and approximate search methods for both metric and non-metric spaces.
  • Pysparnn is a library for Approximate Nearest Neighbor Search for Sparse Data in python.
  • FLANN is a library for performing fast approximate nearest neighbor searches in high dimensional spaces.
  • Scikit-Learn’s Neighbor Module also provides some nearest neighbor search functionality using BallTree, KDTree, and brute-force algorithms.

In terms of databases, ElasticSearch and Solr provide vector similarity search functionality although they are specialized and optimized for inverted index type searches so they have limited options for nearest neighbor search algorithms and how they can be incorporated in queries. Milvus is a young open source Vector Similarity Search Engine. It is powered by the Faiss, NMSLIB and Annoy libraries and can be deployed in a distributed environment to increase scalability and availability.

Filtering

Another way to narrow down the number match results is to apply filters. These filters could be based on price, category, date range, popularity, etc. Both ElasticSearch and Solr, as well as SQL databases with indexes can be used to implement many types of filtering efficiently.



Ranking

In the ranking phase, the matches returned in the matching phase are ordered based on various scoring signals as well as other factors. Typically, the scoring signals are combined into a single scoring signal that can be used to rank the documents to produce ordered search results. The search signals can include various things like how well a document matches a search query and the popularity, recency, average rating and click through rate of the document. Furthermore, the ranking phase can incorporate some personalization to prioritize documents that are thought to be more likely to be liked by the user based on information collected on the user (such as search history, current location, etc). In some search applications, the ranking phase can also boost certain search results in order to ensure the search results have certain desirable characteristics such as novelty, diversity and exploration. For example, in a clothing search application, if the user searches for dresses, it might be desirable to ensure that there are many different styles of dresses in the top search results (diversity of style) in order to maximize the chance that one of the styles will match the user’s tastes instead of only returning the most popular dresses which could all be of a specific style that the current user happens to not like. Depending on the search application, here are five such search results characteristic that might be worth measuring and possibly using rules and algorithms to adjust as part of the ranking phase:

  • Novelty: Showing some search results that are different from what the user has seen in the past
  • Diversity: Showing items that are less similar to each other in order to maximize the chance that one of them matches the user’s needs. The way similarity is measured between documents is highly dependent on the application. There could also be multiple different measures of diversity taken into account in the ranking phase.
  • Exploration: Returning new documents in order to generate feedback data about how users respond to them. This data can then be used to better score and rank those documents in the future.
  • Serendipity: Generating search results that surprise the user in an unexpected but potentially positive way. This can help the user discover new things they like.
  • Coverage: the percent of items that can make it in the top search results. For example, it might be very unlikely or even impossible for certain items to make it in the top search results. This might be a good thing or bad thing depending on the application.

The way to adjust the ranking phase to induce such desired search results characteristics is highly dependent on the specific application but could involve writing hard coded rules or using greedy and/or clustering algorithms. In some cases, it might not be necessary to do anything specific to ensure suitable search result characteristics since the ranking algorithm might produce suitable search results by default.

The rest of this section will discuss the 2 main ways to combine search signals into a single score that can be used for sorting documents in the ranking phase: formula based and model based (learning to rank).

Formula Based Scoring

In formula based scoring, a mathematical formula is used to transform and combine search signals into a score. For example, a media content search application might combine a relevance signal with a popularity signal (based on the number of views), a recency signals (based on the number hours since published) and a personalization signal (based on browsing history) using the following formula:

$$ Score = RelevanceSignal \times \big( w_1 + w_2 \times PopularitySignal + w_3 \times RecencySignal + w4 \times PersonalizationSignal \big)$$ where \$ w_1 \$, \$ w_2 \$, \$ w_3\$ amd \$ w_4 \$ are weights that determine the relative impact the different signals. The popularity and recency signals could be: $$ PopularitySignal = log_{10}(1 + NumViews)$$ and $$ RecencySignal = exp\bigg[ - \frac{ ln(DecayRate) } { TimeScale } \times max(0, NumHoursSincePublished - Offset ) \bigg] $$ The personalization signal could be the output of a matrix factorization collaborative filtering model.

These scoring formulas are often designed based on intuition about desired search result characteristics and well as trial and error. Weights and constants in formulas can be determined using a combination of trial and error, domain knowledge, A/B testing and by finding optimal weights based on some data and criteria.

Formulas often combine signals by adding and multiplying them together and with weights in a way that makes intuitive sense. Before being combined, signals are often transformed using transformations like logarithms, square roots, normalization, standardization, percentiles, step functions and decay functions (gaussian, linear, exponential, etc.).

The formula based approach to combining search signals has many advantages including:

  • High level of control: formulas can be designed and calibrated manually to enforce certain desired properties
  • Interpretability: developers can work through the formula to figure out why certain scores were produced and can manually make improvements/changes in the search results by manually adjusting the formula.
  • Fast development: a reasonable formula can often be designed and manually fine tuned quickly and without accumulated user interaction datasets or judgment lists.

Even relatively simple scoring formulas can take into account complex aspects of search since their search signal inputs can be the result of complex models. Furthermore, it is possible to customize the formulas based approach even more by using different formulas in different situations based on some criteria (such as the output of an intent classification model and/or named entity recognition model) or even to combine the results yielded by different formulas in the search results (for example by interleaving them). However, to take into account complex relationships between signals, a model based approach to combining them may be required. That being said, even when the model based approach is appropriate, it often still makes sense to implement the formula based approach as a baseline and/or to collect a dataset which can then be used as training data for the model based approach.

Model Based Scoring

In the model based approach, search signals are combined into a score using machine learning models. The types of models that can be used in the model based approach include deep learning models, logistic regression and gradient boosted trees. There are 2 main ways to formulate this process into a machine learning problem:

  • Predictive Formulation: a machine learning model can be trained to use the search signals to predict measurable metrics for each potential result such as click through rate if shown, purchase probability if shown, etc. The predicted metric can then be used as a score (or possibly transformed into a score if necessary). This approach works well with many standard machine learning approaches.
  • Learning to Rank Formulation: in this formulation, machine learning models are trained to produce good document orderings without necessarily having any specific meaning. For example, if document A is a better search result than document B, the score produced by the model for document A should be higher than the score produced for Document B. Unlike in traditional machine learning models, the specific value of the score doesn’t matter. Rather, it is the document ordering produced when sorting using the scores that matters. A few things:


Evaluation and Monitoring

The evaluations and monitoring of search engines and their underlying algorithms can be done using a variety of metrics. These metrics can be used to evaluate, calibrate and compare different search engine algorithms as well as to detect production issues. There are 2 main types of metrics:

  • Offline metrics are metrics that can be computed using stored datasets without observing real users interacting with the search engine. For example, a dataset consisting of search queries and corresponding relevant documents that should be returned could be used to compute the Mean Precision@K metric, which is the mean percentage of the top K documents returned for each query that are relevant. The advantage of offline metrics is that they can be computed conveniently and quickly in an offline setting without requiring real user interactions, that they tend to be statistically robust and easy to interpret. This makes them very useful for quick experimentation and fine tuning of search engine algorithms. The disadvantages of offline metrics include that they do not measure how real users actually interact with the search engine and that it can be very time consuming to create a dataset that can be used to compute certain offline metrics. Other popular offline metrics include Mean Average Precision, Mean Reciprocal Rank and Normalized Discounted Cummulative Gain (NDCG)
  • Online metrics are metrics that are computed using the interactions of real users with the search engine. Examples of online metrics include click through rate (the percentage of time a certain result is selected by a user), bounce rate (the percentage a user selects a result but appears to change their mind later, for example by quickly going back to the search results page to choose another result) and session abandonment rate (percentage of users who leave without clicking on anything). Online metrics can also include actual business objectives such as sales per user in an ecommerce store. Online metrics can be used to compare the effectiveness of different search engine algorithms using methods like A/B testing. They can also be used to evaluate changes in other aspects of the search engine such as the user interface. Furthermore, they can be used to detect issues in production systems. For example, an abnormal increase in search result page latency or increase in zero results rate could be indicators for issues in production systems.

For more information about evaluation and monitoring:



The Author

Sebastien Lemieux-Codere

Sebastien is a Data Scientist and Software Developer.