Boolean vs Keyword/Lexical search vs Semantic — keeping things straight
As a librarian I received a minimal amount of education in information retrieval and most of my knowledge is purely practical.
Over the years, I’ve noticed that this has led to some confusion on what various terms like “Keyword search”, “Lexical search” and “Semantic Search” mean. This is my attempt to keep them straight.
This is worth clarifying because I expect more and more academic search will be moving away from Boolean or even keyword/lexical search to outright Semantic Search.
I go in depth into Boolean search, non-boolean lexical/keyword search and particularly semantic search and why practically it is hard to tell between non-boolean lexical search and Semantic search.
As always, there is likely going to be a lot of errors, misunderstandings because I am not an expert in this area.
If you want a piece by an expert that covers very similar ground, see The Basics of AI-Powered (Vector) Search by Cameron Wolfe or this blog series from Vespa blog
Boolean search
If there’s any single piece of knowledge, I expect any librarian to know it is the knowledge of Boolean search. The bread and butter of every librarian we are champions of searching keyword with Boolean operators like AND, OR and to a lesser extent NOT and proximity operators.
From the theoretical Information retrieval viewpoint, this is based on the Boolean model of information retrieval dating back to the 1950s.
By definition, search engines that support strict Boolean searching are all keyword or lexical based searches. But not all keyword/lexical based searches are or support Boolean searches (see later)
It is also interesting to note that technically the classic Boolean retrieval model is a purely binary model. Either a document matches your Boolean criteria, or it does not, they are all equally relevant.
TF-IDF/BM25 and the vector space model
Of course, today you will be hard pressed to find a search that does not have any relevancy ranking.
In fact, the library catalogs in the 2000s often did not have a relevancy ranking, they tend to default to sort on fields like accession number or publication date! As web search engines became popular relevancy ranking became a “must have” so much so that today it is odd to think of a search without relevancy ranking.
So, a typical search engine today first uses Boolean to find items that match the Boolean search and then uses an additional algo to do relevancy ranking — a popular example is Elasticsearch.
The top popular relevancy ranking used is TF-IDF (or it’s updated variant BM25).
TF-IDF is intuitive to understand. Imagine you are trying to match two keyword terms from the query which includes “cat” AND “dog” to the most relevant document.
Term frequency would be how many times the keyword from the query is in the document. If a document mentions “cat” x5 times in document 1 but only x1 in document 2, you expect document 1 to be ranked higher. You would do the same for “dog”.
However, when you think about it not all terms are equally important. For example, in a set of documents about Singapore, probably terms like “Singapore”, “Singaporean” would be in most if not all the documents, so matching those terms aren’t very useful.
At the most extreme consider the concept of “Stop words” like “the”, “are”, “is” these appear in all documents that they add very little signal to matching them and are removed in classic information retrieval
Imagine across 10,000 documents in your database, “Singapore” appear in 9,000 documents or 90% of the time while “Argentina” appears only in 100 documents 1% of all documents.
All things being equal, assuming you want to match both “Argentina” and “Singapore”, the weight you should put on documents that match “Argentina” should be higher than one that matches “Singapore” because the “signal” from matching “Singapore” is low because most documents in the database already has it.
In terms of formula what you want is inverse document frequency (idf), which means the more frequently the term can be found in multiple documents the lower the score. There are a few ways to calculate IDF, in the image below we use a log function.
As the graph above shows, the more a word can be found across all the documents (set to be 1000 in this case), the lower the score. At 1000, where every document has the word, the inverse doc frequency is at 0.
One thing to note is that for TF-IDF , for each document you will need to create a TF-IDF score for each keyword. But how do you compare with the query which itself has multiple keywords? You can just sum the TF-IDF scores of the document for each query
e.g. for query keyword Singapore schools, calculate the TF-IDF (Singapore) of document A + TF-IDF (Schools) of document = total relevancy score
but is there a way to take into account all the query keywords at the same time?
This is where the Vector space model comes along, which we will cover in the next section.
BM25 is the current standard and an evolution of TF-IDF which corrects for some issues of TF-IDF. For example, TF assumes as TF increases relevancy always increases, in practice there should be a saturation point beyond which it shouldn’t. BM25 also normalizes for document length which is superior to TF-IDF (Otherwise longer documents are favoured). Lastly BM25 allows for fine-tuning through parameters (like k1 and b) that control the saturation and length normalization aspects, providing more flexibility in adjusting the algorithm to the specific characteristics of a document collection.[See Video]
Vector space model and its relationship to TF-IDF
Vector space models were the next type of information retrieval model to follow the Boolean model.
Vector space models represent documents and queries as vectors, a long series of numbers and run a similarity function over them to find the closest document.
In the toy example below, again imagine there are only two words in your universe “dog”, “cat”.
Imagine Document 1 which includes the word “dog” multiple times over and “Cat” rarely, while Document 2 is the opposite. When plotted on the diagram you can see how Document 1 and Document 2 are represented in two-dimensional space as a vector. In this case we have a simple vector made up of two numbers.
Once represented as vectors you can measure how similar two vectors are by looking at the angle between them.
In the above example, is the query more similar to Document 1 or Document 2? This is determined by the size of the angle between the vectors, the smaller the size of the angle between the vectors the more similar it is.
In fact, we typically use cosine of the angle to measure similarity. As the two vectors get closer and closer, the cosine of the angle increases to one. At the limit when both vectors overlap it will be one. On the other hand, as the angle increases, the cosine metric will fall which the minimum similarity metric set to zero when they are 90 degrees apart.
You might be thinking this isn’t very useful, because typically there are more than two or even three terms in documents and then you would need 4D, 5D or more spaces.
The cool thing is this idea and math works even if you extend the idea to n dimension space where n can be the number of unique words or terms in the documents of your database.
Below shows the exact formula you use for cosine similarity.
How do we do TF-IDF exactly?
One other way to represent TF-IDF with the vector space model is to look at the data in a table format. Below shows a document-term matrix, where each row represents a document, and each column represents a unique term/keyword.
Each document can then be represented as a long series of numbers (one for each unique term in the corpus) or in other words a vector. This is called the one-hot vector method and is the simplest way to represent words as vectors.
But the cells in the table can be filled up or weighted in multiple ways including
a) a Binary e.g. for the column “abuse”, if the word appears in the document at least once it is set to one, otherwise it is set to zero. (one hot vector)
b) by frequency count (also known as bag of words) e.g., if the column is on “abuse” and the document includes the word “abuse” twice, the value should be two, if thrice, three etc. This was our initial example above using cats and dogs
c) by TF-IDF e.g., you calculate a separate TF-IDF for each document and each term and the cells are populated with TF-IDF weights
If you do c), you can run TF-IDF by
a) Constructing a document-term matrix as above but fill the cells with TF-IDF weights
b) Do the same for the query (but for IDF reuse the earlier IDF weights without adding the effect of query)
c) Calculate cosine similarity function as shown above.
Technically speaking while BM25 can be seen as an evolution of TF-IDF, it is considered a part of probabilistic information retrieval models and not Vector Space models. Probabilistic information retrieval models rank documents based on the probability that the document is relevant to the query. It is derived from the probabilistic relevance framework, which is based on the Probability Ranking Principle. This principle asserts that documents should be ranked by their probability of relevance to a given search query, with the goal of maximizing the effectiveness of the retrieval process.
BM25, specifically, is an extension of the Binary Independence Model (BIM), which is one of the earliest probabilistic models used in information retrieval. BIM operates on the assumption that the presence or absence of each term in a document is independent of the presence or absence of any other term in the document, given the relevance of the document.
BM25 can be considered within the broader context of the vector space model (VSM) framework, especially when we think about how documents are represented and compared to queries. However, the theoretical underpinning of BM25 itself is rooted in the probabilistic retrieval model and fundamentally differs from the traditional vector space model in its approach to document scoring and ranking. The ranking function of the probabilistic models is grounded in probability theory, while the ranking function of Vector state models — cosine similarity — is grounded in vector algebra
Lexical or keyword search is not necessarily Boolean
While currently most academic search is a blend of Boolean AND TF-IDF/BM25, particularly those using elastic search, it isn’t strictly necessary to have both.
I think the confabulation between lexical/keyword search and Boolean search began because search engines began to implement the “implicit AND” functionality. Early search engines made it mandatory to use the Boolean Operator “AND”. In the past the query X AND Y was not the same as the query X Y, the latter was the equivalent of a tighter phrase search. Today due to influences from web search engine, the former query is equalvant to the later query, so there isn’t a clear distinction for what Boolean Search is.
There are in fact two ways I am aware of that modern academic searches make the search a little non-Boolean.
For example, it is perfectly possible for a search to just implement TF-IDF without first filtering from a Boolean search resulting in documents being surfaced even though it did not meet the strict Boolean standards.
e.g. a document might have only two of the three terms used in the search query, but because it has such a high term frequency/low inverse document frequency of the other two, it might still rank higher than another document with all three terms searched for!
You might have seen this in some Google searches where you search for 6 terms and get the top few pages only have 5 terms (even after taking into account stemming, keywords in URLs etc). I’ve seen Google engineers in the past refer to this as “soft AND”.
Another possibility is that the search system may have “smart” or “helpful” features that automatically trigger on different conditions which may affect the search and make it less Boolean.
For example, years ago I remember a ruckus in a mailing list for a library discovery service where a librarian found that a search that was (A OR B) AND C resulted in fewer results than (A AND B AND C).
Eventually the answer given for this behavior was that the system had a rule where if the number of results fell below a certain threshold, it would automatically do stemming, and this can of course increase results.
There can be dozens if not hundreds of such additional rules that trigger depending on the situation that can occasionally make the search results less explainable though often it would be more or less Boolean.
WAND (Weak AND or Weighted AND)
Many search engines also apply something called WAND (Weak AND or Weighted AND) which is designed to help speed up retrieval and ranking.
The rough idea of WAND is that you can often speed up search if you can intelligently select and score documents.
The most straight forward way to do BM25/TF-IDF is to adopt the Document-at-a-Time Approach where you evaluate each document against the query terms one at a time and calculate relevancy scores one document at a time. This can be efficient for queries with a small number of terms or when the document collection is small.
In contrast, the WAND algorithm adopts a term-at-a-time approach, where it processes all query terms together for batches of documents. This allows the algorithm to skip over large sets of documents that are unlikely to be among the top-ranking results, thereby saving time and computational resources.
Intuitively, imagine the Query is APPLE OR ORANGE OR PEAR and you only want to rank the top 10 documents with the highest scores in order. Depending on the scoring function, you may be able to tell for a certain document, given that it only has a certain score after evaluating two terms it has zero chance of getting a score exceeding the lowest score of the top 10 documents found so far, so you can skip.
In practice, you would need to be able to calculate upper bounds for each term, and if this is fully accurate, you can get “safe” guaranteed results. But some terms or functions eg complex query terms such as phrases or proximity pairs, the upper bounds must be estimated, so you may not get guaranteed results
In the above example, we assume we only want to find and rank order the top 10 (ignoring exact scores) and be guaranteed that the paper is missed and the order of the top 10 is 100% correct, but there are other goals (see below), where you accept approximate rankings (the top 10 are the top 10 but the rankings within might not be correct) and/or are willing to miss out papers (false negative) that should be ranked.
See also Magic WAND: Faster Retrieval of Top Hits in Elasticsearch, Efficient query evaluation using a two-level retrieval process and Accelerated OR search using the WAND algorithm
Notice with Google, or Google Scholar, you can’t actually see all the xxk results, usually at most the results stop after around 1k results… This is why!
In practice, you can set θ = F · m where F is a threshold factor. If F>1 there is unsafe optimization or not guaranteed results
For “Safe” optimization
The initial threshold is set based on the query type. For an OR query, or for a free-text query, the initial threshold is set to zero. The approximate score of any document that contains at least one of the query terms would exceed this threshold and would thus be returned as a candidate. Once the heap is full and a more realistic threshold is set, we only fully evaluate documents that have enough terms to yield a high score. For an AND query, the initial threshold can be set to the sum of all term upper bounds. Only documents containing all query terms would have a high enough approximate score to be considered candidates.
For “unsafe” optimization
So far we have only described methods in which we are guaranteed to return accurate results for the query (safe evaluation). However, the threshold can additionally be used to expedite the evaluation process by being more opportunistic in terms of selecting candidate documents for full evaluation. In this case, the threshold would be set to a value larger than the minimum score in the heap. By increasing the threshold, the algorithm can dynamically prune documents during the approximation step and thus fully evaluate less overall candidates but with higher potential. The cost of dynamic pruning is the risk of missing some high scoring documents and thus the results are not guaranteed to be accurate.
Still at the end of the day, academic search systems even Google Scholar (but not Google as we shall see) are still mostly predictable (loose Boolean or near Boolean) and clearly lexical search systems based on matching keywords (though those keywords might be stemmed etc.).
Semantic search and word representations/embeddings
The story so far is the story of keyword-based search or lexical search.
Depending on the source you use, some sources say Keyword based search is the same as lexical search, others will say keyword search is exact phrase search while Lexical involves more advanced techniques like lemmatization and stemming or vice versa. But regardless of the definition used it is clear they are distinct from semantic search.
While keyword or lexical search may not always follow strict Boolean matching, they all rely on essentially matching keyword terms which means missing relevant documents if there is a mismatch in the query and document even if they are semantically the same (this is sometimes called the “vocabulary mismatch problem”). Of course, we are much clever than just matching the exact phrase, and modern lexical or keyword search may do clever things like search query expansion (rule based), stemming, lemmatization to try to overcome the limitations of lexical search but at the end of the day it is still matching closely on the actual word used in the document
A true semantic search would go beyond these methods and see connections and concepts the way humans do. Imagine searching for “Star Wars” and it would still retrieve relevant documents that don’t use those words but were about Jedis and lightsabers which it “knows” or “associates” with Star Wars.
Or a search that can look at a search query like
It is not raining
and “understand” that “not raining” should be interpreted together and not toss the “not” away as a stop word.
Semantic search can be done on title and abstract, for example see Litmaps but it probably gets more powerful if you have full-text.
Semantic search is not direct extraction of answers from passages, that’s a Q&A task and we are seeing this feature increasingly inetrgrated into search but this typically uses Retrieval augmented generation techniques combining search with Large Language Models. Semantic Search is often a component, used to understand queries and match and find documents or text segments that might answer the question but the final answer generated comes from the Large language model’s generative capability.
The first embeddings — Word2vec, GloVe, Fasttext
It’s easy to say we want “Semantic Search” which should understand the meaning of the words or terms and not just matching the word or linguistic form. But how?
Let’s go back to the document-term matrix. This is a very large table with many columns, one column for each unique term in the document.
You can reduce the columns used by doing stemming and lemmatization (so multiple word variants are merged to one column) but you will still have many columns and most documents will be rows with many zeros in them because most documents don’t have all the terms. This applies whether you are just using pure term frequencies or TF-IDF weights.
In other words, when doing traditional vectors space model, you are going to end up with a very long vector where most of the elements, values are zero and you can see why they are often called sparse vector embedding…
Can we do better? What we want is some “magical way” to even further reduce the number of terms/columns used.
For example, if we could have some magical solution that automatically would merge “cat” and “feline” into one column because it understands that both have the same semantic meaning.
If done sufficiently, you would get a much smaller document-term matrix and what is called a dense vector embedding could be used, because most values in the vector would be non-zero.
There have been earlier approaches to do something like this with LSA/LSI but it was in 2013 with the invention of Word2Vec (followed by others like GLOVE, FastText) that semantic search started to look promising.
See this for an accessible but technical discussion of Word2Vec, GloVe
The main trick Word2Vec etc. used was not to use words directly in the document-term matrix.
Instead, it uses deep learning to create representations of words called embeddings from large text corpus (e.g. Google news, Wikipedia) and these are used instead.
The reason why you need a much smaller length is that your machine learning model is able to learn latent or hidden meaning of the words which allows you to represent a large vocab with much fewer numbers or shorter vector
For example, a widely used standard Word2vec embedding trained over google news has a dimension/length of 300 yet represents 3 million words (vocab size)
This creates a very dense vector embedding of course, because there are only 300 “positions” despite covering 3 million words.
How are embedding created
Part of the issue with searching is the problems of synonyms, how do we know when you search for “cars” you are also searching for “automobiles”? Being able to look for words that are semantically “close” is clearly especially useful.
Rule based expansion or using things like thesaurus can solve some of this, but there are more subtle issues that require a lot of context to solve. For example, issues like negation, NOT X
So, you can see how an automated way to train embeddings or representations for words that automatically encode such relationships can be especially useful.
But how are such embeddings like Word2Vec created? In general, the idea of Word2Vec embeddings is that similar words tend to be used in remarkably similar contexts.
This is known as the distributional hypothesis — that is “to capture meaning” and “to capture contexts” are inherently the same.
Intuitively, if you have a sentence with a missing word and two words X or Y can be used in place of the missing words and this is true in other contexts, X and Y are related in meaning and should have similar embeddings.
In the example above both “sad” and “unhappy” would fit into the sentence and hence should have similar embeddings and meaning.
This is the idea of Word2Vec embeddings, where a deep learning neutral net model is fed a large chunk of text and is used to learn how to predict such patterns.
Below shows part of the training data that is used to train using the Continuous Bag of Words Model.
It shows how you can use a chunk of text to train in a self-supervised way, where the aim is to predict the target word (in red above) from the context of the words in green. This particular training uses context window = 2, so it always tries to use the context of 2 words in front and behind the target word (if there are 1 or 0 words in front of target, it used that)
Given say the context
“number”, “of” <target> “words”, “to” , the neutral net should be trained to predict the target word “surrounding”.
Once trained this embedding model will have weights that can be used to convert any text into embeddings.
How do we know embedding approach is really capturing semantics?
The Word2vec paper made quite a stir with the now famous image shown above.
You can clearly see some sort of relationship or meaning in the embedding. Mathematically you could do something like
vector(‘king’) — vector(‘man’) + vector(‘woman’) result is close to vector(‘queen’).
Sparse vs Dense embeddings
One major difference between embedding and the traditional vectors created using TF-IDF is that embeddings have a fixed length and do not depend on the number of unique words in the collection. This is because each number in the vector doesn’t represent a literal word but is a representation of some concept or meaning.
This results in Dense embedding vectors where most of the vector is non-zero. This is as opposed to methods like TF-IDF etc which results in sparse embedding vectors where most of the vector is zero.
That said dense embedding vectors are almost impossible to interpret, since machine learning is used to automatically map latent or hidden meaning to each of these numbers in the vector (equal to columns in the document-term matrix). Unlike Sparse vectors/embedding, each column is not interpretable! For example, you cannot tell what 1 in the 20th position/column actually means.
So, one way around this is to do a two-stage search and reranking system where the sparse retriever first retrieves a large set of results, and this is reranked using Dense reranker. This ensures results/documents retrieved are explainable even though the order of them might be a Blackbox.
Also of course you export all the relevant results for screening, it’s doesn’t really matter how they are relevance ranked!
For systematic review librarians reading this, PubMed actually does a BM25 first stage search followed by a rerank using a Learning to Rank technique LambdaMART on top 500 results.
Important note about Semantic search vs lexical search
It is not necessarily true lexical methods or sparse embeddings are worse as you will see later. State of art approaches for sparse embeddings like SPLADE try to have the best of both words with Sparse embeddings enhanced with learning how to adding new terms (expansion) and/or removing existing terms (compression) automatically using advanced large language models like BERT. See also this explanation
The most significant advantage of SPLADE is not necessarily that it can do term expansion but instead that it can learn term expansions. Traditional methods required rule-based term expansion which is time-consuming and fundamentally limited. Whereas SPLADE can use the best language models to learn term expansions and even tweak them based on the sentence context. See https://medium.com/@zz1409/sparse-representations-of-text-for-search-f231301eacf
State of art — contextual embeddings — BERT and more
The main issue with older methods was that it couldn’t capture meaning (semantic) automatically, and Word2vec and similar embeddings at the time was the first step towards that ideal. But it still didn’t fully solve the issue as most of these methods, whether Wordvec or Glove did not take into account word order in documents or queries.
Take the query (this example was taken from Google’s announcement of they use of BERT in 2019)
Technically speaking, Google is probably using sentence BERT or sBERT rather than vanilla BERT, see this great article or this worked example on how to use sentence embeddings
2019 brazil traveler to use need visa
the systems need to consider the order of the query and understand it is about someone from Brazil going to US and not vice versa. To solve such issues, every word and the order counts, and the latest embeddings (from 2018 onwards) further improve on Word2Vec type embeddings due to improvements like the self-attention mechanisms and positional embeddings.
Another issue was the problem of polysemy (one word can have multiple meanings) and homonymy (sound or spell the same but different meaning).
For example, in Word2Vec the embedding “bank” is a weighted embedding that represents its use in two contexts — in the sense of “river bank” and in the sense of “financial bank”.
Obviously, using the overall embedding for “bank” isn’t ideal since depending on the context it is used, they are different.
This is where transformer based large language models (invented in 2017) were found to produce even better embeddings, sometimes known as contextual embedding due to improvements like the self-attention mechanisms and positional embeddings allowing such systems to take into account position/sequences of words.
I won’t try to explain how self-attention works beyond noting that the neural net is trained to take into account which other words the target word (say “Bank”) should “pay attention” to and this is what allows the system to distinguish different context of use.
For example, in the sentence
“I went to the bank to make a deposit”
Once trained, for the word “bank” it will know how to “pay attention” to the word “deposit” with makes the embedding for bank to be slightly modified from the initial static embedding towards being more on the financial aspects, while in the sentence
“I went to the opposite bank and started to swim”
It will do the same for the word “swim” instead and this difference is what makes it able to distinguish between the two uses of “banks”. In other words, such embeddings take into account the context where the word is used.
How does it know what to pay attention to? Partly this is based on similarity e.g. in the sentence “bank” and “swim” are semantically close.
The most famous example of these types of new embeddings was introduced in 2018 was named Bidirectional Encoder Representations from Transformers (BERT) and was quickly adopted everywhere leading to huge gains in state of art.
BERT models are built on the transformer architecture just like the more well known GPT models. The main difference is BERT models are encoder only models, GPT are decoder only…
Further improvements on contextual embeddings
But even BERT models are just the beginning.
Today besides BERT family of models, there are many other advanced embedding models that have been further fine-tuned to do well for semantic search and reranking including
- Variants based on BERT / SBERT — MiniLM, mpnet-base family (light weight high performance)
- Variants using Google’s T5 as a base
- GPT based e.g. OpenAI’s text-embedding-ada family, Cohere embedding etc
- Domain specific trained on scholarly publication e.g. specific BERT models finetuned with papers e.g Scibert, finbert or even vendor specific ones like Dimensions General Science-BERT. There are even models pretrained with citation signals — e.g. SPECTER by Allen institute for AI and SPECTER2
- More
Here’s a brief over-view on further improvements
These days you are not even limited to using embeddings from Encoder only models like BERT but even popular GPT type Decoder only models like OpenAI and Cohere provide embeddings. For quick and easy use OpenAI’s text-embedding-ada-002 is extremely popular! These embeddings are generated from even larger text sources than BERT models and have more dimensions (costly!), but it’s unclear how they match up against BERT domain specific models.
Firstly, how good embeddings are (even ones from state of art Large Language Models) highly depends on the data that was using to train on. For example, if journal articles were used mostly for training the embedding models, it may not work well when used to search for similar articles in the domain of news.
In fact, most standard BERT base models were/are trained largely on Wikipedia text this may not be so suitable for academic tasks like search or ranking, so it is possible to create embedding from scratch or fine-tune existing embeddings models with data from academic articles or even specific domains like Chemistry (ChemBERT), Finance (e.g. FinBert), SciBERT etc. SPECTER by Allen institute for AI even considers citations!
In-domain training to further push the limits
BERT models have a pretraining phrase where they are trained using the masked language model (MLM) where it [mask] words in text (somewhat like the CBOW from Word2vec but not exactly the same) and Next Sentence Prediction Tasks.
There can be other tasks. eg ELECTRA [Clark et al., 2020b] can be described as a BERT variant that attempts to improve pretraining by substituting its masked language model pretraining task with a replaced token detection task, in which the model predicts whether a given token has been replaced with a token produced by a separate generator model
Unlike the GPT models, BERT models are encoder only models, and they do not generate anything (because they lack a decoder) but embeddings. You also are expected to generally do further fine tuning on the model for the task you want to do with the BERT model.
In this case, how or what should you fine-tune?
For example, while the sentence “Biden is the president of the US” is likely to be semantically similar to “The President of America is Biden”, when doing retrieval tasks, why would you expect the query “Who was the president of USA in 2021?” to be semantically matched to “Joe Biden” ? Or consider how a query and the best document or text segment to answer the question have really different lengths between the query and answer (asymmetric search) or document needed, so semantically they may be viewed as different.
In MTEB benchmark ranking how good embedding are for different tasks, the first example is a task called “Semantic Textual Similarity.”, which is distinguished from Semantic Search/Retrieval tasks and reranking.
Also, even if query passages and document passages are similar it does not mean they are the answers or documents needed.
That is why embeddings should be further fine-tuned with Question and answer pair and indeed you CAN improve performance further by training the model further on specifically labelled documents and queries? This would typically have queries, documents that are relevant (at least 1) and documents that are not relevant.
This is where supervised learning techniques on large datasets with queries and answers like MS MARCO are done to push the limits of performance further.
The success of Transformers models like BERT, GPT lies in the fact that, you do not need specially labelled datasets, and you just do self-supervise training with text corpus. However, domain specific fine tuning requires hard to find labelled datasets with queries and relevant judgements, this is why MS MARCO released by Microsoft that has millions of such query and relevant judgements from Bing Searches is such an important dataset and even today is commonly used as a base for training models.
However, MS MARCO as useful as it is, is highly domain specific (from Bing searches), and not all domains have the needed labelled data to do supervised learning. As such there is now active research in finding self-supervised techniques or distant (also called weak) supervision techniques to auto-label datasets. More advanced techniques like Negative sampling/Contrastive Learning (which focuses on maximizing the similarity within the class and minimizing the similarity between classes) or even using of Language models to generate synthetic data for such domains are ideas being explored.
Also, in most cases fine-tuning on domain specific examples might hurt performance if it is used out of domain as the vocab (tokens) encountered might not be known in training.
As you will see later, this is why many of these bi-encoder type embedding while they outperform top lexical search methods like BM25 when tested in queries that are in domain but fail badly when tested “out of domain” (e.g. if the embedding were fine-tuned with query-document judgements in the area of academic papers it would do badly when tested on queries over newspaper)
In this sense BM25 is more “universal” and across a wide variety of context, it is a hard base-line to beat on average.
Vector embedding, Vector stores and Vector search methods
By now you should pretty much understand how semantic search works. Typically, the most obvious way to do is to
- Convert the query into an embedding (series of numbers) by running it through a pretrained embedding model either Word2vec or a more advanced contextual word embedding like BERT
- For every document do the same and convert it into an embedding using the same method
- Find the most similar document to the query using a method like vector space search cosine similarity.
Step 2 needs to be done over all the documents in your database, so rather than do it on the fly, you typically precompute all the embeddngs
Obviously even with precomputed embeddings, if you have a million of them, it still takes a lot of work to compare with the query embedding. Using KNN (nearest neighbour) you can get to the closest embedding with 100% accuracy but is very slow.
In practice, for large number of documents the embeddings are stored as a inverted file in a Vector store. Approximate KNN method are used for speedups when doing matching but this does not always guarantee the closest embedding will be found.
Bi-encoder vs cross-encoder
Another important concept to understand is the difference between a bi-encoder and a cross-encoder. This impacts how the query, and the documents interact.
The example shows an example of a system trying to decide if two sentences are similar using a sentence embedding based on BERT. In the context of search one sentence would be the query, one sentence would be from the potential matching document.
The most common type of system implements a Bi-encoder where both sentence A (the query) and sentence B (potential matching sentence in document) will be converted into embeddings separately before they are checked for similarity using a function like Cosine-similarity/dot product.
The main advantage of this method is that you can precompute the embeddings for all the documents of your database in advance and store it in a vector database. So, when the search is run, you just need to convert the query into an embedding and then you can use methods that are reasonably fast (approximate nearest Neighbour) to quickly find the embeddings that are closest.
Now consider the alternative cross-encoder method. The Cross-encoder has an even better performance, since you push both query and document into one model which allows the model to take into account interactions between both items (as opposed to the bi-encoder method where you create two separate embeddings and allow interaction only via something like dot product).
Empirically bi-encoder BERT actually performs worse than even older embeddings like GloVe! That said, sBERT style bi-encoders do match Cross-encoder BERT. Though when they are trained on a few domains with one held out domain set, they tend to struggle, showing they might struggle when used on out of domain contexts….
But cross encoders are clearly more difficult to do computationally since you cannot precompute the embeddings for documents like for a bi-encoder. If the database has 100k documents, you need to do all 100k documents embeddings + query pairs on the fly during the search!
A workaround to get the best of both worlds is to use some other method to first cut down the number of top candidates results then do a rerank of these top candidates using a cross encoder.
One obvious method is to use the biencoder to grab the top 100 then the crossencoder to rerank.
ColBERT — better performance and interpretability?
There is also a late interaction model eg COLBERT (Contextualized late interaction over BERT) that is in between the bi-encoder and cross encoder hoping to get the advantages of a cross-encoder at lower cost.
With a regular b-encoder embedding model, you actually do get individual embeddings for each token/word from either the document or query. But instead of storing these embeddings, you do “pooling” a kind of averaging across all the embeddings at the last output layer and combine them into one embedding for each query or document.
While storing/computing an embedding for each document or query for comparison is cheaper, you are forcing the algo to represent the query or document in just one way and a document may be represented in multiple ways. This may not be the best way to do things.
How ColBERT works
ColBERT unlike bi-encoders, store embeddings per token and these are used for comparisons
While this is more costly as you will have a larger index and it takes longer to do the calculations, you get two benefits of better performance and greater interpretability.
How do these token embeddings interact? They use a function called MaxSim, I believe the algo works like this.
1. Take the first query token embedding
2. Calculate the cosine similarity of the query token against each of the token embedding in the document we want to score
3. Keep the cosine similarity with the highest score between the query and document token.
4. Repeat with the next query token embedding
5. When done, sum up all the scores for the overall score a document has scored against the query.
Still unsure?
So here’s a over-simplified example.
Imagine your query has 4 tokens — “Open”, “Access”, “Citation”, “Advantage” and you trying to match a document with say the tokens, “Open”, “bibliometric”, “meta-data”, …
For, the first query token “open”, calculate the cosine similarity vs document tokens “open”, “bibliometric”, “meta-data” and …., likely the highest score will be for “open” (since it matches exactly), Say the score is 0.7
Continue for the other query tokens. e.g. “Citation” and “bibliometric” might have the highest cosine similarity say 0.5.
Then MaxSim is just the sum of these cosine similarities.
Better performance of ColBERT and better interpretability
Empirically, ColBERT results are clearly better than using a normal bi-encoder with performance approaching that of a cross-encoder. This makes sense given how the model works, since doing token level interactions is more nuanced than averaging all the token embeddings into one singular document embedding like in a bi-encoder.
Another interesting property of ColBERT is that the results are more interpretable.
A bi-encoder model will do a cosine similarity between the document and the query and come up with a score. The problem with this is all you get is a score, you are not able to “explain” even after the fact why a certain document is scored as highest or closest in meaning/aboutness as the query.
With ColBERT before you can see the token embedding interactions this gives you more insight.
For example, with the MaxSim function, for each query embedding you can see which document embedding it scored the highest and “matched”
Some companies have even done ColBERT visualizations…
It is important to note that while ColBERT is more interpretable compared to bi-encoders, it is not as interpretable as a lexical keyword search, much less a strict boolean search.
This is because at the end of the day, the token “matching” is done using cosine similarity of embeddings, and this is not easily predictable. For example, cosine similarity of embeddings may “decide” the query embedding “Star wars” is closest to “Jedi” token in the document and score that high, which “makes sense” but this isn’t predictable prior to searching and it may or may not do so depending on the context which is not predictable in advance.
But at the end of the day, if you want “Semantic search”, you have to give up some interpretability and predictability.
More sources
Is the newer Dense embedding better than Sparse embeddings?
In general, dense embeddings are better than standard sparse embeddings in specific domains but tend to be weaker if used out of domain according to the 2021 BEIR paper.
So, for example, to take a crude example if your dense embedding is trained (either supervised with labelled pairs or self-supervised) on journal articles and you are trying to do newspaper searches it can fail.
Indeed a 2021 paper found that accross a wide variety of settings(domains and tasks) such as in the BEIR test, BM25 is a very strong base line to beat. In short, if you want something to work well across a large variety of different domains or are unable to be sure the old school BM25 is your best bet!
This makes sense intuitively since dense embeddings will have problem with out of domain vocab compared to BM25
However, this may be a less relevant problem in academic search where the domain tends to be quite restricted.
Besides technology marches on and a June 2023 blog surveying the results from MTEB benchmark (evolved version of BEIR) concludes that in 2021, while BM25 was a very strong baseline it is no longer a clear winner in 2023 and is starting to be exceeded by the latest dense embedding like E5 and newer embedding methods like ColBERT (Vespa’s ColBERT), SPLADEv2, ELSER (though these methods are computationally expense compared to lexical methods currently)
Still BM25 remains a very strong baseline showing that lexical search while simple is still very effective particularly if you want the system to work decently well across a wide variety of context
Why choose? Hybrid Search
Dense vectors are effective for encoding local syntactic and semantic cues leveraging recent advances in contextualized text encoding (Devlin et al., 2019), while sparse vectors are superior at encoding precise lexical information. I recommend this article that goes in depth comparing BM25 vs dense embedding methods, it even includes a live demo where you can compare 2 sets of results side by side using a subset of newspapers from Australian TROVE (details here).
As such, hybrid approaches help reduce the weakness of Dense embedding approaches of being weaker in out of domain tasks.
There are multiple ways to do so from
- Dense first search — Retrieve candidates using a dense model and rerank using the sparse model
- Sparse first search — Retrieve candidates using the sparse model and rerank using a dense model
- Hybrid search — Retrieve candidates from both lists and combine them as a post-processing step using methods like reciprocal rank fusion, or the combined list can be reranked using a powerful but expense cross-encoder.
It seems the current evidence (2023) and industry practice favours a hybrid approach of pooling documents from both approaches before reranking as doing either Dense first or Sparse first risk missing relevant documents, see this article for discussion of why
Another more practical issue is that users often want control and predictability in their searches, what happens if they search and also require that the documents be after 2020 only? Or if they want an exact match in the title field only?
All this points to the need of a hybrid system using both semantic and lexical features…, See also
How do we know we are dealing with a Semantic Search vs Lexical search vs Boolean search?
This is the question that librarians ask the most but is actually a surprising hard question to answer even if you have a decent knowledge of information retrieval.
Note in practice a lot of systems combine result sets from both semantic and lexical search to try to compensate for weaknesses of both. So in practice what you are trying to figure out is if a search system is purely lexical search based or lexical plus semantic.
You can with some effort tell if something isn’t strict Boolean, by simply noticing that it is pulling in results that cannot be “explained” , either because the result doesn’t have the term you queried (even after taking into account stemming/lemmatization etc) or the reverse, or the number of results don’t make sense when comparing across queries. For example, A AND B AND C getting more results than (A OR B ) AND C.
However, the line between non-Boolean lexical search that does liberal query expansion (rule-based) or does vector state models (eg tf-idf/Bm25) alone without Boolean and “semantic search” is much greyer if you just look at the output. Both methods are capable of getting results that cannot be “explained”.
How so.
Take the following search where I entered a series of new emerging tools for discovery, I am pretty sure there probably isn’t any peer reviewed paper in JSTOR which has all these terms and the default JSTOR shows no result.
JSTOR generative AI search now offers a experimental beta search that
understands your query and provides more relevant results, even if you don’t use the exact words
Basically, semantic search.
And when used lo and behold it finds 20 documents that might be relevant.
This clearly isn’t a strict or probably even loose Boolean search as you will quickly notice the documents surfaced do not even have half much less all the terms searched. Sometimes they have none!
Is this semantic search or just lexical search that is non-boolean?
In theory it can be either since both methods essentially find the closest match/most similar document to the query. For example, TF-IDF alone can still rank documents even if not all or even most keywords are matched and among all documents some will still score higher on TF-IDF
In theory, if you are not doing Boolean, any search query could yield some result because based on your algo similarity metric (TF-IDF or embedding) some document must be closest! Of course, in practice some search engines will have a cut off where if the similarity score is too low it will show no results but many will not.
Some searches even allow or even encourage you to dump in long paragrahs of text. For example, take a search like Elicit.com where I paste in a whole paragraph of text.
In a strict Boolean or even a loose Boolean with query expansion the search will yield zero results.
Yet in Elicit and many of the newer AI powered search tools you get a result.
Boolean vs Lexical vs Semantic — recap & conclusion
Boolean — This is the standard database search librarians are all familiar with. Boolean searching is lexical or keyword searching, however it is possible to have lexical/keyword searching without strict Boolean
Lexical/Keyword search — This simply means the matching and relevancy resolves around keywords/terms you type. The extremely popular TF-IDF/BM25 by itself may not follow Boolean (Tf-idf can rank documents on top with missing query terms if TF and IDF of matched queries is high enough), though it is popular to combine Boolean and TF-IDF/BM25. You can also get loose Boolean systems, if the search mostly follows Boolean matching but under certain conditions the system might be helpful or smart and do rule based search query expansions.
Semantic Search — This is the holy grail of search, search that can understand meaning and concepts. As opposed to lexical/keyword search the search should take into account order of words, link related concepts, synonyms automatically, rather than just doing matching of terms/ngrams.
State of art semantic search is based on embeddings which are word representations learnt by deep learning from huge corpus of text, typically by self-supervised learning of held out words. The latest embeddings are contextual embeddings and not just static unlike word2vec.
Word embeddings provided by word2vec or fastText have a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and their corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are static.
However, contextual embeddings (are generally obtained from the transformer based models). The embeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as context. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model.
The main disadvantage from a librarian view point of semantic type search using state of art dense embeddings is you give up interpretability and predictability.
Newer methods like ColBERT or SPLADE provide more interpretability than the usual bi-encoder methods, by allowing you to look into the token level interactions but at the end of the day, they are still less predictable than traditional lexical keyword much less Boolean, because at their base they are still doing cosine similarity matches of token embeddings, which is not predictable. See http://musingsaboutlibrarianship.blogspot.com/2024/06/can-semantic-search-be-more.html for more details.
Hybrid searches — As you have seen in the examples above, a lot of systems are hybrid. For example, there are many search systems that combine results from both keyword searches (Sparse embeddings) with semantic search (dense embeddings) in a first cut list before using a reranker cross-encoder. Other systems may use keyword searching (sparse embeddings) as a first cut , followed by semantic search (dense embeddings) etc.
Telling if a search is semantic search or not
In practice it is relatively easy to tell if a search is non-Boolean by tossing in a chunk of text and seeing if there are still results. But it is hard to tell without looking under the hood if it is just non-Boolean lexical search or semantic search.
The line between the two isn’t that clear anyway