Large Language Models — More developments and thoughts

Aaron Tay
10 min readApr 5, 2023

In my last long blog post, How Q&A systems based on large language models (eg GPT4) will change things if they become the dominant search paradigm, I shared my first thoughts about these new class of search systems.

But things are really moving fast around large language models (LLMs) and after playing with them more, here are some further thoughts and observations about some of the latest developments.

1.The narrative is slowly changing as people realize you probably don’t want to use LLMs alone for info retrieval.

2. Multiple ways of combining web browse/search with LLMs are emerging.

3. Not just search, but Large Language Models are going to gain multiple capabilities quickly via plugins as integration with APIs is really easy

4. For information retrieval tasks, it is unclear when searching will reduce the quality of answers from LLMs.

5. Large language models can be used in many different ways for information retrieval.

6. What is used to train these large language models? (How much if any academic research is used)

7. Adverts are coming to these new search engines!

8. Momentous changes to society maybe coming (e.g. Job displacements at least, doomsday at worst)

  1. Narrative is slowly changing as people realize you probably don’t want to use LLMs alone for info retrieval.

It’s nice to see the narrative slowly changing as people are starting to explain that besides LLMs answering directly without reference to external documents beyond what was learnt in it’s weights from training, they can also extract answers from databases/documents and put it into the prompt to try to answer

In Crucial difference between web search vs ChatGPT, the author claims that the CEO of OpenAI — Sam Altman says that

ChatGPT is actually more about human-like cognition than providing and even collating regurgitated content… The reason that LLMs can’t always regurgitate — which would be useful for web search — but makes them prone to hallucination is that nowhere in the neural network is a single piece of training data stored verbatim.

That’s not how neural networks work. They are actually kind of compression / decompression algos.

In other words, Large language models are good at reasoning but are not meant to memorize answers or facts. Kinda like a human in fact.

Another excellent piece is “GPT-4 Is a Reasoning Engine” which makes the same point that by using “a version of ChatGPT that can use web searches to ground its answers with what it finds on the internet”, he can find a far more reliable answers than just using GPT-4.

Here’s my stab at explaining it.

2. Multiple ways of combining web browse/search with LLMs are becoming mainstream

Initially the idea of doing LLMs with search was fairly obscure. Besides Simon Willison’s experiments to do Q&A over documents with GPT-3 embeddings, Specialised academic search engines like,, doing the same, the idea was mostly unknown.

But now people are releasing things like PDF GPT, where you can “Ask questions of a PDF”. Not everyone is sure this is useful, particularly if you do it one pdf by one pdf.

But obviously the release of Bing+GPT/Chat introduced to the masses this idea of combining search and LLMs (as well lesser-known earlier precursors like

I suspect the newly announced GPT plugins is what will push this idea even more mainstream as they introduced web search and or retrieval over documents plugins.

3. Not just search, but Large Language Models are going to gain multiple capabilities quickly as integration with APIs is easy.

Of course, the idea of ChatGPT plugins is far more far reaching than just allowing web search and or retrieval over documents plugins

Plugins like code interpreter, support of Wolfram alpha and Zapier greatly expands the capabilities of LLMs.

Moreover, people looking at the document on how to create plugins are raving how easy it is to implement plugins as long as an API is available.

So yes, you just describe the uses of which the API should be used in human natural language and it knows when to use it!

This is akin to explaining to another human the use of an API and the human will decide when to use it to extend their abilities.

For example, GPT models are known to be not that good in mathematics, by hooking it up appropriately to Wolfram alpha, this weakness is mostly reduced. As the tweet belong shows this enables the best of both worlds from two opposing AI paradigms — Symbolic AI and Connectionism AI (neural nets)

This is going to make it near trivial to add additional capabilities to Large language models. For the librarians reading this means adding plugins to check loans, search specialized paywalled databases etc.

4. It’s unclear when searching helps improve quality of answer

Let’s assume a scenario where the LLM can either generate text based oits’s own weights OR choose to search (let’s ignore the other capabilities that it might have from other plugins).

As an academic librarian with an interest in search engines, I tend to be biased to see most things through the lens of search. And while it is indeed true that adding search to LLMs helps ground the answer and reduce hallucinations, for some tasks using search hurts performance.


I’ve found the same.

The tendency for these search+LLMs to search seems to differ for example I’ve found Bing+Chat tends to want to search a lot more than even when it is not necessary leading to poor results. In Bing+Chat the type of mode used ie Creative vs Balanced vs Precise also affects results.

Still, it is not always clear, when searching will hurt or not hurt results, so it might be a good idea for such systems to offer both options. You can always prompt it not to search as suggested above but I find it does not always comply.

Or possibly a hybrid where it generates an answer but still shows relevant documents below the generated answer.

See for example Do note Google does already provide generated direct answers in certain situations by drawing on Google Knowledge Graph or Google featured snipplets.

5. Large language models can be used in many different ways for information retrieval.

As I see different systems start to incorporate LLM capabilities for information retrieval, I am struck by the different ways they are implemented.

Scenario 1 — Use Native LLM capabilities only —eg ChatGPT, GPT4 without plugins

Up to recently, this is what most people were familiar with. The LLM generates text based on the training weights it gotten from training. The major weakness of using this for information retrieval is the lack of citations so you can’t verify the answer as well as the lack of currency.

Scenario 2 — LLM which is capable but does not always search — eg Bing+Chat,, scite assistant, ChatGPT+plugin

This is a newer class of tools where it is at its core a Large language model but depending on the prompt it may “decide” to search and use those results to answer queries.’s scite assistant answers prompts with and without search.

This is a newer class of tools where it is at its core a Large language model but depending on the prompt it may “decide” to search and use those results to answer queries.

In the example, above scite assistant replies normally without searching to my first prompt. Then it starts to search and get results from scite’s database to try to answer the question.

Scenario 3 : Search engine with LLM that always searches — eg,

These are outright search engines only, or traditionally called Q&A systems that only search. They take all input as search queries, try to find relevant documents and extract the answers. This can be in the form of answers from the document ( or it may generate a paragraph of answers (

Examples include,

Unlike say the new Bing+Chat, every query you enter in Elicit is definitely interpreted as a search. The large language model is used to extract answers or may be used to extract information about individual papers such as main findings, methodology, region etc.

Scenario 4 : Using Large language models to interrogate individual papers

This overlaps with some of the earlier ones but is worth nothing. The idea here is to use LLMs, natural language understanding capabilities to interrogate the paper for details.

This can be done on a per paper basis like in Scispace. For this search engine, the initial search works like a normal search , however if the results are indexed full-text you can query the paper with prompts (similar to chatPDF).

You can highlight text to summarise or explain or even ask it to explain tables!

Similarly, uses language models to extract various details about papers such as region, main finding, population etc but instead of just doing it one by one, does it for the top papers found and displays the results in a table.

Lastly, would it surprise you to know even “traditional search engines” that show links to relevant documents like Google use LLMs? Many search engines like BERT are now using LLMs like BERT embeddings to improve query and result relevancy, as these embeddings “understand” queries better and result in better relevancy rankings. However these are not visible to users.

6. What is used to train these large language models?

Even though we have stressed that these language models work well together with search to extract answers, there is still some curiosity and discussion over whether GPT3, ChatGPT or GPT4 is trained on academic papers (either Open Access or Paywall).

Some of the discussion on Twitter is captured here. We know from the GPT3 paper what datasets are used for training

Its unclear how much if any of the Common Crawl includes Open Access articles, but based on my testing it seems unlikely that they have many paywalled articles, though others have suggested Common Crawl might include illegal full-text from sources like libgen, scihub particularly if they use “The Pile”

There have been a many LLMs, encoder BERT models trained on academic papers but far less when we talk about decoder GPT type models. One exception is Meta’ Galactica,

Some have argued Google via Google Scholar, would definitely have access to paywalled articles (same for possibly Bing), but it’s unclear if the public versions of their LLMs are trained on it.

If so is there a problem that LLMs training is not on closed access PDF?

Of course, we have been arguing throughout this post that by attaching a search engine we can get more reliable answers even if the LLM isn’t trained on paywall content. But still I wonder, if there is some sort of bias in what is Open Access and what isn’t and this affects what the LLM is trained on, could the “reasoning” capabilities before biased?

OpenAI has been coy about GPT-4 training sources except for mentioning “partners” in an interview, one wonders if this might include academic papers.

7. Adverts are coming to these new search engines!

It is well known that the cornerstone of the current AI revolution is due to the use of the Transformer architecture. More specifically the late 2017 paper — “Attention Is All You Need” was the seminal paper that proposed this and of the 8 authors, 6 were with Google at the time.

Together with the acquired DeepMind startup, Google undoubtedly has world class AI talent. Yet it seems they have been slow to try to commercialize AI and their Bard product is clearly only a response to OpenAI’s chatGPT.

Many suspect that part of the reason why they have not been keen to develop such technologies despite its obvious applications for search and information retrieval and organization is that it will destroy their business model based on advertising.

After all, if you have a system that automatically extracts answers direct from websites, why would people go to your website? I’m not sure if this reasoning is rock solid but the argument goes, Microsoft which is far less dependent on advertising is willing to go ahead to blow up this model by adding GPT4 to Bing.

That said people are spotting ads appearing in Bing+Chat

This is scary of course.

8. Momentous changes to society maybe coming (e.g. Job displacements at least, doomsday at worst)

To end this one a science fiction-like note.

Elon Musk + a number of Computer Scientists proposed that “all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4”

Among some of the risks of more powerful AI can cause

Contemporary AI systems are now becoming human-competitive at general tasks,[3] and we must ask ourselves: Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization? Such decisions must not be delegated to unelected tech leaders.

If you think AI taking over the world or even destroying humanity is too far out, consider the more mundane scenario of huge job displacements.

Goldman Sach for example predicts generative AI could automate up to 25% of the work currently being done in the United States and Europe.

I personally think more radical AI doom scenarios are not that improbable given a time horizon of 10 to 20 years.

For example, currently LLMs alone do not have any agency, they wait passively for your prompt.

But with some simple modifications, people are making it autonomous, where you can give a few goals and it will keep going until it thinks it has done the goal. By using the capabilities like web browsing given to it by plugins , it really starts to look eeriely like an agent in the real world.



Aaron Tay

A Librarian from Singapore Management University. Into social media, bibliometrics, library technology and above all libraries.