Why Blindly Applying Prompt Engineering Techniques to RAG Systems as a user is a bad idea
Note: This is a GPT4 summarised and edited by me version of a much longer and specific post — Prompt engineering with Retrieval Augmented Generation systems — tread with caution! (Read this if you want the glory details)
As a librarian, I've been excited to explore the potential of prompt engineering in helping users get the most out of language models.
However, after watching a librarian give a talk on their experiments with teaching prompt engineering, I realized that librarians are now teaching or may be tempted to use the prompt engineering techniques they have learnt on LLMs like GPT4 and blindly reuse it on academic search systems that use Retrieval Augmented Generation (RAG) like Scopus AI, Elicit.com, SciSpace, Web of Science Research Assistant (upcoming) , Primo Research assistant (upcoming), Statista’s brand new research AI and many others (see list)
What's the problem?
Prompt engineering techniques developed for querying directly Large Language Models (LLMs) like ChatGPT may not work in RAG systems.
RAG systems work differently, using the input to find relevant sources and then generating an answer based on those sources. This means that very specific techniques like role prompting (You are an expert in X), and emotion prompting (trying to bribe or threaten it) that may work when directly querying LLMs may not be effective in RAG systems, and may even harm the retrieval of relevant sources.
I am unsure if chain of thought techniques or n-shot prompting (e.g. showing examples of what is relevant and what isn’t) might work in RAG but those aren’t often used by normal users when they do “prompt engineering”.
Trying to specify specific layout and formatting (e.g. I want 5 paragaphs, I want results in table) will have mixed results because of uncertain interactions with the default prompt used by the RAG system.
More general “tips” like being “clear” and “specific”, are unlikely to hurt, but those are common sense “tips” that you should know anyway!
How RAG systems work
RAG systems do not directly query the LLM.
Instead they first use the input to find relevant sources, which are then fed to the LLM to generate an answer with reference to the sources found.
For our purposes it doesn’t matter how the relevant sources are found. But avery simplified way that the retrieval could work is this.
The input is converted into a dense embedding and used to find documents and text chunks within them with similar embeddings.
Other ways to find relevant text chunks are possible (e.g. Ask the LLM to come up with a keyword search that is run).
No matter how the top sources are found they are then wrapped around a predetermined RAG prompt to generate the output. This process is quite different from LLMs, which generate text based on the input alone.
The above shows an example from Bing Chat, and shows how results found from the Bing engine is fed to GPT4 which generates the answer. A very similar process is done for academic RAG search systems.
The dangers of applying prompt engineering techniques to RAG systems
Given you now understand how RAG works, you can see that when we apply irrelevant prompt engineering techniques to RAG systems, we risk introducing noise into the system as the RAG.
Instructions to control the generation of text, such as format or tone, display results in a table may not be followed since RAG system itself has it’s own default prompts to work with what is found.
Mixing instructions into the input can lead to inconsistent results. In the example below, I prompt Scispace with instructions to “Present key statistics in a table format.”, notice how in the results it gamely tries but this system is only meant to output text not tables hence the strange result.
For the technically inclined, trying to send instructions like this to override the default prompt is very close to the idea of prompt injection!
A problematic prompt
Let's take a look at an example of a prompt that includes instructions to control the generation of text:
"How has the academic publishing industry changed over the last 5 years? Do not talk about book publishing, instead prioritise academic journals. I am an academic librarian and am familiar with the industry, so tailor your response to this. I want a five-paragraph response I can quote from during a webinar. Ensure that your answer is unbiased and avoids relying on stereotypes. Present key statistics in a table format. Ask me questions before you answer."
The most obvious problematic part is the input — “Ask me questions before you answer.”, while this works if you are prompting a multi-turn chatbot/LLM, for academic search engines like Elicit, Scopus AI etc, it obviously won’t work because it is designed to return with an answer!
This prompt includes instructions to control the generation of text, such as the format and tone. However, these instructions may not be followed as the RAG system already has it’s own default prompt, and may even harm the retrieval of relevant sources.
As shown above even if the instructions for formatting is followed, in some systems like SciSpace that are designed only to return text, trying to return results in a table leads to weird results.
It’s unclear to me if giving negative statements like “Do not talk about book publishing, instead prioritise academic journals” might be helpful. Embedding type search usually is smart enough to “understand” negatives but might still get it wrong. But it might help at the generator part?
What can we do instead?
So, what can we do instead? Here are a few recommendations:
- Be cautious when applying prompt engineering techniques to RAG systems. Test rigorously with multiple runs and different variants to ensure that the techniques are effective rather than assume they work because they were tested in ChatGPT or similar LLMs.
we do a lot of our own prompt engineering under the hood so if you’re using ChatGPT for your own research you’ve got a bunch of prompts in very specific ways it won’t necessarily play as directly nicely with like assistant because we use different language models in our own set of prompts to kind of manipulate the behavior so you’ll probably need to kind of play around with it
Remember these LLM based search systems are very fault tolerant, and even if you enter a lot of unnecesary text, the results are almost always decent. This does not mean they are BETTER. Always test against a simple off the top of your head input vs your much longer engineered prompt. You will be surprised how often there is little difference.
2. Reach out to the RAG system designers for advice before suggesting prompt engineering tactics to users.
For example, Scite.ai now offers a “AI Prompt Handbook for Scite Assistant Optimize AI Outputs & Accelerate Your Research”. notably besides a tip that boils down to tbe specific in your input, it does not mention any fancy tricks like role prompting , giving incentives or even asking for specific layouts outputs. It does mention using the specific scite assistant advanced settings to control what is being cited.
Conclusion
In conclusion, while prompt engineering can be a powerful tool for improving the performance of language models, blindly applying these techniques to RAG systems can backfire. By understanding how RAG systems work and being cautious when applying prompt engineering techniques, we can avoid misleading our users and ensuring they get the most out of these powerful tools.