Skip to content

Can RAG eliminate AI hallucinations or do we smell a RAT?

One of the many well-known problems with large language models (LLMs) is their tendency to hallucinate – or in plainer language, to make stuff up. In a legal context, lawyers must apply facts, citations of previous cases or relevant laws, and legal reasoning. Judges tend to have a low tolerance for stuff that is made up. As a result, generic LLMs on their own are not particularly reliable for legal tasks such as answering questions on points of law or finding citations to support an argument.

Some legal technology vendors propose a method called retrieval augmented generation (RAG) as a solution for this hallucination problem. However, vendors’ claims that RAG can entirely eliminate hallucinations are just more AI hype.

A cyberpunk rat reading legal booksThis rat does legal research. It’s adorable but you might be seeing things. Picture: AI generated by Stable Diffusion.

What’s RAG?

LLMs are designed to provide plausible answers to questions by generating huge mathematical matrices of words that frequently go together in the material they have been trained on. In the case of ChatGPT and other commercial LLMs, the training sets were vast quantities of data scraped from the internet. Some of that training material, you may be shocked to learn, is wrong. And the LLM doesn’t know if the answer it generates is true or false – it can’t reason logically or fact-check. It just uses a mathematical model to generate a sentence that sounds like it answers the question. (This is why one recent paper in the Ethics and Information Technology journal suggested AI hallucinations would be better described as ‘bullshit’ because AI models don’t know and don’t care if their answers are true.)

RAG works by splitting the process of answering a question into two parts:

  • First it uses the user’s question to search a knowledge base of known and verified material
  • Then it passes the results of that search to an LLM, which uses those sources to generate the final response.

This is illustrated in the diagram below from a preprint paper, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, by Stanford University’s RegLab research centre.

Diagram showing the process of querying a knowledge base and using the retrieved documents as sources for an LLM to generate the responseRAG answers questions by querying a knowledge base and using the retrieved documents as sources for an LLM to generate the response.

In addition, RAG tools are programmed to cite the sources they use when generating answers.

If RAG only uses accurate source material to generate its answers, and it allows the user to fact-check, does that solve the hallucination problem?

RAGs to riches? Not so fast

The Stanford paper found that RAG-based legal research tools were more accurate than commercially available LLMs such as ChatGPT in answering legal questions, however they still hallucinated. “Over 1 in 6 of our queries caused Lexis+ AI and Ask Practical Law AI to respond with misleading or false information. And Westlaw hallucinated substantially more—one-third of its responses contained a hallucination.”

The paper goes into a lot of detail about how it defines hallucination and why the authors think RAG still generates incorrect answers – it’s well worth a read if you have the time. I should also mention that the vendors examined in the paper criticised its methodology, which led the authors to publish a revised version. However, a lot of the vendors’ criticism was quibbling over the definition of ‘hallucinations’ or claiming that unpublished internal research found much lower error rates.

“Over 1 in 6 of our queries caused Lexis+ AI and Ask Practical Law AI to respond with misleading or false information. And Westlaw hallucinated substantially more—one-third of its responses contained a hallucination.”

In essence, RAG only eliminates one of the two causes of LLM hallucinations, and it adds a new potential point of failure. The RAG approach can’t do anything about LLMs’ tendency to make up plausible but false claims, so it tries to limit the hallucinations by only providing reliable source material.

Jeremy Pickens, head of applied science at Redgrave Data, told LegalTech News: “he avoids the word ‘reasoning’ when it comes to how the RAG process works. The LLM is still performing its core function, that is, ‘generating the next word in the sequence,’ but this time, it’s generating within the contexts given to it by the user’s external documents.”

However, the sources retrieved from the knowledge base, even though they’re accurate, may not be the right ones to answer the question. The paper gives the example of a question about an airline’s motion to dismiss a wrongful death suit arising from a plane crash. One of the RAG tools retrieved sources that were about motions to dismiss in other contexts, such as bankruptcy and patent litigation, but could not identify that the core issue was tort negligence.

With those factors combined, the Stanford paper catalogues four different ways RAG can go wrong:

  • Naive retrieval – failing to find the most relevant sources available to address the question, as in the airline question above.
  • Inapplicable authority – citing a document that is not legally applicable to the question, for example because it’s for the wrong jurisdiction, statute, or court, or has been overruled.
  • Sycophancy – a tendency to agree with the person asking the question even when they are mistaken, for example asking about a law or case that doesn’t exist.
  • Reasoning errors – elementary errors of reasoning and fact such as claiming a case supported a conclusion that it didn’t.

Useful but far from perfect

As we’ve said before, EDT is evaluating a variety of applications for LLMs, and some are very promising. But it’s important to stay aware of the potential pitfalls as well as the promise and to avoid overhyping.

I have no doubt the RAG approach will have time-saving applications in the legal industry but I’m sceptical that it can eliminate mistakes entirely – especially naïve retrieval and inapplicable authority errors. Lawyers have to spend a long time studying to be lawyers because the law is complex and legal reasoning is hard. Even if LLMs sometimes exhibit behaviour that looks or works like reasoning, finding the right answers for tricky legal questions is something only humans can do … for now.