Correcting Hallucinations in Large Language Models
In this blog post, we share the results of our initial experiments aimed at correcting hallucinations generated by Large Language Models (LLMs). Our focus is on the open-book setting, which encompasses tasks such as summarization and Retrieval-Augmented Generation (RAG).
≈ 15 minutes readOverview
In the context of LLMs, hallucination refers to the phenomenon where the model makes up information when responding to a user’s prompt or question. While the exact causes of hallucinations remain unclear and are the subject of ongoing research, these occurrences can have significant real-world consequences, especially in enterprise applications and even more so with Agentic RAG.
As we mention in our hallucination detection blog post, one of the most effective methods for reducing hallucinations is grounding LLM responses in a set of provided documents (also called references). In other words Augmenting Generation with Retrieval aka RAG. However, even RAG is not the catch-all solution to the hallucination problem, and as Vectara’s Hallucination Leaderboard shows modern LLMs hallucinate anywhere from 1% to nearly 30% of the time even when generating outputs based on reference sources, called open book generation.
Vectara has historically provided the Hughes Hallucination Evaluation Model (HHEM) as a detector of hallucinations for the open book generation use case. Today, we are excited to share our early findings on the next frontier: hallucination correction.
The Hallucination Correction Model
We have been exploring a specific technique for hallucination correction called “post editing”, in which the model corrects hallucinations after a summary has been generated.
Here’s how it works. The Hallucination Correction Model (aka HCM) receives the reference documents and the generated response to the user query, and generates a “corrected” response, as shown in Figure 1 below:
Note that we do not use the query as input to HCM. Furthermore, HCM is not generating a whole new response, instead, it will attempt to only fix those parts of the original response that are hallucinated, and leave the other parts and the overall response structure intact.
Because our model functions as a post-editing tool, it can be added on as a post-processing stage to any existing RAG (or other open-book generation) pipeline without requiring modifications to other components.
Hallucination Correction Model Performance
We report the performance of HCM on a few different datasets and quantify how well it’s able to correct hallucinations from several different open-weight and commercially available LLMs.
We also report a brief error analysis and some interesting failure modes of the model.
HHEM Leaderboard Benchmark
Vectara maintains a hallucination evaluation leaderboard, which serves as our first data source for evaluating HCM. We select a diverse range of models from the leaderboard and run their generated summaries through HCM to correct any hallucinations, as described above. To assess the effectiveness of our corrections, we use HHEM (specifically HHEM-2.1), to measure the Factuality Rate (FR, defined as the inverse of the hallucination rate, i.e., factuality_rate = 1.0 – hallucination_rate) both before and after the corrections. The results of this evaluation are presented in Figure 2.
As shown in the plot, the summaries corrected by HCM demonstrate a significant improvement in FR. The enhancements are especially notable for models that initially had low FR. Even for models with higher original FR, HCM still provides a boost in scores.
To measure whether our model introduces unnecessary alterations to the original response during the correction process, we calculate both the ROUGE score and BERTScore between the original and corrected responses. These metrics are a measure of lexical and semantic similarity respectively and range from 0 to 1, where a score of 1 indicates a perfect match, meaning HCM made no changes, and a score of 0 indicates a complete mismatch.
For our purposes, we want to aim for high scores, as we want HCM to minimize changes to the original summaries. This is because the assumption we make is that hallucinations make up only smaller portions of the original summary. Figure 3 illustrates the ROUGE and BERTScore on the HHEM Leaderboard for different models.
We see that ROUGE and BERTScore are both high for most models, which supports the fact that HCM is not changing the majority of the summary, keeping it intact.
FAVABENCH and NonFacts Datasets Benchmark
We also compare our model’s performance on two other publicly available datasets FAVABENCH and NonFactS. One point to note here is that the hallucinations in these datasets are not natural hallucinations produced by LLMs, instead, these datasets are constructed using artificial techniques to inject hallucinations. For example, FAVABENCH prompts ChatGPT to insert errors from different taxonomies and NonFactS uses a fine-tuned BART-base model to inject hallucinations by passing incomplete information during inference. Therefore, these hallucinations do not completely match the distribution of real-world hallucinations.
We asked HCM to correct the hallucinated summaries in these datasets and the FR of the corrected answers are plotted in Figure 4.
As we can see, HCM shows strong capability in correcting nonfactual information.
Figure 5 plots the ROUGE and BERTScore scores between original and corrected summaries for FAVABENCH and NonFactS dataset. As previously noted, the hallucinations in these datasets are more “artificial” and therefore easier to detect, which allows HCM to make more substantial changes to the original responses. Consequently, this results in lower ROUGE and BERTScore values, as we can see in Figure 5.
RAGTruth Dataset
We also evaluate HCM’s performance on RAGTruth, a publicly available human-annotated hallucination dataset. RAGTruth is arguably the hardest hallucination dataset to detect and correct for.
Unlike some other datasets, like FAVABENCH and NonFactS, which use sophisticated techniques to inject hallucinations into the dataset, RAGTruth contains hallucinations that are naturally generated by LLMs, thus they tend to be subtle and more representative of real-world hallucinations generated by LLMs. Thus hallucination correction techniques that improve performance on this benchmark should indicate generalizable FR gains.
The performance comparison with and without our HCM is plotted in Figure 6. ROUGE and BERTScore scores for individual models in the dataset are plotted in Figure 7.
For this analysis, we utilized the entire test split of the RAGTruth dataset for the summarization task.
It is clear from Figures 6 and 7 that there is a notable improvement in the HHEM factuality rate of post-edited responses originally generated by Mistral-7B-Instruct and Llama-2-Chat (7B, 13B, and 70B).
GPT models, which typically exhibit a high factuality rate by default, continued to demonstrate strong performance with the application of HCM indicating that the model is able to let factual summaries pass through unchanged.
Analysis
In this section we show some examples where HCM does not succeed in correcting hallucinations, and try to analyze the performance in different scenarios.
One particular scenario where our model fails to achieve a high FR (see Figure 1) is when correcting answers generated by the Falcon-7B-Instruct model on the HHEM Leaderboard. Manual analysis reveals that Falcon-7B-Instruct often deviates from the instructions it is given, and accesses its knowledge to add information while generating answers. This does not necessarily mean that the extra information it adds is hallucinated per se and we usually notice it being grounded in the real world, but those parts of information are not directly inferred from the provided documents.
From the perspective of retrieval augmented generation, where you want your answers to only depend on the supporting documents, this is a form of hallucination. Table 1 below shows a few examples of such cases.
Table 1: Analysis of Falcon-7B-Instruct on HHEM Leaderboard.
As is clearly evident, Falcon-7B-Instruct often generates information that is not directly supported by the provided documents. These particular examples turn out to be problematic because our HCM aims to make the most crucial minimal changes to edit non-factual pieces of information that aren’t directly supported by the provided documents while maintaining the overall structure of the original answer. This, however, means that the model isn’t accustomed to making larger changes to summaries, such as deleting or adding large pieces of text, and thus fails to correct the outputs generated by the Falcon-7B-Instruct model.
Conclusion
In this blog, we presented our Hallucination Correction Model (HCM) and evaluated its performance on several public benchmarks as well as our HHEM leaderboard.
Our analysis revealed a significant improvement in the factuality rate of the generated summaries across all datasets and leading LLMs. Additionally, we examined some edge cases where our model encounters challenges and identified several shortcomings that we intend to address in future iterations.
Reducing hallucinations in LLM, especially in enterprise RAG pipelines, remains an important area of research and we consider HCM a substantial step towards our goal of reducing hallucinations.
As we work to further improve our offering in this field, we are excited to share these initial results with the community.
Appendix
We randomly selected ~100 examples with original and corrected summaries along with references from our evaluation datasets and have made them available on HuggingFace.