In the battle of ideas between AI advocates and skeptics, there seems to be one major sticking point that simply won’t go away, always in the back pocket of the latter group, a silver bullet ready to pull out at any moment. That point is, in a word, hallucinations.
It’s a hard topic for pro-AI thinkers to battle, as it’s not really something that can be ironed out of modern AI tools. To use a simplistic metaphor, hallucinations are baked into LLMs, and the detractors of artificial intelligence can always point to the simple fact that – in a completely unpredictable way – AI can simply go rogue and invent something unreal and quite possibly dangerous. This is simply a hazard of using the technology.
But how big of a problem are hallucinations, really? Whenever something goes wrong with an AI tool, it can simply be chalked up to a ‘hallucination’, but this is not necessarily always the case. Salesforce often says that AI is only as good as the data it’s built on. This is, of course, a plug for its own Data 360 (formerly Data Cloud) platform, but it’s also a fair comment.
So, when AI – including Salesforce’s own suite, Agentforce – goes wrong, is it due to the inevitability of hallucinations? Or, is it far more likely to be your own bad data?
“To My Knowledge, We Never Once Had a Hallucination”
At Dreamforce ‘25, we spoke to several high-ups at Salesforce about how big a problem hallucinations are with Agentforce.
Brad Arkin, Chief Trust Officer at Salesforce, said that the impact of hallucinations depends on the use case, saying that they are more of a usability concern than a security one in many contexts – especially with strict permission settings.
When asked if hallucinations were a big security concern, Brad said: “Use cases are going to determine how much a hallucination would be a problem. If it’s a hotel concierge scenario, you might show up to dinner, and there’s no reservation for you; that’s annoying. If it’s a healthcare scenario, there are other scenarios where it’s a much bigger problem.”
He added that usability hallucinations are a focus for Salesforce, but from a security perspective, there are other things to worry about – for instance, if the agent has a lot of permissions, can the user trick it into doing something it shouldn’t?
Brad said: “In most of the use cases that I have, because we use a lot of Agentforce within my security team, we just don’t have any trouble at all for hallucinations in the real world with the way we’re using it.”
These use cases include things like summarizing tickets and creating timelines, with Agentforce performing these tasks “thousands of times a week” within Brad’s security team.
“To my knowledge, we never once had a single hallucination; it just doesn’t happen,” Brad detailed. “So those real-world things for me are not a problem. But then there are other concerns around managing the permission sets and optimizing that.”
Chief Legal Officer at Salesforce, Sabastian Niles, stressed the point – which was a theme among people we asked – that it’s important to ground AI in trusted data.
“We’ve been dealing with this long before AI,” Sabastian explained. “You ask the person a question, they give you an answer that is inaccurate. Why? Did they just make it up? Did they just misinterpret some of the relevant things? Did they just have incomplete data?”
Sabastian added that one approach to dealing with the issue of hallucinations is to work with customers to make sure everything is grounded in trusted, accurate, relevant data – which he said is “very, very important and addresses many issues”.
“We Thought the Salesforce Help Agent Was Hallucinating”
Shibani Ahuja, SVP, Enterprise IT Strategy, Salesforce, also spoke to us at Dreamforce, outlining just how much infrastructure agents need, making them almost comparable to a human employee in terms of onboarding.
AI agents, much like human employees, will need to be set up with specific roles, channels, job descriptions, access controls, and even KPIs, with regular performance monitoring – not too dissimilar to quarterly check-ins and annual reviews.
She made a similar point to one made by Sabastian – who we spoke to in a separate meeting – that what might initially appear to be a hallucination might in fact be the inevitable result of conflicting or chaotic data. This is something that happened when Salesforce initially rolled out Agentforce on help.salesforce.com.
Shibani said: “We’re so early on the agentic journey. When we went live with help.salesforce.com, we actually thought that the agent was hallucinating because the responses were coming back inconsistently. We quickly shut it down, but then analyzed what was going wrong. It wasn’t the agent; it was the underlying data. It was the knowledge articles that were conflicting.
“This problem has persisted for as long as we’ve had our website, but we didn’t identify it until that agent was put on, and that surfaced something. So then we turned that agent inwards to identify the anomalies, which is a brilliant use case. Now you’ve got this guardian agent that is overseeing and helping you clean your data, or do QA, QC. That’s like another value or another learning of failing and just trying it.”
“Enterprises Pay for Consistency”
At some point, every generative AI model will hallucinate. But not all hallucinations are created equal.
If you ask ChatGPT an open question that it lacks the proper contextual data to adequately answer properly, you are far more likely to get an incorrect response or hallucination, due to the broad scope of what is being considered.
The key difference for Salesforce is scope, says Salesforce MVP and Founder & CEO Groundwork Apps, Paul Battisson. He says that if, as with Agentforce, you provide well-structured, clean, contextual data to ground the model, the scope for hallucinations decreases, and the answers should be more accurate.
“What Salesforce are saying is true – having clean and well-structured trusted data will help your agents’ answers be better,” Paul told Salesforce Ben.
Earlier this year, we reported on the news that Marc Benioff claimed 93% agent accuracy. As our Technical Content Director, Peter Chittum, and Journalist, Thomas Morgan, wrote at the time: “Salesforce is used by hundreds of thousands of companies across different industries that deal with very serious use cases, including financial institutions, drug manufacturers, and government agencies.
“Imagine a 93% accurate agent telling you your current credit limit, or advising a pharmacist on drug interactions, or identifying your immigration status. For situations like these, there’s no real room for hallucinations, as they could cause serious consequences for a company’s brand and, more importantly, their customer or patient.”
Paul says that the key question here is how much of that other 7% could be reduced by better data and tuning, and what level of accuracy is good enough for you.
He added: “I heard a quote recently around AI-developed products, where, when paraphrased, it is ‘enterprises pay for consistency’; that’s why companies still expect Salesforce to provide 99.9% uptime.
“Is getting your reservation right 95% of the time good enough for you? What about 99% of the time? If it is for a financial services company or bank, what is the impact of 1% of your queries being answered incorrectly? If you have 99.9% accuracy and get 1000 customer queries per day, that is still one wrong every day.
“What Brad says resonates most for me – if we have a hallucination, what is the impact of it? A ticket summary being slightly off is ‘okay’ because a human will still be reading the ticket. But if the agent is truly autonomous without any oversight, you need to ensure you have accounted for hallucinations and are aware of their implications.”
Final Thoughts: A Familiar Comparison
Perhaps the real question is not whether these inevitable hallucinations are a problem, but what the acceptable level of risk is. As mentioned above, a dinner reservation going wrong is frustrating and might ruin your evening plans, but a stray 0 when there should be a 1 (so to speak) in crucial financial transactions could be utterly catastrophic.
I mentioned in the introduction to this article that AI skeptics always have the ‘H-word’ in their back pocket, with hallucinations still very much a problem “baked in” within generative AI models, by design. But, like Sabastian alluded to in his comments, humans make errors too, and this is often the pro-AI lobby’s response to the hallucination issue.
Yes, they might argue, we should minimize AI hallucinations wherever and however possible, but we’re not comparing an AI’s (let’s say) 93% accuracy to a hypothetical 100% accuracy model. We’re comparing it to the accuracy level of a human being – who is also capable of messing up your dinner reservation, and of making some catastrophic error in a crucial financial transaction.
Errors are baked into generative AI, it’s true. But big companies like Salesforce wouldn’t have a legal department if human error wasn’t baked into our programming, too.
So, yes, addressing your bad data will minimize AI hallucinations – but the problem persists. The question we, as a species, are reckoning with right now is: ‘How faultless does AI have to be, when compared to a human?’
Answers will vary, from industry to industry, and, in my opinion, we will find out the implications of this AI transformation sooner rather than later.