Artificial Intelligence / Platform

Why Most Enterprise AI Lacks Value and How Salesforce Leaders Can Fix It in 2026

By Connor O’Brien

Enterprises are finding themselves months into high-profile AI initiatives with muted outputs. While studies on AI adoption by major firms are frequently cited, the reality on the ground is that unreliability often leads to rework, meaning initiatives that report time savings underreport the “verification” work required to confirm accuracy.

Despite the bad press, the challenges with implementing AI and agents within the enterprise can be mitigated with strong governance. This article examines how Salesforce leaders can overcome the paradox of adoption and return on investment (ROI) by employing three primary methods: evaluations, guardrails, and measurement.

Note: This article shares lessons from the founding of Colby AI, an Agentic Salesforce Co-Pilot for Asset Managers. As the founder of Colby, the author is passionate about the company; however, this article is intended as an educational resource on governance and should not be viewed as a promotion of Colby. There is no affiliation between Colby AI and Salesforce Ben.

Are We in an AI Bubble?

Many industry titans have suggested that we are in an AI bubble. Still, bubbles do not mean the underlying technology isn’t capable, and “popping the bubble” won’t reduce funding for clearly high-ROI projects. 

Salesforce reports some clients are seeing over 200% ROI with customer service automation using Agentforce, proving that routine tasks can be automated with massive payoffs. The best firms take a blended governance approach, treating AI agents as both software systems and human employees to ensure effectiveness.

Evals: From Pilots to Production

AI isn’t magic (although sometimes it feels that way). To scale, you must evaluate it deliberately – as you would software, for example – especially when embedding it into business systems like Salesforce. This is where “evals” come in.

What Is an Eval?

An eval (evaluation) is a systematic test that compares AI agent outputs against known-good examples or human decisions. Think of it as unit testing for AI behavior – instead of testing whether code compiles, you’re testing whether the agent makes the right decision, executes the correct action, or provides an acceptable response.

For example, if a case comes in regarding a pipeline forecast report, the input is the request, and the expected output is the specific report. If the agent returns that exact report, the result is a “pass.”

Evaluation must be continuous, not a one-time check. Rather than just checking if it works once, you need to track specific metrics over time:

  • Task completion accuracy: Did the agent complete the task correctly end-to-end?
  • Rate of agent override: How often do humans need to step in to correct the work?
  • Escalation frequency: Is the agent appropriately escalating edge cases it cannot handle?

Unlike traditional software, AI agents operate probabilistically and won’t always produce identical outputs for identical inputs. These metrics help you understand not only if your agent works, but also how reliably it works.

Implementing Proper Guardrails With Trust

Successful AI is built on trust. Guardrails define what the AI is allowed to do and under what constraints, analogous to permissions for a human employee.

Why Guardrails Matter

Without guardrails, AI agents can make decisions that are technically possible but organizationally unacceptable. Imagine giving a new employee full admin access on day one, with no training, supervision, or boundaries. The potential for errors, security breaches, or misaligned actions would be unacceptable. AI agents require similar constraints, such as:

  • Permission boundaries to prevent unauthorized data access.
  • Action limits to avoid bulk operations that could corrupt data.
  • Validation rules to ensure outputs meet quality standards.
  • Escalation triggers to route complex decisions to humans.

Fortunately, many of these guardrails are already built into Salesforce. Ensuring robust validation rules and formula fields will continue to allow agents and users to operate by the same rules.

For operations where consistency is uncertain or risk is high, implement human-in-the-loop patterns. This ensures a person is involved in the decision-making or approval process before the agent executes a task. Common patterns include:

  • Approval workflows: The agent proposes an action, and a human approves it before execution.
  • Confidence thresholds: The agent will only auto-execute when confidence exceeds 90%. Otherwise, it will escalate.
  • Audit trails: Every agent action is logged with reasoning for human review.

At Colby, our agents operate under constrained permissions that match the user, and have no delete operations allowed without human-in-the-loop approval. By disallowing reckless operations, we lower the chance of catastrophic errors.

Proving ROI at Task Level Using Measurement

Without measurement, AI remains speculative. McKinsey reports that fewer than one in five organizations track defined KPIs for their AI use cases, despite this behavior being highly predictive of performance. To break the paradox, each AI agent must be treated like a mini-investment with measurable returns.

Rather than focusing solely on model metrics such as accuracy or throughput, measure the time or cost saved per task. At Colby AI, we instrument this by tracking per-task tool calls:

TaskHuman TimeAgent Time Saved
Pulling a report3 minutes2 minutes saved
Updating a record1 minute1 minute saved
Researching a contact10 minutes9 minutes saved
Triaging a case queue15 minutes14 minutes saved
Creating a dashboard10 minutes5 minutes saved

These micro-savings aggregate into macro-level return on investment (ROI). If your agent handles 50 tasks per day, that is hours of productivity gained daily.

Once you have the time data, you should convert it into dollars. The formula is straightforward: multiply the hours saved by the hourly cost of labor, then divide by the cost of the agent infrastructure. 

For example, if a team of five admins costs $75 an hour and the AI saves a collective ten hours a week, you are looking at $39,000 in annual savings. If the agent costs $15,000, your net ROI is 160%. Presenting AI in ‘dollars saved’ can make it more difficult for management to ignore.

Final Thoughts

Many organizations are using AI, but few are extracting value. The missing link is management, not models. The small percentage of companies that succeed don’t just deploy AI; they leverage it effectively by evaluating it in real workflows, governing it with strict guardrails, and measuring it like a financial investment.

Salesforce professionals are especially well-positioned to lead this transition because you already understand structured workflows and data integrity. Start small: pick one repetitive workflow, map it, evaluate the agent’s performance, add guardrails, and measure the savings. The enterprises that win won’t have the best models; they’ll have the best governance.

The Author

Connor O'Brien

Connor is the Founder of Colby, where he helps asset managers and distribution teams increase capital formation using AI.

Leave a Reply