Understanding how to harness large language models (LLMs) can be a complex task – despite its rapid growth, each LLM has its pros and cons, with most models still in their infancy. This means that it can be difficult for businesses to understand how much trust they can put into this generative AI tool and which ones work best for their requirements.
To combat this, Salesforce has announced the world’s first-ever CRM AI benchmark, allowing businesses to evaluate LLMs and decide which ones work best for their own CRM needs. Let’s take a look at what this means in more detail…
What Is an LLM Benchmark?
Salesforce’s LLM benchmark acts as a recommendation tool to help businesses decide which LLM will be most effective for individual use cases. The benchmark considers these four factors:
- Accuracy: The benchmark will look at factuality, completeness, conciseness, and ability to follow instructions to determine LLMs value to organizations.
- Cost: This considers how cost-effective an LLM is, categorizing costs into high, medium, and low.
- Speed: How responsive and efficient is the LLM at delivering response times?
- Trust and Safety: Arguably the most important metric, the benchmark looks at if an LLM can shield sensitive customer data, adhere to data privacy regulations, secure information, and refrain from bias and toxicity for CRM use cases.

The benchmark will then score an LLM in terms of how well each factor aligns with your business needs. For example, if a business needs to use an LLM for tasks such as generating sales emails, the benchmark will help discover which LLM does this specifically well while considering how affordable, and it is safe to use.
Also, if a company wants to use an LLM to respond to customer support emails, they could use Salesforce’s benchmark to compare different LLMs and decide which is most suitable in terms of the four metrics above.
Another important aspect of the benchmark that relates to Salesforce, is that the Einstein Platform lets you customize your LLM usage based on specific tasks. Instead of being stuck with a one-size-fits-all solution, you can now mix and match LLMs to get the best results for each job. For example, you could use Claud to create a persuasive sales email and then switch to OpenAI to summarize a customer account. This flexibility means you can get top-notch performance across various tasks by focusing on the key strengths of different LLMs.
Why Is This Important?
Artificial intelligence is all the talk in the technology world this year, but there is still a significant trust gap with users in terms of how far we can trust AI to deliver as expected and protect customer data. Providing a benchmark encourages discussion over the quality of LLMs and gives organizations clear insights into how to apply them effectively.
It’s also important to note that this process isn’t automated – it is spearheaded by experts who have researched and identified the criteria being used to evaluate these LLMs. Silvio Savarese, EVP of Salesforce’s AI research, believes this allows for the most comprehensive and trustworthy understanding of how to practically use these LLMs in business environments.
“We want to ensure the qualities of these generative processes are aligned with the CRM goals. The idea is that if a customer has certain needs about use cases, or costs to serve, or latency, they can look at our results, tabular data, and plots and graphs, and they can make an informed decision.” Silvio Savarese, EVP & Chief Scientist, Salesforce AI Research
Savarese also states that this is just the beginning for Salesforce’s LLM benchmark:
“We’re committed to continuing this investigation. We want to expand with more metrics, use cases, data, and more annotations.”
Ultimately, Salesforce’s benchmark could become the Gartner Magic Quadrant for LLMs – this innovative step positions them as a go-to CRM for an LLM-agnostic experience.
Summary
Implementing an LLM benchmark is a remarkable first step in understanding how to utilize AI properly for business use cases. As powerful as LLMs may get, giving organizations the opportunity to fully understand what they’re using and deciding on the most effective tool is only going to encourage trust in its capabilities.
Make sure to leave your thoughts in the comments below!