How to Find Duplicates in Salesforce by Using Machine Learning

Share this article...

As a Salesforce admin, you wonder about which way is the best to find duplicates in Salesforce. When we think of machine learning, we tend to think about robotic process automation, virtual assistants, and self-driving cars. Machine learning, however, can simplify what we come across day-to-day – like identifying duplicates in your Salesforce org. Just like with autonomous vehicles, the algorithms that power deduping can be trained to produce the desired outcome.

In this guide, I will give you an overview of how machine learning algorithms are trained to dedupe not only Salesforce, but any unstructured data – plus, the advantages of this approach over existing rule-based methods.

“Salesforce”

How Does Machine Learning Match Two Records?

If we take a look at the two records shown below, it is pretty clear that these are duplicates:

Name Last NameAddress
Michael Bolton123 Lockwood Drive
Mikebolton123 Lockwood Dr

However, a machine doesn’t have the ability or background to make the same determination. In fact, it is actually much harder than it might seem. We might start by pointing out all of the similarities. Since there are obviously so many of them, we can conclude that these are duplicates. While this may be a good first step, we would then need to stipulate exactly what we mean by the word “similar.” Is there a range where something may be considered not similar at all or very similar? How would a machine go about identifying these similarities?

One of the ways researchers “teach” similarities to machines is through string metrics. This is when you take two strings and return a number that is low if the strings are similar and high if they are dissimilar. There are many string metrics out there, with one of the most well-known ones being the Hamming distance. This method counts the number of substitutions that are required to turn one string into another. For example, if you consider the Last Name from the example above, the Hemming distance would only be 1, since you only need to change only one letter to convert “Bolton” to “bolton.”

Another variation to this is learnable distance metrics, which takes into consideration that different edit operations have varying significance in different domains. For example, substituting a digit makes a huge difference in a street address since it effectively changes the entire address. However, a single letter substitution may not be that significant because it is more likely to be caused by a typo or an abbreviation. Therefore, adapting string edit distance to a particular domain requires assigning different weights to different strings. We will drill down into these concepts at a later point in this article. For now, let’s take a look at how all of these metrics are used to dedupe Salesforce.

Deduping Salesforce With Machine Learning Algorithms

There are a couple of ways using which we can look at a Salesforce record. Let’s start by assuming it is a single block of text (as shown below):

Record 1
Record 2
Michael Bolton 123 Lockwood DriveMike bolton 123 Lockwood Dr

Another option is to compare each field individually:

 Record 1Record 2
First NameMichaelMike
Last NameBoltonbolton
Address 123 Lockwood Drive123 Lockwood Dr

For the “single block” approach, each field string would be treated equally. This makes it less convenient if you want any emphasis placed on a specific field, such as Last Name. The “field by field” approach allows you to do this by assigning a specific weight to each field, starting with the most important fields having the highest weight and so forth. Salesforce deduping tools that use machine learning will allow you to set the weights for each field and then create a model so that the approach is comparatively codified and leveraged.

What is the Advantage of Using Machine Learning to Dedupe Salesforce?

Every company’s dataset is unique and has its own challenges when it comes to deduplication. Whenever a human determines whether a set of records are duplicates (or not), the system will “learn” from these actions and tweak the machine learning algorithm to identify future duplicates without human interaction. This process, known as “active learning,” will continue to modify the weights assigned to each field, based on user interaction. Consequently, it will improve duplicate detection.

It is important to point out that setting accurate weight to each field has its own challenges. For example, is the Last Name field twice as important as the First Name or 1.5 times and so on? It would be very difficult for any individual to determine this, since we just couldn’t practically process that much data. On the other hand, computers using machine learning can crunch an almost infinite amount of data quickly and efficiently. The only limitation is the available computation power. Machine learning algorithms will be able to calculate the accurate weights of each field in your dataset. This process is known as regularized logistic regressions.

Added Value of Deduping With Machine Learning

With rule-based tools, every time a duplicate record is identified, a Salesforce admin will need to create an additional rule to prevent it from recurring. Not only is this process highly time-consuming, but it’s also nearly impossible to account for every possible “fuzzy” duplicate. You can try to set all of the weightings for each field yourself, or use other metrics to catch the duplicates. In the end, it is very time-consuming and ineffective in catching all the issues. Machine learning does all of this for you – saving you time and hassle.

There are many other advantages of using machine learning. The algorithm is fully customizable and there is no need for a complicated setup. Remember, if you are using a tool that relies on complex rules, someone needs to set up the rules and then maintain them.

A machine learning tool eliminates this effort completely, allowing you to simply download the product and start using it right away. DataGroomr is one example, which offers a free 14 day trial.

Add Comment