Getting Started With Salesforce Einstein Vision

Share this article...

When asked what Einstein Vision is about, people tend to mix ‘learning with cat pictures’ with ‘AI’ and ‘cool apps to modify pictures’. While Einstein Vision is indeed fun and about ‘recognising images’ it goes much further than playing around with pictures.

Thanks to the Einstein Platform services, Einstein Vision combines ease of use with powerful technology to experiment and bring new use cases alive. Let’s have a look at what is needed to make it work – including some tips and looking at which Salesforce Clouds you can use Einstein Vision with.


What is Einstein Vision? Part of the Einstein Platform

Einstein is more than a Salesforce product. It is an inspiring name, a cool mascot, and, in practice, a family of AI products. Together, they cover a wide range of functionalities. Some are focused on the advanced Artificial Intelligence side of things. Others offer a solid dose of down-to-earth Augmented Intelligence to facilitate your work.

Similarly, Prediction Builder helps to build rapidly data science insights through different AI models. Einstein Next Best Action tells you the next best thing to do to convert, for example, an opportunity. Many other small Einstein functionalities exist as well.

Einstein Vision shares this same DNA blending ease of use with powerful technology.

Image Classification, Object Detection, and OCR

When asked what Einstein Vision is about, people tend to talk about cat pictures, being able to make the distinction between dog faces and raspberry muffins, or Trailhead challenges to save bears.

Einstein Vision is indeed about ‘recognising images’ – but in its broadest sense. The Einstein Vision APIs made available by Salesforce cover three subfamilies: Image Classification, Object Detection, and OCR.

[Picture Source: Mariya Yao]
OCR is the most recent member of the Vision family. It focuses on converting characters within small images to text. Typically, it is to capture label information based on the picture of a product tag. You can also scan Business cards or images that contain tabular data. GA since this year, it is gearing up.

Image Classification is about categorizing images. Single-label classification tell you to what category an image belongs (eg. ‘dogs’ or ‘muffins’). Multi-label classification tells you one or more categories that the image belongs to (eg. ‘dogs’ and ‘muffins’ and ‘Salesforce gathering’).

Einstein Object Detection focuses on extracting and contextualizing objects in images. It tells you that there are three muffins and one dog on the picture. It also provides the x/y coordinates of these objects.

How to Set up Einstein Vision (Image Classification)

Getting Einstein Vision up to speed requires you to go through different phases: preparing data, training a model, doing the predictions, and then creating value by using the insights.

The essence is summarised in the picture below (thank you Trailhead)!

Stage 0 – The Preparation Work

In practice, you need to go through a stage 0. Stage 0 of the training is defining what the focus should be on. While gaining experience can be done with any use case, Stage 0 is really important to make sure that you can combine exciting Einstein Vision work with something that really will help the organisation you are working for.

Do you want to be able to distinguish beaches from mountains? Damaged car doors from new doors? Good body position from a sports (wo)man versus a bad one? 2-door fridges versus 3-door fridges? The only limit to the use cases are those of your imagination, really.

Let’s imagine that you want to distinguish damaged doors from new doors. You will create one folder with training data for ‘damaged doors’, and one for ‘new doors’. For each label, you need 200-1000 qualitative images.
To ease your work, it is advised that you consider one or more of the following tips:

    • Without surprise: a wide variety of images makes the model more accurate.
    • Ideally, focus on diversity based on the following factors:
      • Color
      • Black and white
      • Blurred
      • With other objects the object might typically be seen with
      • If applicable, with text and without text
    • In a binary dataset, include images in the negative label that look similar to images in the positive label. For example, if your positive label is a tennis ball, include some from soccer, basket ball, baseball and other sports in your negative label.
    • For a multi-label model, include images with objects that appear in different areas within the image.

Creating data sets is the most time-consuming step. Is there a way around? Yes. If you are creative with coding, you can also write a query that looks up the internet for appropriate pictures. While it will still be needed to check if the captured pictures fit to your needs, this small technical workaround helps to save time.

Stage 1: Training And Analysing The First Predictions

Once you have the necessary pictures, the real Einstein platform work starts.
First, you need to connect to the Vision platform.
This is it! You can now create labelled folders and add the pictures, to kick-start the learning.
The different steps are:

    • Register online to get your private Einstein Vision key ( The format will be einstein_platform.pem
    • Upload it ( and get your token.
    • Upload the private key in the org and add the remote site settings via the set-up
    • Install Git (and cURL)

Click on the Train button, grab a cup of tea (or a refreshing summer drink) and wait for the model to be ready. Time till completion varies. The status gets updated to SUCCEEDED when everything is ready.

As a fully hosted managed service, the Einstein platform is designed to take all the deep learning hassle from your shoulders. This means that the training requires not much more data science than what has been described above. Technically, cURL helps to speed up the process, and the personal learning.

Stage 2: Evaluating Quality With Iterations

Close attention is needed when evaluating the quality of your model. This is done by launching the first ‘predictions’. To do so, load your first pictures. The model will indicate for each picture the % of fit to label A, and the % fit to label B. For more data science skilled aficionados, the % results can also be translated into raw results.

To refine your metrics, useful tips can be found on Metamind’s page.

With the results, you will have to evaluate yourself to what extent you deem the model strong enough. Often, people ask what can be done to be sure that the outcome is good. The answer boils down to the very essence of data science: the better the data fit to your use case(s), the better the outcome will be.

Expect thus the training of the model to be an iterative exercise. For data quality, there is no magic, even with Einstein. The fact that Einstein Vision is very user friendly really smoothens the process though.

Stage 3: Activating The Knowledge

You are happy about your model? The latest predictions look good? Congratulations! That means that you are ready to connect the last dots.

As the learning and the predictions are done on the deep learning platform, you need to build your own API connections to fuel your Salesforce org appropriately with the predictions.

It will be tempting to anticipate this work while the data collection is ongoing. While there is no advice against it, functional and technical team members alike should keep in mind that the data training iterations can lead to changes in the envisioned use cases. Scope shifts are part of the data science process – also with Einstein Vision.

With a given data set, it is possible, for example, that a Business-to-Business opportunity use case for the car door analysis could change into a more valuable use case for Business-to-Consumer service management. This will not involve more complicated APIs, but might involve other parts of the overall solution design.

Object Detection (more technical)

For Einstein Object Detection, the logic is similar:

    • Preparing data and uploading them on the Einstein Platform
    • Training the model
    • Predicting
    • Creating value by Using the insights

Technically, it is more work as the goal is to also obtain the number of objects within the images, and their location.

Once you have collected all images, you have to provide the coordinates of the bounding boxes around the objects that you want to detect.

To do so, log the x-y coordinates of the top left edge of the bounding boxes, as well as the pixel width and height. Multiple objects are allowed within the same image. Store the exact image names and belonging image coordinates in an annotations csv. This needs to be loaded along with the images to train the model.

Which Salesforce Clouds Should I Use Einstein Vision With?

While the applications of Einstein Vision are virtually unlimited, certain use cases tend to be more present. Typically, this involves:

    • Audits: Streamlining in store audits (e.g. on shelf availability, share of shelf) with product identification.
    • Part Identification: Fast-tracking sales and service support with automatic part identification
    • Diagnostics: Automating insurance claims by identifying damaged parts and severity
    • Brand Recognition: Measuring brand impact across social media

In terms of ‘cloud scope’, this means that applications will typically arise in orgs using Sales Cloud, Service Cloud or Marketing Cloud.

Sales and Service Cloud

For Sales and Service Cloud environments, the pictures are typically taken by sales or service representatives of the company. Increasingly, it can be envisioned to ask customers to submit the pictures for further analysis by the company. Typical examples are car damage processes or guest feedback.

Marketing Cloud

For Marketing Cloud, a wide series of Einstein features are already available, be it driven by Vision or other Einstein intelligence (an entire article could be dedicated to that). Einstein Vision for Social Studio, for instance, allows marketing departments to monitor the activity of competing brands in an integrated way.


As we have seen, Einstein Vision is less about advanced coding or finding the optimal combination with other products. It really starts with defining good use cases and staying focused on data quality. Once you have those and are willing to iterate forward, Einstein Vision will do the rest. Along with the experience, ideas will pop-up, as well as tricks to ease your work.

Make therefore sure to take many pictures during the next months. Your holiday memories might become the perfect input for your first model… In the meantime, Einstein questions and ideas are welcome. It is a vast world, worth exploring further.

Leave a Reply