Architects / Data

Architect’s Guide: Data Profiling to Assess and Monitor Data Reliability

By Mehmet Orun

I loved Salesforce’s recent blog that stated: “Bad data is junk food for AI.” However, after years of research stating more than half (55%) of business leaders do not trust their data, the challenge remains. I believe one key reason is the gap between IT vs business’ perception of what data reliability is. In fact, if you look closely at the recent Salesforce survey, 57% of Data and Analytics leaders are confident in data accuracy, whereas only 42% of Sales and Service leaders feel the same. How can you be confident data is reliable if you don’t understand how the business uses it?

It’s incumbent on Salesforce Architects to bridge this gap and guide their organization on the necessary capabilities to address data reliability challenges effectively.

In this blog, I’ll explain why data profiling solutions are an essential part of the Salesforce solution architecture. I’ll describe the pros and cons of different architectural approaches to data profiling, guide you through selection criteria, and share best practices and anti-patterns.

Capabilities Required for Sustainable Data Reliability

Salesforce provides a framework for administering data quality in CRM orgs:

My observation, having spent years at Salesforce advising customers on data strategy, is this: most organizations have implemented solutions for duplicate management, standardization, and to some extent data validation. If organizations had implemented data profiling, most were for point-in-time needs (what are my unused fields) vs for ongoing data reliability. Very few organizations have put in place monitoring.

Data reliability starts with assessing your CRM org’s data and associated technical metadata health. Without a quantitative understanding of your data health, you cannot determine if your data reliability is sufficient to meet different business objectives. Even if your data is sufficient to meet your business needs today, you cannot detect unexpected deviations that put your business outcomes at risk without ongoing monitoring.

CRM data and metadata ailments will impact every downstream initiative: enterprise AI, data unification in Data Cloud or Tableau, automation with Flow, etc. Data quality is an ongoing need with moving targets as the business evolves. You must have an effective data profiling and monitoring solution for your stakeholders.

What is Data Profiling and Why Do You Need It?

The Data Management Association defines data profiling as “statistical analysis of data set contents to understand format, completeness, consistency, validity, and structure of the data.” For projects involving a significant amount of data, like Salesforce, they recommend a data profiling tool as the most efficient means of conducting this analysis.

Organizations require scalable profiling solutions for assessing data in Salesforce CRM. These tools must cater to various stakeholder needs across different record types (Sales, Partner Sales, Vendor Management, HR, etc). Similarly, companies need profiling solutions to effectively unify data from different sources in Salesforce Data Cloud. It is crucial to be able to perform rapid assessments and continuous data trend monitoring.

READ MORE: Ultimate Guide to Salesforce Data Quality and Data Cleansing

Select a Profiling Solution That Meets Your Stakeholders’ Needs

Data profiling is not a new concept and there are many tools available in the industry. In general, there are 3 deployment architectures for data profiling solutions.

  • Native solutions that profile data within the boundaries of the business application.
  • External tools that profile data from applications based on data exports.
  • Hybrid solutions that have Salesforce user interfaces but process assessment and analytics outside of the org, often through APIs.

Understanding the pros and cons of each architecture is essential to drive approval and adoption within your organization.

Native Data Profiling

Built and hosted on the Salesforce platform, native solutions offer four key advantages.

  1. Data security: Native profiling offers the greatest data security of the three architectures because:
    • Data never leaves the org.
    • Native apps can leverage Salesforce’s field-level security and sharing rules.
    • These apps also come with the added assurance of having passed Salesforce’s security review process.
  2. Current data: Because native profiling solutions run in the org they will always profile real-time data and metadata.
  3. Easy installation and upgrades: The AppExchange makes installation and upgrades seamless.
  4. Familiar user interface: You and your users will have the advantage of the familiar, Salesforce user interface.

Examples of native data profiling solutions include Cuneiform for CRM, Field Pro, FieldSpy, and Field Trip.

External Profiling Tools

External tools have historically been the purview of IT departments. They can assess data from any source and may have more specialized features. However, external tools have five key disadvantages:

  1. Data security: These tools typically require data to be exported outside of the business application to be analyzed, meaning analysis occurs outside of the security controls built into your CRM environment. This comes with further disadvantages:
    • Data context is lost. Because exports standardize data types (details such as string vs. picklist values) and associated configuration metadata is not available to external profiling tools.
    • Out-of-date data. The analysis will be limited to the export, making trend analysis also more expensive at best.
    • Increased data governance complexity. Data copies and when they are exported must be tracked, security and access controls need to be maintained across different technologies, and data access control and deletion processes must be expanded.
  2. Not accessible to business users or CRM admins: As IT tools, these applications are seldom accessible to CRM admins, never mind business users, who are primarily responsible for effective metadata configuration and data maintenance.
  3. Higher costs: Licensing costs aside, the need to handle data security concerns, additional integrations and processes, and the learning curve of external tools come with a higher cost than native solutions.
  4. Difficult to understand data trends: Your organization’s data retention policies may require purging data from the external solution before you have time to effectively monitor and respond to trends. Essential to build an effective history of the org’s data reliability.

Examples of external data profiling solutions include Ataccama, IBM InfoSphere, Informatica Data Explorer, and Talend Open Studio.

Hybrid Profiling Tools

Also found on the AppExchange, hybrid solutions have Salesforce user interfaces and have passed a Salesforce Security Review. However, they do process and persist data outside of your org.

Hybrid solutions have the advantage of ease of use but require much more security review vigor and effort. They share the same disadvantages as external tools when it comes to data governance, data trending, and acquisition costs.

Examples of hybrid solutions include Hubbl Process Analytics and Metazoa Snapshot.

What About Reports and Queries?

When I heard at a recent Circle of Success that “you can create custom reports to identify empty fields or field value frequency,” I was initially confused. It is possible to create custom solutions that mimic data profiling results but these would be time-consuming with the additional burden of ongoing maintenance. I for one would not want to build a separate query for every single field I may need to assess, when my objects may have 200-500+ custom fields.

External query applications, e.g. DBeaver and SoqlXplorer, while favorite tools on my laptop have the same challenges as above, so I would not recommend these as a scalable “business solution”.

Evaluation Guide

When evaluating data profiling solutions keep these patterns and anti-patterns in mind.

Do Start With Native Data Profiling Solutions

Begin by evaluating native data profiling apps. In addition to several advantages, many native apps are also free, making them an ideal starting point for evaluation.

Do Assess Profiling Solutions for Data Security and Access

Most effective data profiling solutions assess both CRM data and associated metadata. It’s also important to understand how the solution manages access and purging.

Does the app require the Administrator profile with read-all permissions or can it run under the user’s own permission levels? I prefer the latter. It ensures users only see the objects, fields, and records they are allowed to see.

Does the app support a read-only view of only the profiling results? This empowers data specialists to find patterns even under the most restrictive permission models.

Do Evaluate Solution Performance With Representative Data

Performance levels and feature breadth of data profiling solutions available on the AppExchange may vary. Performance assessments are key, especially in larger orgs.

Start by asking about the maximum field and record count that the solution provider has certified for their solution.

If you have a full copy Sandbox:

  1. Identify your largest objects by size as well as by the number of fields.
  2. Create and run the same profiling definition for the object(s) in each tool. Document execution time and if the tool can process the entire object without timing out.

If you cannot use real production data to evaluate tools, use synthetic data with a comparable number of fields and rows. A tool like Mockaroo will enable you to create synthetic data with custom parameters. Use the CSV to create a temporary custom object and assess.

Of course, always remember how governor limits may skew initial results.

As a general benchmark, the latest native data profiling tools can assess 10 million records with 500+ fields in 20 minutes or less.

Do Use a Data Profiling Features Evaluation Matrix

You may not initially utilize all of the below features. However, identifying feature gaps can assist in making your initial selection and facilitate future growth and expansion over time.

CapabilityDescriptionExampleWhy?
Data Profiling FundamentalsDoes the tool capture:
– Field fill rate
– Field distinct rate
A field is 97% null with 3 distinct valuesFoundational capability every profiling tool should have.
Data with ConfigurationHow observed data compares to the config
– Distinct values vs. active picklist values
– Distinct values vs. data type
– Fill rates vs. field usage
Picklist field has 10 distinct values but 14 active value configurations

Field is string data type but only has 7 distinct values

Field is not used but is in 4 UIs and 8 reports
Assessing data and configuration metadata together can quickly provide insights on potential usability and data reliability challenges.
Data Profiling GranularityAbility to create multiple profiling scenarios to assess an object’s contentsCustomer Accounts vs. Partner Accounts

Open Opportunities vs. Closed Won Opportunities
Salesforce’s flexible data model means it is common for multiple functions or business units to be supported by the same object.

Granular scenarios are essential to assess business-specific data quality considerations.
Advanced Profiling FeaturesAbility to infer additional insights based on data and metadata. e.g. net population rateThe field’s default value represents 90% of populated valuesDemonstrates tool benefits to assess probable reliability
Data Governance and Dictionary SupportAbility to capture Definition, Help Text, Data Owner, Data Classification, Data Management Rules, Encryption, and Usage details in a common viewField is classified as Confidential, PII but not encryptedData Governance features in Salesforce CRM are spread across multiple parts of the setup tree.

Consolidating all insights in a common UI and data model simplifies visualization and monitoring.
Trend MonitoringAbility to take snapshots and compare profiling insights over timeProfiling definition shows 12% record volume growth but field completeness has droppedSnapshotting and trend analysis are critical for understanding data changes over time and can aid in proactive data management.
Data Quality KPIsAbility to incorporate data quality formulas into the assessmentBilling Address completeness

Incorrect Account detection

Junk account detection
The inclusion of KPIs for data quality provides a structured approach to measuring and improving data integrity.
Reports and DashboardsOut-of-the-box reports
Ability to create custom reports with Salesforce tools
Fill rate visualizationsAccelerates value realization with common tools for stakeholder engagement.
Data Health ScoringApplication providing a snapshot of data health based on out-of-the-box or configurable scoring modelsData Dictionary health is 47/100

Account object data health is 72/100
Offering a holistic view of data health at a glance is crucial for quick assessments and prioritization.
Data Quality Improvement RecommendationsAbility to look across all relevant/profiled objects to identify tactical next steps to improve data and org configurationConvert string field to picklist

Encrypt sensitive field
Every org has data quality and configuration health challenges. The ability to correlate findings to actions quickly brings value faster.
User Experience and UsabilityHow user-friendly is the tool?
Does it use the latest UX patterns (e.g. LDS)
N/AA good UI/UX can significantly affect adoption rates and the effectiveness of data profiling activities.
Customization and FlexibilityAbility to expose insights, e.g. data quality scores, within other CRM applicationsCan the UI or data be exposed in other parts of the app?The ability to customize the tool to fit specific organizational needs.
ScalabilityHow well does the tool scale with the growing amount of data?What are certified data volume and field counts per object?

Does the vendor publish performance statistics?
Ensure the tool can scale with the growing amount of data and evolving business requirements.
Shield SupportCan the solution profile encrypted fields? Are there limitations?N/ASome customer orgs may have Shield turned on. Understand limits.
Compliance and CertificationsCan the vendor demonstrate current certifications or compliance?SOC2
ISO27001
HIPAA
FedRamp
Certain industries may require these certifications.
Support and CommunityThe availability of support options, documentation, user communityIs the product well documented?

Does the vendor offer free and premier support?
Effective documentation and an active user community can be valuable resources for troubleshooting and best practices.

Don’t Let IT Tooling Get in the Way of Data Reliability

Your IT department may have already procured a data profiling tool to support their integration development initiatives. More mature IT departments may even possess internal data quality monitoring capabilities built on other systems (data warehouse, enterprise message bus, etc).

Stay vigilant and make the business case for why your Salesforce Admins, Data Specialists, and Business Data Stewards need to assess and monitor their data and metadata health. Asking “How many of these users are users of the IT tool?” is often an effective way to get the point across.

Do Work With Your Admins to Unlock the Full Potential of Data

As an architect, you guide the overall Salesforce architecture and roadmap. This includes how to identify and address scalability, security, integration, and data quality concerns to meet the organization’s strategic goals. Admins have always been key allies. As a primary user of data profiling solutions, they can ensure your data assessments are impactful and help to maintain data quality and reliability over time.

If you do not have a data profiling solution in your CRM org, partner with and educate your admins on the benefits:

  • Assessing data and metadata quality to guide tactical actions to improve data reliability, application usability, and maintainability. E.g:
    • Identifying unused/underutilized fields and field values.
    • Using profiling insights to have better data models (e.g. pick list conversions, deactivating unused picklist values, or splitting a field into many to capture more granular data).
    • Understanding data dictionary health.
  • Identifying fields that can predict successful business outcomes to focus user adoption on key business data.
  • Implementing monitoring solutions to catch deviations and ensure data remains reliable for the life of your Salesforce solutions.

Collaborate closely with admins to communicate the architectural vision, ensuring practical application and maintenance. This partnership ensures that the system not only meets current needs but is also poised for future growth and change. It utilizes tools like data profiling to uphold high data quality and system efficiency.

READ MORE: A Salesforce Admin’s Guide to Data Leaks

Do Profile Every Production Instance

Your evaluation data profiling solutions will likely happen in your full or partial copy Sandbox. To take advantage of data quality monitoring, you will need to deploy data profiling solutions in production. This way, you can take snapshots of your data growth and correlate that to metadata changes over time.

If your organization has multiple instances, deploy the tooling across each org. This assessment can significantly demonstrate the importance of enterprise-wide data reliability and governance strategies. It will also illustrate the importance of data unification for AI, analytics, automation, and activation initiatives, including but not limited to Data Cloud.

When Do You Augment Native Data Profiling?

My simple answer is when there is a specific set of insights that are impactful to the business outcomes you want to achieve and you are not able to get these insights from your native data profiling solution.

I prefer to have the smallest set of solutions working together, starting with native and hybrid solutions and then moving to external tooling due to the above-mentioned reasons. While I am comfortable with hybrid tools that analyze metadata (e.g. dependencies), I would have to have a very good reason to do data assessments outside of my org, especially given the importance of ongoing data quality monitoring, contextual root cause analysis, and in-app response.

Summary

Architects must help their organizations assess and improve data reliability to unlock the full potential of their Salesforce data ecosystem and drive long-term success. Putting in place the right data profiling and ongoing monitoring solution is a key component in achieving this outcome.

The first step to ensuring data reliability is assessing the data and associated metadata against business outcomes.

Start with a native data profiling solution from the AppExchange that can support multiple business scenarios. Assess the technical data health of your key objects and business data reliability for one scenario to build the foundation.

Set up data quality formulas based on data that matters to your business. Set up monitoring and alerts to detect and respond to unexpected changes. Scale to additional business use cases over time and, if you have them, to other orgs as well.

Also implement data profiling in Data Cloud, so you are not only ensuring your CRM data is reliable, but you are also monitoring data reliability across any data source that is powering your applications, AI, automation, analytics, and activation initiatives.

The Author

Mehmet Orun

Salesforce Veteran and Data Management SME, working with Salesforce since 2005 as a Customer, Employee, Practice Lead, and Partner. Now GM and Data Strategist for PeerNova, an ISV partner focused on data reliability, as well as Data Matters Global Community Leader.

Leave a Reply