Schema Enforcement Strategy for Salesforce to Blob to Data Flow Pipelines

Question

Schema Enforcement Strategy for Salesforce to Blob to Data Flow Pipelines

Bilesh Ganguly 20

We have a common Azure Data Factory pipeline that extracts Salesforce data via SOQL, stages it as JSON in Blob Storage, and then loads it through Mapping Data Flows.

We observed a case where a field such as BillingPostalCode was interpreted as integer instead of string.

Could you clarify the following:

Can a published Mapping Data Flow ever change a column's datatype at runtime based on the incoming data (for example, string → integer) when schema drift is disabled?
- If yes, what is the recommended way to prevent this?
If the issue originates from Salesforce-to-JSON serialization (for example, some records being written as numeric values and others as strings), what is the Microsoft-recommended approach for enforcing consistent datatypes in a reusable/dynamic ingestion framework?
- Should the schema be enforced during extraction, staging, source projection, or within the Data Flow transformations?
For fields such as postal codes, identifiers, account numbers, and similar business keys, what is the recommended pattern to guarantee they are always treated as strings throughout the pipeline?

0 comments

Answer accepted by question author

0 additional answers

Your answer

Answer 1

A mapping data flow with schema drift disabled does not change a column’s data type at runtime. With drift disabled, the data flow uses the defined projection (early binding) and data types remain as designed. Type changes at runtime are associated with schema drift and late binding, where drifted columns can be auto‑typed or inferred.

Given the behavior described, the inconsistency is most likely introduced before or at the JSON staging layer (Salesforce → JSON), not by a published mapping data flow with schema drift disabled.

For the follow‑up points:

Can a published Mapping Data Flow change a column’s datatype at runtime when schema drift is disabled?

No. When schema drift is disabled, the source projection is taken from the dataset schema and used as the fixed contract for the flow. The documentation notes that:

Source projection is created from the dataset schema definition.
Schema drift is explicitly defined as reading columns that are not in that projection and treating them as drifted columns.

Without schema drift, the data flow does not dynamically re‑infer types based on incoming values.

If yes, how to prevent it?

Since type changes are tied to schema drift and inference:

Ensure Allow schema drift is unchecked on both source and sink transformations.
Ensure Infer drifted column types is not used (this only applies when drift is enabled).
Keep a well‑defined projection and avoid late‑binding patterns for critical fields.

Where to enforce schema in a reusable/dynamic ingestion framework?

The context shows several enforcement points and their characteristics:

Source projection / dataset schema
- When a dataset is selected, the service “automatically take[s] the schema from the dataset and create[s] a projection from that dataset schema definition.”
- This is the primary early‑binding contract for mapping data flows.
- For JSON, the “Use projected schema” optimization explicitly does not work with schema drift, reinforcing that projection is the enforcement point when drift is off.
Within mapping data flow transformations
- The Cast transformation is designed to “modify the data types of individual columns in a data flow” and to check for casting errors.
- The Stringify transformation converts complex types to string when needed.

Given this, the Microsoft‑aligned pattern for a reusable/dynamic framework is:

Define and maintain a strong schema in the dataset/source projection for known fields.
Disable schema drift for those sources/sinks where strict typing is required.
Use a Cast transformation early in the flow to normalize types for critical columns if there is any ambiguity from upstream systems.

Enforcing at extraction (Salesforce → JSON) is outside the provided context. Within Azure Data Factory, enforcement is primarily at:

Dataset/source projection, and
Mapping data flow transformations (Cast/Stringify) over that projection.

Recommended pattern for postal codes, identifiers, account numbers, business keys

For fields that must always be treated as strings (postal codes, identifiers, account numbers, business keys), the recommended pattern within ADF mapping data flows is:

In the dataset and source projection:
- Define these columns explicitly as string in the dataset schema so that the projection is string‑typed.
- Keep Allow schema drift disabled for these sources to avoid late‑binding behavior.
In the mapping data flow:
- Add a Cast transformation near the start of the flow and explicitly cast these columns to string, even if they already appear as string in the projection. This provides a defensive normalization layer against any upstream inconsistencies.
- If any of these fields are complex (arrays, maps, structs), use Stringify to convert them into a single string representation.
In sinks:
- Define sink schemas so that these columns are string (or equivalent text type) and keep Allow schema drift disabled for strict enforcement.
- If schema drift must be enabled for other columns, use rule‑based mapping so that these key fields are explicitly mapped as strings and not left to auto‑mapping.

This combination—early‑bound dataset schema, disabled schema drift for critical paths, and explicit Cast/Stringify transformations—provides a reusable pattern to guarantee that business keys are consistently treated as strings throughout the pipeline.

References:

Share via

Schema Enforcement Strategy for Salesforce to Blob to Data Flow Pipelines

0 additional answers

Your answer