An Azure service for ingesting, preparing, and transforming data at scale.
- Yes. A single Azure Data Factory pipeline can run multiple activities in parallel, including multiple Execute Azure ML Pipeline activities, as long as there is no dependency configured between them. When multiple activities in a pipeline have no dependency on each other, they can run in parallel.
- The correct pattern in ADF is to fan out after the backup step:
- Keep each segment’s internal sequence (Copy → Data Flow → Execute ML Pipeline → Data Flow) chained with success dependencies.
- Connect the backup activity’s success output to the first activity of each of the three segments.
- Do not chain Segment A → Segment B → Segment C. With no cross‑segment dependencies, ADF will schedule the three branches in parallel and the pipeline run will complete only after all branches finish.
- On the Azure ML side, parallelism is governed by the compute configuration and job type:
- Each Azure ML pipeline triggered from ADF is an independent job.
- If all three ML pipelines target the same compute cluster, Azure ML will schedule their steps onto that cluster according to available nodes and cores.
- To achieve true parallelism, the cluster must have sufficient capacity (node count and per‑node resources) so that the steps from different pipelines can be placed concurrently instead of queueing.
- Azure ML supports parallel jobs and multinode compute; parallelization works by splitting work into mini‑batches and dispatching them across nodes. If the cluster is undersized relative to the total parallel workload (three pipelines × their internal parallel steps), some steps will wait for resources.
- Therefore, to move from sequential to concurrent execution, increase the cluster’s maximum nodes or quota so that the combined workload of the three pipelines fits without contention.
- Additional considerations:
- In ADF, if Execute ML Pipeline activities are configured with Wait on completion, each branch will wait for its ML pipeline to finish before moving to its next activity, but branches remain independent and can still run in parallel.
- When using Mapping Data Flows in parallel branches, be aware that each data flow activity spins up its own Spark cluster when executed in parallel, which increases concurrent compute usage and cost.
- For any shared status or tracking tables, ensure that updates from the three branches are either to distinct rows or are coordinated to avoid conflicts. ADF itself does not serialize access; concurrency control must be handled in the database logic.
References: