Edit

Architecture strategies for designing a workload development supply chain

Applies to this Azure Well-Architected Framework Operational Excellence checklist recommendation:

OE:06 Build and optimize your workload supply chain by using predictable, automated pipelines that test and promote changes across environments. Design these pipelines to consistently meet your reliability, security, cost-efficiency, and performance targets.

To provide a predictable, standardized way to maintain your workload, design your workload development supply chain around continuous integration and continuous delivery (CI/CD). Maintain a single, standardized supply chain and implement it by using automated CI/CD pipelines. You can use multiple pipelines as long as they all adhere to the same supply chain.

Use a standardized supply chain to protect your workload from the risks of unmanaged changes. Maintain continuous visibility into workload state to avoid unpredictable behavior and costly retracing of untracked changes when problems occur. To minimize these risks, standardize the processes and tools that define your supply chain, and ensure that your workload team fully commits to their use.

Definition

Term Definition
Supply chain In cloud workloads, a supply chain is a standardized suite of tools and processes that you use to implement infrastructure and application change across environments.

Note

The recommendations in this guide refer to workload environments in a code promotion chain. Sandbox or other exploratory and proof-of-concept (POC) environments require less rigor and structure.

The following recommendations can help you define the core tenets of your supply chain.

Enforce a strict policy of automated template-based deployments

Make all proposed workload changes through your supply chain's processes and tools, and enforce automated, template-based deployments. This approach keeps configurations standardized, well defined, and tightly controlled across environments. For environments in a code promotion chain, prohibit manual updates or direct interaction with the cloud control plane, such as the portal or APIs. Apply all changes to the environment through pipelines that follow your defined deployment practices. To enforce this policy, limit access to read-only and use authorization gates to grant write access when needed.

An important aspect of this tenet is that all changes are proposed changes until they're deployed into production. Through automated testing, like integration and smoke testing, you enable your supply chain to automatically reject changes.

AI opportunity: Manual diagnosis of deployment failures slows teams, causes frustration, and creates knowledge silos. Improve productivity and resilience by first evaluating off-the-shelf DevOps tools that provide AI-assisted failure analysis. To extend these capabilities, layer a custom agent on your existing toolchain to identify and resolve common problems. For advanced scenarios, consider an agentic solution that monitors pipeline telemetry and historical incident data to detect recurring failure patterns and recommend proactive remediations.

Deploy repeatable and immutable infrastructure as code

Deploy repeatable, immutable infrastructure by using infrastructure as code (IaC). Manage infrastructure as declarative, version-controlled definitions that mirror your application source. Applying the same IaC consistently creates the same environment every time, similar to how the same source code creates the same binary when you compile it.

To standardize and automate how you deploy and configure infrastructure, adopt IaC. Replace person-dependent deployments with fully automated pipelines. This approach shifts responsibility from individuals to tooling and reduces manual effort. Enable authorized team members to initiate deployments through this automated process to maintain consistency and quality across environments.

Design your workload as a logical group of components that you can bundle into one template to make deployments simple and repeatable. You can think of these bundles as stamps or units of scale. For more information, see Deployment Stamps pattern. When you need to deploy your workload to scale out into another region or zone within the same region, deploy a stamp by using a pipeline. Depending on how you design your stamps, you might deploy a subset of your workload instead of the entire workload. To ensure that your deployment stamps automatically connect to existing resources, include networking components in your IaC pipelines.

To optimize your IaC pipeline for consistency and efficiency, design an immutable infrastructure rather than a mutable infrastructure. Implement an immutable infrastructure to ensure that all systems in scope are replaced with the updated configuration simultaneously and identically with each deployment.

AI opportunity: Streamline pipeline management and reduce manual effort by using AI-powered deployment optimization. GitHub Copilot can accelerate code generation in your CI/CD pipelines. AI integration can validate and enforce organizational standards, recommend rollback strategies, and detect drift by comparing current and intended infrastructure states. Enhance monitoring with AI models that anticipate problems and surface them early for intervention.

Maintain accurate, continuously monitored resource inventories across environments, and include approval steps and audit trails to track AI actions.

Use the same set of deployment artifacts across all environments

Use the same set of code assets and artifacts across all environments and pipelines. A common challenge for organizations is when nonproduction environments differ from production environments. Manually building production and nonproduction environments can result in mismatched configurations between the environments. This mismatch slows testing and increases the likelihood that changes might harm a production system. An IaC approach reduces these problems. When you use IaC automation, you can use the same infrastructure configuration files for all environments to produce nearly identical environments. You can add parameters to the infrastructure configuration files and adjust them to meet the requirements for each environment.

To control costs, there's typically a variance between production and nonproduction environments. You typically don't need the same level of redundancy or performance in nonproduction, so resource counts and SKUs can differ. Ensure that you control and understand the variance by using standardized parameters to help you maintain predictability as you make changes.

Reflect the organizational structure in the supply chain

Reflect your organizational structure in your supply chain and pipeline designs. Your organization might be siloed among teams. You can organize your teams by function (such as networking, data, and compute) or integrate them as DevOps teams that manage infrastructure and applications. There are many ways to organize the teams that are involved in a supply chain. Regardless of structure, your supply chain relies on collaboration across all teams. Ensure that all teams follow standard processes and use standard tools to make the supply chain as efficient as possible.

Your supply chain might rely on external vendors if you outsource parts of the workload life cycle. These vendors are as critical to the success of your supply chain as internal resources. Ensure that there's a mutual agreement across all teams about the processes and tools that you use.

Choose the right deployment method

Standardize your deployment method. Talk to the product owner about the acceptable amount of production downtime for your workload. Depending on how much downtime is acceptable, you can choose the deployment method that fits your requirements. Ideally, perform deployments during a maintenance window to reduce complexity and risk. If no downtime is acceptable, use a blue-green deployment method.

Use a progressive-exposure approach, such as canary deployments, to reduce the risk of introducing undetected bugs to your customers. Deploy changes to small, controlled groups in stages so that you can detect problems before a broad release. The initial rollout group might be a subsection of your customers that are aware of the rollout strategy. This subsection of customers can tolerate some amount of unexpected behavior and provide feedback. Or it might be a group of internal users, which helps limit the impact of bugs during the rollout.

When you define your deployment method, adopt a standard policy of releasing the smallest viable change in each deployment. Determine what qualifies as the smallest viable change based on your workload's criticality and the complexity of its components. If you use an immutable infrastructure, the smallest viable change is to redeploy resources with the latest configuration to replace those resources that run the previous version. If you use a mutable infrastructure, you might decide that the smallest viable change is only a single, scoped update on the group of resources.

AI opportunity: AI-driven analysis identifies usage patterns and recommends optimal deployment times, which eliminates repetitive manual log inspection. Avoid guessing low-usage periods, deploying during high traffic times, or causing user disruption. Start with a low-effort generative AI (GenAI) approach that connects models directly to your workload telemetry. This approach enables interactive pattern detection and insight generation. For large-scale, high-transaction workloads, scale to machine learning-based predictive models that forecast optimal deployment windows as usage trends evolve.

Follow a layered approach

Follow a layered approach to reflect different life cycles. Workload layers are expected to change throughout the life cycle, but you should keep the life cycle operations static and consistent. Use separate deployment pipelines for each layer so that you can apply changes independently and appropriately.

A landing zone is located at the lowest layer of your architecture. It's a logical grouping of foundational elements, such as subscriptions, management groups, resource groups, governance controls, and network topology, that you can use to deploy and operate workloads consistently. It gives central operations or platform teams a repeatable environment configuration. To ensure consistency, Azure landing zones include common design areas, a reference architecture, a reference implementation, and a process to tailor deployments to your design requirements. The design principles recommend practices based on policy-driven governance alongside subscription democratization. A landing zone should require minimal changes over the course of your workload life cycle. To see an example of a landing zone with recommended practices for Azure, see Azure landing zone.

Keep your core infrastructure stable, including ingress and egress network controllers, load balancers, network routing solutions, Domain Name System (DNS), and core servers, by making major changes only when necessary. Expect and plan for more frequent configuration updates to these components.

Your application and data layers typically require frequent configuration updates and infrastructure changes. Use the most dynamic deployment pipelines for these components.

Incorporate comprehensive types of testing

Plan for a holistic testing strategy. A core tenet of system reliability is the shift-left principle. The software development process has steps that move from left to right. Don't wait until the end to test. Move tests as early as possible so that you detect problems when they cost less to fix. Defects that you discover late in the life cycle can be costly or even impossible to correct.

Test all code, including application code, infrastructure templates, and configuration scripts. Version-control the environment that runs applications and deploy it through the same mechanisms that you use for application code. Use the same testing approaches that your teams use for application code to validate the environment.

Use automated testing when possible to ensure consistency. Include the following types of testing in your supply chain design.

  • Unit testing: Run unit tests as part of a continuous integration routine. Unit tests should be extensive and quick. They should ideally cover 100% of the code and run in under 30 seconds.

    Implement unit testing to verify that the syntax and functionality of individual modules of code work the way that they should, such as producing a defined output for a known input. You can also use unit tests to verify that IaC assets are valid.

    Apply unit tests to all code assets, including templates and scripts.

  • Smoke testing: Smoke tests verify that a workload can be set up in a test environment and performs as expected. Smoke tests don't verify the interoperability of components.

    Smoke tests verify that the deployment methodology for the infrastructure and the application works, and that the system responds as intended after the process finishes.

  • Integration testing: Integration tests ensure that the application components operate individually and determine whether components can interact with each other as intended.

    It can take a considerable amount of time to run a large integration test suite. For this reason, it's best to incorporate the shift-left principle and test early in the software development life cycle (SDLC). Reserve integration tests for scenarios that you can't test by using a smoke test or unit test.

    You can run long-running test processes on a regular interval if needed. A regular interval provides a good compromise and detects interoperability problems between application components no later than one day after they're introduced.

    Some testing scenarios benefit from manual runs. Use manual testing when you need to introduce human interactivity elements into tests.

  • Acceptance testing: Depending on the context, you can manually perform acceptance testing. It can be partially or fully automated. Acceptance testing determines whether the software system meets the requirement specifications.

    The main purpose of this test is to evaluate the system's compliance with the business requirements and determine whether the system meets the required criteria for delivery to end users.

AI opportunity: Enhance your testing strategy to find customer-focused edge cases and scenarios that are hard to detect and often overlooked. Start by using your existing analytics and reporting to identify coverage gaps, then use tools like Copilot to generate new test cases and scripts. Optimize your test suite by removing redundant tests to improve speed and efficiency. Consider an agentic AI solution that analyzes production usage, monitoring data, and historical defects to find patterns and automatically create tests in your codebase that match your organization's standards.

Implement quality gates in code promotion processes

Implement quality gates throughout your code promotion process via testing. Promote your code through lower environments, such as development and testing, before you advance to higher environments like staging and production. Define clear quality gates and targets for each stage and only promote changes to production after they meet those criteria. Your business requirements determine the focus of your quality gates. Also consider the fundamental Well-Architected Framework principles: Reliability, Security, and Performance Efficiency.

Integrate approval workflows into your quality gates. Clearly define and automate approval workflows when appropriate. Define quality acceptance criteria into your automation so that you can move through your gates efficiently and safely.

AI opportunity: Eliminate bottlenecks during reviews by automatically routing PRs to the right subject matter experts (SMEs). Begin with Copilot to assess scope and impact. Then add an agentic CI/CD integration to evaluate code changes, configurations, and approval patterns. Give least-privilege access only to required artifacts, such as repositories, pipelines, configuration data, incident history, and approval logs. With this access, you can assess impact, assign reviewers, identify bottlenecks, and recommend auto approval or extra review. For high-value workloads, train on historical data to predict deployment risk and approval outcomes. Keep humans in the loop and be cautious with auto approval.

Azure facilitation

Azure DevOps is a collection of services that help you build a collaborative, efficient, and consistent development practice.

Azure Pipelines provides build and release services to support CI/CD in your applications.

GitHub Actions for Azure integrates with Azure to simplify deployments. Use GitHub Actions to automate CI/CD processes. You can create workflows that build and test every pull request to your repository, or deploy merged pull requests to production.

You can use Terraform, Bicep, and Azure Resource Manager for IaC deployments. Depending on your requirements and your team's familiarity with the tools, you might use one or more of these tools for your deployments and management of the resources.

Example

For an example that shows how to use Azure Pipelines to build a CI/CD pipeline, see Azure Pipelines baseline architecture.

Operation Excellence checklist

Refer to the complete set of recommendations.