Tag: DevOps

  • Azure Policy as Code: How to Govern Cloud Resources at Scale Without Losing Your Mind

    Azure Policy as Code: How to Govern Cloud Resources at Scale Without Losing Your Mind

    If you’ve spent any time managing a non-trivial Azure environment, you’ve probably hit the same wall: things drift. Someone creates a storage account without encryption at rest. A subscription gets spun up without a cost center tag. A VM lands in a region you’re not supposed to use. Manual reviews catch some of it, but not all of it — and by the time you catch it, the problem has already been live for weeks.

    Azure Policy offers a solution, but clicking through the Azure portal to define and assign policies one at a time doesn’t scale. The moment you have more than a handful of subscriptions or a team larger than one person, you need something more disciplined. That’s where Policy as Code (PaC) comes in.

    This guide walks through what Policy as Code means for Azure, how to structure a working repository, the key operational decisions you’ll need to make, and how to wire it all into a CI/CD pipeline so governance is automatic — not an afterthought.


    What “Policy as Code” Actually Means

    The phrase sounds abstract, but the idea is simple: instead of managing your Azure Policies through the portal, you store them in a Git repository as JSON or Bicep files, version-control them like any other infrastructure code, and deploy them through an automated pipeline.

    This matters for several reasons.

    First, Git history becomes your audit trail. Every policy change, every exemption, every assignment — it’s all tracked with who changed it, when, and why (assuming your team writes decent commit messages). That’s something the portal can never give you.

    Second, you can enforce peer review. If someone wants to create a new “allowed locations” policy or relax an existing deny effect, they open a pull request. Your team reviews it before it goes anywhere near production.

    Third, you get consistency across environments. A staging environment governed by a slightly different set of policies than production is a gap waiting to become an incident. Policy as Code makes it easy to parameterize for environment differences without maintaining completely separate policy definitions.

    Structuring Your Policy Repository

    There’s no single right structure, but a layout that has worked well across a variety of team sizes looks something like this:

    azure-policy/
      policies/
        definitions/
          storage-require-https.json
          require-resource-tags.json
          allowed-vm-skus.json
        initiatives/
          security-baseline.json
          tagging-standards.json
      assignments/
        subscription-prod.json
        subscription-dev.json
        management-group-root.json
      exemptions/
        storage-legacy-project-x.json
      scripts/
        deploy.ps1
        test.ps1
      .github/
        workflows/
          policy-deploy.yml

    Policy definitions live in policies/definitions/ — these are the raw policy rule files. Initiatives (policy sets) group related definitions together in policies/initiatives/. Assignments connect initiatives or individual policies to scopes (subscriptions, management groups, resource groups) and live in assignments/. Exemptions are tracked separately so they’re visible and reviewable rather than buried in portal configuration.

    Writing a Solid Policy Definition

    A policy definition file is JSON with a few key sections: displayName, description, mode, parameters, and policyRule. Here’s a practical example — requiring that all storage accounts enforce HTTPS-only traffic:

    {
      "displayName": "Storage accounts should require HTTPS-only traffic",
      "description": "Ensures that all Azure Storage accounts are configured with supportsHttpsTrafficOnly set to true.",
      "mode": "Indexed",
      "parameters": {
        "effect": {
          "type": "String",
          "defaultValue": "Audit",
          "allowedValues": ["Audit", "Deny", "Disabled"]
        }
      },
      "policyRule": {
        "if": {
          "allOf": [
            {
              "field": "type",
              "equals": "Microsoft.Storage/storageAccounts"
            },
            {
              "field": "Microsoft.Storage/storageAccounts/supportsHttpsTrafficOnly",
              "notEquals": true
            }
          ]
        },
        "then": {
          "effect": "[parameters('effect')]"
        }
      }
    }

    A few design choices worth noting. The effect is parameterized — this lets you assign the same definition with Audit in dev (to surface violations without blocking) and Deny in production (to actively block non-compliant resources). Hardcoding the effect is a common early mistake that forces you to maintain duplicate definitions for different environments.

    The mode of Indexed means this policy only evaluates resource types that support tags and location. For policies targeting resource group properties or subscription-level resources, use All instead.

    Grouping Policies into Initiatives

    Individual policy definitions are powerful, but assigning them one at a time to every subscription is tedious and error-prone. Initiatives (also called policy sets) let you bundle related policies and assign the whole bundle at once.

    A tagging standards initiative might group together policies for requiring a cost-center tag, requiring an owner tag, and inheriting tags from the resource group. An initiative like this assigns cleanly at the management group level, propagates down to all subscriptions, and can be updated in one place when your tagging requirements change.

    Define your initiatives in a JSON file and reference the policy definitions by their IDs. When you deploy via the pipeline, definitions go up first, then initiatives get built from them, then assignments connect initiatives to scopes — order matters.

    Testing Policies Before They Touch Production

    There are two kinds of pain with policy governance: violations you catch before deployment, and violations you discover after. Policy as Code should maximize the first kind.

    Linting and schema validation can run in your CI pipeline on every pull request. Tools like the Azure Policy VS Code extension or Bicep’s built-in linter catch structural errors before they ever reach Azure.

    What-if analysis is available for some deployment scenarios. More practically, deploy to a dedicated governance test subscription first. Assign your policy with Audit effect, then run your compliance scripts and check the compliance report. If expected-compliant resources show as non-compliant, your policy logic has a bug.

    Exemptions are another testing tool — if a specific resource legitimately needs to be excluded from a policy (legacy system, approved exception, temporary dev environment), track that exemption in your repo with a documented justification and expiry date. Exemptions that live only in the portal are invisible and tend to become permanent by accident.

    Wiring Policy Deployment into CI/CD

    A minimal GitHub Actions workflow for policy deployment looks something like this:

    name: Deploy Azure Policies
    
    on:
      push:
        branches: [main]
        paths:
          - 'policies/**'
          - 'assignments/**'
          - 'exemptions/**'
      pull_request:
        branches: [main]
    
    jobs:
      validate:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
          - name: Validate policy JSON
            run: |
              find policies/ -name '*.json' | xargs -I {} python3 -c "import json,sys; json.load(open('{}'))" && echo "All JSON valid"
    
      deploy:
        runs-on: ubuntu-latest
        needs: validate
        if: github.ref == 'refs/heads/main'
        steps:
          - uses: actions/checkout@v4
          - uses: azure/login@v2
            with:
              creds: ${{ secrets.AZURE_CREDENTIALS }}
          - name: Deploy policy definitions
            run: ./scripts/deploy.ps1 -Stage definitions
          - name: Deploy initiatives
            run: ./scripts/deploy.ps1 -Stage initiatives
          - name: Deploy assignments
            run: ./scripts/deploy.ps1 -Stage assignments

    The key pattern: pull requests trigger validation only. Merges to main trigger the actual deployment. Policy changes that bypass review by going directly to main can be prevented with branch protection rules.

    For Azure DevOps shops, the same pattern applies using pipeline YAML with environment gates — require a manual approval before the assignment stage runs in production if your organization needs that extra checkpoint.

    Common Pitfalls Worth Avoiding

    Starting with Deny effects. The first instinct when you see a compliance gap is to block it immediately. Resist this. Start every new policy with Audit for at least two weeks. Let the compliance data show you what’s actually out of compliance before you start blocking things. Blocking before you understand the landscape leads to surprised developers and emergency exemptions.

    Scope creep in initiatives. It’s tempting to build one giant “everything” initiative. Don’t. Break initiatives into logical domains — security baseline, tagging standards, allowed regions, allowed SKUs. Smaller initiatives are easier to update, easier to understand, and easier to exempt selectively when needed.

    Not versioning your initiatives. When you change an initiative — adding a new policy, changing parameters — update the initiative’s display name and maintain a changelog. Initiatives that silently change are hard to reason about in compliance reports.

    Forgetting inherited policies. If you’re working in a larger organization where your management group already has policies assigned from above, those assignments interact with yours. Map the existing policy landscape before you assign new policies, especially deny-effect ones, to avoid conflicts or redundant coverage.

    Not cleaning up exemptions. Exemptions with no expiry date live forever. Add an expiry review process — even a simple monthly script that lists exemptions older than 90 days — and review whether they’re still justified.

    Getting Started Without Boiling the Ocean

    If you’re starting from scratch, a practical week-one scope is:

    1. Pick three policies you know you need: require encryption at rest on storage accounts, require tags on resource groups, deny resources in non-approved regions.
    2. Stand up a policy repo with the folder structure above.
    3. Deploy with Audit effect to a dev subscription.
    4. Fix the real violations you find rather than exempting them.
    5. Set up the CI/CD pipeline so future changes require a pull request.

    That scope is small enough to finish and large enough to prove the value. From there, building out a full security baseline initiative and expanding to production becomes a natural next step rather than a daunting project.

    Policy as Code isn’t glamorous, but it’s the difference between a cloud environment that drifts toward chaos and one that stays governable as it grows. The portal will always let you click things in. The question is whether anyone will know what got clicked, why, or whether it’s still correct six months later. Code and version control answer all three.

  • Terraform vs. Bicep vs. Pulumi: How to Choose the Right IaC Tool for Your Azure and Cloud Infrastructure

    Terraform vs. Bicep vs. Pulumi: How to Choose the Right IaC Tool for Your Azure and Cloud Infrastructure

    Why Infrastructure as Code Tool Choice Still Matters in 2026

    Infrastructure as code has been mainstream for years, yet engineering teams still debate which tool to use when they start a new project or migrate an existing environment. Terraform, Bicep, and Pulumi represent three distinct philosophies about how infrastructure should be described, managed, and maintained. Each has earned its place in the ecosystem — and each comes with trade-offs that can make or break a team’s productivity depending on context.

    This guide breaks down the real-world differences between Terraform, Bicep, and Pulumi so you can choose the right tool for your team’s skills, cloud footprint, and long-term operations requirements — rather than defaulting to whatever someone on the team used at their last job.

    Terraform: The Multi-Cloud Standard

    HashiCorp Terraform has been the dominant open-source IaC tool for most of the past decade. It uses a declarative configuration language called HCL (HashiCorp Configuration Language) that reads cleanly and is approachable for practitioners who are not software engineers. Terraform’s provider ecosystem is enormous — covering AWS, Azure, Google Cloud, Kubernetes, GitHub, Cloudflare, Datadog, and hundreds of other platforms in a consistent interface.

    Terraform’s state file model is one of its most consequential design choices. All deployed resources are tracked in a state file that Terraform uses to calculate diffs and plan changes. This makes drift detection and incremental updates precise, but it also means your team needs a reliable remote state backend — usually Azure Blob Storage, AWS S3, or Terraform Cloud — and must handle state locking carefully in team environments. State corruption, while uncommon, is a real operational concern.

    The licensing change HashiCorp made in 2023 — moving Terraform from the Mozilla Public License to the Business Source License (BSL) — prompted the community to fork the project as OpenTofu under the Linux Foundation. By 2026, most enterprises using Terraform have evaluated whether to migrate to OpenTofu or accept the BSL terms. For most teams using Terraform without commercial redistribution, the practical impact is limited, but the shift has added a layer of strategic consideration that was not present before.

    When Terraform Is the Right Choice

    Terraform excels when your organization manages infrastructure across multiple cloud providers and wants a single tool and workflow. Its declarative approach, mature module ecosystem, and broad community support make it the default choice for teams that are not already deeply invested in a specific cloud vendor’s native tooling. If your platform engineers have Terraform experience and your infrastructure spans more than one provider, Terraform (or OpenTofu) is a natural fit.

    Bicep: Azure-Native and Designed for Simplicity

    Bicep is Microsoft’s domain-specific language for deploying Azure resources. It is a declarative language that compiles down to ARM (Azure Resource Manager) JSON templates, which means anything expressible in ARM can be expressed in Bicep — just with dramatically less verbose syntax. Bicep integrates tightly with the Azure CLI, Azure DevOps, and GitHub Actions, and it ships first-class support in Visual Studio Code with real-time type checking, autocomplete, and inline documentation.

    One of Bicep’s most underappreciated advantages is that it has no external state file. Azure Resource Manager itself is the state store — Azure tracks what was deployed and what it should look like, so there is no separate file to manage or corruption to recover from. For teams that operate exclusively in Azure and want the lowest possible infrastructure overhead, this is a meaningful operational simplification.

    Bicep is also the tool Microsoft recommends for Azure Policy assignments, deployment stacks, and subscription-level deployments. If your team is already using Azure DevOps and managing Azure subscriptions as the primary cloud environment, Bicep’s deep integration with the Azure toolchain reduces the number of moving parts in your CI/CD pipeline.

    When Bicep Is the Right Choice

    Bicep is the clear winner when your organization is Azure-only or Azure-primary and your team wants the closest possible alignment with Microsoft’s supported tooling and roadmap. It requires no third-party toolchain to manage, no state backend to configure, and no provider versions to pin. For organizations subject to strict software supply chain requirements or those that prefer to minimize external open-source dependencies in production tooling, Bicep’s native Microsoft support is a genuine advantage.

    Pulumi: Infrastructure as Real Code

    Pulumi takes a different approach from both Terraform and Bicep: it lets you define infrastructure using general-purpose programming languages — TypeScript, Python, Go, C#, Java, and YAML. Rather than learning a configuration language, engineers write infrastructure definitions using the same language patterns, testing frameworks, and IDE tooling they use for application code. This makes Pulumi particularly compelling for platform engineering teams with strong software development backgrounds who want to apply standard software engineering practices — unit tests, code reuse, abstraction patterns — to infrastructure code.

    Pulumi uses its own state management system, which can be hosted in Pulumi Cloud (the managed SaaS offering) or self-hosted in a cloud storage bucket. Like Terraform, Pulumi tracks resource state explicitly, which enables precise drift detection and update planning. The Pulumi Automation API is a standout feature: it allows teams to embed infrastructure deployments directly into their own applications and scripts without shelling out to the Pulumi CLI, enabling sophisticated orchestration scenarios that are difficult to achieve with declarative-only tools.

    The trade-off with Pulumi is that the expressiveness of a general-purpose language cuts both ways. Teams with disciplined engineering practices will find Pulumi enables clean, testable, maintainable infrastructure code. Teams with less structure may produce infrastructure that is harder to read and audit than equivalent Terraform HCL — especially for operators who are not comfortable with the chosen language. Code review complexity scales with language complexity.

    When Pulumi Is the Right Choice

    Pulumi shines for platform engineering teams building internal developer platforms, composable infrastructure abstractions, or complex multi-cloud environments where the expressiveness of a real programming language delivers a genuine productivity advantage. It is also a natural fit when the same team is responsible for both application and infrastructure code and wants to apply consistent engineering practices across both. If your team is already writing TypeScript or Python and wants infrastructure that lives alongside application code with the same testing and review workflows, Pulumi is worth serious evaluation.

    Side-by-Side: Key Differences That Should Influence Your Decision

    Understanding the practical distinctions across a few key dimensions makes the trade-offs clearer:

    • Cloud scope: Terraform and Pulumi support multiple cloud providers; Bicep is Azure-only.
    • State management: Bicep uses Azure as the implicit state store. Terraform and Pulumi require explicit state backend configuration.
    • Language: Terraform uses HCL; Bicep uses a purpose-built DSL; Pulumi uses TypeScript, Python, Go, C#, or Java.
    • Testing: Pulumi offers the richest native testing story using standard language test frameworks. Terraform supports unit and integration testing via the testing framework added in 1.6. Bicep testing relies primarily on Azure deployment validation and Pester-based test scripts.
    • Community and ecosystem: Terraform has the largest existing module ecosystem. Pulumi has growing component libraries. Bicep relies on Azure-maintained modules and the Bicep registry.
    • Licensing: Bicep is MIT-licensed. Pulumi is Apache 2.0. Terraform is BSL post-1.5; OpenTofu is MPL 2.0.

    Migration and Adoption Considerations

    Switching IaC tools mid-project carries real risk and cost. Before committing to a tool, consider how your existing infrastructure was provisioned, what your team already knows, and what your CI/CD pipeline currently supports.

    Terraform can import existing Azure resources with terraform import or the newer import block syntax introduced in Terraform 1.5. Bicep supports ARM template decompilation to bootstrap Bicep files from existing deployments. Pulumi offers import commands and a pulumi convert utility that can translate Terraform HCL into Pulumi programs in supported languages, which meaningfully reduces the migration cost for teams moving from Terraform.

    For greenfield projects, the choice is mostly about team skills and strategic direction. For existing environments, assess the cost of migrating state, rewriting definitions, and retraining the team against the benefits of the target tool before committing.

    The Honest Recommendation

    There is no universally correct answer here — which is exactly why this debate persists in engineering teams across the industry. The decision should be driven by three questions: What cloud providers do you need to manage? What skills does your team already have? And what level of infrastructure-as-software sophistication does your use case actually require?

    If you manage multiple clouds and want a proven, widely-understood tool with a massive community, use Terraform or OpenTofu. If you are Azure-focused and want Microsoft-supported simplicity with zero external state management, use Bicep. If your team is software-engineering-first and wants to apply proper software development practices to infrastructure — unit tests, abstraction, automation APIs — give Pulumi a serious look.

    All three tools are production-ready, actively maintained, and used successfully by engineering teams at scale. The right choice is the one your team will actually use well.