Tag: Azure Policy

  • Azure Policy as Code: How to Govern Cloud Resources at Scale Without Losing Your Mind

    Azure Policy as Code: How to Govern Cloud Resources at Scale Without Losing Your Mind

    If you’ve spent any time managing a non-trivial Azure environment, you’ve probably hit the same wall: things drift. Someone creates a storage account without encryption at rest. A subscription gets spun up without a cost center tag. A VM lands in a region you’re not supposed to use. Manual reviews catch some of it, but not all of it — and by the time you catch it, the problem has already been live for weeks.

    Azure Policy offers a solution, but clicking through the Azure portal to define and assign policies one at a time doesn’t scale. The moment you have more than a handful of subscriptions or a team larger than one person, you need something more disciplined. That’s where Policy as Code (PaC) comes in.

    This guide walks through what Policy as Code means for Azure, how to structure a working repository, the key operational decisions you’ll need to make, and how to wire it all into a CI/CD pipeline so governance is automatic — not an afterthought.


    What “Policy as Code” Actually Means

    The phrase sounds abstract, but the idea is simple: instead of managing your Azure Policies through the portal, you store them in a Git repository as JSON or Bicep files, version-control them like any other infrastructure code, and deploy them through an automated pipeline.

    This matters for several reasons.

    First, Git history becomes your audit trail. Every policy change, every exemption, every assignment — it’s all tracked with who changed it, when, and why (assuming your team writes decent commit messages). That’s something the portal can never give you.

    Second, you can enforce peer review. If someone wants to create a new “allowed locations” policy or relax an existing deny effect, they open a pull request. Your team reviews it before it goes anywhere near production.

    Third, you get consistency across environments. A staging environment governed by a slightly different set of policies than production is a gap waiting to become an incident. Policy as Code makes it easy to parameterize for environment differences without maintaining completely separate policy definitions.

    Structuring Your Policy Repository

    There’s no single right structure, but a layout that has worked well across a variety of team sizes looks something like this:

    azure-policy/
      policies/
        definitions/
          storage-require-https.json
          require-resource-tags.json
          allowed-vm-skus.json
        initiatives/
          security-baseline.json
          tagging-standards.json
      assignments/
        subscription-prod.json
        subscription-dev.json
        management-group-root.json
      exemptions/
        storage-legacy-project-x.json
      scripts/
        deploy.ps1
        test.ps1
      .github/
        workflows/
          policy-deploy.yml

    Policy definitions live in policies/definitions/ — these are the raw policy rule files. Initiatives (policy sets) group related definitions together in policies/initiatives/. Assignments connect initiatives or individual policies to scopes (subscriptions, management groups, resource groups) and live in assignments/. Exemptions are tracked separately so they’re visible and reviewable rather than buried in portal configuration.

    Writing a Solid Policy Definition

    A policy definition file is JSON with a few key sections: displayName, description, mode, parameters, and policyRule. Here’s a practical example — requiring that all storage accounts enforce HTTPS-only traffic:

    {
      "displayName": "Storage accounts should require HTTPS-only traffic",
      "description": "Ensures that all Azure Storage accounts are configured with supportsHttpsTrafficOnly set to true.",
      "mode": "Indexed",
      "parameters": {
        "effect": {
          "type": "String",
          "defaultValue": "Audit",
          "allowedValues": ["Audit", "Deny", "Disabled"]
        }
      },
      "policyRule": {
        "if": {
          "allOf": [
            {
              "field": "type",
              "equals": "Microsoft.Storage/storageAccounts"
            },
            {
              "field": "Microsoft.Storage/storageAccounts/supportsHttpsTrafficOnly",
              "notEquals": true
            }
          ]
        },
        "then": {
          "effect": "[parameters('effect')]"
        }
      }
    }

    A few design choices worth noting. The effect is parameterized — this lets you assign the same definition with Audit in dev (to surface violations without blocking) and Deny in production (to actively block non-compliant resources). Hardcoding the effect is a common early mistake that forces you to maintain duplicate definitions for different environments.

    The mode of Indexed means this policy only evaluates resource types that support tags and location. For policies targeting resource group properties or subscription-level resources, use All instead.

    Grouping Policies into Initiatives

    Individual policy definitions are powerful, but assigning them one at a time to every subscription is tedious and error-prone. Initiatives (also called policy sets) let you bundle related policies and assign the whole bundle at once.

    A tagging standards initiative might group together policies for requiring a cost-center tag, requiring an owner tag, and inheriting tags from the resource group. An initiative like this assigns cleanly at the management group level, propagates down to all subscriptions, and can be updated in one place when your tagging requirements change.

    Define your initiatives in a JSON file and reference the policy definitions by their IDs. When you deploy via the pipeline, definitions go up first, then initiatives get built from them, then assignments connect initiatives to scopes — order matters.

    Testing Policies Before They Touch Production

    There are two kinds of pain with policy governance: violations you catch before deployment, and violations you discover after. Policy as Code should maximize the first kind.

    Linting and schema validation can run in your CI pipeline on every pull request. Tools like the Azure Policy VS Code extension or Bicep’s built-in linter catch structural errors before they ever reach Azure.

    What-if analysis is available for some deployment scenarios. More practically, deploy to a dedicated governance test subscription first. Assign your policy with Audit effect, then run your compliance scripts and check the compliance report. If expected-compliant resources show as non-compliant, your policy logic has a bug.

    Exemptions are another testing tool — if a specific resource legitimately needs to be excluded from a policy (legacy system, approved exception, temporary dev environment), track that exemption in your repo with a documented justification and expiry date. Exemptions that live only in the portal are invisible and tend to become permanent by accident.

    Wiring Policy Deployment into CI/CD

    A minimal GitHub Actions workflow for policy deployment looks something like this:

    name: Deploy Azure Policies
    
    on:
      push:
        branches: [main]
        paths:
          - 'policies/**'
          - 'assignments/**'
          - 'exemptions/**'
      pull_request:
        branches: [main]
    
    jobs:
      validate:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
          - name: Validate policy JSON
            run: |
              find policies/ -name '*.json' | xargs -I {} python3 -c "import json,sys; json.load(open('{}'))" && echo "All JSON valid"
    
      deploy:
        runs-on: ubuntu-latest
        needs: validate
        if: github.ref == 'refs/heads/main'
        steps:
          - uses: actions/checkout@v4
          - uses: azure/login@v2
            with:
              creds: ${{ secrets.AZURE_CREDENTIALS }}
          - name: Deploy policy definitions
            run: ./scripts/deploy.ps1 -Stage definitions
          - name: Deploy initiatives
            run: ./scripts/deploy.ps1 -Stage initiatives
          - name: Deploy assignments
            run: ./scripts/deploy.ps1 -Stage assignments

    The key pattern: pull requests trigger validation only. Merges to main trigger the actual deployment. Policy changes that bypass review by going directly to main can be prevented with branch protection rules.

    For Azure DevOps shops, the same pattern applies using pipeline YAML with environment gates — require a manual approval before the assignment stage runs in production if your organization needs that extra checkpoint.

    Common Pitfalls Worth Avoiding

    Starting with Deny effects. The first instinct when you see a compliance gap is to block it immediately. Resist this. Start every new policy with Audit for at least two weeks. Let the compliance data show you what’s actually out of compliance before you start blocking things. Blocking before you understand the landscape leads to surprised developers and emergency exemptions.

    Scope creep in initiatives. It’s tempting to build one giant “everything” initiative. Don’t. Break initiatives into logical domains — security baseline, tagging standards, allowed regions, allowed SKUs. Smaller initiatives are easier to update, easier to understand, and easier to exempt selectively when needed.

    Not versioning your initiatives. When you change an initiative — adding a new policy, changing parameters — update the initiative’s display name and maintain a changelog. Initiatives that silently change are hard to reason about in compliance reports.

    Forgetting inherited policies. If you’re working in a larger organization where your management group already has policies assigned from above, those assignments interact with yours. Map the existing policy landscape before you assign new policies, especially deny-effect ones, to avoid conflicts or redundant coverage.

    Not cleaning up exemptions. Exemptions with no expiry date live forever. Add an expiry review process — even a simple monthly script that lists exemptions older than 90 days — and review whether they’re still justified.

    Getting Started Without Boiling the Ocean

    If you’re starting from scratch, a practical week-one scope is:

    1. Pick three policies you know you need: require encryption at rest on storage accounts, require tags on resource groups, deny resources in non-approved regions.
    2. Stand up a policy repo with the folder structure above.
    3. Deploy with Audit effect to a dev subscription.
    4. Fix the real violations you find rather than exempting them.
    5. Set up the CI/CD pipeline so future changes require a pull request.

    That scope is small enough to finish and large enough to prove the value. From there, building out a full security baseline initiative and expanding to production becomes a natural next step rather than a daunting project.

    Policy as Code isn’t glamorous, but it’s the difference between a cloud environment that drifts toward chaos and one that stays governable as it grows. The portal will always let you click things in. The question is whether anyone will know what got clicked, why, or whether it’s still correct six months later. Code and version control answer all three.

  • How to Use Azure Policy Exemptions for AI Workloads Without Turning Guardrails Into Suggestions

    How to Use Azure Policy Exemptions for AI Workloads Without Turning Guardrails Into Suggestions

    Azure Policy is one of the cleanest ways to keep AI platform standards from drifting across subscriptions, resource groups, and experiments. The trouble starts when delivery pressure collides with those standards. A team needs to test a model deployment, wire up networking differently, or get around a policy conflict for one sprint, and suddenly the word exemption starts sounding like a productivity feature instead of a risk decision.

    That is where mature teams separate healthy flexibility from policy theater. Exemptions are not a failure of governance. They are a governance mechanism. The problem is not that exemptions exist. The problem is when they are created without scope, without evidence, and without a path back to compliance.

    Exemptions Should Explain Why the Policy Is Not Being Met Yet

    A useful exemption starts with a precise reason. Maybe a vendor dependency has not caught up with private networking requirements. Maybe an internal AI sandbox needs a temporary resource shape that conflicts with the normal landing zone baseline. Maybe an engineering team is migrating from one pattern to another and needs a narrow bridge period. Those are all understandable situations.

    What does not age well is a vague exemption that effectively says, “we needed this to work.” If the request cannot clearly explain the delivery blocker, the affected control, and the expected end state, it is not ready. Teams should have to articulate why the policy is temporarily impractical, not merely inconvenient.

    Scope the Exception Smaller Than the Team First Wants

    The easiest way to make exemptions dangerous is to grant them at a broad scope. A subscription-wide exemption for one AI prototype often becomes a quiet permission slip for unrelated workloads later. Strong governance teams default to the smallest scope that solves the real problem, whether that is one resource group, one policy assignment, or one short-lived deployment path.

    This matters even more for AI environments because platform patterns spread quickly. If one permissive exemption sits in the wrong place, future projects may inherit it by accident and call that reuse. Tight scoping keeps an unusual decision from becoming a silent architecture standard.

    Every Exemption Needs an Owner and an Expiration Date

    An exemption without an owner is just deferred accountability. Someone specific should be responsible for the risk, the follow-up work, and the retirement plan. That owner does not have to be the person clicking approve in Azure, but it should be the person who can drive remediation when the temporary state needs to end.

    Expiration matters for the same reason. A surprising number of “temporary” governance decisions stay alive because nobody created the forcing function to revisit them. If the exemption is still needed later, it can be renewed with updated evidence. What should not happen is an open-ended exception drifting into permanent policy decay.

    Document the Compensating Controls, Not Just the Deviation

    A good exemption request does more than identify the broken rule. It explains what will reduce risk while the rule is not being met. If an AI workload cannot use the preferred network control yet, perhaps access is restricted through another boundary. If a logging standard cannot be implemented immediately, perhaps the team adds manual review, temporary alerting, or narrower exposure until the full control lands.

    This is where governance becomes practical instead of theatrical. Leaders do not need a perfect environment on day one. They need evidence that the team understands the tradeoff and has chosen deliberate safeguards while the gap exists.

    Review Exemptions as a Portfolio, Not One Ticket at a Time

    Individual exemptions can look reasonable in isolation while creating a weak platform in aggregate. One allows broad outbound access, another delays tagging, another bypasses a deployment rule, and another weakens log retention. Each request sounds manageable. Together they can tell you that a supposedly governed AI platform is running mostly on exceptions.

    That is why a periodic exemption review matters. Security, platform, and cloud governance leads should look for clusters, aging exceptions, repeat patterns, and teams that keep hitting the same friction point. Sometimes the answer is to retire the exemption. Sometimes the answer is to improve the policy design because the platform standard is clearly out of sync with real work.

    Final Takeaway

    Azure Policy exemptions are not the enemy of governance. Unbounded exemptions are. When an exception is narrow, time-limited, owned, and backed by compensating controls, it helps serious teams ship without pretending standards are frictionless. When it is broad, vague, and forgotten, it turns guardrails into suggestions.

    The right goal is not “no exemptions ever.” The goal is making every exemption look temporary on purpose and defensible under review.

  • How to Use Azure Policy to Keep AI Sandbox Subscriptions From Becoming Production Backdoors

    How to Use Azure Policy to Keep AI Sandbox Subscriptions From Becoming Production Backdoors

    Abstract blue and violet cloud security illustration with layered shapes and glowing network paths

    AI teams often start in a sandbox subscription for the right reasons. They want to experiment quickly, compare models, test retrieval flows, and try new automation patterns without waiting for every enterprise control to be polished. The problem is that many sandboxes quietly accumulate permanent exceptions. A temporary test environment gets a broad managed identity, a permissive network path, a storage account full of copied data, and a deployment template that nobody ever revisits. A few months later, the sandbox is still labeled non-production, but it has become one of the easiest ways to reach production-adjacent systems.

    Azure Policy is one of the best tools for stopping that drift before it becomes normal. Used well, it gives platform teams a way to define what is allowed in AI sandbox subscriptions, what must be tagged and documented, and what should be blocked outright. It does not replace identity design, network controls, or human approval. What it does provide is a practical way to enforce the baseline rules that keep an experimental environment from turning into a permanent loophole.

    Why AI Sandboxes Drift Faster Than Other Cloud Environments

    Most sandbox subscriptions are created to remove friction. That is exactly why they become risky. Teams add resources quickly, often with broad permissions and short-term workarounds, because speed is the point. In AI projects, this problem gets worse because experimentation often crosses several control domains at once. A single proof of concept may involve model endpoints, storage, search indexes, document ingestion, secret retrieval, notebooks, automation accounts, and outbound integrations.

    If there is no policy guardrail, each convenience decision feels harmless on its own. Over time, though, the subscription starts to behave like a shadow platform. It may contain production-like data, long-lived service principals, public endpoints, or copy-pasted network rules that were never meant to survive the pilot stage. At that point, calling it a sandbox is mostly a naming exercise.

    Start by Defining What a Sandbox Is Allowed to Be

    Before writing policy assignments, define the operating intent of the subscription. A sandbox is not simply a smaller production environment. It is a place for bounded experimentation. That means its controls should be designed around expiration, isolation, and reduced blast radius.

    For example, you might decide that an AI sandbox subscription may host temporary model experiments, retrieval prototypes, and internal test applications, but it may not store regulated data, create public IP addresses without exception review, peer directly into production virtual networks, or run identities with tenant-wide privileges. Azure Policy works best after those boundaries are explicit. Without that clarity, teams usually end up writing rules that are either too weak to matter or so broad that engineers immediately look for ways around them.

    Use Deny Policies for the Few Things That Should Never Be Normal

    The strongest Azure Policy effect is `deny`, and it should be used carefully. If you try to deny everything interesting, developers will hate the environment and the policy set will collapse under exception pressure. The better approach is to reserve deny policies for the patterns that should never become routine in an AI sandbox.

    A good example is preventing unsupported regions, blocking unrestricted public IP deployment, or disallowing resource types that create uncontrolled paths to sensitive systems. You can also deny deployments that are missing required tags such as data classification, owner, expiration date, and business purpose. These controls are useful because they stop the easiest forms of drift at creation time instead of relying on cleanup later.

    Use Audit and Modify to Improve Behavior Without Freezing Experimentation

    Not every control belongs in a hard block. Some are better handled with `audit`, `auditIfNotExists`, or `modify`. Those effects help teams see drift and correct it while still leaving room for legitimate testing. In AI sandbox subscriptions, this is especially helpful for operational hygiene.

    For instance, you can audit whether diagnostic settings are enabled, whether Key Vault soft delete is configured, whether storage accounts restrict public access, or whether approved tags are present on inherited resources. The `modify` effect can automatically add or normalize tags when the fix is straightforward. That gives engineers useful feedback without turning every experiment into a support ticket.

    Treat Network Exposure as a Policy Question, Not Just a Security Review Question

    AI teams often focus on model quality first and treat network design as something to revisit later. That is how sandbox environments end up with public endpoints, broad firewall exceptions, and test services that are reachable from places they should never be reachable from.

    Azure Policy can help force the right conversation earlier. You can use it to restrict which SKUs, networking modes, or public access settings are allowed for storage, databases, and other supporting services. You can also audit or deny resources that are created outside approved network patterns. This matters because many AI risks do not come from the model itself. They come from the surrounding infrastructure that moves prompts, files, embeddings, and results across environments with too little friction.

    Require Expiration Signals So Temporary Environments Actually Expire

    One of the most practical sandbox controls is also one of the least glamorous: require an expiration tag and enforce follow-up around it. Temporary environments rarely disappear on their own. They survive because nobody is clearly accountable for cleaning them up, and because the original test work slowly becomes an unofficial dependency.

    A policy initiative can require tags such as `ExpiresOn`, `Owner`, and `WorkloadStage`, then pair those tags with reporting or automation outside Azure Policy. The value here is not the tag itself. The value is that a sandbox subscription becomes legible. Reviewers can quickly see whether a deployment still has a business reason to exist, and platform teams can spot old experiments before they turn into permanent access paths.

    Keep Exceptions Visible and Time Bound

    Every policy program eventually needs exceptions. The mistake is treating exceptions as invisible administrative work instead of as security-relevant decisions. In AI environments, exceptions often involve high-impact shortcuts such as broader outbound access, looser identity permissions, or temporary access to sensitive datasets.

    If you grant an exception, record why it exists, who approved it, what resources it covers, and when it should end. Even if Azure Policy itself is not the system of record for exception governance, your policy model should assume that exceptions are time-bound and reviewable. Otherwise the exception process becomes a slow-motion replacement for the standard.

    Build Policy Sets Around Real AI Platform Patterns

    The cleanest policy design usually comes from grouping controls into a small number of understandable initiatives instead of dumping dozens of unrelated rules into one assignment. For AI sandbox subscriptions, that often means separating controls into themes such as data handling, network exposure, identity hygiene, and lifecycle governance.

    That structure helps in two ways. First, engineers can understand what a failed deployment is actually violating. Second, platform teams can tune controls over time without turning every policy update into a mystery. Good governance is easier to maintain when teams can say, with a straight face, which initiative exists to control which class of risk.

    Final Takeaway

    Azure Policy will not make an AI sandbox safe by itself. It will not fix bad role design, weak approval paths, or careless data handling. What it can do is stop the most common forms of cloud drift from becoming normal operating practice. That is a big deal, because most AI security problems in the cloud do not begin with a dramatic breach. They begin with a temporary shortcut that nobody removed.

    If you want sandbox subscriptions to stay useful without becoming production backdoors, define the sandbox operating model first, deny only the patterns that should never be acceptable, audit the rest with intent, and make expiration and exceptions visible. That is how experimentation stays fast without quietly rewriting your control boundary.

  • How to Use Azure Policy Without Turning Governance Into a Developer Tax

    How to Use Azure Policy Without Turning Governance Into a Developer Tax

    Azure Policy is one of those tools that can either make a cloud estate safer and easier to manage, or make every engineering team feel like governance exists to slow them down. The difference is not the feature set. The difference is how you use it. When policy is introduced as a wall of denials with no rollout plan, teams work around it, deployments fail late, and governance earns a bad reputation. When it is used as a staged operating model, it becomes one of the most practical ways to raise standards without creating unnecessary friction.

    Start with visibility before enforcement

    The fastest way to turn Azure Policy into a developer tax is to begin with broad deny rules across subscriptions that already contain drift, exceptions, and legacy workloads. A better approach is to start with audit-focused initiatives that show what is happening today. Teams need a baseline before they can improve it. Platform owners also need evidence about where the biggest risks actually are, instead of assuming every standard should be enforced immediately.

    This visibility-first phase does two useful things. First, it surfaces repeat problems such as untagged resources, public endpoints, or unsupported SKUs. Second, it gives you concrete data for prioritization. If a rule only affects a small corner of the estate, it does not deserve the same rollout energy as a control that improves backup coverage, identity hygiene, or network exposure across dozens of workloads.

    Write policies around platform standards, not one-off preferences

    Strong governance comes from standardizing the things that should be predictable across the platform. Naming patterns, required tags, approved regions, private networking expectations, managed identity usage, and logging destinations are all good candidates because they reduce ambiguity and improve operations. Weak governance happens when policy gets used to encode every opinion an administrator has ever had. That creates clutter, exceptions, and resistance.

    If a standard matters enough to enforce, it should also exist outside the policy engine. It should be visible in landing zone documentation, infrastructure-as-code modules, architecture patterns, and deployment examples. Policy works best as the safety net behind a clear paved road. If teams can only discover a rule after a deployment fails, governance has already arrived too late.

    Use initiatives to express intent at the right level

    Individual policy definitions are useful building blocks, but initiatives are where governance starts to feel operationally coherent. Grouping related policies into initiatives makes it easier to align controls with business goals like secure networking, cost discipline, or data protection. It also simplifies assignment and reporting because stakeholders can discuss the outcome they want instead of memorizing a list of disconnected rule names.

    • A baseline initiative for core platform hygiene such as tags, approved regions, and diagnostics.
    • A security initiative for identity, network exposure, encryption, and monitoring expectations.
    • An application delivery initiative for approved service patterns, backup settings, and deployment guardrails.

    The list matters less than the structure. Teams respond better when governance feels organized and purposeful. They respond poorly when every assignment looks like a random pile of rules added over time.

    Pair deny policies with a clean exception process

    Deny policies have an important place, especially for high-risk issues that should never make it into production. But the moment you enforce them, you need a legitimate path for handling edge cases. Otherwise, engineers will treat the platform team as a ticket queue whose main job is approving bypasses. A clean exception process should define who can approve a waiver, how long it lasts, what compensating controls are expected, and how it gets reviewed later.

    This is where governance maturity shows up. Good policy programs do not pretend exceptions will disappear. They make exceptions visible, temporary, and expensive enough that teams only request them when they genuinely need them. That protects standards without ignoring real-world delivery pressure.

    Shift compliance feedback left into delivery pipelines

    Even a well-designed policy set becomes frustrating if developers only encounter it at deployment time in a shared subscription. The better pattern is to surface likely violations earlier through templates, pre-deployment validation, CI checks, and standardized modules. When teams can see policy expectations before the final deployment stage, they spend less time debugging avoidable issues and more time shipping working systems.

    In practical terms, this usually means platform teams invest in reusable Bicep or Terraform modules, example repositories, and pipeline steps that mirror the same standards enforced in Azure. Governance becomes cheaper when compliance is the default path rather than a separate clean-up exercise after a failed release.

    Measure whether policy is improving the platform

    Azure Policy should produce operational outcomes, not just dashboards full of non-compliance counts. If the program is working, you should see fewer risky configurations, faster environment provisioning, less debate about standards, and better consistency across subscriptions. Those are platform outcomes people can feel. Raw violation totals only tell part of the story, because they can rise temporarily when your visibility improves.

    A useful governance review looks at trends such as how quickly findings are remediated, which controls generate repeated exceptions, which subscriptions drift most often, and which standards are still too hard to meet through the paved road. If policy keeps finding the same issue, that is usually a platform design problem, not just a team discipline problem.

    Governance works best when it feels like product design

    The healthiest Azure environments treat governance as part of platform product design. The platform team sets standards, publishes a clear path for meeting them, watches the data, and tightens enforcement in stages. That approach respects both risk management and delivery speed. Azure Policy is powerful, but power alone is not what makes it valuable. The real value comes from using it to make the secure, supportable path the easiest path for everyone building on the platform.