How I Think About Kubernetes Access Control Beyond RBAC

Kubernetes RBAC is a reasonable starting point for access control, and it stays reasonable right up until the cluster grows beyond a single team with a single set of concerns. The limitations that surface after that point are structural properties of a model built around static role bindings, one that lacks the vocabulary for conditional logic, environmental context, or cross-cutting policy constraints. I wrote recently about the trust model Kubernetes assumes, and access control is where that trust model meets operational reality most painfully, because “who can do what” turns out to be a much harder question than four resource types can answer.

RBAC in Practice: From Role Definitions to Role Explosion

The core RBAC primitives are straightforward: a Role (namespaced) or ClusterRole (cluster-wide) defines a set of permissions, and a RoleBinding or ClusterRoleBinding attaches that permission set to a user, group, or service account. What matters here is what happens to them over time.

A well-scoped CI/CD role starts clean:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: cicd-deployer
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["create", "update", "patch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]

Six months later, after a series of incident responses and on-call escalations, the same role has drifted into something unrecognizable:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: cicd-deployer
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]

The wildcards arrive one at a time: a failed deployment needs get on pods, log tailing needs pods/log, and someone adds pods/exec to debug a crashloop at speed, with the rollback never making it onto the backlog. Eventually, scoping individual resources feels slower than opening everything up, and a wildcard replaces a list whose contents the team has long forgotten.

The privilege escalation vectors matter more than simple over-permissioning. A RoleBinding that grants the escalate and bind verbs allows a subject to create new role bindings with permissions exceeding their own:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: production
  name: dangerous-binding
subjects:
  - kind: ServiceAccount
    name: platform-automation
    namespace: production
roleRef:
  kind: ClusterRole
  name: rbac-escalation-capable
  apiGroup: rbac.authorization.k8s.io

With bind and escalate, that service account can grant itself cluster-admin equivalent permissions through a self-referential binding chain. The impersonate verb is another escalation path: any subject with impersonation privileges can act as any other user or service account, bypassing every RBAC check scoped to the impersonated identity.

At scale, the combinatorial growth becomes the primary security problem. Twenty namespaces with fifteen roles each, plus their corresponding bindings, easily exceeds six hundred RBAC objects, and auditing that volume for least-privilege compliance without automated tooling is a task that teams abandon before completion. Aggregated ClusterRoles (using label selectors to compose multiple ClusterRoles into one) reduce the object count, albeit at the cost of opacity: tracing which individual rules compose an aggregated role requires label-chasing across the cluster, and most teams don’t build that observability.

Secret enumeration through list and get on secrets resources is the permission combination that keeps biting teams. A service account that can list secrets across namespaces can enumerate every credential, API key, and TLS certificate stored in the cluster. Combine that with get access to individual secret objects, and the attacker has the plaintext values. This permission pair shows up in cluster security assessments with remarkable regularity, and it’s been granted unintentionally in every case I’ve had the chance to trace back.

ABAC: The Model Kubernetes Deprecated and Clouds Still Use

Kubernetes supported Attribute-Based Access Control before RBAC became the default in v1.6. ABAC policies are static JSON files loaded at API server startup, and any policy change requires restarting kube-apiserver, which makes iterating on access control a disruptive operational event rather than a routine kubectl apply.

A well-scoped ABAC policy looks straightforward:

{
  "apiVersion": "abac.authorization.kubernetes.io/v1beta1",
  "kind": "Policy",
  "spec": {
    "user": "jane",
    "namespace": "frontend",
    "resource": "pods",
    "readonly": true
  }
}

Once the policy file grows, wildcard attributes turn a targeted policy into a universal allow:

{
  "apiVersion": "abac.authorization.kubernetes.io/v1beta1",
  "kind": "Policy",
  "spec": {
    "user": "*",
    "namespace": "*",
    "resource": "*",
    "readonly": false
  }
}

Staleness kills ABAC in practice. The policy file lives on the API server’s filesystem, outside of any version-controlled deployment pipeline in most configurations. Decommissioned user accounts persist because removing them requires editing the file and restarting the API server, so teams defer the cleanup. Over months, the policy file accumulates entries for users who no longer exist in the organization, and if a new service account or authenticating proxy happens to present a matching username, it inherits whatever permissions the original user had. Attribute spoofing through a misconfigured authenticating proxy is the other major vector: if the proxy passes unvalidated headers that the API server trusts as identity attributes, any client that can reach the proxy can assert arbitrary identity claims. ABAC also lacks deny rules entirely; it’s an allow-only model where the most permissive matching policy wins.

Kubernetes deprecated ABAC in favor of RBAC’s declarative model, whilst every major cloud provider built their entire IAM architecture around attribute-based conditions, the same primitive expressed through different operational ergonomics.

AWS, GCP, and Azure all rely heavily on attribute conditions that ABAC pioneered. An AWS IAM policy using aws:PrincipalTag conditions evaluates dynamic attributes at request time, achieving what Kubernetes ABAC attempted with static files but couldn’t operationalize cleanly:

{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::team-buckets/*",
  "Condition": {
    "StringEquals": {
      "aws:PrincipalTag/Team": "${s3:prefix}"
    }
  }
}

Runtime evaluation versus static file loading is what separates a viable access control model from one that Kubernetes correctly moved away from.

Policy-Based Access Control: OPA and Kyverno

Policy-as-code evaluated at admission time represents the next evolution: access decisions that can reference arbitrary context, enforce cross-cutting constraints, and express conditional logic that RBAC’s verb-resource-subject model can’t represent. These systems hook into the Kubernetes API server through ValidatingAdmissionWebhooks, evaluating every API request against a policy engine before the request takes effect.

OPA (Open Policy Agent) with Gatekeeper is the most established option. A ConstraintTemplate defines the policy logic in Rego, and a Constraint applies that logic to specific API resources:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels
        violation[{"msg": msg}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Missing required labels: %v", [missing])
        }

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-team-labels
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment"]
  parameters:
    labels: ["team", "cost-center", "environment"]

Rego’s learning curve is genuine. The declarative evaluation model, where rules are unordered and variables unify rather than assign, trips up engineers coming from imperative languages. The := operator is local assignment whilst = is unification, and confusing the two produces policies that silently evaluate to unexpected results. Negation is where the most dangerous bugs hide: a negated condition in Rego that references undefined input values can silently fail to match, which in a Gatekeeper context means the violation rule never fires and a missing field silently satisfies a constraint intended to deny requests.

Kyverno takes a different approach entirely, expressing policies in pure YAML without a separate policy language:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-team-labels
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-labels
      match:
        any:
          - resources:
              kinds:
                - Deployment
      validate:
        message: "Deployments must have team, cost-center, and environment labels."
        pattern:
          metadata:
            labels:
              team: "?*"
              cost-center: "?*"
              environment: "?*"

Choosing between Rego and YAML comes down to how much expressiveness you need. Kyverno’s YAML patterns are immediately readable by any Kubernetes operator, and that familiarity dramatically lowers adoption barriers. Rego can express policies that YAML patterns can’t: cross-resource validation, arithmetic constraints, external data lookups. Most teams won’t exceed what YAML can express, at least not initially, and the migration cost if they do is worth factoring in early.

For regulated workloads requiring application-level authorization, an OPA sidecar running alongside the application provides fine-grained access decisions beyond what infrastructure-level admission control can reach. This pattern matters when compliance requirements demand provable policy properties or when access decisions depend on application-domain context (patient records, financial transactions) that Kubernetes resources don’t model. Formal verification of policy properties, proving that no combination of inputs can produce a given access result, is an active area of development in this space.

The security concerns with admission-time policy enforcement are operational. Setting failurePolicy: Ignore on a ValidatingAdmissionWebhook means that if the policy engine is unreachable (network partition, pod crash, resource exhaustion), the API server proceeds without evaluation, and every request during that window bypasses all policy checks. Rego logic errors, particularly in negation, can create policies that appear to enforce constraints whilst actually permitting everything. ConstraintTemplate scope mismatches, where a template targets the wrong API group or resource kind, produce constraints that enforce nothing against the resources you intended to protect. And admission control operates exclusively at request time: a workload that was compliant when created can drift out of compliance without any re-evaluation until the next API mutation.

What Breaks When You Get Each Model Wrong

RBAC fails through quiet drift: a compromised pod running with a service account that has list and get on secrets cluster-wide can enumerate every credential in the cluster:

# Attacker in a compromised pod
kubectl get secrets --all-namespaces -o json | \
  jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name}'

# Check what permissions the current service account holds
kubectl auth can-i --list

# Exfiltrate a specific secret
kubectl get secret database-credentials -n production -o jsonpath='{.data}'

The service account was probably granted those permissions for a legitimate operational reason, and the permission outlived the need by months because RBAC has no concept of temporal scoping or access review.

ABAC fails through invisible staleness: a decommissioned account’s username, still present in the static policy file because the team never restarted the API server with a cleaned-up version, gets inherited by a new service that happens to authenticate with the same username through a reconfigured identity provider. The new service receives permissions that were scoped to a completely different operational context, and the connection between the old policy entry and the new identity is invisible without auditing the policy file line by line.

PBAC fails through false confidence: a Rego typo in a negation clause, combined with failurePolicy: Ignore on the webhook configuration, produces a system where the policy engine is running, the constraints exist, the audit logs show the webhook being called, and nothing is actually being enforced. Every indicator says the policy is active whilst every request sails through unchecked.

Choosing Your Model: A Decision Framework

RBAC alone works for single-team clusters with fewer than fifty roles and no requirement for conditional access logic. The moment you need to express “this service account can only create deployments with specific labels” or “pods in this namespace must pull images from an approved registry,” you’ve exceeded what RBAC can represent.

RBAC combined with policy-as-code (OPA/Gatekeeper or Kyverno) is the multi-tenant production sweet spot. RBAC handles identity binding, determining which subjects map to which permission sets, whilst the admission controller handles constraint enforcement, ensuring that the resources those subjects create comply with organizational policy. The two systems complement each other because they operate at different points in the API request lifecycle.

For regulated workloads requiring application-level authorization, an OPA sidecar running alongside the application provides fine-grained access decisions beyond what infrastructure-level admission control can reach. This layer matters when compliance requirements demand provable policy properties or when access decisions depend on application-domain context (patient records, financial transactions) that Kubernetes resources don’t model.

Single-team clusters with fewer than fifty roles and no conditional logic requirements can stay on RBAC alone. Multi-tenant production clusters need the RBAC-plus-policy combination, with team count, compliance requirements, and policy complexity as the three axes that determine how much policy engine overhead is justified. Regulated workloads requiring application-domain access decisions need the full stack. Each layer adds enforcement capability and operational overhead in roughly equal measure, and the right access control architecture is the simplest one that covers your actual threat model, paired with auditing that tells you when you’ve outgrown it.

The trust boundary model I outlined last month is the foundation that access control policies build on. The threads I want to explore next are secrets management as a control plane problem (because access control policies reference credentials whose lifecycle they can’t govern), control plane security (because someone has to secure the policy engine itself), and runtime enforcement at the container level (because access control stops at the API server boundary).

The right time to revisit your access control model is when your audit tooling stops being able to answer “who can access what and why” across the full cluster, not after the incident that proves the gap.