GKE Security Checklist

February 15, 2024

Container orchestration is very complex and needs fine-tuning to be managed correctly. Every step is important, especially when it comes to security.

In this article, I will discuss how you can enhance the security of your GKE’s.

I would like to point out in advance that some of the concepts discussed here require deep expertise and are only superficially covered.

Shielded GKE Nodes
Private Google Access & Nodes
Integrity Monitoring
Auto Repair Nodes
Control Plane Authorized Networks
Legacy Metadata Enpoints
Legacy Authorization (ABAC)
Legacy Client Authentication
Container Optimized OS Images
GKE Dashboard
Subscribe to Release Channel
Cluster Monitoring
Workload Identity
Pod Security Admission Controller or Gatekeeper
Network Policies
Access Scopes
Metadata Server
Binary Authorization
Application-Layer Secrets Encryption

Shielded GKE Nodes

Boot or kernel-level malware or rootkits are types of malicious software that are particularly dangerous because they operate at a low level in a computer’s operating system.

They can control or alter the operations of an operating system without being detected by traditional antivirus software. They persist beyond infected OS, meaning that they can survive even if the operating system is reinstalled. This makes them extremely hard to remove and allows them to cause significant damage or data theft.

Shielded GKE Nodes run firmware which is signed and verified using Google’s Certificate Authority, ensuring that the nodes’ firmware is unmodified and establishing the root of trust for Secure Boot.

GKE node identity is strongly protected via virtual Trusted Platform Module (vTPM) and verified remotely by the master node before the node joins the cluster.

The virtual Trusted Platform Module (vTPM)

It is a technology that provides a level of security for computers. It is a virtual version of a physical TPM chip, which is a secure cryptoprocessor that can store cryptographic keys that protect information.

In the context of Shielded GKE nodes, the vTPM is used to protect the identity of the node. This means that the node’s identity is encrypted and can only be verified and decrypted by the master node. This adds an additional layer of security as it ensures that only verified nodes can join the cluster.

Private Google Access & Nodes

Private Nodes are nodes without public IP addresses. They can’t be accessed directly from the internet, only from internal networks. This boosts their security. By not allowing public IP addresses, we add an extra security layer. This limits access from outside, reducing chances of unauthorized entry.

Although this seems simple, it has significant effects. It means attackers must first enter the local network before they can try to access the Kubernetes hosts. This makes it harder for them, increasing our system’s security. This strategy makes Kubernetes hosts less likely to be threatened.

You should limit exposure of your cluster control plane and nodes to the internet.

By default the GKE cluster control plane and nodes have internet routable addresses that can be accessed from any IP address. There are three types of private clusters offering varying levels of network protection:

Public endpoint access disabled: This is the most secure option, blocking all internet access to both control planes and nodes. It is a good choice if you’ve connected your on-premises network to Google Cloud using Cloud Interconnect and Cloud VPN.

Public endpoint access with authorized networks (recommended): This option permits access to the control plane from specified IPs. It’s ideal for environments without VPN infrastructure, or with remote users or branches connecting via public internet instead of corporate VPN, Cloud Interconnect, or Cloud VPN.

Public endpoint access enabled, authorized networks disabled: This is the default option and permits anyone on the internet to establish network connections to the control plane.

Integrity Monitoring

Integrity monitoring is a node pool setting that’s enabled by default on GKE. The Integrity Monitoring feature should be enabled for GKE cluster nodes in order to monitor and automatically check the runtime boot integrity of shielded cluster nodes using Cloud Monitoring service.

To check / enable node integrity monitoring

Click the Nodes tab.
Under Node Pools, click the name of the node pool you want to modify.
Under security, you can modify integrity monitoring checkbox.

Auto Repair Nodes

Node auto-repair in GKE ensures that the nodes in your cluster remain healthy and operational. When activated, GKE periodically checks the health of each node in your cluster. If a node consecutively fails these health checks over a prolonged duration, GKE starts a repair process for the affected node.

To check / enable node auto-repair

Click the Nodes tab.
Under Node Pools, click the name of the node pool you want to modify.
Under management, you can modify auto-repair checkbox.

Node repair process

When GKE identifies a node needing repair, it’s drained and replaced. GKE waits an hour for the drain, if incomplete, the node is replaced. For multiple repairs, GKE might perform parallel operations, adjusting based on cluster size and unhealthy nodes. If node auto-repair is disabled during repair, ongoing repairs aren’t cancelled.

Control Plane Authorized Networks

Control Plane Authorized Networks is a security feature that blocks untrusted IP addresses that aren’t from white list that you define. It helps keep your data safe by stopping unauthorized access from outside. But, not all access is blocked. If an address is from Google Cloud Platform, it can still reach your master. They can connect using HTTPS, but they need the right Kubernetes credentials. Without these, even addresses from Google Cloud Platform can’t get in. This makes sure your data is safe but still easy to access.

Legacy Metadata Endpoints

The instance metadata server, a critical component of system infrastructure, previously exposed legacy v0.1 and v1beta1 endpoints. These particular endpoints do not enforce the use of metadata query headers, a feature that has been significantly improved in the v1 APIs.

The importance of these headers cannot be overstated, as they provide an additional layer of security by making it considerably more challenging for a potential attacker to retrieve crucial instance metadata.

Such attacks could include, for example, Server-Side Request Forgery (SSRF), a common exploit that can lead to serious data breaches. Unless there is a specific, compelling reason for maintaining these legacy APIs within your system, my recommendation would be to disable them. This action will further enhance the security of your data and the robustness of your infrastructure.

Legacy Authorization

The legacy authorizer in Kubernetes grants broad, statically defined permissions. To ensure that RBAC limits permissions correctly, you must disable the legacy authorizer. RBAC has significant security advantages, can help you ensure that users only have access to cluster resources within their own namespace and is now stable in Kubernetes.

gcloud container clusters update CLUSTER_NAME \
    --no-enable-legacy-authorization

Legacy Client Authentication

GKE allows the use of service account bearer tokens, OAuth tokens, and x509 client certificates for connection to the Kubernetes API server.

It handles this connection using the OAuth token method. In the past, GKE used one-time generated x509 certificates or constant passwords.

However, these are no longer advised and were turned off starting from GKE version 1.12. The use of constant passwords was discontinued and removed from version 1.19.

Basic authentication allows a user to authenticate to the cluster with a username and password and it is stored in plain text without any encryption. Disabling Basic authentication will prevent attacks like brute force. Its recommended to use either client certificate or IAM for authentication.

Existing clusters need to switch to OAuth. If an external system needs credentials that last a long time, make a Google or Kubernetes service account that has the required access rights and export the key.

Container Optimized OS Images

Container-Optimized OS from Google is a specialized operating system image for your VMs, optimized for running containers. Backed by Google and grounded on the open-source Chromium OS project, Container-Optimized OS allows you to launch your containers on Google Cloud Platform swiftly, proficiently, and securely.

It is recommended to use container-optimized OS images, as they provide improved support, security and stability.

Container-Optimized OS provides:

Container Launch: Pre-installed with Docker, containerd runtimes, and cloud-init, it enables simultaneous VM and container launch, eliminating setup needs.
Reduced Attack Surface: Its smaller footprint decreases potential instance attack points.
Built-in Security: Pre-configured firewall and security settings are included.
Auto Updates: Instances automatically download weekly updates, applied with a simple reboot.

Kubernetes Dashboard

You should disable the Kubernetes Web UI (Dashboard) when running on Kubernetes Engine.

The Kubernetes Web UI (Dashboard) is backed by a highly privileged Kubernetes Service Account. The Cloud Console provides much of the same functionality, so you don’t need this functionality.

gcloud container clusters update CLUSTER_NAME \
    --update-addons=KubernetesDashboard=DISABLED

Release Channels

Google Kubernetes Engine (GKE) has different release channels to provide users with options for how quickly they receive updates and new features. These release channels help users balance the need for stability and reliability with the desire to access the latest Kubernetes features.

The Regular channel, upgrading every few weeks, is for users needing features beyond the Stable channel. These versions pass internal validation but lack historical stability data. Known issues usually have workarounds.

The Stable channel, upgrading every few months, is for users prioritizing stability and avoiding frequent upgrades. These versions pass internal validation and have proven stable and reliable in production. All channels receive critical security patches.

Cluster Monitoring

Google Kubernetes Engine (GKE) integrates with Cloud Logging and Cloud Monitoring, including the Google Cloud Managed Service for Prometheus.

This allows monitoring of GKE clusters, management of logs, performance assessment using profiling and tracing, and includes a dashboard for observability. Security logs, including audit logs, are available for GKE and other Google Cloud services, with or without Cloud Logging enabled.

You should enable cluster monitoring and use a monitoring service so your cluster can export metrics about its activities.

Workload Identities

Enabling Workload Identity manages the distribution and rotation of Service account keys for the workloads to use.

Kubernetes workloads should not use cluster node service accounts to authenticate to Google Cloud APIs. Each Kubernetes Workload that needs to authenticate to other Google services using Cloud IAM should be provisioned a dedicated Service account.

Pod Security Admission Controller or Gatekeeper

If you want to continue using Pod-level security controls in GKE, Google recommend one of the following solutions:

Pod Security Admission Controllers

Enforce Pod Security Standards on your GKE Standard and Autopilot clusters using pod security admission controller. Pod Security Standards are not just predefined security policies, they are designed to satisfy the highest expectations of Pod security in Kubernetes. These policies, ranging from highly permissive to extremely restrictive, are cumulative and comprehensive.

PodSecurityPolicy (beta) is deprecated in Kubernetes 1.21, set to be removed in 1.25. In GKE clusters running 1.25 or later, PodSecurityPolicy can’t be used and must be disabled prior to upgrading. To migrate your existing PodSecurityPolicy configuration to PodSecurity, refer to Migrate from PodSecurityPolicy.

Gatekeeper

Standard GKE clusters enable you to apply security policies using Gatekeeper. Gatekeeper not only allows you to enforce the same functionalities as PodSecurityPolicy but also offers additional features like dry-run, gradual rollouts, and auditing.

Here is a simple example of applying Gatekeeper to a GKE cluster:

You can create a ConstraintTemplate that enforces the policy and the schema of the constraint.

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        # Schema for the `parameters` field
        openAPIV3Schema:
          properties:
            labels:
              type: array
              items: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("you must provide labels: %v", [missing])
        }

After the ConstraintTemplate is created, you can create a constraint that uses it.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: pods-must-have-gk
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    labels: ["gatekeeper"]

This example ensures that all pods have a label called gatekeeper.

Network Policies

In a Kubernetes setup, pods can talk to each other by default. This is to make sure everything is accessible. But you can change this by applying a NetworkPolicy.

A NetworkPolicy sets rules for how groups of pods can talk to each other and other network points. When it’s applied to a pod, it controls what traffic the pod can get.

Restricting network access to services makes it much more difficult for attackers to move laterally within your cluster, and also offers services some protection against accidental or deliberate denial of service.

Google recommend two ways to control traffic are :

Istıo

Istio is an open-source platform providing tools for handling, securing, controlling, and observing microservices in container orchestration platforms. Its main purpose is to streamline the management and security of service communication in complex distributed systems.

Kubernetes Network Policy

A Kubernetes Network Policy is a specification that defines how groups of pods are allowed to communicate with each other and other network endpoints. It provides a way to control the traffic flow at the network level within a Kubernetes cluster.

Istio is more focused on service mesh capabilities, traffic management, and application-layer features, while Kubernetes Network Policies are designed for network-level segmentation(Layer 3/4) and control between pods within the cluster.

Access Scopes

If you are not creating a separate service account for your nodes, you should limit the scopes of the node service account to reduce the oportunity for privilege escalation.

This ensures that the default service account does not have permissions beyond those necessary to run your cluster. While the default scopes are limited, they may include scopes beyond the minimally required ones needed to run your cluster.

For example, If you are accessing private images in Google Container Registry, the minimally required scopes are only logging.write, monitoring, and devstorage.read_only.

Metadata Server

Every GKE node stores its metadata on a metadata server. Some of this metadata, such as kubelet credentials and the VM instance identity token, is sensitive and should not be exposed to a Kubernetes workload.

Enabling the GKE Metadata server prevents pods (that are not running on the host network) from accessing this metadata and facilitates Workload Identity. When unspecified, the default setting allows running pods to have full access to the node’s underlying metadata server.

Binary Authorization

Binary Authorization provides software supply-chain security for images that you deploy to GKE from Google Container Registry (GCR) or another container image registry.

Binary Authorization requires images to be signed by trusted authorities during the development process. These signatures are then validated at deployment time. By enforcing validation, you can gain tighter control over your container environment by ensuring only verified images are integrated into the build-and-release process.

Application-Layer Secrets Encryption

GKE automatically encrypts all user data, so there’s no need for any additional actions. This includes Secrets.

Adding Application-layer Secrets Encryption gives an extra security layer to protect sensitive data. This means it safeguards user-created Secrets and necessary operational Secrets like service account keys, which are stored in etcd.

With this feature, you can use a key from Cloud KMS to encrypt data at the application level. This helps against potential attackers who might get access to etcd.

Make sure the key is located in the same location as the cluster to reduce latency and avoid situations where resources rely on services spread across multiple failure domains. Once you’ve created a key, you can activate the feature on a new or existing cluster by specifying the desired key. Upon enabling this feature, GKE will encrypt all new and existing secrets using your encryption key.

Articles, Notes