Keycloak High Availability in Kubernetes: A Comprehensive Guide

Keycloak High Availability in Kubernetes: A Comprehensive Guide




Keycloak stands as a prevalent Identity and Access Management (IAM) tool, frequently embraced by customers for its open-source nature and robust functionalities. It has evolved into an industry standard for clients preferring self-hosted IAM solutions over SaaS alternatives like Okta or Azure AD for their Single Sign-On (SSO) necessities.
Being a seasoned Kubernetes practitioner, what if we simply adjust the Keycloak deployment to have 2 or more replicas to achieve HA? Shouldn’t it be pretty straightforward? — Umm .. Well, the answer is NO!

Problem with Just Increasing the replicas?

Out of the box, Keycloak isn’t configured to be deployed with multiple replicas in any environment, let alone in a Kubernetes cluster. By default, Keycloak assumes it’s a Standalone Instance without any knowledge of other Keycloak instances that may be running around it.
It uses a backend database that stores user information and data about your organization, and you set Keycloak to pull from it. If an instance of Keycloak (pod) goes down, Kubernetes spins a new one up and it retrieves the information from the database as normal and keeps on going.
While most data is stored in your backend database, there are a couple of pieces of important data that are stored in an internal Infinispan Cache inside each pod, like ***user authentication sessions. With our load balancer in front, if a user that was already authenticated was somehow directed to the other Keycloak pod, that pod would have no authentication data for them and force them to log in again.***
Infinispan is an open-source in-memory data grid that offers flexible deployment options and robust capabilities for storing, managing, and processing data. It provides a key/value data store that can hold all types of data, from Java objects to plain text, and also distributes your data across elastically scalable clusters to guarantee high availability and fault tolerance, whether you use Infinispan as a volatile cache or a persistent data store.

Setting up Keycloak High Availability

Multi-site deployments

Keycloak supports deployment that consists of multiple Keycloak instances that connect to forma Standalone Cluster *(multi-site deployment)*, using its embedded Infinispan Cache; loadbalancer can distribute the load evenly across those instances. Those setups are intended for a transparent network on a single site, but it entails an additional effort in tailoring deployments to achieve HA.

Deployment, Data storage, and Caching

Two independent Keycloak instances running on different sites are connected with a low-latency network connection. Users, realms, clients, offline sessions, and other entities are stored in a database that is replicated synchronously across the two sites. The data is also cached in the Keycloak embedded Infinispan as local caches. When the data is changed in one Keycloak instance, that data is updated in the database, and an invalidation message is sent to the other site using the replicated work cache.

Architecture Diagram

Keycloak High Availability 001
Essentially, this setup allows multiple Keycloak instances to be made aware of each other so that the Infinispan Cache can be retrieved and shared by multiple instances of Keycloak. It has multiple ways of doing this:
  • DNS\_PING
  • KUBE\_PING
Keycloak Clustering is based on the discovery protocols of JGroups (Keycloak uses Infinispan cache and Infinispan uses JGroups to discover nodes).
On the Kubernetes cluster, we are going to use KUBE\_PING as JGroups discovery protocol. These will discover how many containers are running in that namespace. It will get pod details using JGroups properties and it will sync up with each pod and also cache between one container to another using Infinispan concepts. 

Steps to Make Keycloak HA

The following steps allow us to make a Keycloak run in HA mode:
  • Running Keycloak in Standalone clustered mode[Multi-site deployment]
  • Multiple replicas of deployment — to Tolerate pod(s) failure.
  • Scheduling of pods on different K8S Nodes
  • Enabling Readiness & Liveness probes

Implementation

1. Set up Keycloak with 2 replicas.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
  namespace: dev
  labels:
    app: keycloak
spec:
  replicas: 2 
2. Set up podAntiAffinity to enforce the scheduling of pods to run on different nodes. 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
spec:
  replicas: 2
  ... // additional configs truncated
  spec:
    podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - keycloak
            topologyKey: kubernetes.io/hostname 
3. Optionally, add nodeAffinity to schedule the pods on dedicated nodes. 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
spec:
  replicas: 2
  ... // additional configs truncated
  spec: 
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: dedicated
                  operator: In
                  values:
                    - keycloak 
4. Set the following environment variables in the Keycloak pod to run in Standalone Clustered Mode. 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
spec:
  replicas: 2
  ... // additional configs truncated
  spec:
    containers:
      - name: keycloak
        image: quay.io/keycloak/keycloak:15.0.2
        ... // additional configs truncated
        env:
          - name: JGROUPS_DISCOVERY_PROTOCOL
            value: "kubernetes.KUBE_PING"
          - name: JGROUPS_DISCOVERY_PROPERTIES
            value: "dump_requests=false,namespace=dev"
          - name: CACHE_OWNERS_COUNT
            value: "2"
          - name: CACHE_OWNERS_AUTH_SESSIONS_COUNT
            value: "2"  
          - name: KEYCLOAK_HTTP_PORT
            value: "80"
          - name: KEYCLOAK_HTTPS_PORT
            value: "443" 
5. Enable Readiness & Liveness probes 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
spec:
  replicas: 2
  ... // additional configs truncated
  spec:
    containers:
      - name: keycloak
        image: quay.io/keycloak/keycloak:15.0.2
        ... // additional configs truncated
        readinessProbe:
          httpPath: "/auth/realms/master"
          port: http
          initialDelaySeconds: 40
          periodSeconds: 10
          timeoutSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpPath: "/auth/"
          port: http
          initialDelaySeconds: 40
          periodSeconds: 10
          timeoutSeconds: 10
          failureThreshold: 5 
After making the changes in deployment, let’s apply it to the k8s cluster. 
kubectl apply -f keycloak-deployment.yaml 
You should see the pods running in the mentioned namespace, as follows: 
Keycloak High Availability 002

Verification of Keycloak HA

Here are the checks required to verify Keycloak is running in HA: 
1. Pods running on different nodes
Keycloak High Availability 007
2. [Important] Look for the following lines in the pod’s logs to verify that Keycloak replicas are aware of each other while forming a CLUSTER.
Keycloak High Availability 003
3. [Important] Check the pod’s log to verify that Keycloak is running Standalone-Clustered Mode.
Keycloak High Availability 004
4. The new Keycloak node can join/leave the Keycloak cluster.
Keycloak High Availability 005
5. There is rebalancing happening between Keycloak pods
Keycloak High Availability 006
6. If the Keycloak primary instance goes down, It can fail over to the other instance, and vice versa.
7. Access the Keycloak UI, and make a session by logging into the account. Try restarting one of the Keycloak pods, and make sure the user session is still Valid.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
That’s all! You’ve made it to the end 🎉… I’d love to hear your thoughts on this solution. Don’t hesitate to ask questions or share suggestions about the post.
Happy learning! 😊

References