Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 116 additions & 25 deletions docs/managing-mirrord/high-availability.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: High Availability
date: 2026-01-12T00:00:00.000Z
lastmod: 2026-01-12T00:00:00.000Z
lastmod: 2026-06-07T00:00:00.000Z
draft: false
images: []
linktitle: High Availability
Expand All @@ -15,37 +15,128 @@ tags:
description: High availability for mirrord Operator in Enterprise Tier
---

Starting from chart version `1.40.1`, the mirrord Operator is by default highly available.
This means that mirrord sessions should survive transient failures in the cluster,
with respect to the set of advanced mfT features used by the session (see [advanced features](#advanced-features) section).
This includes failures of nodes where the mirrord Operator pods are running.
{% hint style="info" %}
Session persistence — the ability to survive an Operator restart without terminating active mirrord sessions — is available to users on the Enterprise pricing plan.
{% endhint %}

High availability is implemented in a lightweight fasion, by saving sessions' state in the cluster's builtin etcd database (as [Custom Resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)). Therefore, no persistent volume is required. At startup, the operator restores and resumes saved sessions.
{% hint style="warning" %}
The Operator's leader election and standby replica infrastructure is active regardless of license tier, so the Operator itself may recover from failures even on non-Enterprise plans. However, session persistence is not guaranteed outside of Enterprise. Sessions created on non-Enterprise plans are tied to a specific operator instance and are not guaranteed to survive a leader transition. We make no promises about HA behaviour for non-Enterprise plans.
{% endhint %}

{% hint style="info" %}
This feature is available to users on the Enterprise pricing plan.
Starting from chart version `1.40.1`, the mirrord Operator is highly available by default. This means mirrord sessions survive transient failures — including failures of the node where the Operator pod is running — without terminating the user's local process.

## How it works

HA is built on two mechanisms that work together:

**State persistence in etcd.** The Operator continuously saves session state as Kubernetes [Custom Resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/), which are stored in the cluster's built-in etcd database. No external database or persistent volume is required. When an Operator pod starts (or restarts after a failure), it reads these Custom Resources and restores all in-progress sessions automatically.

**Leader election.** The Operator uses Kubernetes leader election to ensure that exactly one replica is active at any time. The elected leader serves all mirrord sessions. Any additional replicas remain in standby, holding the Kubernetes API watch loop open and ready to acquire leadership the moment the current leader becomes unavailable.

When a leader failure is detected, a standby replica acquires the leader lease, reads the persisted session state, and resumes active sessions. The time to recover depends on the Kubernetes leader-election timeouts and whether a warm standby replica is already running.

## Requirements

- **Enterprise license.** HA is an Enterprise-tier feature.
- **Helm chart `1.40.1` or later.** HA is enabled by default from this version; no additional configuration is required to enable it.
- **Standard Kubernetes cluster.** The Operator relies on the cluster's etcd (via the Kubernetes API) for state storage. No additional infrastructure is required.

## Configuration

### Replica count

By default the Operator runs with a single replica. A single-replica deployment is already highly available in the sense that sessions are persisted and automatically restored after a pod restart. However, with a single replica there is a cold-start delay during recovery: the new pod must be scheduled and started before leadership can be acquired.

Starting from chart version `1.40.3`, you can run multiple replicas for faster failover:

```yaml
# values.yaml
operator:
replicas: 2
```

With multiple replicas, standby pods are already running when a leader failure occurs. The new leader is elected from the existing standby pool, which eliminates the pod scheduling delay and shortens the recovery window.

{% hint style="warning" %}
Additional replicas provide failover redundancy, not additional serving capacity. Only the elected leader handles mirrord sessions. A single Operator instance is a lightweight Kubernetes controller capable of managing hundreds of concurrent sessions across large clusters — additional replicas shorten recovery time after a failure, not throughput. If you are hitting session limits, see [Scalability](scalability.md).
{% endhint %}

## Multiple replicas
### Recommended: pod anti-affinity

When running multiple replicas, schedule them on separate nodes so that a single node failure does not take down both the leader and all standbys at the same time:

```yaml
# values.yaml
operator:
replicas: 2
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: mirrord-operator
topologyKey: kubernetes.io/hostname
```

Using `preferredDuringSchedulingIgnoredDuringExecution` avoids blocking pod scheduling if only one node is available.

### Recommended: PodDisruptionBudget

To prevent Kubernetes from evicting all replicas simultaneously during voluntary disruptions (node drains, cluster upgrades), set a [PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/):

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: mirrord-operator-pdb
namespace: mirrord
spec:
minAvailable: 1
selector:
matchLabels:
app: mirrord-operator
```

## Caveats

### Not all advanced features are HA-ready

HA protects the core session lifecycle: the connection between the local process and its agent pod. However, some advanced mirrord for Teams features maintain additional state that has not yet been migrated to the persistent HA model. Sessions that use these features will be **forcefully terminated** if the Operator pod fails, rather than resumed.

The table below shows the current support status:

| Feature | HA | Minimum chart version | Notes |
| --------------------------- | --- | --------------------- | ----- |
| SQS splitting | ✅ | `1.40.3` | |
| RabbitMQ splitting | ✅ | `3.152.0` | |
| GCP Pub/Sub splitting | ✅ | `3.158.0` | |
| Azure Service Bus splitting | ✅ | `3.161.0` | |
| Kafka splitting | ❌ | | |
| Copy target | ❌ | | Opt-in HA available via `COPIED_PODS_HA=true` (see below) |
| DB branching | ❌ | | |

If a session is using a non-HA feature and the Operator pod fails, the user's local process will see the session end. They can restart mirrord to begin a new session once the Operator is back online.

#### Opt-in: Copy target HA

Copy target sessions can be made HA by setting the `COPIED_PODS_HA` environment variable to `true` on the Operator pod:

By default, the mirrord Operator workload runs with a single replica.
Starting from chart version `1.40.3`, this can be configured with the `operator.replicas` setting in the chart.
```yaml
# values.yaml
operator:
extraEnv:
- name: COPIED_PODS_HA
value: "true"
```

Regardless of the configured scale, only one replica acts as a leader, and serves mirrord sessions.
Any additional replicas wait in a standby mode, ready to acquire leadership and resume work, in case of the current leader's failure.
Configuring the operator to run with multiple replicas allows for smoother leadership transitions,
since the new leader candidates are already running.
When enabled, the Operator persists the copied pod name in the session status. After a leader transition, the new leader recovers the session by locating the existing copied pod rather than creating a new one. This is disabled by default because it requires the copied pod to still be running and reachable after the transition.

## Advanced features
### Leadership transition window

Not all of the advanced mfT features have yet been migrated to highly available sessions.
Sessions using such features will be forcefully terminated in case of operator pod failure.
Even with multiple warm standby replicas, there is a brief window during leader failover when no replica is serving sessions. The Operator uses a Kubernetes `Lease` with a 120-second duration, renewed every 60 seconds. A standby replica can acquire the lease as soon as the current holder stops renewing it — up to 120 seconds in the worst case. Sessions that were in-flight during this window may experience a short interruption before the new leader resumes them.

The table below summarizes the current state of development.
### etcd as a dependency

| Feature | HA | Chart version |
| --------------- | ---| ------------- |
| SQS splitting | ✅ | `1.40.3` |
| Kafka splitting | ❌ | |
| Copy target | ❌ | |
| DB branching | ❌ | |
The Operator's HA model stores state in the Kubernetes API (backed by etcd). If the cluster's control plane itself is unavailable, the Operator cannot persist or restore session state. This is expected behavior for any Kubernetes workload that relies on the API server.