Skip to main content

How it works

Support, stability, and dependency info

High-availability Namespaces are in Public Preview for Temporal Cloud.

In traditional active/active replication, multiple nodes serve requests and accept writes simultaneously, ensuring strong synchronous data consistency. In contrast, with a Temporal Cloud high-availability Namespace, only the active zone accepts requests and writes at any given time. Workflow history events are written to the active zone first and then asynchronously replicated to the standby zone replica, ensuring that the replica remains in sync.

Needs new images
Before failoverAfter failover
Before failoverAfter failover

Failovers

A failover shifts Workflow Execution processing from an active Temporal Namespace region to a standby Temporal Namespace region during outages or other incidents. Standby Namespace regions use replication to duplicate data and prevent data loss during failover.

What happens during the failover process?

Temporal Cloud initiates a Namespace failover when it detects an incident or outage that raises error rates or latency in the active region of a multi-region Namespace. The failover shifts Workflow processing to a standby region that isn’t affected by the incident. This lets existing Workflows continue and new Workflows start while the incident is fixed. Once the incident is resolved, Temporal Cloud performs a "failback" by shifting Workflow Execution processing back to the original region.

info

You can test the failover of your multi-region Namespace by manually triggering a failover using the UI page or the 'tcld' CLI utility. In most scenarios, we recommend you let Temporal handle failovers for you.

Health Checks

How does Temporal detect failover conditions?

Temporal Cloud automates failovers by performing internal health checks. This process monitors your request error rates, latencies, and any infrastructure issues that might cause service disruptions, such as request timeouts. It automatically triggers failovers when these indicators exceed our allowed thresholds.

Replication lag

Multi-region Namespaces use asynchronous replication between regions. Workflow updates in the active region, along with associated history events, are transmitted to the standby region with a short delay. This delay is called the replication lag. Temporal Cloud strives to maintain a P95 replication delay of less than 1 minute. In this context, P95 means 95% of requests are processed faster than this specified limit.

Replication lags mean a forced failover may cause Workflows to rollback in progress. Lags may also cause recently started Workflows to be temporarily unavailable until the active region recovers. Temporal event versioning and conflict resolution mechanisms help guarantee that the Workflow Event History can be replayed. Critical operations like Signals won't get lost.

Failover scenarios

The Temporal Cloud failover mechanism supports several modes to execute Namespace failovers. These modes include graceful failover ("handover"), forced failover, and a hybrid mode. The hybrid mode is Temporal Cloud’s default Namespace behavior.

Graceful failover (handover)

In this mode, replication tasks are fully processed and drained. Temporal Cloud pauses traffic to the Namespace before the failover. This prevents the loss of progress and avoids data conflicts. The Namespace experiences a short period of unavailability, defaulting to 10 seconds.

During this period, existing Workflows stop progress. Temporal Cloud returns a "Service unavailable error", which is retried by SDKs. State transitions will not happen and tasks are not dispatched. User requests like start/signal workflow will be rejected while operations are paused during handover.

This mode favors consistency over availability.

Forced failover

In this mode, a Namespace immediately activates in the standby region. Events not replicated due to replication lag will undergo conflict resolution upon reaching the new active region.

This mode prioritizes availability over consistency.

Hybrid failover mode

While graceful failovers are preferred for consistency, they aren’t always practical. Temporal Cloud’s hybrid failover mode (the default mode) limits an initial graceful failover attempt to 10 seconds or less. During this period, existing Workflows stop progress. Temporal Cloud returns a "Service unavailable error", which is retried by SDKs. If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. This strategy balances consistency and availability requirements.

See the sections on triggering a failover, Worker deployment, and routing for more information.