Proxmox High Availability (HA) Configuration Guide

Set up high availability in Proxmox VE to ensure your VMs and containers automatically failover when a node goes down.

What Is High Availability in Proxmox?

High Availability (HA) in Proxmox VE ensures that your critical VMs and containers are automatically restarted on a different node if the host they're running on fails. Instead of manually detecting a node outage and migrating workloads, the HA manager handles everything automatically, typically recovering services within a couple of minutes.

This guide covers everything you need to configure a reliable HA setup in Proxmox, from prerequisites and fencing to HA groups, failover testing, and monitoring.

HA Prerequisites

Before enabling HA, you need the following in place:

A working Proxmox cluster with at least 3 nodes (for proper quorum)
Shared storage accessible from all nodes (Ceph, NFS, iSCSI, or similar)
Fencing configured (so failed nodes can be forcibly stopped)
VMs/containers stored on shared storage (local disks cannot be used for HA)
Stable cluster network with redundant links recommended

HA will not work correctly with only 2 nodes unless you have a QDevice providing the third quorum vote. Without proper quorum, the HA manager cannot make safe failover decisions.

Understanding Fencing and STONITH

Fencing is the most critical component of any HA system. When a node becomes unresponsive, the remaining cluster members must ensure the failed node is truly stopped before starting its VMs on another node. Without fencing, you risk "split-brain" scenarios where the same VM runs on two nodes simultaneously, causing data corruption.

How Proxmox Fencing Works

Proxmox uses a watchdog-based fencing mechanism. Each node runs a watchdog timer that must be regularly reset by the HA manager. If a node loses quorum or the HA stack crashes, the watchdog triggers a hard reset of the node.

By default, Proxmox uses a software watchdog. For production environments, a hardware watchdog (IPMI watchdog) is strongly recommended:

# Check if a hardware watchdog is available
ls -la /dev/watchdog*

# View the current watchdog configuration
cat /etc/default/pve-ha-manager

# To use the IPMI hardware watchdog, load the module
echo "ipmi_watchdog" >> /etc/modules
modprobe ipmi_watchdog

Testing the Watchdog

You can verify the watchdog is working by checking the HA manager logs:

# Check HA manager status
systemctl status pve-ha-lrm
systemctl status pve-ha-crm

# View HA logs
journalctl -u pve-ha-crm --since "1 hour ago"

Configuring HA Groups

HA groups define which nodes can run a particular set of HA resources and the preferred order. This lets you control where VMs should normally run and where they can migrate to during failover.

Creating an HA Group via the CLI

# Create an HA group with preferred nodes
ha-manager groupadd production-group \
    --nodes pve1,pve2,pve3 \
    --nofailback 0 \
    --restricted 0

Key group settings:

nodes: Comma-separated list of nodes with optional priority (e.g., pve1:2,pve2:1 where higher priority is preferred)
restricted: If set to 1, resources in this group can only run on the listed nodes. If 0, they can run on any node if all listed nodes are unavailable.
nofailback: If set to 1, the VM will not automatically migrate back to the preferred node when it comes back online.

# Create a group with node priorities
ha-manager groupadd db-group \
    --nodes pve1:2,pve2:1,pve3:1 \
    --nofailback 0 \
    --restricted 1

In this example, VMs in the db-group prefer to run on pve1 (priority 2), will failover to pve2 or pve3 (priority 1), and cannot run on any other node (restricted=1).

Adding Resources to HA

Once your groups are set up, add VMs or containers as HA resources:

# Add a VM (VMID 100) to HA with the production-group
ha-manager add vm:100 --group production-group --state started --max_restart 3 --max_relocate 2

# Add a container (CTID 200)
ha-manager add ct:200 --group production-group --state started

You can also add HA resources through the web UI by selecting a VM, going to More > Manage HA, and choosing the group and desired state.

HA Resource States

started: The resource should always be running. If it stops or the node fails, HA will restart or migrate it.
stopped: The resource should remain stopped. HA will stop it if someone starts it manually.
ignored: HA does not manage this resource (useful for temporary maintenance).
disabled: The resource is removed from HA management entirely.

Recovery Policies

The HA manager uses two key parameters to control recovery behavior:

max_restart: Maximum number of times to try restarting the VM on the same node before giving up (default: 1)
max_relocate: Maximum number of times to try migrating the VM to another node (default: 1)

# Set aggressive recovery for a critical VM
ha-manager set vm:100 --max_restart 3 --max_relocate 3

# View current HA configuration for a resource
ha-manager config vm:100

If both restart and relocate limits are exhausted, the resource enters an error state and requires manual intervention.

Testing Failover

Never put an HA setup into production without testing failover. Here are several approaches:

Graceful Node Shutdown

# Shut down a node to simulate planned maintenance
shutdown -h now

# Watch the remaining nodes pick up the HA resources
ha-manager status

Simulated Node Crash

# Force a kernel panic to simulate a real crash (CAUTION: this will hard-crash the node)
echo c > /proc/sysrq-trigger

After triggering a crash, watch the remaining nodes. The HA manager should detect the failure within about 60 seconds, fence the failed node, and restart the HA resources on a surviving node.

Network Isolation Test

# Simulate network failure by blocking corosync traffic
iptables -A INPUT -p udp --dport 5405:5412 -j DROP
iptables -A OUTPUT -p udp --dport 5405:5412 -j DROP

# The isolated node should lose quorum and trigger its watchdog
# Remember to remove the rules when testing is complete
iptables -F

Monitoring HA Status

Regular monitoring of your HA setup catches problems before they cause outages:

# View the overall HA status
ha-manager status

# Example output:
# quorum OK
# master pve1 (active, Wed Feb 15 10:30:00 2026)
# vm:100 pve1 started
# vm:101 pve2 started
# ct:200 pve3 started

# Check detailed HA manager logs
journalctl -u pve-ha-crm -f

# Monitor fencing status
journalctl -u watchdog-mux

The Proxmox web interface shows HA status under Datacenter > HA > Status, displaying each resource, its current node, and its state. For mobile monitoring, ProxmoxR lets you keep an eye on your HA resource status and node health from anywhere, which is particularly valuable when you need to confirm that a failover event completed successfully while you're away from your workstation.

Common HA Pitfalls

Using local storage for HA VMs: HA requires shared storage. VMs on local disks cannot be migrated to another node.
Running HA on a 2-node cluster without QDevice: Loss of one node means loss of quorum, and the surviving node cannot safely start HA resources.
Ignoring fencing: Without proper fencing, split-brain scenarios can corrupt VM data.
Overcommitting resources: If a node fails and the surviving nodes don't have enough CPU/RAM to run the migrated VMs, they may fail to start.
Not testing failover: Assumptions about HA behavior often differ from reality. Test before you need it.

Conclusion

Proxmox high availability is a powerful feature that keeps your critical workloads running through hardware failures. The keys to a reliable HA setup are proper quorum (3+ nodes), shared storage, working fencing, and thorough failover testing. Start by configuring HA for your most critical VMs, test failover scenarios thoroughly, and monitor your HA status continuously to catch issues before they affect your services.