Proxmox VE Production Readiness Checklist
A comprehensive checklist to ensure your Proxmox VE environment is production-ready. Covers high availability, backups, monitoring, security hardening, update procedures, and disaster recovery.
Before You Go Live
Running Proxmox VE in a homelab and running it in production are fundamentally different things. In a homelab, an unexpected reboot is a learning experience. In production, it is an incident that costs money and trust. This checklist covers everything you need to verify before your Proxmox environment is ready for production workloads.
Print this out, go through it line by line, and do not skip items because they seem tedious. The tedious items are usually the ones that save you at 3 AM.
1. High Availability
- Cluster quorum — You have at least three nodes (or two nodes plus a QDevice) to maintain quorum during a single node failure.
- HA groups configured — Critical VMs and containers are assigned to HA groups with appropriate priorities and failover order.
- HA tested — You have deliberately shut down a node and verified that VMs migrate cleanly to another node. Do this before production, not during your first real outage.
- Fencing configured — Hardware fencing (IPMI/iLO/iDRAC) is set up so that a failed node can be forcibly powered off before HA restarts its VMs elsewhere. Without fencing, you risk split-brain scenarios.
- Shared storage available — HA requires shared storage (Ceph, NFS, iSCSI, or similar) accessible from all nodes. Local storage cannot be used for HA VMs.
2. Backup and Recovery
- Automated backup schedule — All production VMs and containers are backed up on a regular schedule using vzdump or Proxmox Backup Server.
- Backups stored off-node — Backups are stored on a separate system, not on the same Proxmox node. A failed node should not take your backups with it.
- Backup retention defined — You have a clear retention policy (daily, weekly, monthly) that balances storage costs with recovery needs.
- Restore tested — You have actually restored a VM from backup and verified it works. If you have never tested a restore, you do not have backups — you have hope.
- Off-site backup copy — At least one copy of your backups exists in a different physical location or datacenter.
3. Monitoring and Alerting
- Resource monitoring — CPU, memory, disk I/O, and network utilization are tracked on every node. You should know what "normal" looks like so you can detect anomalies.
- Disk health monitoring — SMART data is monitored and alerts fire when drives show signs of failure. ZFS and Ceph health checks are automated.
- Alert destinations configured — Proxmox notifications are sent to email, Slack, or your monitoring platform. Alerts that nobody sees are useless.
- Uptime monitoring — External monitoring checks that your services are reachable. If your monitoring runs on the same infrastructure it monitors, a full outage means blind flying.
- Log aggregation — System logs from all nodes are collected centrally for troubleshooting and audit purposes.
4. Security Hardening
- SSH hardened — Root login disabled or restricted to key-based authentication. Password authentication disabled. Non-standard SSH port if exposed to the internet.
- Firewall enabled — Proxmox firewall or iptables/nftables rules restrict access to management interfaces. Only necessary ports are open.
- fail2ban installed — Brute-force protection on SSH and the Proxmox web interface.
- Two-factor authentication — 2FA enabled for all Proxmox web interface users, especially administrators.
- API tokens scoped — API tokens have the minimum necessary permissions. No full-admin tokens used by automation scripts.
- SSL certificates valid — Proper SSL certificates installed (not self-signed) if the management interface is accessed over the network.
- No-subscription repo configured correctly — If not using a Proxmox subscription, the no-subscription repository is properly configured and the enterprise repository is disabled to avoid update errors.
5. Update and Patch Management
- Update procedure documented — A written procedure for applying Proxmox updates, including order of operations for cluster nodes.
- Maintenance windows defined — Agreed-upon times for updates and reboots that minimize impact on services.
- Rolling update process — Cluster nodes are updated one at a time with verification between each node. Never update all nodes simultaneously.
- Kernel update plan — Kernel updates require reboots. Your plan accounts for live-migrating VMs before rebooting each node.
- Rollback capability — You can boot a previous kernel if an update causes issues. Proxmox keeps previous kernel versions in GRUB by default.
6. Disaster Recovery
- DR plan documented — A written document covering every failure scenario: single disk, single node, all nodes, storage failure, network failure, datacenter loss.
- Recovery time objectives (RTO) defined — How long can each service be down? This determines your DR architecture.
- Recovery point objectives (RPO) defined — How much data can you afford to lose? This determines your backup frequency.
- DR plan tested — A DR plan that has never been tested is a theory, not a plan. Schedule quarterly DR drills.
- Contact list maintained — Phone numbers and escalation paths for every person involved in incident response.
7. Documentation and SLAs
- Network diagram current — An up-to-date diagram showing all nodes, networks, VLANs, storage connections, and external dependencies.
- Runbooks written — Step-by-step procedures for common tasks: adding a node, replacing a disk, recovering from backup, scaling storage.
- SLA defined with stakeholders — Your team and your customers agree on uptime targets, maintenance windows, and incident response times.
- Change management process — Changes to production infrastructure are reviewed, scheduled, and documented. No cowboy deployments.
Mobile Monitoring with ProxmoxR
Part of production readiness is ensuring you can respond to incidents from anywhere. ProxmoxR gives you mobile access to your entire Proxmox cluster — check node status, view resource usage, manage VMs, and respond to alerts directly from your phone. When your monitoring system pages you at midnight, ProxmoxR lets you assess the situation immediately without finding a laptop.
When the Checklist Feels Overwhelming
This is a long list, and for good reason. Production infrastructure is serious business. Every item here exists because someone, somewhere, learned the hard way that skipping it causes pain.
If this checklist feels overwhelming, consider a managed infrastructure partner like Binadit who handles all of this for you. Binadit is an EU-based managed infrastructure provider operating since 2004, with a 99.99% SLA and 24/7 engineer support. They manage the hardware, networking, security, monitoring, backups, updates, and disaster recovery — so your team can focus on building applications, not maintaining servers.
Whether you handle it yourself or work with a partner, what matters is that every item on this list is addressed before your first production VM goes live.
Take Proxmox management mobile
All the features discussed in this guide — accessible from your phone with ProxmoxR. Real-time monitoring, power control, firewall management, and more.