Proxmox ZFS Pool Degraded: How to Diagnose and Replace Failed Disks
Fix a degraded ZFS pool in Proxmox VE. Covers zpool status, disk replacement, resilvering, scrub errors, CKSUM errors, and zpool clear commands.
Understanding ZFS Pool States
ZFS is a popular storage choice for Proxmox VE because of its built-in redundancy, snapshots, and data integrity verification. When a disk in a redundant ZFS pool (mirror, RAIDZ, RAIDZ2) starts failing, the pool enters a DEGRADED state. This means your data is still accessible but you have lost redundancy. Another disk failure in the same vdev could mean data loss.
A degraded pool is an urgent situation that requires prompt attention. Do not wait.
Checking Pool Status
The zpool status command is your primary diagnostic tool. It shows the health of every disk in every pool.
# Check all pool status
zpool status
# Example output for a degraded mirror:
# pool: rpool
# state: DEGRADED
# status: One or more devices has been removed by the FMA.
# action: Online the device using 'zpool online' or replace it.
# config:
#
# NAME STATE READ WRITE CKSUM
# rpool DEGRADED 0 0 0
# mirror-0 DEGRADED 0 0 0
# sda2 ONLINE 0 0 0
# sdb2 DEGRADED 0 0 4
#
# errors: No known data errors
# Check detailed device info
zpool status -v rpool
# View pool I/O statistics
zpool iostat rpool 5
# Check SMART data for the failing drive
smartctl -a /dev/sdb
Understanding Error Columns
The three error columns in zpool status output tell you different things about what is going wrong:
- READ - Number of read errors. Indicates the disk is having trouble returning data.
- WRITE - Number of write errors. The disk failed to accept writes.
- CKSUM - Checksum errors. Data was read but did not match the expected checksum. This often indicates a failing disk, cable, or controller.
# CKSUM errors are particularly concerning because they mean
# data corruption was detected (and corrected from redundancy)
# A few CKSUM errors after a power failure may be transient
# Consistently increasing CKSUM errors mean the disk is dying
# Monitor error counts over time
watch -n 60 "zpool status rpool | grep -E 'READ|WRITE|CKSUM|DEGRADED'"
Replacing a Failed Disk
Once you have identified the failing disk and installed its replacement, use zpool replace to initiate the rebuild (resilvering).
# Identify the failing disk's device path
zpool status rpool
# Note the device name, e.g., sdb2
# Find the new replacement disk
lsblk
# Look for the new unpartitioned disk, e.g., sdc
# If replacing in a root pool, partition the new disk to match
sgdisk /dev/sda -R /dev/sdc
sgdisk -G /dev/sdc
# Replace the failed disk
zpool replace rpool /dev/sdb2 /dev/sdc2
# For non-root data pools with whole-disk vdevs
zpool replace tank /dev/sdb /dev/sdc
# Monitor resilvering progress
zpool status rpool
# scan: resilver in progress since Fri Mar 20 10:00:00 2026
# 1.23T scanned at 456M/s, 789G issued at 300M/s
# estimated 45 minutes to complete
# Watch resilver progress in real time
watch -n 10 "zpool status rpool | grep -A 3 scan"
Do not reboot or power off the server during resilvering unless absolutely necessary. An interrupted resilver will restart from the beginning.
Running a Scrub After Replacement
After resilvering completes, run a scrub to verify all data integrity across the pool.
# Start a scrub
zpool scrub rpool
# Check scrub progress
zpool status rpool | grep scan
# Example output:
# scan: scrub in progress since Fri Mar 20 14:00:00 2026
# 500G scanned, 200G issued at 150M/s, 1.2T total
# 0 repaired, 16.67% done, estimated 2:30:00 to go
# After scrub completes:
# scan: scrub repaired 0B in 03:15:30 with 0 errors on Fri Mar 20
# Schedule regular scrubs (recommended monthly)
# Add to crontab:
# 0 2 1 * * /sbin/zpool scrub rpool
Using zpool offline/online
Sometimes you need to take a disk offline before physically removing it, or bring a disk back online after resolving a transient issue.
# Take a disk offline (before physical removal)
zpool offline rpool /dev/sdb2
# Bring a disk back online (after fixing cable/connection)
zpool online rpool /dev/sdb2
# If a disk has transient errors but is not actually failing,
# clear the error counters after resolving the issue
zpool clear rpool
# Clear errors for a specific device
zpool clear rpool /dev/sdb2
When zpool clear Is Appropriate
The zpool clear command resets error counters and can bring a disk back from a FAULTED state. Use it only when you have resolved the underlying problem.
# Appropriate uses of zpool clear:
# - After replacing a bad cable
# - After a power event caused transient errors
# - After fixing a SATA/SAS controller issue
# - After reseating a drive
# NOT appropriate:
# - To hide errors from a dying disk
# - To "fix" a disk without addressing the root cause
# After clearing, monitor for recurring errors
zpool clear rpool
# Then watch:
zpool status rpool
# If errors return within hours, the disk needs replacement
Monitoring Pool Health
Proactive monitoring prevents degraded pools from becoming data loss events. Use ProxmoxR alongside local monitoring to keep an eye on ZFS health across all your Proxmox nodes.
# Simple monitoring script
#!/bin/bash
HEALTH=$(zpool status -x)
if [ "$HEALTH" != "all pools are healthy" ]; then
echo "ZFS ALERT: Pool issue detected"
zpool status
exit 1
fi
# Check ZFS event log for recent issues
zpool events -v | tail -30
# Monitor disk temperatures (requires hddtemp or smartctl)
smartctl -A /dev/sda | grep Temperature
smartctl -A /dev/sdb | grep Temperature
A degraded ZFS pool is a warning, not a crisis, as long as you act on it promptly. Replace the failing disk, let the resilver complete, run a scrub, and set up monitoring so you catch the next failure before it becomes critical.
Take Proxmox management mobile
All the features discussed in this guide — accessible from your phone with ProxmoxR. Real-time monitoring, power control, firewall management, and more.