Proxmox ZFS Pool Degraded: How to Diagnose and Replace Failed Disks

Fix a degraded ZFS pool in Proxmox VE. Covers zpool status, disk replacement, resilvering, scrub errors, CKSUM errors, and zpool clear commands.

Understanding ZFS Pool States

ZFS is a popular storage choice for Proxmox VE because of its built-in redundancy, snapshots, and data integrity verification. When a disk in a redundant ZFS pool (mirror, RAIDZ, RAIDZ2) starts failing, the pool enters a DEGRADED state. This means your data is still accessible but you have lost redundancy. Another disk failure in the same vdev could mean data loss.

A degraded pool is an urgent situation that requires prompt attention. Do not wait.

Checking Pool Status

The zpool status command is your primary diagnostic tool. It shows the health of every disk in every pool.

# Check all pool status
zpool status

# Example output for a degraded mirror:
#   pool: rpool
#  state: DEGRADED
# status: One or more devices has been removed by the FMA.
# action: Online the device using 'zpool online' or replace it.
# config:
#
#     NAME                    STATE     READ WRITE CKSUM
#     rpool                   DEGRADED     0     0     0
#       mirror-0              DEGRADED     0     0     0
#         sda2                ONLINE       0     0     0
#         sdb2                DEGRADED     0     0     4
#
# errors: No known data errors

# Check detailed device info
zpool status -v rpool

# View pool I/O statistics
zpool iostat rpool 5

# Check SMART data for the failing drive
smartctl -a /dev/sdb

Understanding Error Columns

The three error columns in zpool status output tell you different things about what is going wrong:

READ - Number of read errors. Indicates the disk is having trouble returning data.
WRITE - Number of write errors. The disk failed to accept writes.
CKSUM - Checksum errors. Data was read but did not match the expected checksum. This often indicates a failing disk, cable, or controller.

# CKSUM errors are particularly concerning because they mean
# data corruption was detected (and corrected from redundancy)

# A few CKSUM errors after a power failure may be transient
# Consistently increasing CKSUM errors mean the disk is dying

# Monitor error counts over time
watch -n 60 "zpool status rpool | grep -E 'READ|WRITE|CKSUM|DEGRADED'"

Replacing a Failed Disk

Once you have identified the failing disk and installed its replacement, use zpool replace to initiate the rebuild (resilvering).

# Identify the failing disk's device path
zpool status rpool
# Note the device name, e.g., sdb2

# Find the new replacement disk
lsblk
# Look for the new unpartitioned disk, e.g., sdc

# If replacing in a root pool, partition the new disk to match
sgdisk /dev/sda -R /dev/sdc
sgdisk -G /dev/sdc

# Replace the failed disk
zpool replace rpool /dev/sdb2 /dev/sdc2

# For non-root data pools with whole-disk vdevs
zpool replace tank /dev/sdb /dev/sdc

# Monitor resilvering progress
zpool status rpool
# scan: resilver in progress since Fri Mar 20 10:00:00 2026
#        1.23T scanned at 456M/s, 789G issued at 300M/s
#        estimated 45 minutes to complete

# Watch resilver progress in real time
watch -n 10 "zpool status rpool | grep -A 3 scan"

Do not reboot or power off the server during resilvering unless absolutely necessary. An interrupted resilver will restart from the beginning.

Running a Scrub After Replacement

After resilvering completes, run a scrub to verify all data integrity across the pool.

# Start a scrub
zpool scrub rpool

# Check scrub progress
zpool status rpool | grep scan

# Example output:
# scan: scrub in progress since Fri Mar 20 14:00:00 2026
#        500G scanned, 200G issued at 150M/s, 1.2T total
#        0 repaired, 16.67% done, estimated 2:30:00 to go

# After scrub completes:
# scan: scrub repaired 0B in 03:15:30 with 0 errors on Fri Mar 20

# Schedule regular scrubs (recommended monthly)
# Add to crontab:
# 0 2 1 * * /sbin/zpool scrub rpool

Using zpool offline/online

Sometimes you need to take a disk offline before physically removing it, or bring a disk back online after resolving a transient issue.

# Take a disk offline (before physical removal)
zpool offline rpool /dev/sdb2

# Bring a disk back online (after fixing cable/connection)
zpool online rpool /dev/sdb2

# If a disk has transient errors but is not actually failing,
# clear the error counters after resolving the issue
zpool clear rpool

# Clear errors for a specific device
zpool clear rpool /dev/sdb2

When zpool clear Is Appropriate

The zpool clear command resets error counters and can bring a disk back from a FAULTED state. Use it only when you have resolved the underlying problem.

# Appropriate uses of zpool clear:
# - After replacing a bad cable
# - After a power event caused transient errors
# - After fixing a SATA/SAS controller issue
# - After reseating a drive

# NOT appropriate:
# - To hide errors from a dying disk
# - To "fix" a disk without addressing the root cause

# After clearing, monitor for recurring errors
zpool clear rpool
# Then watch:
zpool status rpool
# If errors return within hours, the disk needs replacement

Monitoring Pool Health

Proactive monitoring prevents degraded pools from becoming data loss events. Use ProxmoxR alongside local monitoring to keep an eye on ZFS health across all your Proxmox nodes.

# Simple monitoring script
#!/bin/bash
HEALTH=$(zpool status -x)
if [ "$HEALTH" != "all pools are healthy" ]; then
    echo "ZFS ALERT: Pool issue detected"
    zpool status
    exit 1
fi

# Check ZFS event log for recent issues
zpool events -v | tail -30

# Monitor disk temperatures (requires hddtemp or smartctl)
smartctl -A /dev/sda | grep Temperature
smartctl -A /dev/sdb | grep Temperature

A degraded ZFS pool is a warning, not a crisis, as long as you act on it promptly. Replace the failing disk, let the resilver complete, run a scrub, and set up monitoring so you catch the next failure before it becomes critical.

Proxmox ZFS Pool Degraded: How to Diagnose and Replace Failed Disks

Understanding ZFS Pool States

Checking Pool Status

Understanding Error Columns

Replacing a Failed Disk

Running a Scrub After Replacement

Using zpool offline/online

When zpool clear Is Appropriate

Monitoring Pool Health

Take Proxmox management mobile

Related Articles

How to Unlock a Locked VM in Proxmox VE

How to Remove Proxmox Subscription Warning

Proxmox Web Interface Not Loading: Troubleshooting Guide

Manage Proxmox from your phone