Need assistance solving unexpected and random I/O errors in homelab

tintedundercollar12@sopuli.xyz · edit-2 2 months ago

Need assistance solving unexpected and random I/O errors in homelab

empireOfLove2@lemmy.dbzer0.com · 2 months ago

My default goto with any stability issue is to first force a new drive self test

smartctl -l selftest /dev/nvme0

And then I would also run a complete extended memory test (memtest86) to ensure bad ram isn’t doing something dumb like corrupting the part of the kernel that handles disk IO. The number of times I’ve had unsolvable issues that traced to an unstable stick of memory is… Surprisingly high.

If the memtest passes try fsck’ing nvme0, if there are corrupted blocks yeah it’s possible the SSD is dying but the controller isn’t reporting it.

Andres@social.ridetrans.it · 2 months ago

@empireOfLove2 @tintedundercollar12 Yeah, that absolutely looks like a hardware issue. Memtest is a good idea, but also reseat the nvme and keep an eye out for overheating (eg, ssh’ing in and keeping the following running in a terminal:
while (sleep 5); do sudo smartctl -a /dev/nvme0|grep ‘Temperature:’; done
). Components on the drive could be failing early when temperatures get high, but not high enough to trigger warning thresholds.