edit: you are right, it’s the I/O WAIT that it destroying my performance:
%Cpu(s): 0,3 us, 0,5 sy, 0,0 ni, 50,1 id, 49,0 wa, 0,0 hi, 0,1 si, 0,0 st
I could clearly see it using nmon > d > l > -
such as was suggested by @SayCyberOnceMore. Not quite sure what to do about it, as it’s simply my sdb1
drive which is a Samsung 1TB 2.5" HDD. I have now ordered a 2TB SSD and maybe I am going to reinstall from scratch on that new drive as sda1. I realize that’s just treating the symptom and not the root cause, so I should probably also look for that root cause. But that’s for another Lemmy thread!
I really don’t understand what is causing this. I run a few very small containers, and everything is fine - but when I start something bigger like Photoprism, Immich, or even MariaDB or PostgreSQL, then something causes the CPU load to rise indefinitely.
Notably, the top
command doesn’t show anything special, nothing eats RAM, nothing uses 100% CPU. And yet, the load is rising fast. If I leave it be, my ssh session loses connection. Hopping onto the host itself shows a load of over 50,or even over 70. I don’t grok how a system can even get that high at all.
My server is an older Intel i7 with 16GB RAM running Ubuntu22. 04 LTS.
How can I troubleshoot this, when ‘top’ doesn’t show any culprit and it does not seem to be caused by any one specific container?
(this makes me wonder how people can run anything at all off of a Raspberry Pi. My machine isn’t “beefy” but a Pi would be so much less.)
run top and paste the output the top portion of the screen.
I would suspect it is IO wait. You can get into disk contention if you have multiple containers fighting for disk. You will notice the IO queue is building up and that is shows you are waiting for IO transactions.
%Cpu(s): 67.4 us, 13.0 sy, 0.0 ni, 19.4 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
See the field labeled WA, that is wait time. Basically time you are waiting for IO to complete.
If that is high, you can increase the cache used by Linux BUT if the system crash you are at risk of losing saves.
many people aren’t running containers on RBpi … while feasible, it was notoriously poor until the 8GB pi4, and still is easily bounded by SD card I/o. are there docker stats so you can see the disk + net I/o of each container?
It’s sounds like it could be an IO wait issue, system load will climb a ton without showing much CPU usage.
Make sure you’re not running out of RAM and going into swap space, it doesn’t sound like it though.
iotop
might show something useful. And inhtop
you can add the 'PERCENT_IO_DELAY" column which can be useful.Yep. IO.
OP, this might be overkill for you but it might be worth standing up a grafana/prometheus stack… You’d be able to see this stuff a lot faster and potentially narrow in on a root cause.
That is definitely an interesting idea! Much, much better than the stupid
dashdot
container I am running now :-D
Immich and photoprism do AI detection and sorting right? Until they have scanned through all current photos you are going to have a lot of system load. And my 4 gig pi usually jusy runs at 1 gig of memory and low load…no Immich, but running Openmediavault for DAAP and SAMBA, and DejaDup and Syncthing, homeassistant, CUPS, etc.
I ran immich on a server that’s substantially faster than a raspberry pi, and after about a day and a half it keeps me from getting in with ssh. Even locally, I have to wait for a long time to get a login prompt.
“load” is not “CPU usage.” It’s “system usage” and includes disk and network activity. Including swapping if you’re low on memory.
vmstat can tell you what your disk io looks like. Iotop can help with narrowing it down to a process.
It’s a bit more complicated than that. System load is a count of how many processes are in an R state (either "R"unning or "R"eady). If a process does disk I/O or accesses the network, that is not counted towards load, because as soon as it makes a system call, it’s now in an S (or D) state instead of an R state.
But disk I/O does affect it, which makes it a bit tricky. You mentioned swapping. Swapping’s partner in crime, memory-mapped files, also contribute. In both of those cases, a process tries to access memory (without making a system call) that the kernel needs to do work to resolve, so the process stays in an R state.
I can’t think of a common situation where network activity could contribute to load, though. If your swap device is mounted over NFS maybe?
Anyway, generally load is measuring CPU usage, but if you have high disk usage elsewhere (which is not counted directly) and are under high memory pressure, that can contribute to load. If you’re seeing a high load with low CPU utilization, that’s almost always due to high memory pressure, which can cause both swapping and filesystem cache drops.
Using network storage to store your swapfile is one of the… um, more interesting ideas I’ve heard today
Adding to list of things to try when one’s network only knows 100Gb
Install nmon - it’s a CLI tool to show system load Run it and press ‘d’ to show disk usage, then ‘l’ to show a longterm graph, then ‘-’ to speed it up. If your storage is the issue you’ll see it here - and potentially which drive(s) are affected.
I’d try each application one by one. Maybe write a script to monitor load and stop the program if it goes past your desired threshold and notify you.
It could also be a setting in some app like photoprism or immich … I think one of them uses tensorflow to classify images. That would increase the load if thats running in the background.
Maybe try them with an empty directory so there is no data to process and see if you encounter the error. Then add some data and see how the load is.