I have been running the lab for several weeks now and some of the gremlins are starting to show themselves. No platform is 100% stable, so I certainly expected to find things to fix here and there, but this one started falling apart quite quickly.
It started with one node suddenly not responding to
ssh. Rebooting got things working again, but typically, it'd stop responding within 24 hours. Attaching a keyboard and monitor to the machine, I was flooded with journald messages like this:
Searching for answers for the issue, I found this, which looked very similar. Notably, the asker mentioned the issue starting "once in a while", and its fixed with a reboot. Also notable, is the presence of an SSD drive.
The two answers are:
- Upgrade your kernel, but given this is a fresh install of Ubuntu, I'm already at the latest.
- Update the firmware of your SSD. Basically, Ubuntu does a process called "fstrim", which periodically marks unused data segments on a drive as able to be deleted. During this process, the file system gets locked, but since its the root partition, journald is trying to write entries, we see the above issues.
My apologies for anyone finding this post looking for an answer. I took the easy way out: I re-installed Ubuntu with the HDD as the root partition. The SSD is mounted, but basically unused.
One cool thing: since the Micr0k8s cluster was in HA mode, once the node was running again, I joined the cluster and the workloads just kept running. After all this, I'll be happy to start posting about real work, not trying to get some computers working!