Should I change my homelab approach?

2024-01-11

I have been running the setup I described in August and while I have been fairly happy with it I think I haven't got everything 100% right: I have a stateful VM that don't really fit well with the rest of my model. I was aware of the issue from the beginning, I decided to leave it as something I to look into it later and I guess "later" has arrived. Let's review all the layers of my stack, what works well and what could be improved.

Application level

Overall my homelab has been very stable. All the changes I made have been to the running Docker containers, DNS and other pieces of configuration via Terraform, which worked great as it was able to effectively manage all these resources.

Interaction with 3rd party system like DNS via API or the management of resources that can easily be destroyed and recreated like containers is where Terraform shines so there is no surprise it worked so well for me. Overall I'm still happy with the choice of Docker Swarm instead of a flavour of Kubernetes.

Hypervisor

At the bottom of the stack I have Proxmox Virtual Environment. Ideally I am not going to touch it at all except for upgrades and in fact the only change I made to it was an upgrade to version 8.1. I did it manually and I'm not too concerned about that. After all I have a single node with has almost no customisation so this should always be a straightforward process.

Virtual machines

This is the part I think I haven't really got right. Between the hypervisor and the containers I have a VM, managed via Terraform, that acts as the Docker Swarm node.

The problem with my setup is related to keeping it up to date. In Terraform I am specifying the image to be debian-12-genericcloud-amd64-20230612-1409.img which is not super old but eventually I will need to update it.

An option would be to manually run the upgrade by ssh-ing into the VM. That would work but then the state of the VM would drift from the Terraform definition. Ouch!

Another option would be to update the image in the configuration file and let Terraform destroy the VM, create a new one and then redeploy the services. This would work albeit a bit brutal for stateless services. Unfortunately I also have stateful containers so I need to find a way to keep their volumes around. I have backups but a system upgrade shouldn't involve a restore from backup of all the data.

A third option would be something along the lines of:

add a new VM with an updated image and have it join the Docker Swarm;
drain the node of existing VM;
once the migration is completed, remove the node and destroy the VM.

I'm pretty sure that to get this work I would need to at the very least decouple the state from the VM, likely adding a NAS to the mix so that volumes aren't going to be on the disk of the VM I am going to destroy. This is also hinting that while a single node Docker Swarm works I am probably starting to reach it's limits.

What next?

It feels like I reached the point where I need to make a decision on how I see my VMs.

An option is to continue with the cloud native model I have so far, add a NAS or some other system to hold all the state and treat all the VMs as cattle and probably add at least another node to the Docker Swarm to have some wiggle room to take one of them offline and have a freshly created VM join the Swarm.

Another option is to decide I want pet VMs and dial my use of terraform back a bit. If I go this route, I could still have infrastructure as code (maybe via NixOS) but that's would be a lot of things to learn and change.

Back to the drawing board.

Thanks for reading. Feel free to reach out for any comment or question.