Leinninger.com

June 21st, 2022:
Homelab Outage and Recovery

Last Wednesday my AC didn’t work either because of the load buffer module (SMM) that delays turning the AC on if the generator has been activated. Normally, it keeps the AC from switching on immediately and drawing too much juice when the generator hasn’t fully started. In this case, it failed and defaulted to “open” meaning no power to the AC. Hopefully, Generac will extend warranty work because of the failed units and replaced it with a default “closed” SMM (so the AC won’t be unavailable if it fails again). I called the generator service company and they had a repairman schedule to visit the next day.

esxi vmware homelab

That repair was supposed to take “5 – 10 minutes.” The UPS on my homelab server rack is good for 15 minutes or so. Around 12 minutes into the repair, it started beeping like crazy and my whole lab shut off without powerdown. (ESXi VM server which includes media server, automation server, web server and DNS/Pihole VMs plus the SAS array of storage.) Servers don’t like when that happens. It wouldn’t post when I tried to boot. I really, really should have shut everything down in advance… but I wanted to save my over 600 days of uptime on the VM host!

After I got home from a weekend of racing, I spent several hours getting it back up and running by pulling hardware, re-installing VMware to new SD cards (it boots from internal SD cards), running diagnostics on the RAID arrays, etc. Luckily I had the DNS and Web Server VMs backed up to another server and powered that on to cover me for the weekend… but I thought I was screwed. (None of that effort contributed to the solution.)

The fix was to run the Lifecycle Manager… which is a Dell EMC server feature that walks you through server setup and maintenance. I don’t think I used it since I set up the server! It walked through the automated firmware upgrade and it reinstalled BIOS, etc. That unexpected process allowed me to post. I just had to mark all of the disks in my RAID arrays as “good, offline” and re-scan the foreign configuration and I was back online.

Lessons learned:

  1. Backups are important! I have the backup server which allowed me to get up and running on Thursday so I could keep working from home. But, I had no home automation or media server… they rely on hardware that doesn’t work in the spare server. If all else failed, I also have online “cold storage” backups for everything, but the recovery time for those can be weeks or months.
  2. Don’t get too dependent on home automation. 2 of my automated light switched have gone bad (Insteon units which I need to replace and re-configure with Z-Wave). I was able to control them with Amazon Echo commands though… if my server was online. Also, our automated cat feeder is dependent on the automation server. We were gone for a weekend and had to leave bowls of food out for them. I bet it was gone in the first 24 hours.
  3. My daughter’s birthday was on Sunday (6 years old!). While at the racetrack. she made up her mind that she wanted to watch dvd rips of the Gummi Bears when she got home. When we got home on Sunday afternoon, her world fell apart because I didn’t get the server back online yet. I’m working on setting her expectations that the world of entertainment offered by the server is a convenience and isn’t always available when she has screen time.
  4. Practice makes perfect. I haven’t had to perform any significant maintenance on my ESXi host for almost 2 years. In the future, I’ll practice more frequent roll-overs to my backup host and add a USB controller that matches the one in the primary host for automation interfaces. Then, I’ll only be without a media server if this happens again (see #3 above).

Like any other hobby, running a homelab can be a lot of work. I’ve become VERY dependent on the main VMs that run on my host. Most of the experimental hosts could be lost and I could easily start over with them to play with Docker and K8s, etc. I’ll have to make the routine maintenance part of the “fun” of my homelab, too.

Leave a Reply

Your email address will not be published.

Proove You Are Human: * Time limit is exhausted. Please reload CAPTCHA.

Devtroit