Maintenance Procedures

This section covers routine and emergency maintenance procedures for the homelab.

Maintenance Schedule
Update Procedures
1. Operating System Updates
  1. Proxmox Updates
  2. TrueNAS Updates
2. Application Updates
Backup Procedures
1. Backup Methods
2. VM Backup Script
Troubleshooting
1. Common Issues
2. Diagnostic Commands
  1. Kubernetes
  2. System
Recovery Procedures
1. Service Recovery
2. Full System Recovery

Maintenance Schedule

The following table outlines the routine maintenance schedule for the homelab:

Task	Frequency	Duration	Impact
OS Updates	Monthly	1-2 hours	Potential service disruption
Application Updates	Weekly	30-60 minutes	Minor service disruption
Database Backups	Daily	15-30 minutes	None
Full System Backup	Weekly	2-3 hours	None
Hardware Inspection	Quarterly	1 hour	None
UPS Testing	Bi-annually	30 minutes	None
Network Audit	Quarterly	1-2 hours	None

Update Procedures

Operating System Updates

Always create a snapshot or backup before performing OS updates.

Proxmox Updates

# Update package list
apt update

# Perform system upgrade
apt dist-upgrade

# Check if reboot is needed
if [ -f /var/run/reboot-required ]; then
  echo "Reboot required"
  # Plan for a maintenance window
fi

TrueNAS Updates

TrueNAS updates are performed through the web UI:

Navigate to System > Update
Click “Check for Updates”
If updates are available, click “Download Updates”
Once downloaded, apply the updates during a maintenance window

Application Updates

Most applications are updated via Kubernetes:

# Update Helm repository
helm repo update

# Upgrade a specific application
helm upgrade [release-name] [chart] --namespace [namespace] --values values.yaml

# Example for updating Traefik
helm upgrade traefik traefik/traefik --namespace traefik --values traefik-values.yaml

Backup Procedures

The homelab uses a 3-2-1 backup strategy:

3 copies of data
2 different media types
1 off-site copy

Backup Methods

Type	Tool	Schedule	Retention	Location
VM Backups	Proxmox Backup	Daily	7 daily, 4 weekly	Backup server
Config Backups	Ansible + Git	On change	Unlimited	GitHub + Local
Data Backups	Restic	Daily	7 daily, 4 weekly, 12 monthly	NAS + B2
Docker Volumes	Velero	Daily	7 daily	NAS

VM Backup Script

#!/bin/bash
# VM backup script

DATE=$(date +%Y-%m-%d)
BACKUP_DIR="/mnt/backup/$DATE"

# Create backup directory
mkdir -p $BACKUP_DIR

# Backup VMs
echo "Backing up VMs..."
for VM_ID in $(qm list | grep -v VMID | awk '{print $1}')
do
  echo "Backing up VM $VM_ID"
  vzdump $VM_ID --compress --mode snapshot --storage local-zfs
done

echo "Backup completed successfully!"

Troubleshooting

Common Issues

Issue	First Response	Next Steps
Service unreachable	Check pod status	Examine logs, restart service
High CPU/Memory	Identify consuming process	Scale resources or optimize
Disk space low	Clear logs and temporary files	Evaluate storage expansion
Network connectivity	Check physical connections	Verify switch ports, VLAN settings

Diagnostic Commands

Kubernetes

# Check pod status
kubectl get pods -n [namespace]

# View pod logs
kubectl logs -n [namespace] [pod-name]

# Describe a pod for detailed information
kubectl describe pod -n [namespace] [pod-name]

# Check node status
kubectl get nodes

System

# Check system resource usage
htop

# Check disk usage
df -h

# Check network connectivity
ping -c 4 192.168.1.1

# View logs
journalctl -u [service] -f

Recovery Procedures

Document all recovery actions taken in the incident log.

Service Recovery

Identify the failing service
Check logs for errors
Attempt service restart
If restart fails, restore from backup
Verify service functionality
Document the incident

Full System Recovery

In case of catastrophic failure:

Boot from recovery media
Restore system from backup
Verify hardware functionality
Restore VMs and containers
Verify network connectivity
Test critical services
Document the recovery process