Improve docs (2bf39555) · Commits · OOP / deploy / ansible

Original line number	Diff line number	Diff line
		# Production-Ready Deployment

		!!! warning "Work in Progress"
		This guide is being developed. For production deployments, consult with the i2CAT team.

		## Overview

		While the Quick Start and Single OOP deployments are excellent for development and testing, production environments require additional considerations for security, reliability, and scalability.

		## Key Differences from Development

		\| Aspect \| Development \| Production \|
		\|--------\|-------------\|------------\|
		\| Kubernetes \| Kind (local) \| Managed K8s or production cluster \|
		\| Passwords \| Defaults \| Strong, unique passwords \|
		\| TLS/SSL \| HTTP \| HTTPS with valid certificates \|
		\| Storage \| Ephemeral \| Persistent volumes with backups \|
		\| Monitoring \| Basic Prometheus \| Full observability stack \|
		\| High Availability \| Single node \| Multi-node with redundancy \|
		\| Security \| Minimal \| Hardened, with RBAC and policies \|

		## Prerequisites

		Before deploying to production:

		- [ ] Production Kubernetes cluster (not Kind)
		- [ ] Persistent storage solution
		- [ ] SSL/TLS certificates
		- [ ] Secret management system (Vault, etc.)
		- [ ] Monitoring and alerting infrastructure
		- [ ] Backup and disaster recovery plan
		- [ ] Security policies and RBAC configuration

		## Production Architecture

		```
		┌─────────────────┐
		│ Load Balancer │
		│ (Ingress) │
		└────────┬────────┘
		│
		┌────────────┴────────────┐
		│ │
		┌─────▼─────┐ ┌──────▼──────┐
		│ Master │ │ Master │
		│ Node │ │ Node │
		│ (HA) │ │ (HA) │
		└───────────┘ └─────────────┘
		│ │
		┌───────────┴─────────────────────────┴───────────┐
		│ │
		┌────▼─────┐ ┌────────────┐ ┌────────────┐ ┌───────▼────┐
		│ Worker │ │ Worker │ │ Worker │ │ Worker │
		│ Node │ │ Node │ │ Node │ │ Node │
		└──────────┘ └────────────┘ └────────────┘ └────────────┘
		│ │ │ │
		└──────────────┴────────────────┴───────────────┘
		│
		┌───────▼────────┐
		│ Persistent │
		│ Storage │
		│ (NFS/Ceph) │
		└────────────────┘
		```

		## Production Checklist

		### Infrastructure

		- [ ] Use production-grade Kubernetes cluster (not Kind)
		- Managed: EKS, GKE, AKS
		- Self-managed: kubeadm with HA setup
		- [ ] Multi-node cluster (minimum 3 master, 3 worker nodes)
		- [ ] Persistent storage configured (NFS, Ceph, cloud storage)
		- [ ] Network policies implemented
		- [ ] Load balancer configured

		### Security

		- [ ] All default passwords changed
		- [ ] SSL/TLS certificates installed
		- [ ] RBAC policies configured
		- [ ] Network policies enforced
		- [ ] Pod security policies/standards implemented
		- [ ] Secrets encrypted at rest
		- [ ] Service mesh for mTLS (optional but recommended)
		- [ ] Regular security scanning

		### Monitoring & Logging

		- [ ] Prometheus configured with persistent storage
		- [ ] Grafana dashboards customized
		- [ ] Alertmanager rules configured
		- [ ] Log aggregation (ELK, Loki, etc.)
		- [ ] Distributed tracing (Jaeger, Zipkin)
		- [ ] Uptime monitoring

		### Backup & Recovery

		- [ ] Automated backup of persistent volumes
		- [ ] Backup of Kubernetes resources (Velero)
		- [ ] Backup of Harbor registry
		- [ ] Disaster recovery plan documented
		- [ ] Regular backup testing

		### Scalability

		- [ ] Horizontal Pod Autoscaling configured
		- [ ] Cluster autoscaling enabled
		- [ ] Resource limits and requests defined
		- [ ] Load testing performed

		## Deployment Steps

		### 1. Prepare Production Cluster

		Ensure you have a production Kubernetes cluster ready. For managed Kubernetes:

		AWS EKS:
		```bash
		eksctl create cluster --name op-production --region us-west-2 --nodes 3
		```

		Google GKE:
		```bash
		gcloud container clusters create op-production --num-nodes=3 --region=us-central1
		```

		Azure AKS:
		```bash
		az aks create --resource-group myResourceGroup --name op-production --node-count 3
		```

		### 2. Configure Persistent Storage

		Set up a storage class for persistent volumes:

		```yaml
		# storageclass.yaml
		apiVersion: storage.k8s.io/v1
		kind: StorageClass
		metadata:
		name: op-storage
		provisioner: kubernetes.io/aws-ebs # or appropriate provisioner
		parameters:
		type: gp3
		encrypted: "true"
		```

		### 3. Configure SSL/TLS

		Obtain certificates and create Kubernetes secrets:

		```bash
		kubectl create secret tls op-tls-cert \
		--cert=path/to/cert.crt \
		--key=path/to/cert.key \
		-n operator-platform
		```

		### 4. Customize Group Variables

		Create production-specific variables:

		```yaml
		# inventory/production/group_vars/all.yml
		harbor_admin_password: "{{ vault_harbor_password }}"
		enable_tls: true
		storage_class: "op-storage"
		replica_count: 3 # for HA
		```

		### 5. Deploy Using Existing Cluster Playbooks

		```bash
		ansible-playbook playbooks/existing-cluster/deploy.yml \
		-i inventory/production \
		-e @production-secrets.yml \
		--vault-password-file ~/.vault_pass
		```

		## Post-Deployment

		### Verify Deployment

		```bash
		# Check all pods are running
		kubectl get pods -A

		# Verify persistent volumes
		kubectl get pv

		# Check ingress
		kubectl get ingress -A

		# Test service endpoints
		curl -k https://your-domain.com/healthz
		```

		### Configure Monitoring

		Set up alerting rules in Prometheus:

		```yaml
		# prometheus-alerts.yaml
		groups:
		- name: operator-platform
		rules:
		- alert: PodDown
		expr: up{job="kubernetes-pods"} == 0
		for: 5m
		annotations:
		summary: "Pod {{ $labels.pod }} is down"
		```

		### Set Up Backups

		Configure Velero for Kubernetes backups:

		```bash
		velero install \
		--provider aws \
		--bucket op-backups \
		--backup-location-config region=us-west-2

		velero schedule create daily-backup --schedule="0 2 * * *"
		```

		## Maintenance

		### Regular Tasks

		- Weekly: Review logs and metrics
		- Monthly: Update components, security patches
		- Quarterly: Disaster recovery testing
		- Annually: Review and update security policies

		### Updating Components

		```bash
		# Update Harbor
		ansible-playbook playbooks/tools/harbor/upgrade.yml -i inventory/production

		# Update Kubernetes
		# Follow your cluster provider's upgrade process
		```

		## Troubleshooting Production Issues

		### High Availability Issues

		Check pod distribution across nodes:
		```bash
		kubectl get pods -A -o wide
		```

		### Performance Issues

		Review resource usage:
		```bash
		kubectl top nodes
		kubectl top pods -A
		```

		### Security Incidents

		1. Isolate affected components
		2. Review audit logs
		3. Check for unauthorized access
		4. Update security policies
		5. Patch vulnerabilities

		## Best Practices

		1. Never use default passwords in production
		2. Always use TLS/SSL for external endpoints
		3. Implement proper RBAC and least privilege
		4. Monitor everything (metrics, logs, traces)
		5. Test backups regularly
		6. Document your configuration and procedures
		7. Automate deployments (use CI/CD)
		8. Version control all configurations
		9. Regular security audits
		10. Stay updated with security patches

		## Related Documentation

		- [Existing Cluster Deployment](existing-cluster.md)
		- [Managing Secrets](../how-to/manage-secrets.md)
		- [Architecture Overview](../reference/architecture.md)

		---

		Need assistance with production deployment? Contact the i2CAT team for consultation and support.