Commit 2bf39555 authored by Sergio Gimenez's avatar Sergio Gimenez
Browse files

Improve docs

parent 339f8de0
Loading
Loading
Loading
Loading
+294 −0
Original line number Diff line number Diff line
# Production-Ready Deployment

!!! warning "Work in Progress"
    This guide is being developed. For production deployments, consult with the i2CAT team.

## Overview

While the Quick Start and Single OOP deployments are excellent for development and testing, production environments require additional considerations for security, reliability, and scalability.

## Key Differences from Development

| Aspect | Development | Production |
|--------|-------------|------------|
| **Kubernetes** | Kind (local) | Managed K8s or production cluster |
| **Passwords** | Defaults | Strong, unique passwords |
| **TLS/SSL** | HTTP | HTTPS with valid certificates |
| **Storage** | Ephemeral | Persistent volumes with backups |
| **Monitoring** | Basic Prometheus | Full observability stack |
| **High Availability** | Single node | Multi-node with redundancy |
| **Security** | Minimal | Hardened, with RBAC and policies |

## Prerequisites

Before deploying to production:

- [ ] Production Kubernetes cluster (not Kind)
- [ ] Persistent storage solution
- [ ] SSL/TLS certificates
- [ ] Secret management system (Vault, etc.)
- [ ] Monitoring and alerting infrastructure
- [ ] Backup and disaster recovery plan
- [ ] Security policies and RBAC configuration

## Production Architecture

```
                        ┌─────────────────┐
                        │   Load Balancer │
                        │   (Ingress)     │
                        └────────┬────────┘

                    ┌────────────┴────────────┐
                    │                         │
              ┌─────▼─────┐           ┌──────▼──────┐
              │  Master   │           │   Master    │
              │  Node     │           │   Node      │
              │  (HA)     │           │   (HA)      │
              └───────────┘           └─────────────┘
                    │                         │
        ┌───────────┴─────────────────────────┴───────────┐
        │                                                   │
   ┌────▼─────┐  ┌────────────┐  ┌────────────┐  ┌───────▼────┐
   │  Worker  │  │   Worker   │  │   Worker   │  │   Worker   │
   │  Node    │  │   Node     │  │   Node     │  │   Node     │
   └──────────┘  └────────────┘  └────────────┘  └────────────┘
        │              │                │               │
        └──────────────┴────────────────┴───────────────┘

                    ┌───────▼────────┐
                    │  Persistent    │
                    │  Storage       │
                    │  (NFS/Ceph)    │
                    └────────────────┘
```

## Production Checklist

### Infrastructure

- [ ] Use production-grade Kubernetes cluster (not Kind)
  - Managed: EKS, GKE, AKS
  - Self-managed: kubeadm with HA setup
- [ ] Multi-node cluster (minimum 3 master, 3 worker nodes)
- [ ] Persistent storage configured (NFS, Ceph, cloud storage)
- [ ] Network policies implemented
- [ ] Load balancer configured

### Security

- [ ] All default passwords changed
- [ ] SSL/TLS certificates installed
- [ ] RBAC policies configured
- [ ] Network policies enforced
- [ ] Pod security policies/standards implemented
- [ ] Secrets encrypted at rest
- [ ] Service mesh for mTLS (optional but recommended)
- [ ] Regular security scanning

### Monitoring & Logging

- [ ] Prometheus configured with persistent storage
- [ ] Grafana dashboards customized
- [ ] Alertmanager rules configured
- [ ] Log aggregation (ELK, Loki, etc.)
- [ ] Distributed tracing (Jaeger, Zipkin)
- [ ] Uptime monitoring

### Backup & Recovery

- [ ] Automated backup of persistent volumes
- [ ] Backup of Kubernetes resources (Velero)
- [ ] Backup of Harbor registry
- [ ] Disaster recovery plan documented
- [ ] Regular backup testing

### Scalability

- [ ] Horizontal Pod Autoscaling configured
- [ ] Cluster autoscaling enabled
- [ ] Resource limits and requests defined
- [ ] Load testing performed

## Deployment Steps

### 1. Prepare Production Cluster

Ensure you have a production Kubernetes cluster ready. For managed Kubernetes:

**AWS EKS:**
```bash
eksctl create cluster --name op-production --region us-west-2 --nodes 3
```

**Google GKE:**
```bash
gcloud container clusters create op-production --num-nodes=3 --region=us-central1
```

**Azure AKS:**
```bash
az aks create --resource-group myResourceGroup --name op-production --node-count 3
```

### 2. Configure Persistent Storage

Set up a storage class for persistent volumes:

```yaml
# storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: op-storage
provisioner: kubernetes.io/aws-ebs  # or appropriate provisioner
parameters:
  type: gp3
  encrypted: "true"
```

### 3. Configure SSL/TLS

Obtain certificates and create Kubernetes secrets:

```bash
kubectl create secret tls op-tls-cert \
  --cert=path/to/cert.crt \
  --key=path/to/cert.key \
  -n operator-platform
```

### 4. Customize Group Variables

Create production-specific variables:

```yaml
# inventory/production/group_vars/all.yml
harbor_admin_password: "{{ vault_harbor_password }}"
enable_tls: true
storage_class: "op-storage"
replica_count: 3  # for HA
```

### 5. Deploy Using Existing Cluster Playbooks

```bash
ansible-playbook playbooks/existing-cluster/deploy.yml \
  -i inventory/production \
  -e @production-secrets.yml \
  --vault-password-file ~/.vault_pass
```

## Post-Deployment

### Verify Deployment

```bash
# Check all pods are running
kubectl get pods -A

# Verify persistent volumes
kubectl get pv

# Check ingress
kubectl get ingress -A

# Test service endpoints
curl -k https://your-domain.com/healthz
```

### Configure Monitoring

Set up alerting rules in Prometheus:

```yaml
# prometheus-alerts.yaml
groups:
  - name: operator-platform
    rules:
      - alert: PodDown
        expr: up{job="kubernetes-pods"} == 0
        for: 5m
        annotations:
          summary: "Pod {{ $labels.pod }} is down"
```

### Set Up Backups

Configure Velero for Kubernetes backups:

```bash
velero install \
  --provider aws \
  --bucket op-backups \
  --backup-location-config region=us-west-2

velero schedule create daily-backup --schedule="0 2 * * *"
```

## Maintenance

### Regular Tasks

- **Weekly**: Review logs and metrics
- **Monthly**: Update components, security patches
- **Quarterly**: Disaster recovery testing
- **Annually**: Review and update security policies

### Updating Components

```bash
# Update Harbor
ansible-playbook playbooks/tools/harbor/upgrade.yml -i inventory/production

# Update Kubernetes
# Follow your cluster provider's upgrade process
```

## Troubleshooting Production Issues

### High Availability Issues

Check pod distribution across nodes:
```bash
kubectl get pods -A -o wide
```

### Performance Issues

Review resource usage:
```bash
kubectl top nodes
kubectl top pods -A
```

### Security Incidents

1. Isolate affected components
2. Review audit logs
3. Check for unauthorized access
4. Update security policies
5. Patch vulnerabilities

## Best Practices

1. **Never** use default passwords in production
2. **Always** use TLS/SSL for external endpoints
3. **Implement** proper RBAC and least privilege
4. **Monitor** everything (metrics, logs, traces)
5. **Test** backups regularly
6. **Document** your configuration and procedures
7. **Automate** deployments (use CI/CD)
8. **Version control** all configurations
9. **Regular** security audits
10. **Stay updated** with security patches

## Related Documentation

- [Existing Cluster Deployment](existing-cluster.md)
- [Managing Secrets](../how-to/manage-secrets.md)
- [Architecture Overview](../reference/architecture.md)

---

**Need assistance with production deployment?** Contact the i2CAT team for consultation and support.
Loading