Temario del curso
EXO Infrastructure as Code
- Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters
- Automating dependency installation (Xcode, uv, Node.js, Rust) with configuration management
- Using Nix flakes for reproducible EXO builds and developer environments
- Writing Ansible playbooks or shell scripts for unattended cluster provisioning
Reproducible Builds and CI Integration
- Pinning dependencies and building the dashboard in CI pipelines
- Running EXO smoke tests in GitHub Actions or GitLab CI runners
- Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs
- Versioning custom model cards alongside application code
Cluster Discovery and Networking Automation
- Configuring mDNS and static DNS for reliable libp2p node discovery
- Automating network profile creation and Thunderbolt bridge management on macOS
- Using custom namespaces (EXO_LIBP2P_NAMESPACE) to separate dev, staging, and prod clusters
- Firewall rules and network segmentation for multi-tenant environments
Storage and Model Lifecycle Management
- Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies
- Mounting NFS or SAN shares as read-only model repositories for fast provisioning
- Garbage collection of stale caches and versioned weight retention policies
- Automating model pre-downloads and health checks before rolling updates
Monitoring and Alerting
- Shipping EXO logs to centralized logging (ELK, Loki, or Splunk)
- Building Grafana dashboards from EXO_TRACING_ENABLED output
- Alerting on cluster membership changes, OOM events, and inference latency spikes
- Correlating macmon hardware telemetry with model performance regressions
Update, Rollback, and Disaster Recovery
- Staging EXO binary updates in a canary node before fleet-wide rollout
- Model-level rollback: switching between quantized versions without re-downloading
- Backing up and restoring cluster state, custom namespaces, and cached weights
- Documenting recovery runbooks for total cluster rebuild scenarios
Security Hardening and Compliance
- Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API
- Implementing API rate limiting and IP whitelisting for EXO endpoints
- Isolating clusters with VLANs and zero-trust network policies
- Auditing access and maintaining an inventory of deployed models and versions
Requerimientos
- Experience with DevOps practices (CI/CD, IaC, container orchestration)
- Familiarity with macOS or Linux system administration and package management
- Understanding of networking, DNS, and storage concepts
Audience
- DevOps engineers
- Infrastructure architects
- SREs responsible for on-premise AI workloads
Formación Corporativa a Medida
Soluciones de formación diseñadas exclusivamente para empresas.
- Contenido personalizado: Adaptamos el temario y los ejercicios prácticos a los objetivos y necesidades reales del proyecto.
- Calendario flexible: Fechas y horarios adaptados a la agenda de su equipo.
- Modalidad: Online (en directo), In-company (en sus oficinas) o Híbrida.
Precio por grupo privado (formación online) desde 4350 € + IVA*
Contáctenos para obtener un presupuesto exacto y conocer nuestras promociones actuales
Testimonios (2)
El conocimiento y experiencia del consultor ya que se abordan los temas teóricos aplicándolos a la realidad de los procesos. El curso contiene un programa de mucho valor en la gestión de las tecnologías de información.
Luis Castro Gamboa - Cooperativa De Ahorro Y Credito Ande No. 1 R.L.
Curso - Site Reliability Engineering (SRE) Foundation®
Que fue muy claro en cada especificación