Site Reliability Engineer
2025 — Present
- Automated lifecycle management of Kubernetes clusters running NVIDIA and AMD GPUs, including health monitoring and auto-remediation of unhealthy nodes.
- Contributed to Cilium (CNCF) — fixed a race condition in the IPSec key rotation path that caused packet drops during agent rolling restarts. Solution used pinned BPF maps to defer peer advertisement until kernel XFRM states were ready (PR #44701).
- Drove the migration from kube-proxy to Cilium's kube-proxy-replacement across all clusters, auditing the Cilium codebase to validate impact and correctness of service routing.
- Led the migration of Kubernetes clusters from Ingress NGINX to Gateway API + Envoy.
- Enabled secure arbitrary code execution for AI workloads by architecting a sandboxed runtime on Kata Containers backed by Firecracker microVMs — VM-level isolation inside Kubernetes for running untrusted code.