/ tags / sre
Let’s say you have a kubernetes cluster and prometheus with multiple workloads running on it. You want to monitor the health of the cluster and the workloads.
Picture this: I’m sitting in a room packed with the infrastructure team, the vendor, and our developers. Tension is high. We had just gone through a platform bridge change that caused IPs to cycle.…
Managing an internal AI platform (OpenWebUI/LiteLLM) at a 500+ DAU scale is a constant lesson in the high cost of “Optics.” When working with innovative teams that demand the latest Gemini models the…