/ tags / sre

# Surviving a 100% CPU Database Meltdown in Open WebUI - Fixing a Hidden Full Table Scan

2026-07-06 4 min read

We host a beta version of an open-source tool called Open WebUI at an enterprise scale. We have over 1,000 daily users, hitting 500+ concurrent users at peak times.

SRE

Read

# Automating Kubernetes Observability: Scaling Your Metrics with Dynamic Discovery

2026-06-15 3 min read

Let’s say you have a kubernetes cluster and prometheus with multiple workloads running on it. You want to monitor the health of the cluster and the workloads.

SRE Infrastructure Observability

Read

# Anatomy of a 3-Hour Outage: How a Single Redis Config DDoS’d Our Own Production

2026-06-05 4 min read

Picture this: I’m sitting in a room packed with the infrastructure team, the vendor, and our developers. Tension is high. We had just gone through a platform bridge change that caused IPs to cycle.…

SRE Infrastructure

Read

# Scaling Multi-Tenant Runtimes: Why Isolation is Mandatory at 500+ DAU

2026-03-07 3 min read

Managing an internal AI platform (OpenWebUI/LiteLLM) at a 500+ DAU scale is a constant lesson in the high cost of “Optics.” When working with innovative teams that demand the latest Gemini models the…

Software Architecture SRE Platform Engineering

Read