the warm pool trap
Every mid-stage startup I've talked to in the last two years has the same origin story for their warm pools. Some Tuesday at 2pm, a product manager noticed that the first request after a quiet period was taking 4-6 seconds, someone filed a P1, and an engineer solved it the fastest way possible: keep a fleet of pre-warmed instances running 24/7. Problem gone. Ticket closed. But that "solution" is now costing you $3,000-$8,000 a month in compute that's sitting at 3% CPU utilization for roughly 20 hours of every day.
The underlying assumption baked into a warm pool is that cold-start latency is a property of your infrastructure, something you solve by throwing always-on capacity at it. It's not. Cold-start latency is a property of your initialization logic, your dependency injection, your database connection setup, and your first-request JIT penalties. You're renting servers to paper over a code problem, and the bill grows every time you add a new environment or a new region.
We see this constantly at steezr, especially with clients who've graduated from a couple of EC2 instances to something ECS or Kubernetes-shaped. The muscle memory from the early "just add more" phase carries over, and nobody stops to ask whether the warm pool is actually the cheapest path to the latency target. Usually it isn't. Usually there's a combination of faster initialization, smarter bursting, and client-side retry behavior that gets you to the same p95 number at a fraction of the cost.
what the alternative actually looks like
The pattern I want to describe is four pieces that fit together, and it only works if you implement all four. Swap one out and you end up either blowing your latency budget or rebuilding a warm pool under a different name.
First, you need a Go worker pool with pre-allocated goroutines that keeps a small number of workers genuinely warm, meaning they've already done the expensive initialization work, but auto-scales the pool size based on queue depth rather than a static min-max setting. Second, a Redis 7 token bucket in front of your burst capacity, so you're not hammering downstream services during the scale-up window and you have a clean mechanism for rate-limiting individual tenants without global throttling. Third, KEDA 2.x watching that same Redis queue (or an SQS queue, or whatever you've got) and making scale-to-zero decisions based on actual lag rather than CPU metrics, which are a lagging indicator and nearly useless for this kind of workload. Fourth, and this is the one people skip, idempotent requests with client-side exponential backoff and a retry budget, so that the 150-300ms window during a cold scale-up from zero is invisible to the user.
Those four pieces together get you to sub-200ms p95 on steady traffic, sub-500ms p95 during burst events, and actual scale-to-zero during off-hours instead of a permanently running warm fleet. The math on cost is not subtle. If your warm pool is running 10 x c6g.large instances overnight, you're burning maybe $180/month on that alone, and that's before you count the identical setup you have in staging.
the Go worker pool piece
The worker pool implementation in Go is the part of this that people overthink. You don't need a framework. The standard library has everything you need, and if you reach for a third-party worker pool library before you've profiled your initialization code, you're solving the wrong problem.
The pattern I use keeps a channel of Worker structs, each of which has already established its database connections, loaded its configuration, and warmed whatever LRU caches it needs on startup. The pool manager runs a goroutine that watches queue depth (pulled from Redis via a simple LLEN call every 500ms) and spawns or reclaims workers accordingly. The minimum pool size, the one that stays warm, is determined empirically. You profile your initialization time, you figure out how long KEDA takes to bring a new pod up (typically 15-25 seconds with a reasonably tuned pollingInterval), and you keep enough workers warm to handle traffic during that window without the token bucket firing.
A concrete snippet for the pool resize logic in Go, roughly:
1func (p *WorkerPool) adjustSize(ctx context.Context) {2 ticker := time.NewTicker(500 * time.Millisecond)3 defer ticker.Stop()4 for {5 select {6 case <-ctx.Done():7 return8 case <-ticker.C:9 depth, err := p.redis.LLen(ctx, p.queueKey).Result()10 if err != nil {11 continue12 }13 target := clamp(14 int(depth/int64(p.cfg.JobsPerWorker)),15 p.cfg.MinWorkers,16 p.cfg.MaxWorkers,17 )18 p.resize(target)19 }20 }21}
The clamp function just bounds the target between min and max. resize either starts new workers (each of which does its full initialization in a new goroutine) or signals existing idle workers to shut down gracefully. Nothing exotic here. The whole pool manager is about 200 lines of Go that you can read in 10 minutes.
The initialization code path is where you actually save latency. Move every blocking operation, every database connection, every config fetch into the worker's initialize() method, which runs once per worker lifecycle, not once per job. If your worker handles document processing and needs an S3 client, a Postgres connection from pgx/v5, and a Redis client, all three of those get set up in initialize(). The job handler itself should be doing nothing but business logic.
Redis token buckets for burst control
The token bucket lives in Redis and it serves two purposes that people usually conflate into one messy rate-limiter: it protects downstream services during the burst window, and it gives you per-tenant fairness without a distributed coordination layer.
Redis 7's CL.THROTTLE command (from the RedisBloom module, or you can implement it cleanly with a Lua script) makes this straightforward. The Lua approach I prefer looks something like this:
1local key = KEYS[1]2local capacity = tonumber(ARGV[1])3local rate = tonumber(ARGV[2])4local now = tonumber(ARGV[3])5local requested = tonumber(ARGV[4])67local fill_time = capacity / rate8local ttl = math.floor(fill_time * 2)910local last_tokens = tonumber(redis.call('GET', key .. ':tokens') or capacity)11local last_refreshed = tonumber(redis.call('GET', key .. ':ts') or now)1213local delta = math.max(0, now - last_refreshed)14local filled = math.min(capacity, last_tokens + (delta * rate))15local new_tokens = filled - requested1617if new_tokens >= 0 then18 redis.call('SETEX', key .. ':tokens', ttl, new_tokens)19 redis.call('SETEX', key .. ':ts', ttl, now)20 return 121else22 return 023end
You call this atomically for each incoming job before it enters the worker pool queue. If the bucket is empty, the request gets a 429 with a Retry-After header calculated from the refill rate, which is exactly the signal the client-side retry logic needs.
Per-tenant keys look like throttle:tenant:{tenant_id}, and you set different capacity and rate values based on the tenant's plan tier. This lets a bursty free-tier tenant get throttled without affecting your paying customers, which is a problem that's genuinely hard to solve cleanly without something like this.
KEDA configuration that doesn't lie to you
KEDA 2.x with a Redis scaler is where the scale-to-zero magic happens, and the configuration details matter more than people expect. The default pollingInterval of 30 seconds is too slow for any workload where you care about latency. We run it at 10 seconds for latency-sensitive workloads, which means your pod count can be growing within 10-25 seconds of a queue spike, depending on how fast your pod startup is.
The scaler configuration I'd use for a Redis list queue:
1apiVersion: keda.sh/v1alpha12kind: ScaledObject3metadata:4 name: worker-scaler5spec:6 scaleTargetRef:7 name: worker-deployment8 pollingInterval: 109 cooldownPeriod: 12010 minReplicaCount: 011 maxReplicaCount: 5012 triggers:13 - type: redis14 metadata:15 address: redis-service:637916 listName: job-queue17 listLength: "5"18 activationListLength: "1"
The listLength: "5" means KEDA targets one replica per 5 items in the queue, which you tune based on your average job duration and acceptable lag. activationListLength: "1" means you go from zero to one replica the moment a single item appears in the queue, rather than waiting for KEDA's next evaluation cycle to notice a threshold breach.
The cooldownPeriod of 120 seconds is conservative but right for most workloads. If you set it too low, you'll flap between zero and one replicas on bursty but low-volume traffic, and every scale-up burns that 15-25 second initialization window. The Go worker pool's minimum worker count inside the pod covers the gap between the KEDA scale event and your first pod being ready, so you have two levels of warm capacity: pods that are already running (handled by KEDA's minReplicaCount if you can't tolerate full scale-to-zero) and workers inside a running pod that are pre-initialized.
One trap I've seen repeatedly: people configure KEDA against CPU metrics because that's what they're used to from HPA, and then wonder why their latency is spiky. CPU lags behind queue depth by 10-30 seconds on typical async workloads. Queue depth is a leading indicator. Use it.
client-side retry is load-bearing
Every engineering team I've met underestimates how much of their latency "solution" is actually just hiding behind always-on capacity. When you go to scale-to-zero, you have to confront the fact that the first request after a quiet period will hit a pod that's initializing. If your client gives up after one attempt, you have a correctness problem that no amount of KEDA tuning fixes.
The retry contract needs three things: idempotency keys on every mutating request, exponential backoff with jitter, and a retry budget rather than a fixed retry count. Idempotency keys are non-negotiable. If a request might land on a pod that's mid-initialization and responds with a 503, you need the client to retry safely without double-processing. A UUID in the Idempotency-Key header, stored in Redis with a TTL that matches your job TTL, is sufficient.
The backoff curve matters. Linear backoff with a fixed 500ms delay will cause thundering herd problems during the scale-up window when 200 clients are all retrying at the same cadence. Exponential with jitter = random(0, base * 2^attempt) spreads the load across the 15-25 second window and dramatically reduces the chance that you're overloading the first pod that comes up.
A retry budget means you allow, say, 3 retries over a maximum of 8 seconds, not 3 retries with no time bound. If the worker fleet hasn't scaled up within 8 seconds of the first attempt, something is genuinely wrong and you should surface an error, not keep retrying. The budget prevents a bad deployment or a misconfigured KEDA trigger from turning a failed scale-up into a client-side infinite loop.
We've shipped this pattern on a document processing pipeline for a client where p95 was sitting at ~4 seconds with warm pools and a $2,100/month EC2 bill for the worker fleet alone. After the migration, p95 settled at 180ms on steady traffic, 420ms during the rare burst events, and the worker fleet cost dropped to around $340/month. The difference is almost entirely from scale-to-zero working overnight and on weekends, when the workload was near-zero anyway.
where this breaks down
This pattern has failure modes and you should know them before you commit.
The obvious one: if your job initialization time is over 30 seconds, the math stops working. A Go worker that's connecting to a vector database, loading a 2GB model into memory, and pre-compiling a bunch of regex patterns isn't going to be ready within the retry budget. For those workloads, you genuinely do need a small warm floor, minReplicaCount: 2 or so, and you accept the cost as the price of having a slow initialization path. The fix is usually to make the initialization faster, not to expand the warm floor.
Another one: Redis as a queue introduces a durability tradeoff. If your Redis instance goes down during a burst event, your queue disappears. For workloads where job loss is unacceptable, you want SQS or a Postgres-backed queue (we've used pgq and river for Go) with KEDA's corresponding scaler. KEDA has first-class support for SQS via the AWS SQS scaler and for Postgres via the postgresql scaler, both of which work with the same ScaledObject pattern.
The token bucket Lua script assumes your Redis deployment handles atomicity correctly, which it does for standalone Redis but gets complicated with Redis Cluster due to key slot hashing. If you're on Redis Cluster, make sure your token bucket keys for a given tenant hash to the same slot, or use a single-shard Redis for rate limiting separate from your queue Redis.
Finally, this pattern works well for async and near-async workloads, things like document processing, report generation, webhook delivery, background AI inference. For synchronous request/response APIs where the user is staring at a spinner, scale-to-zero is much harder to sell because even a 400ms cold-start is perceptible. For those, a minimal warm floor is the right call, but you can still apply the token bucket and adaptive pool sizing inside your existing pods to reduce over-provisioning.
