Designing a fail-open Redis layer
May 10, 2026
Redis is everywhere in this platform. It backs the Rails cache, rate-limit counters, flash-sale inventory, and live counters on nearly every page. It is the reason a lot of things are fast.
Which is exactly the problem. A dependency that makes you fast is, by default,
a dependency that takes you down. If every REDIS_POOL.with block can raise,
then a Redis hiccup turns into a wall of 500s. So the rule for this codebase
became simple to state and harder to enforce:
Redis is fail-open. When it is down, the platform degrades. It never 500s, and it never causes a healthy node to be pulled from the load balancer.
Rescue every call — through one door
The first half is easy: rescue every raw Redis call. The trap is doing it
inconsistently — a logger.warn here, a swallowed exception there, a
rescue nil somewhere else. Six months later nobody can answer “how often is
Redis failing?” because the failures are logged six different ways.
So every rescue routes through one object, RedisErrorReporter:
REDIS_POOL.with { |r| r.incr(key) }
rescue => e
RedisErrorReporter.report(e, context: "LiveCounter#bump",
level: :warn, extra: { key: key })
nil # fail open — caller treats nil as "no cached value"
end
One reporter means one log format — [Class#method] Redis error: ... — which
means one alert rule and one greppable string. The reporter is itself
fail-safe: if reporting the error somehow raises, that is swallowed too. The
thing that tells you Redis is broken must not break when Redis is broken.
The health check that almost lied
The second half — “never pull a healthy node” — is the part that is easy to get subtly, dangerously wrong.
The instinct is to make /health honest: Redis is down, so report unhealthy.
But think about what the load balancer does with that. Redis is shared. If a
Redis outage makes /health fail, it fails on every node at once. The load
balancer dutifully pulls all of them. A cache outage just became a total
outage — caused entirely by the health check.
So /health reports Redis, but keeps it out of the HTTP verdict:
| Situation | Body | HTTP |
|---|---|---|
| All good | status: "ok" |
200 |
| Redis down, app serving | status: "degraded" |
200 |
| App genuinely broken | status: "error" |
503 |
A degraded node is still serving shoppers, so it stays in rotation. The
dashboards still see degraded and still page someone — the operator finds
out, the shoppers don’t.
What I’d tell someone starting this
Fail-open for a cache is the easy decision. The work is consistency: one reporter, one message format, one explicit health-check policy. Done that way, “Redis is down” stops being an incident and becomes a known, tested, observable state. That is the whole goal — turn a scary failure into a boring one.