# Designing a fail-open Redis layer *May 10, 2026* Redis is everywhere in this platform. It backs the Rails cache, rate-limit counters, flash-sale inventory, and live counters on nearly every page. It is the reason a lot of things are fast. Which is exactly the problem. A dependency that makes you fast is, by default, a dependency that takes you down. If every `REDIS_POOL.with` block can raise, then a Redis hiccup turns into a wall of 500s. So the rule for this codebase became simple to state and harder to enforce: > Redis is **fail-open**. When it is down, the platform degrades. It never > 500s, and it never causes a healthy node to be pulled from the load balancer. ## Rescue every call — through one door The first half is easy: rescue every raw Redis call. The trap is doing it *inconsistently* — a `logger.warn` here, a swallowed exception there, a `rescue nil` somewhere else. Six months later nobody can answer "how often is Redis failing?" because the failures are logged six different ways. So every rescue routes through one object, `RedisErrorReporter`: ```ruby REDIS_POOL.with { |r| r.incr(key) } rescue => e RedisErrorReporter.report(e, context: "LiveCounter#bump", level: :warn, extra: { key: key }) nil # fail open — caller treats nil as "no cached value" end ``` One reporter means one log format — `[Class#method] Redis error: ...` — which means one alert rule and one greppable string. The reporter is itself fail-safe: if reporting the error somehow raises, that is swallowed too. The thing that tells you Redis is broken must not break when Redis is broken. ## The health check that almost lied The second half — "never pull a healthy node" — is the part that is easy to get subtly, dangerously wrong. The instinct is to make `/health` honest: Redis is down, so report unhealthy. But think about what the load balancer does with that. Redis is *shared*. If a Redis outage makes `/health` fail, it fails on **every** node at once. The load balancer dutifully pulls all of them. A cache outage just became a total outage — caused entirely by the health check. So `/health` reports Redis, but keeps it out of the HTTP verdict: | Situation | Body | HTTP | |-----------|------|------| | All good | `status: "ok"` | `200` | | Redis down, app serving | `status: "degraded"` | **`200`** | | App genuinely broken | `status: "error"` | `503` | A degraded node is still serving shoppers, so it stays in rotation. The dashboards still see `degraded` and still page someone — the operator finds out, the shoppers don't. ## What I'd tell someone starting this Fail-open for a cache is the easy *decision*. The work is consistency: one reporter, one message format, one explicit health-check policy. Done that way, "Redis is down" stops being an incident and becomes a known, tested, observable state. That is the whole goal — turn a scary failure into a boring one.