# Designing a fail-open Redis layer

*May 10, 2026*

Redis is everywhere in this platform. It backs the Rails cache, rate-limit
counters, flash-sale inventory, and live counters on nearly every page. It is
the reason a lot of things are fast.

Which is exactly the problem. A dependency that makes you fast is, by default,
a dependency that takes you down. If every `REDIS_POOL.with` block can raise,
then a Redis hiccup turns into a wall of 500s. So the rule for this codebase
became simple to state and harder to enforce:

> Redis is **fail-open**. When it is down, the platform degrades. It never
> 500s, and it never causes a healthy node to be pulled from the load balancer.

## Rescue every call — through one door

The first half is easy: rescue every raw Redis call. The trap is doing it
*inconsistently* — a `logger.warn` here, a swallowed exception there, a
`rescue nil` somewhere else. Six months later nobody can answer "how often is
Redis failing?" because the failures are logged six different ways.

So every rescue routes through one object, `RedisErrorReporter`:

```ruby
REDIS_POOL.with { |r| r.incr(key) }
rescue => e
  RedisErrorReporter.report(e, context: "LiveCounter#bump",
                            level: :warn, extra: { key: key })
  nil   # fail open — caller treats nil as "no cached value"
end
```

One reporter means one log format — `[Class#method] Redis error: ...` — which
means one alert rule and one greppable string. The reporter is itself
fail-safe: if reporting the error somehow raises, that is swallowed too. The
thing that tells you Redis is broken must not break when Redis is broken.

## The health check that almost lied

The second half — "never pull a healthy node" — is the part that is easy to
get subtly, dangerously wrong.

The instinct is to make `/health` honest: Redis is down, so report unhealthy.
But think about what the load balancer does with that. Redis is *shared*. If a
Redis outage makes `/health` fail, it fails on **every** node at once. The load
balancer dutifully pulls all of them. A cache outage just became a total
outage — caused entirely by the health check.

So `/health` reports Redis, but keeps it out of the HTTP verdict:

| Situation | Body | HTTP |
|-----------|------|------|
| All good | `status: "ok"` | `200` |
| Redis down, app serving | `status: "degraded"` | **`200`** |
| App genuinely broken | `status: "error"` | `503` |

A degraded node is still serving shoppers, so it stays in rotation. The
dashboards still see `degraded` and still page someone — the operator finds
out, the shoppers don't.

## What I'd tell someone starting this

Fail-open for a cache is the easy *decision*. The work is consistency: one
reporter, one message format, one explicit health-check policy. Done that way,
"Redis is down" stops being an incident and becomes a known, tested,
observable state. That is the whole goal — turn a scary failure into a boring
one.