# Designing a fail-open Redis layer

*May 10, 2026*

Redis is everywhere in this platform. It backs the Rails cache, rate-limit
counters, flash-sale inventory, and live counters on nearly every page. It is
the reason a lot of things are fast.

Which is exactly the problem. A dependency that makes you fast is, by default,
a dependency that takes you down. If every `REDIS_POOL.with` block can raise,
then a Redis hiccup turns into a wall of 500s. So the rule for this codebase
became simple to state and harder to enforce:

> Redis is **fail-open**. When it is down, the platform degrades. It never
> 500s, and it never causes a healthy node to be pulled from the load balancer.

## Rescue every call — through one door

The first half is easy: rescue every raw Redis call. The trap is doing it
*inconsistently* — a `logger.warn` here, a swallowed exception there, a
`rescue nil` somewhere else. Six months later nobody can answer "how often is
Redis failing?" because the failures are logged six different ways.

So every rescue routes through one object, `RedisErrorReporter`:

```ruby
REDIS_POOL.with { |r| r.incr(key) }
rescue => e
  RedisErrorReporter.report(e, context: "LiveCounter#bump",
                            level: :warn, extra: { key: key })
  nil   # fail open — caller treats nil as "no cached value"
end
```

One reporter means one log format — `[Class#method] Redis error: ...` — which
means one alert rule and one greppable string. The reporter is itself
fail-safe: if reporting the error somehow raises, that is swallowed too. The
thing that tells you Redis is broken must not break when Redis is broken.

## The health check that almost lied

The second half — "never pull a healthy node" — is the part that is easy to
get subtly, dangerously wrong.

The instinct is to make `/health` honest: Redis is down, so report unhealthy.
But think about what the load balancer does with that. Redis is *shared*. If a
Redis outage makes `/health` fail, it fails on **every** node at once. The load
balancer dutifully pulls all of them. A cache outage just became a total
outage — caused entirely by the health check.

So `/health` reports Redis, but keeps it out of the HTTP verdict:

| Situation | Body | HTTP |
|-----------|------|------|
| All good | `status: "ok"` | `200` |
| Redis down, app serving | `status: "degraded"` | **`200`** |
| App genuinely broken | `status: "error"` | `503` |

A degraded node is still serving shoppers, so it stays in rotation. The
dashboards still see `degraded` and still page someone — the operator finds
out, the shoppers don't.

## What I'd tell someone starting this

Fail-open for a cache is the easy *decision*. The work is consistency: one
reporter, one message format, one explicit health-check policy. Done that way,
"Redis is down" stops being an incident and becomes a known, tested,
observable state. That is the whole goal — turn a scary failure into a boring
one.

Designing a fail-open Redis layer

May 10, 2026

Redis is everywhere in this platform. It backs the Rails cache, rate-limit counters, flash-sale inventory, and live counters on nearly every page. It is the reason a lot of things are fast.

Which is exactly the problem. A dependency that makes you fast is, by default, a dependency that takes you down. If every REDIS_POOL.with block can raise, then a Redis hiccup turns into a wall of 500s. So the rule for this codebase became simple to state and harder to enforce:

Redis is fail-open. When it is down, the platform degrades. It never 500s, and it never causes a healthy node to be pulled from the load balancer.

Rescue every call — through one door

The first half is easy: rescue every raw Redis call. The trap is doing it inconsistently — a logger.warn here, a swallowed exception there, a rescue nil somewhere else. Six months later nobody can answer “how often is Redis failing?” because the failures are logged six different ways.

So every rescue routes through one object, RedisErrorReporter:

REDIS_POOL.with { |r| r.incr(key) }
rescue => e
  RedisErrorReporter.report(e, context: "LiveCounter#bump",
                            level: :warn, extra: { key: key })
  nil   # fail open — caller treats nil as "no cached value"
end

One reporter means one log format — [Class#method] Redis error: ... — which means one alert rule and one greppable string. The reporter is itself fail-safe: if reporting the error somehow raises, that is swallowed too. The thing that tells you Redis is broken must not break when Redis is broken.

The health check that almost lied

The second half — “never pull a healthy node” — is the part that is easy to get subtly, dangerously wrong.

The instinct is to make /health honest: Redis is down, so report unhealthy. But think about what the load balancer does with that. Redis is shared. If a Redis outage makes /health fail, it fails on every node at once. The load balancer dutifully pulls all of them. A cache outage just became a total outage — caused entirely by the health check.

So /health reports Redis, but keeps it out of the HTTP verdict:

Situation	Body	HTTP
All good	`status: "ok"`	`200`
Redis down, app serving	`status: "degraded"`	`200`
App genuinely broken	`status: "error"`	`503`

A degraded node is still serving shoppers, so it stays in rotation. The dashboards still see degraded and still page someone — the operator finds out, the shoppers don’t.

What I’d tell someone starting this

Fail-open for a cache is the easy decision. The work is consistency: one reporter, one message format, one explicit health-check policy. Done that way, “Redis is down” stops being an incident and becomes a known, tested, observable state. That is the whole goal — turn a scary failure into a boring one.