# Redis Resilience

Redis makes the platform fast — it backs caching, rate limiting, flash-sale
inventory, and live counters. But a dependency that makes you fast must not be
a dependency that takes you down. So Redis here is **fail-open everywhere**:
when it is unavailable, the platform degrades; it never returns a 500.

## The rule

> A down Redis degrades gracefully. It never 500s, and it never causes a
> healthy node to be pulled from the load balancer.

This applies to every Redis touchpoint — the raw `REDIS_POOL` connection pool,
the Rails cache store, and the `RACK_ATTACK_CACHE` behind rate limiting.

## Every Redis call is rescued

Raw `REDIS_POOL.with { ... }` blocks never let a connection error escape.
Each rescue routes through a single reporter:

```ruby
REDIS_POOL.with { |r| r.incr(key) }
rescue => e
  RedisErrorReporter.report(e, context: "LiveCounter#bump",
                            level: :warn, extra: { key: key })
  nil   # fail open — caller treats this as "no cached value"
end
```

`RedisErrorReporter` is the *one* path for Redis failures. It logs with a
consistent message — `[Class#method] Redis error: <class>: <msg>` — and, in
production only, reports to Sentry. The reporter is itself fail-safe: if
*reporting* the error fails, that is swallowed too. There are no bare
`logger.warn` rescues scattered around — consistency is the point, because it
makes Redis failures greppable and alertable.

## The health check that doesn't lie

The subtle danger is the health check. If `/health` reported the node
unhealthy whenever Redis was down, a Redis outage would make the load balancer
pull **every** node at once — turning a cache outage into a total outage.

So `HealthController#check` reports Redis status but **keeps it out of the HTTP
verdict**:

| Situation | `/health` body | HTTP status |
|-----------|----------------|-------------|
| Everything up | `status: "ok"` | `200` |
| Redis down, app serving | `status: "degraded"` | **`200`** |
| App genuinely broken | `status: "error"` | `503` |

A degraded node is still a *serving* node, so the load balancer keeps it in
rotation. `/up` stays a pure boot check. The dashboards still see `degraded`
and can alert — the operator finds out, the shoppers don't.

## Why fail-open is the right default

For a cache, fail-open is almost always correct: the worst case is a slower
request, which is strictly better than a failed one. The work was in making it
*consistent* — one reporter, one message format, one health-check policy — so
that "Redis is down" is a known, tested, observable state instead of a
surprise.

## Key files

| Concern | Files |
|---------|-------|
| Error reporting | `RedisErrorReporter` |
| Connection pool | `config/initializers/cache_stores.rb` (`REDIS_POOL`) |
| Health | `HealthController` (`/health`, `/up`) |
| Consumers | `LiveCounter`, `RedisCounterWriter`, `FlashSaleInventoryService`, rate limiting |

Redis Resilience

Redis makes the platform fast — it backs caching, rate limiting, flash-sale inventory, and live counters. But a dependency that makes you fast must not be a dependency that takes you down. So Redis here is fail-open everywhere: when it is unavailable, the platform degrades; it never returns a 500.

The rule

A down Redis degrades gracefully. It never 500s, and it never causes a healthy node to be pulled from the load balancer.

This applies to every Redis touchpoint — the raw REDIS_POOL connection pool, the Rails cache store, and the RACK_ATTACK_CACHE behind rate limiting.

Every Redis call is rescued

Raw REDIS_POOL.with { ... } blocks never let a connection error escape. Each rescue routes through a single reporter:

REDIS_POOL.with { |r| r.incr(key) }
rescue => e
  RedisErrorReporter.report(e, context: "LiveCounter#bump",
                            level: :warn, extra: { key: key })
  nil   # fail open — caller treats this as "no cached value"
end

RedisErrorReporter is the one path for Redis failures. It logs with a consistent message — [Class#method] Redis error: <class>: <msg> — and, in production only, reports to Sentry. The reporter is itself fail-safe: if reporting the error fails, that is swallowed too. There are no bare logger.warn rescues scattered around — consistency is the point, because it makes Redis failures greppable and alertable.

The health check that doesn’t lie

The subtle danger is the health check. If /health reported the node unhealthy whenever Redis was down, a Redis outage would make the load balancer pull every node at once — turning a cache outage into a total outage.

So HealthController#check reports Redis status but keeps it out of the HTTP verdict:

Situation	`/health` body	HTTP status
Everything up	`status: "ok"`	`200`
Redis down, app serving	`status: "degraded"`	`200`
App genuinely broken	`status: "error"`	`503`

A degraded node is still a serving node, so the load balancer keeps it in rotation. /up stays a pure boot check. The dashboards still see degraded and can alert — the operator finds out, the shoppers don’t.

Why fail-open is the right default

For a cache, fail-open is almost always correct: the worst case is a slower request, which is strictly better than a failed one. The work was in making it consistent — one reporter, one message format, one health-check policy — so that “Redis is down” is a known, tested, observable state instead of a surprise.

Key files

Concern	Files
Error reporting	`RedisErrorReporter`
Connection pool	`config/initializers/cache_stores.rb` (`REDIS_POOL`)
Health	`HealthController` (`/health`, `/up`)
Consumers	`LiveCounter`, `RedisCounterWriter`, `FlashSaleInventoryService`, rate limiting