Redis Resilience
Redis makes the platform fast — it backs caching, rate limiting, flash-sale inventory, and live counters. But a dependency that makes you fast must not be a dependency that takes you down. So Redis here is fail-open everywhere: when it is unavailable, the platform degrades; it never returns a 500.
The rule
A down Redis degrades gracefully. It never 500s, and it never causes a healthy node to be pulled from the load balancer.
This applies to every Redis touchpoint — the raw REDIS_POOL connection pool,
the Rails cache store, and the RACK_ATTACK_CACHE behind rate limiting.
Every Redis call is rescued
Raw REDIS_POOL.with { ... } blocks never let a connection error escape.
Each rescue routes through a single reporter:
REDIS_POOL.with { |r| r.incr(key) }
rescue => e
RedisErrorReporter.report(e, context: "LiveCounter#bump",
level: :warn, extra: { key: key })
nil # fail open — caller treats this as "no cached value"
end
RedisErrorReporter is the one path for Redis failures. It logs with a
consistent message — [Class#method] Redis error: <class>: <msg> — and, in
production only, reports to Sentry. The reporter is itself fail-safe: if
reporting the error fails, that is swallowed too. There are no bare
logger.warn rescues scattered around — consistency is the point, because it
makes Redis failures greppable and alertable.
The health check that doesn’t lie
The subtle danger is the health check. If /health reported the node
unhealthy whenever Redis was down, a Redis outage would make the load balancer
pull every node at once — turning a cache outage into a total outage.
So HealthController#check reports Redis status but keeps it out of the HTTP
verdict:
| Situation | /health body |
HTTP status |
|---|---|---|
| Everything up | status: "ok" |
200 |
| Redis down, app serving | status: "degraded" |
200 |
| App genuinely broken | status: "error" |
503 |
A degraded node is still a serving node, so the load balancer keeps it in
rotation. /up stays a pure boot check. The dashboards still see degraded
and can alert — the operator finds out, the shoppers don’t.
Why fail-open is the right default
For a cache, fail-open is almost always correct: the worst case is a slower request, which is strictly better than a failed one. The work was in making it consistent — one reporter, one message format, one health-check policy — so that “Redis is down” is a known, tested, observable state instead of a surprise.
Key files
| Concern | Files |
|---|---|
| Error reporting | RedisErrorReporter |
| Connection pool | config/initializers/cache_stores.rb (REDIS_POOL) |
| Health | HealthController (/health, /up) |
| Consumers | LiveCounter, RedisCounterWriter, FlashSaleInventoryService, rate limiting |