# Redis Resilience Redis makes the platform fast — it backs caching, rate limiting, flash-sale inventory, and live counters. But a dependency that makes you fast must not be a dependency that takes you down. So Redis here is **fail-open everywhere**: when it is unavailable, the platform degrades; it never returns a 500. ## The rule > A down Redis degrades gracefully. It never 500s, and it never causes a > healthy node to be pulled from the load balancer. This applies to every Redis touchpoint — the raw `REDIS_POOL` connection pool, the Rails cache store, and the `RACK_ATTACK_CACHE` behind rate limiting. ## Every Redis call is rescued Raw `REDIS_POOL.with { ... }` blocks never let a connection error escape. Each rescue routes through a single reporter: ```ruby REDIS_POOL.with { |r| r.incr(key) } rescue => e RedisErrorReporter.report(e, context: "LiveCounter#bump", level: :warn, extra: { key: key }) nil # fail open — caller treats this as "no cached value" end ``` `RedisErrorReporter` is the *one* path for Redis failures. It logs with a consistent message — `[Class#method] Redis error: : ` — and, in production only, reports to Sentry. The reporter is itself fail-safe: if *reporting* the error fails, that is swallowed too. There are no bare `logger.warn` rescues scattered around — consistency is the point, because it makes Redis failures greppable and alertable. ## The health check that doesn't lie The subtle danger is the health check. If `/health` reported the node unhealthy whenever Redis was down, a Redis outage would make the load balancer pull **every** node at once — turning a cache outage into a total outage. So `HealthController#check` reports Redis status but **keeps it out of the HTTP verdict**: | Situation | `/health` body | HTTP status | |-----------|----------------|-------------| | Everything up | `status: "ok"` | `200` | | Redis down, app serving | `status: "degraded"` | **`200`** | | App genuinely broken | `status: "error"` | `503` | A degraded node is still a *serving* node, so the load balancer keeps it in rotation. `/up` stays a pure boot check. The dashboards still see `degraded` and can alert — the operator finds out, the shoppers don't. ## Why fail-open is the right default For a cache, fail-open is almost always correct: the worst case is a slower request, which is strictly better than a failed one. The work was in making it *consistent* — one reporter, one message format, one health-check policy — so that "Redis is down" is a known, tested, observable state instead of a surprise. ## Key files | Concern | Files | |---------|-------| | Error reporting | `RedisErrorReporter` | | Connection pool | `config/initializers/cache_stores.rb` (`REDIS_POOL`) | | Health | `HealthController` (`/health`, `/up`) | | Consumers | `LiveCounter`, `RedisCounterWriter`, `FlashSaleInventoryService`, rate limiting |