How DoorDash Rearchitected its Cache to Improve Scalability and Performance

DoorDash rearchitected the heterogeneous caching system they were using across all of their microservices and created a common, multi-layered cache providing a generic mechanism and solving a number of issues coming from the adoption of a fragmented cache.

Caching is a common mechanism used to optimize performance in a system without requiring expensive optimizations. This was especially relevant in DoorDash's case since implementing business logic has higher priority than performance optimization, explain DoorDash engineers Lev Neiman and Jason Fan.

Unfortunately, different teams at DoorDash relied on different caching systems, including Caffeine, Redis Lettuce, and HashMaps, which also meant they were experiencing and solving again and again the same issues, such as cache staleness, heavy dependency on Redis, inconsistent key schema, and more. For this reason, a team of engineers set off to create a shared caching library for all of DoorDash microservices, starting with DashPass, a key service that was experiencing scaling challenges and frequent failures due to increasing traffic levels.

The first step was defining a common API based on two Kotlin interfaces: CacheManager to create a new cache for a specific key type and a fallback method, and a CacheKey class abstracting over key types.

This allows us to use dependency injection and polymorphism to inject arbitrary logic behind the scenes while maintaining uniform cache calls from business logic.

While striving to keep the cache simple, DoorDash engineers opted for a multi-layered design with three layers to push the possibility of performance optimization further. In the first layer, named request local cache, data reside in a hash map and its lifetime is bound by the request's. In the second layer, the local cache, Caffeine is used to share data among all workers within the same Java virtual machine. The third layer is the Redis cache visible to all pods in the same Redis cluster and using Redis Lettuce.

An important feature of this multi-layered cache system is runtime control, available for each separate layer, to enable turning the cache on or off, setting the cache time to live (TTL), or shadow mode, where a given percentage of cache requests are also compared to the source of truth. Additionally, the cache system included support for metrics collection, including hits and misses, cache latency, and for logging.

Once the cache system was ready and working as expected for DashPass, it was progressively rolled out to the rest of the organization, with clear guidance about when and how to use it or not to use it.

According to Neiman and Fan, the new cache system improved scalability and safety across all of their services while also making it simple for teams to adopt a cache when necessary to improve performance.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Sergio De Simone

Rate this Article

This content is in the Architecture topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter