Specs: 22 M subs, 12 Tbps @ peak, 10 ms max jitter, 20 ms to first byte
Product Vision & Solution
Low Cost, Short Time To Market
Solution: COTS hardware running open-source software
The legacy video CDN was built on proprietary hardware and software from blue-chip companies. It was expensive and took 10 years to deploy. Commercial Off The Shelf (COTS) hardware would keep costs down while scaling with Moore’s law. Open source software would get us a large community to support the software
We used Apache Traffic Server, RIAK KV, small amount custom code and an ingenious way to keep things simple
Seamless Elasticity
We needed the ability to add servers, upgrade drives and add new clusters of caches very often
Solution: Client-side Routing + Distributed Topology Using RIAK
We reduced the configuration to a very small set of parameters. Each server advertised a set of tuples
Server Configuration = array of { server-name, volume name, number of 128 GB buckets }
We distributed these configurations automatically using RIAK KV. A client could query ANY server to download the topology. It would hash the filename of the requested content and use the hash to pick the right { server-name, volume-name } pair
Fault Tolerant, of ANY number of servers
Since all content was available in one many origins, we could withstand any number of failed caches. Of course, our capacity would be impacted. But then again, the number of servers deployed reflect peak mother’s day demand, so we could withstand some failures. Multiple failures are extremely rare
Solution: Uniform RESTFUL interfaces
If a cache failed, the client could always get content from the origin. Because the southbound interfaces of both the ATS cache and the origin are the same. a client to could failover to the origin if needed. Using consistent hashing, the client could also get the content from another ATS node
No Central Management System
Central systems are hard to keep in sync with reality
Solution: RIAK KV
Configuration is distributed to all nodes, eliminating a management system. This combined with the consistent hashing and blacklisting of broken nodes, we eliminated central management servers