Few organizations regardless of their size can claim to have 1.35 billion users, have to manage the upload and ongoing management of 930 million photos a day or be responsible for the transmission of 12 billion messages daily. Yet these are the challenges that Facebook’s data center IT staff routinely encounter. To respond to them, Facebook is turning to a disaggregated racks strategy to create a next gen cloud computing data center infrastructure that delivers the agility, scalability and cost-effective attributes it needs to meet its short and long term compute and storage needs.
At this past week’s Storage Visions in Las Vegas, NV, held at the Riviera Casino and Hotel, Facebook’s Capacity Management Engineer, Jeff Qin, delivered a keynote that provided some valuable insight into how uber-large enterprise data center infrastructures may need to evolve to meet their unique compute and storage requirements. As these data centers daily may ingest hundreds TBs of data that must be managed, manipulated and often analyzed in near real-time conditions, even the most advanced server, networking and storage architectures that exist today break down.
Qin explained that in Facebook’s early days it also started out using these technologies that most enterprises use today. However the high volumes of data that it ingests coupled with end-user expectations that the data be processed quickly and securely and then managed and retained for years (and possibly forever) exposed the shortcomings of these approaches. Facebook quickly recognized that buying more servers, networking and storage and then scaling them out and/or up resulted in costs and overhead that became onerous. Further, Facebook recognized that the available CPU, memory and storage capacity resources contained in each server and storage node were not being used efficiently.
To implement an architecture that most closely aligns with its needs, Facebook is currently in the process of implementing a Disaggregated Rack strategy. At a high level, this approach entails the deployment of CPU, memory and storage in separate and distinct pools. Facebook then creates virtual servers that are tuned to each specific application’s requirements by pulling and allocating resources from these pools to each virtual server. The objective when creating each of these custom application servers is to utilize 90% of the allocated resources to use them as optimally as possible.
Facebook expects that by taking this approach that, over time, it can save in the neighborhood of $1 billion. While Qin did not provide the exact road map as to how Facebook would achieve these savings, he provided enough hints in his other comments in his keynote that one could draw some conclusions as to how they would be achieved.
For example, Facebook already only acquires what it refers to as “vanity free” servers and storage. By this, one may deduce that it does not acquire servers from the likes of Dell or HP or storage from the likes of EMC, HDS or NetApp (though Qin did mention Facebook did initially buy from these types of companies.) Rather, it now largely buys and configures its own servers and configures and configures them by itself to meet its specific processing and storage needs.
Also it appears that Facebook may be or is already buying the component parts that make up server and storage such as the underlying CPU, memory, HDDs and network cabling to create its next gen cloud computing data center. Qin did say that what he was sharing at Storage Visions represented what equated to a 2 year strategy for Facebook so exactly how far down the path that it is toward implementing it is unclear.
Having presented that vision for Facebook, the presentations at Storage Visions for the remainder of that day and the next were largely spent showing why this is the future at many large enterprise data centers but why it will take some time to come to fruition. For instance, there were some presentations on next generation interconnect protocols such as PCI Express, Infiniband, iWarp and RoCE (RDMA over Converged Ethernet).
This high performance, low latency protocols are needed in order to deliver the high levels of performance between these various pools of resources that enterprises will need. As resources get disaggregated, their ability can achieve the same levels of performance that can within servers or storage arrays diminishes since there is more distance and communication required between them. While performance benchmarks of 700 nanoseconds are already being achieved using some of these protocols, these are in dedicated, point-to-point environments and not in switched fabric networks.
Further, there was very little discussion as to what type of cloud operating system would overlay all of these components so as to make the creation and ongoing management of these application-specific virtual servers across these pools of resources possible. Even assuming such an OS did exist, tools that manage its performance and underlying components would still need to be developed and tested before such an OS could realistically be deployed in most production environments.
Facebook’s Qin provided a compelling early look into what the next generation of cloud computing may look like in enterprise data centers. However the rest of the sessions at Storage Visions also provided a glimpse into just how difficult the task will be for Facebook to deliver on this ideal as many of the technologies needed are still in their infancy stages if they exist at all.