Thursday 2 December 2010

Constraints of the Cloud Provider: The Operator's Dilemma

As one who has worked on building and operating on-demand services for many years, there are several areas where I feel most mainstream thinking around cloud have not yet sufficiently developed enough to allow operators to efficiently and effectively provide services for their customers. This frustrates me greatly, as having worked either for or with the often quoted but still poorly understood top Internet service companies out there (like the ones that start with "Y", "A", and "G"), very advanced techniques have been created that when used properly largely overcome these challenges. These approaches are in many ways very different than the traditional industry methods that so many vendors tout, so different that they often turn traditional thinking completely on its head and thus are not hindered by limitations that most of the industry have taken as law. This has left me either chuckling or exasperated as so called experts who have never worked for those sorts of companies jump up and down and insist that the world is flat and that there are dragons at the edge of the horizon.

While I can (and, if my friends and colleagues have their way, probably will) write books on the subject, I will try and first lay out some of the key constraints that I see must be overcome in order to really provide effective and efficient cloud services, whether in the large or in the small.

* flexibility and agility of the stack- this is the one that seems to befuddle people the most, and the one that many companies unnecessarily jam in a hypervisor to coarsely "tune". Service providers usually experience peak and trough traffic demands, with the occasional spike or "super peak" event interspersed. Understanding how customers use the service, including the standard load period, amplitude, how the load manifests itself (CPU, I/O, memory, bandwidth, etc within particular parts of the stack), and (if possible) peak and trough times allows a provider to calibrate not only the size and type of required capacity, but how quickly demand can increase. Providers can build out to fully accommodate for peaks or spikes, but that means that for much of the time there may be underutilized resources. At very small scale this might not be much of an issue, but at large scale, as well as when resources are otherwise very tight, this can be unnecessarily costly.

If a provider wishes to respond to load increases and decreases in a more dynamic way, they need to be able to bring up and down more capacity quickly, as this response time determines whether the end user experiences degraded service or an outage. The biggest challenge here usually comes down to the how finely grained the unit of capacity is. Most people treat the OS "unit", whether it is on physical hardware or a virtual guest, as the one (and sometimes only) building block. This is unnecessarily limiting, as it does not easily allow for scaling that discretely addresses the bottlenecked resource. Mechanisms for building and bringing up and down an OS "unit" are cumbersome outside a virtualized environment that has an image library. Without the right frameworks building a server can be slow and error prone, especially if the physical equipment is not already on hand. Within a virtualized environment, images are not always well maintained, and the instrumentation and service management tools are often either inadequate or improperly used.

Finally, agility needs to be provided in a way that is resource-effective and reliable. It must not require large numbers of staff to respond and maintain, and infrastructure changes need to be known, atomic, reproducible, and revertible as much as possible to maintain consistency and reliability. Agility also needs to include the relocatibility of infrastructure and services for contractual, legal, regulatory or other business factors. I have seen many companies get these wrong, or worse hope that by merely adopting industry practices such as ITIL without developing any deeper understanding or refinement of the underlying services and environment will magically improve agility, reliability and resource-effectiveness. Until this is overcome, the service will never approach its cloud-like potential.

* Data Classification and understanding- I covered this in an earlier post. Understanding the nature and usage of the underlying data allows you to work through how best to approach dynamism within your stack. Without this understanding, data can become stuck, fragmented, inconsistent, or insecure, and negatively affect the reliability of the service.

* Interconnectivity- a truly dynamic cloud infrastructure needs to efficiently and securely communicate both across the components within the infrastructure but also with the end user community. This means that networks must be flexible, dynamically scalable, and have minimal latency (often referred to as "near wire speed"), as well as manage to also create secure "pipes" for moving data. This poses challenges as traditional network topology is often either rigidly dedicated to enhance security or flat and insecure. Also, dynamic storage networks are still in their infancy, as can be seen by the as of yet lack of one definitive winning standard, despite the hype around FCoE and iSCSI.

* Power- this constraint can be defined in any number of ways, whether by power, cooling, space efficiency, or redundancy. As compute power continues to grow and miniaturization improves, achieving effective compute densities in a cost efficient manner is quickly becoming an industry of its own. Innovations such as the Yahoo Computing Coop and Google's shifting datacenters are far out in front. Others battle with lower compute densities, more moving parts, and therefore higher overall costs. Not addressing this limits the availability and cost effectiveness of cloud.

* Security- this constraint is more about perceptions, with some additional sophisticated understanding of the usage of the underlying service, than the problem that many hawkers of "private clouds" will have you believe. This is not to say that many cloud providers have already figured this out. In fact, some are only slightly less sloppy than the standard IT department. Better providers will understand the dynamics of the underlying service, will properly classify and segment where applicable data within the service. They will also properly track and audit access to the data to ensure proper handling.

Where security becomes a more interesting challenge is in the Contractual, Legal and Regulatory (CLR) arena, where requirements may be introduced that are less about security and more around data handling and locale. There is still a great deal of FUD that has built up due to the constant challenges that have arisen in the consumer space. As technology, understanding, and the legal and regulatory landscape evolves, this should improve to allow for customers to enjoy the power of the cloud.

No comments:

Post a Comment