Thursday 2 December 2010

Constraints of the Cloud Provider: The Operator's Dilemma

As one who has worked on building and operating on-demand services for many years, there are several areas where I feel most mainstream thinking around cloud have not yet sufficiently developed enough to allow operators to efficiently and effectively provide services for their customers. This frustrates me greatly, as having worked either for or with the often quoted but still poorly understood top Internet service companies out there (like the ones that start with "Y", "A", and "G"), very advanced techniques have been created that when used properly largely overcome these challenges. These approaches are in many ways very different than the traditional industry methods that so many vendors tout, so different that they often turn traditional thinking completely on its head and thus are not hindered by limitations that most of the industry have taken as law. This has left me either chuckling or exasperated as so called experts who have never worked for those sorts of companies jump up and down and insist that the world is flat and that there are dragons at the edge of the horizon.

While I can (and, if my friends and colleagues have their way, probably will) write books on the subject, I will try and first lay out some of the key constraints that I see must be overcome in order to really provide effective and efficient cloud services, whether in the large or in the small.

* flexibility and agility of the stack- this is the one that seems to befuddle people the most, and the one that many companies unnecessarily jam in a hypervisor to coarsely "tune". Service providers usually experience peak and trough traffic demands, with the occasional spike or "super peak" event interspersed. Understanding how customers use the service, including the standard load period, amplitude, how the load manifests itself (CPU, I/O, memory, bandwidth, etc within particular parts of the stack), and (if possible) peak and trough times allows a provider to calibrate not only the size and type of required capacity, but how quickly demand can increase. Providers can build out to fully accommodate for peaks or spikes, but that means that for much of the time there may be underutilized resources. At very small scale this might not be much of an issue, but at large scale, as well as when resources are otherwise very tight, this can be unnecessarily costly.

If a provider wishes to respond to load increases and decreases in a more dynamic way, they need to be able to bring up and down more capacity quickly, as this response time determines whether the end user experiences degraded service or an outage. The biggest challenge here usually comes down to the how finely grained the unit of capacity is. Most people treat the OS "unit", whether it is on physical hardware or a virtual guest, as the one (and sometimes only) building block. This is unnecessarily limiting, as it does not easily allow for scaling that discretely addresses the bottlenecked resource. Mechanisms for building and bringing up and down an OS "unit" are cumbersome outside a virtualized environment that has an image library. Without the right frameworks building a server can be slow and error prone, especially if the physical equipment is not already on hand. Within a virtualized environment, images are not always well maintained, and the instrumentation and service management tools are often either inadequate or improperly used.

Finally, agility needs to be provided in a way that is resource-effective and reliable. It must not require large numbers of staff to respond and maintain, and infrastructure changes need to be known, atomic, reproducible, and revertible as much as possible to maintain consistency and reliability. Agility also needs to include the relocatibility of infrastructure and services for contractual, legal, regulatory or other business factors. I have seen many companies get these wrong, or worse hope that by merely adopting industry practices such as ITIL without developing any deeper understanding or refinement of the underlying services and environment will magically improve agility, reliability and resource-effectiveness. Until this is overcome, the service will never approach its cloud-like potential.

* Data Classification and understanding- I covered this in an earlier post. Understanding the nature and usage of the underlying data allows you to work through how best to approach dynamism within your stack. Without this understanding, data can become stuck, fragmented, inconsistent, or insecure, and negatively affect the reliability of the service.

* Interconnectivity- a truly dynamic cloud infrastructure needs to efficiently and securely communicate both across the components within the infrastructure but also with the end user community. This means that networks must be flexible, dynamically scalable, and have minimal latency (often referred to as "near wire speed"), as well as manage to also create secure "pipes" for moving data. This poses challenges as traditional network topology is often either rigidly dedicated to enhance security or flat and insecure. Also, dynamic storage networks are still in their infancy, as can be seen by the as of yet lack of one definitive winning standard, despite the hype around FCoE and iSCSI.

* Power- this constraint can be defined in any number of ways, whether by power, cooling, space efficiency, or redundancy. As compute power continues to grow and miniaturization improves, achieving effective compute densities in a cost efficient manner is quickly becoming an industry of its own. Innovations such as the Yahoo Computing Coop and Google's shifting datacenters are far out in front. Others battle with lower compute densities, more moving parts, and therefore higher overall costs. Not addressing this limits the availability and cost effectiveness of cloud.

* Security- this constraint is more about perceptions, with some additional sophisticated understanding of the usage of the underlying service, than the problem that many hawkers of "private clouds" will have you believe. This is not to say that many cloud providers have already figured this out. In fact, some are only slightly less sloppy than the standard IT department. Better providers will understand the dynamics of the underlying service, will properly classify and segment where applicable data within the service. They will also properly track and audit access to the data to ensure proper handling.

Where security becomes a more interesting challenge is in the Contractual, Legal and Regulatory (CLR) arena, where requirements may be introduced that are less about security and more around data handling and locale. There is still a great deal of FUD that has built up due to the constant challenges that have arisen in the consumer space. As technology, understanding, and the legal and regulatory landscape evolves, this should improve to allow for customers to enjoy the power of the cloud.

Wednesday 1 December 2010

Cloud and the Edge Device: What about the User

(Those who know me know I have had my hands more than a little full the last couple of months. Hopefully, things will settle down soon enough so that I have more time to devote to this.)

While a lot of the discussion in the last year or so has concentrated on the infrastructure and service side of Cloud, arguably the biggest effect upon the whole stack will be driven by what happens on the edge. As more and more people acquire, use and become comfortable with smart phones, tablets, and various roaming laptop form factors, demands for speed, flexibility and portability of applications and services to such mobile devices will skyrocket. No longer will road warriors want to have to be tethered to the office or a wired network to be productive and stay connected to the business and friends. They will also steadily have less and less patience for those "sticky situations" where applications and/or the data they need is "stuck" on a system in the office. While some might see a VDI approach as a way to overcome this, it is a far from perfect hack that twists a traditional paradigm to create a stop gap intermediary layer rather than have the edge device communicate directly with the wanted service itself.

The first place to go in order to best understand the edge is to look at the constraints that exist there. The first is connectivity. In the early days of the Internet when line speeds were measured in Kbps it was impractical to run data hungry applications, and it was not until broadband became more widely available that applications like streaming media became popular. Wifi, LTE and 4G will go a long way to do this with mobile devices however reliability and any latency issues will still need to be addressed. Uneven coverage, clogged masts and backhaul, as well as coverage holes and shadows must be dealt with through further improvements in the technology as well as presentation resilience of the service on the edge device itself.

The second constraint is form factor. While people have different thresholds for what they are willing to carry and use, it is clear that the edge device needs to be lightweight and relatively easy to carry and stow away while still being big and powerful enough to use comfortably. The advent of devices like the iPhone and iPad have just begun to cross the threshold for many. However, insufficient processing power, fast connectivity, and the proprietary nature of iOS have limited the eventual revolution thus far. This has driven the market for specialized applications, but has not yet quite opened the floodgates towards showing what might be possible with cloud services. As it is now clear there is a strong demand for these devices, I expect rapid improvements in technology, as well as the spread of more open operating systems such as Android, should allow operators, consumers and cloud service providers alike to rapidly hone in one this sweet spot and create an even larger market and drive ever more innovative usage patterns.

The third constraint is power. As people are always on the go, and wifi connections are generally power hogs, battery life is critical. As form factor is also important, these batteries must be small and lightweight, yet powerful, long lasting and easily recharged. While I have been impressed with the iPad (though not with any of my other mobile devices), a lot of progress still needs to be made here.

Finally, the last constraint, which I view as perhaps also the biggest opportunity for Cloud, is security. As regulators start to create ever harsher laws for the loss of data, it will become ever more important to tighten the amount of data that is allowed to reside on mobile devices. This may mean either creating a much sharper tiered storage policy towards sensitive and personal data that limits the amount that is actually stored on the edge device. If connectivity, power, and form factor are improved, this will allow for less resident data to exist on the device. Data can also be keyed back to, or have holes punched in it like a puzzle to be filled by, a central cloud service to allow it to be accessed in whole. As the other constraints are slowly pushed back, there will be ever more creative ways that this can be addressed.