Thursday 2 December 2010

Constraints of the Cloud Provider: The Operator's Dilemma

As one who has worked on building and operating on-demand services for many years, there are several areas where I feel most mainstream thinking around cloud have not yet sufficiently developed enough to allow operators to efficiently and effectively provide services for their customers. This frustrates me greatly, as having worked either for or with the often quoted but still poorly understood top Internet service companies out there (like the ones that start with "Y", "A", and "G"), very advanced techniques have been created that when used properly largely overcome these challenges. These approaches are in many ways very different than the traditional industry methods that so many vendors tout, so different that they often turn traditional thinking completely on its head and thus are not hindered by limitations that most of the industry have taken as law. This has left me either chuckling or exasperated as so called experts who have never worked for those sorts of companies jump up and down and insist that the world is flat and that there are dragons at the edge of the horizon.

While I can (and, if my friends and colleagues have their way, probably will) write books on the subject, I will try and first lay out some of the key constraints that I see must be overcome in order to really provide effective and efficient cloud services, whether in the large or in the small.

* flexibility and agility of the stack- this is the one that seems to befuddle people the most, and the one that many companies unnecessarily jam in a hypervisor to coarsely "tune". Service providers usually experience peak and trough traffic demands, with the occasional spike or "super peak" event interspersed. Understanding how customers use the service, including the standard load period, amplitude, how the load manifests itself (CPU, I/O, memory, bandwidth, etc within particular parts of the stack), and (if possible) peak and trough times allows a provider to calibrate not only the size and type of required capacity, but how quickly demand can increase. Providers can build out to fully accommodate for peaks or spikes, but that means that for much of the time there may be underutilized resources. At very small scale this might not be much of an issue, but at large scale, as well as when resources are otherwise very tight, this can be unnecessarily costly.

If a provider wishes to respond to load increases and decreases in a more dynamic way, they need to be able to bring up and down more capacity quickly, as this response time determines whether the end user experiences degraded service or an outage. The biggest challenge here usually comes down to the how finely grained the unit of capacity is. Most people treat the OS "unit", whether it is on physical hardware or a virtual guest, as the one (and sometimes only) building block. This is unnecessarily limiting, as it does not easily allow for scaling that discretely addresses the bottlenecked resource. Mechanisms for building and bringing up and down an OS "unit" are cumbersome outside a virtualized environment that has an image library. Without the right frameworks building a server can be slow and error prone, especially if the physical equipment is not already on hand. Within a virtualized environment, images are not always well maintained, and the instrumentation and service management tools are often either inadequate or improperly used.

Finally, agility needs to be provided in a way that is resource-effective and reliable. It must not require large numbers of staff to respond and maintain, and infrastructure changes need to be known, atomic, reproducible, and revertible as much as possible to maintain consistency and reliability. Agility also needs to include the relocatibility of infrastructure and services for contractual, legal, regulatory or other business factors. I have seen many companies get these wrong, or worse hope that by merely adopting industry practices such as ITIL without developing any deeper understanding or refinement of the underlying services and environment will magically improve agility, reliability and resource-effectiveness. Until this is overcome, the service will never approach its cloud-like potential.

* Data Classification and understanding- I covered this in an earlier post. Understanding the nature and usage of the underlying data allows you to work through how best to approach dynamism within your stack. Without this understanding, data can become stuck, fragmented, inconsistent, or insecure, and negatively affect the reliability of the service.

* Interconnectivity- a truly dynamic cloud infrastructure needs to efficiently and securely communicate both across the components within the infrastructure but also with the end user community. This means that networks must be flexible, dynamically scalable, and have minimal latency (often referred to as "near wire speed"), as well as manage to also create secure "pipes" for moving data. This poses challenges as traditional network topology is often either rigidly dedicated to enhance security or flat and insecure. Also, dynamic storage networks are still in their infancy, as can be seen by the as of yet lack of one definitive winning standard, despite the hype around FCoE and iSCSI.

* Power- this constraint can be defined in any number of ways, whether by power, cooling, space efficiency, or redundancy. As compute power continues to grow and miniaturization improves, achieving effective compute densities in a cost efficient manner is quickly becoming an industry of its own. Innovations such as the Yahoo Computing Coop and Google's shifting datacenters are far out in front. Others battle with lower compute densities, more moving parts, and therefore higher overall costs. Not addressing this limits the availability and cost effectiveness of cloud.

* Security- this constraint is more about perceptions, with some additional sophisticated understanding of the usage of the underlying service, than the problem that many hawkers of "private clouds" will have you believe. This is not to say that many cloud providers have already figured this out. In fact, some are only slightly less sloppy than the standard IT department. Better providers will understand the dynamics of the underlying service, will properly classify and segment where applicable data within the service. They will also properly track and audit access to the data to ensure proper handling.

Where security becomes a more interesting challenge is in the Contractual, Legal and Regulatory (CLR) arena, where requirements may be introduced that are less about security and more around data handling and locale. There is still a great deal of FUD that has built up due to the constant challenges that have arisen in the consumer space. As technology, understanding, and the legal and regulatory landscape evolves, this should improve to allow for customers to enjoy the power of the cloud.

Wednesday 1 December 2010

Cloud and the Edge Device: What about the User

(Those who know me know I have had my hands more than a little full the last couple of months. Hopefully, things will settle down soon enough so that I have more time to devote to this.)

While a lot of the discussion in the last year or so has concentrated on the infrastructure and service side of Cloud, arguably the biggest effect upon the whole stack will be driven by what happens on the edge. As more and more people acquire, use and become comfortable with smart phones, tablets, and various roaming laptop form factors, demands for speed, flexibility and portability of applications and services to such mobile devices will skyrocket. No longer will road warriors want to have to be tethered to the office or a wired network to be productive and stay connected to the business and friends. They will also steadily have less and less patience for those "sticky situations" where applications and/or the data they need is "stuck" on a system in the office. While some might see a VDI approach as a way to overcome this, it is a far from perfect hack that twists a traditional paradigm to create a stop gap intermediary layer rather than have the edge device communicate directly with the wanted service itself.

The first place to go in order to best understand the edge is to look at the constraints that exist there. The first is connectivity. In the early days of the Internet when line speeds were measured in Kbps it was impractical to run data hungry applications, and it was not until broadband became more widely available that applications like streaming media became popular. Wifi, LTE and 4G will go a long way to do this with mobile devices however reliability and any latency issues will still need to be addressed. Uneven coverage, clogged masts and backhaul, as well as coverage holes and shadows must be dealt with through further improvements in the technology as well as presentation resilience of the service on the edge device itself.

The second constraint is form factor. While people have different thresholds for what they are willing to carry and use, it is clear that the edge device needs to be lightweight and relatively easy to carry and stow away while still being big and powerful enough to use comfortably. The advent of devices like the iPhone and iPad have just begun to cross the threshold for many. However, insufficient processing power, fast connectivity, and the proprietary nature of iOS have limited the eventual revolution thus far. This has driven the market for specialized applications, but has not yet quite opened the floodgates towards showing what might be possible with cloud services. As it is now clear there is a strong demand for these devices, I expect rapid improvements in technology, as well as the spread of more open operating systems such as Android, should allow operators, consumers and cloud service providers alike to rapidly hone in one this sweet spot and create an even larger market and drive ever more innovative usage patterns.

The third constraint is power. As people are always on the go, and wifi connections are generally power hogs, battery life is critical. As form factor is also important, these batteries must be small and lightweight, yet powerful, long lasting and easily recharged. While I have been impressed with the iPad (though not with any of my other mobile devices), a lot of progress still needs to be made here.

Finally, the last constraint, which I view as perhaps also the biggest opportunity for Cloud, is security. As regulators start to create ever harsher laws for the loss of data, it will become ever more important to tighten the amount of data that is allowed to reside on mobile devices. This may mean either creating a much sharper tiered storage policy towards sensitive and personal data that limits the amount that is actually stored on the edge device. If connectivity, power, and form factor are improved, this will allow for less resident data to exist on the device. Data can also be keyed back to, or have holes punched in it like a puzzle to be filled by, a central cloud service to allow it to be accessed in whole. As the other constraints are slowly pushed back, there will be ever more creative ways that this can be addressed.

Monday 20 September 2010

The Art of Data Classification and Management

Anyone who has tried to build and manage dynamic infrastructure and services knows that data handling is one of the most complex parts. While it is slowly becoming more and more obvious to people how to generalize and template infrastructure, operating systems, application and service stacks, many still get stuck on how best to deal with the underlying data. Data for them feels unduly "sticky". Due to its size it is often difficult to dynamically move en masse, it changes frequently enough, its nature is often poorly understood, and to people who treat systems as a black box is unpredictable enough ways to make people nervous.

This uniqueness is a very serious problem. It also highlights one of the biggest weaknesses of most companies that try and tread into the cloud space, which is not understanding the nature and uses of the entire ecosystem. Being able to deploy, move around, and even virtualize systems is novel and potentially useful, but without a true understanding and model for the end to end service and how it is used gaps form that limit the usefulness of cloud, or worse run roughshod over the service quality expectations of the user community. This increases the amount of FUD against dynamic services and muddies the water for people to truly realize their real strengths.

Back to data. In my experience, the nature of data in any system can and should be understood, tracked, and put into classifications for better management and usability. One form of classification is the dynamism of the data. Is is static, either not changing at all or very slowly over time. Things like birth dates fit into this category, as your birthday never changes. Is the data dynamic, meaning that it actively changes, and the patterns of change can be captured, tracked, and replayed 90%+ of the time? Most common transactions in a database will fit this pattern, which is why redo logs and heavy tuning of their application and the way they are replayed on backup or replicate databases is an important skill in an HA environment. Is the data interstitial, data that is caught between the customer and the system or service that the customer is interacting with? This is the data that will almost certainly be lost if the underlying service goes away during the interaction, and needs to be minimized to an amount that the customer is willing to accept. Understanding the patterns for which the customer interacts with the system, and the types of states that the system can be left in during an outage help not only set customer expectations but also allow for a much better understanding of ways to more dynamically manage an environment in a way that maximizes service quality, resilience, utilization and cost effectiveness. Mapping the data across these classifications helps with the understanding of the environment in ways that greatly improves the effectiveness of targeted data management techniques.

There are other classifications that may be useful as well. Is the data relational or block? Is it big (such as media) or small (simple text)? Is the data latency tolerant (RSS feeds) or intolerant (financial feeds)? Is it quickly processed/rapid access (transactional system or distributed hash table) or batch (such as data processing in a map reduce or data warehouse cluster)? Is the service distributed globally or centrally located? Is data tolerance absolute or resilient to lazy updates? Does it need to be secure or have other regulatory or legal constraints that need to be taken into account while storing and handling the data? Each of these and others help with understanding the end to end system and allow for a much more targeted battery of approaches and tools that can be used to manage the environment more effectively. Often I build a matrix of sorts that allows for people to map and understand the nature of the various types of data to assist with such targeting and break the mindset that all data is created (and thus treated) equally, or the worst and most common mistake that it must be all treated at the same lowest common denominator.

Saturday 21 August 2010

Building Effective teams

Team building is a lot more difficult than most make it out to be. It is not something that one can easily learn in a book, or in a team building exercise for that matter. It is something that requires collaboration, understanding, time, and patience for it to occur. It tends to come together in the more stressful times, and much like a family involves conflict, misunderstanding, shared successes and failures, and a common thread that pulls everyone together. It is also something that must live and breathe like an organism, for while there might be a nominated leader it takes constant feedback, leadership and effort from everyone to succeed.

I have been part of attempts to build many teams throughout my career. In fact, I really enjoy building them, bringing out the best in people, and building a greater organism that brings with it friendships and learning. Some teams have come together and through thick and thin achieved amazing feats, while others behaved worse than a random list of names on a piece of paper. Some acted like elite special forces units, some like an eccentric family, while others like two year olds on the first day of nursery school or worse strangers on the Tube. While building a team isn't necessarily something that can be done in a prescriptive way, it in time be a lot like cooking. In fact, like Iron Chef, it is almost like being given various, sometimes almost random, ingredients to combine in various ways to attempt to create a delicious feast.

The best teams were rarely made up of the best from a talent and skills perspective, but rather those who were willing to put in the heart and effort, against the odds, to help the group. They came together naturally rather than being artificially put together based on some formula. There were always missing skills, things to learn, and stuff that looked set for failure that sometimes did. Roles sometimes seemed fuzzy, and would sometimes change frequently to keep the team balanced and on track, as well as sometimes providing the added benefit of ensuring that experiences (and pain) were shared. Nicknames often were given to people, places, things and ideas (I have had many bestowed onto me). Bread was broken, drinks shared, ideas constantly flowed, and it was quite normal for work and personal life to intermingle at times. I have had many key strategies and architectures drawn on a series of napkins and the back of random junk mail at people's homes and at random family style restaurants. Disagreements would abound. In fact, impassioned arguments always seem to be one of the biggest signs of success. Working through the heat of passion of people who really cared and who were allowed to confront each other in the desire to find the best way through. Ideas, discussion, and feedback must be constant, like blood flowing through the body. If feedback doesn't flow, or worse puddles in a silo, the team dies.

Teams have been of a wide variety of sizes. I have had organizations of many hundreds, and sometimes consisted of several teams, but I would not call any of the organizations themselves a "team". Organizations can pull together through alignment to a common vision, series of working groups and aligned teams, and occasionally "nested teams" where there might be some folks who have a natural knack for being able to be members of or straddle more than one team that has related goals and can cross pollinate both to create something that is far more than the sum of the parts. From experience the team itself is usually between 6 and 12 people, preferably 8 (+/-2), which seems to follow a lot of the common anthropological data on effective human groups out there. A team with any more than 12 never seems to gel, as it allows for people to overly specialize, or hide/be overshadowed by others.

Above all, the most important item is having a common purpose or goal that everyone shares and believes in. The shared thread, whether it be a project, a product, or a transformational mission, must be shared and believed in by everyone. That doesn't mean that people can't have different interpretations and opinions, even strong and contradicting ones. They also do not necessarily have to agree on the particular path. However, the belief in the high level end goal needs to be the same, and must be nearly a mantra. If it is not, like the two year olds the team will never gel.

Service Engineering and Service Management: The Forgotten Elements of the Virtuous Circle of Success

The understanding of the importance of the customer's role in ensuring fit for purpose products has improved significantly over the past several years, as has to some extent the interaction between developers and the customer community. However, in the many years that I have been building highly scalable services, one of the biggest challenges I have faced is developing an appreciation and understanding of the roles that Service Engineering and Service Management play in developing and ensuring an effective service. Ensuring that these two groups are ever present in the conversation, either as separate entities, or as roles within more traditional Engineering and Operations organisations, has rarely come easily. This oversight is one of the more costly sins that limits the effectiveness of organisations to truly exploit and achieve what is possible with cloud services.

To help appreciate the functions these two roles play in the ongoing collaborative discussion that develops and improves services, let me describe what each in my experience has looked like when implemented effectively.

Service Engineering

Service Engineering is perhaps the most difficult to describe, as the functions of the role span a variety of traditional roles, including Release Engineering, as well as IT Operations. Working in conjunction with the development community, this group develops the frameworks that transform bare metal and bits of byte code to reproducible on demand services that can flex within the constraints of the business. They are, in essence, the providers of the foundation of the service. They ensure that there is effective physical inventory, and that configuration information is authoritatively captured in a way that allows for reproduction as well as traversal for better understanding of dependencies. They design key elements of the datacenter, including networks, storage, and compute, and look for standardization and modularization where possible. They manage automated OS, hypervisor and application installation and configuration. They ensure effective application packaging, and consistent and repeatable builds and installations of the entire stack. They usually understand breadth and depth of the stack, and try and provide as much of an ant farm approach to providing visualization. They have a very keen eye towards service performance and tuning, and can often be found better understanding kernels and database design. They are sometimes called "Systems Engineers", "Reliability Engineers", "Release Engineers", "Operations Engineers", and "Infrastructure Engineers".

Service Management

Service Management provides the management wrap around a service. They develop and tune monitoring systems, service desks, dashboards, and other items that provide a better understanding of the performance, health, and help capture and understand patterns that may be of use for the rest of the organisation, as well as the customer. They help developers and service engineers develop hooks and harnesses that they can interrogate to better understand, collect, merge, interpret and roll up key aspects of the service. They need to understand dependency trees, understand what is most meaningful to customers, engineering, and the business, and effectively represent it as this information is critical to effective service exploitation, risk management, and contractual reporting. They understand and tune IT Operations support processes and change management, and must work especially closely with Service Engineering to capture and help coordinate any automated changes.

Engagement in the Lifecycle

Both of these groups provide the key elements for which a service is provided and managed. However, they mostly produce what most would consider non-functionals, which are the items that are not very whiz-bang for the business, and are simply assumed to be there by the customer. They are usually poorly understood, rarely invested in, and ignored like plumbing until it breaks. They are typically blindly squeezed as a cost line, and are sometimes viewed as a form of janitorial function that can do magic voodoo with computers. Business people often fear putting them in front of customers, afraid that they will scare them with their quirkiness, or that they might reveal something about the infrastructure that might terrify them. Good engineers understand their importance, but often do not have enough cycles to work with them while they work on more feature development. QA often misses that effective Service Management captures much of the same sorts of data, albeit usually unfortunately only in a production sense, that they need to assure the level of product quality that is expected. These Service teams themselves often under appreciate their own role if not shown the light.

By bringing these teams to the table, there is a much more effective grasp of the entire service lifecycle. Service Engineering can help understand the demands for flex and agility and provide both the understanding, as well as grounds and execution needed to provide it. Service Management can help with understanding the services themselves, whether to better understand user experience, user exploitation, code and service quality, as well as performance and capacity, and present these back to each of the key audiences in ways that are most meaningful to them. Working together with the rest of the players within the lifecycle, they can rapidly and continuously improve. They can reduce waste, and delight customers through collaboration and preemptive service management.

Friday 25 June 2010

Inventory of the Cloud: It’s Just Vapor Without It

Imagine you run a logistics business, complete with warehouses, lorries, ships and cargo aircraft. However, you quickly notice that your customers are completely up in arms. They are saying that it seems to take absolutely forever for deliveries, and even when they eventually are made they often are delivered to the wrong place or are the wrong items. Your staff say they are moving everything as quickly as they can, but there just are not enough hands or vehicles to deal with the demand. They point to some of the warehouses where there are packages stacked high in the parking lot. Just at this point, a critical order from your most important client hits. You realize that not only do you not know where the inventory is, but aren’t even sure where the vehicles to transport it are, or how much capacity each of those vehicles can hold.

There can be no argument that this is an extremely ineffective way to run a logistics business. Logistics requires knowing where everything is at any given time, whether it is product, capacity or capability. A mom-and-pop operation might be able to manage their operation in their head or with intuition, with perhaps only a couple sets of hands that handle a package from beginning to end. As soon as the number of moving parts increases beyond what one can easily see and count, and the number of actors involved increases beyond a small handful, tracking becomes difficult and the business can no longer be managed so informally.

I would argue that a Cloud service provider has exactly the same challenges. A traditional IT house can probably manage a small pile of applications on a few dozen servers with a small handful of people who all know each other. Tracking and managing of rather simple services with a smattering of dependencies can usually be brute forced to resolution by restarting and tweaking various bits. However, the scale and dynamism of Cloud causes the number of dependencies to not only explode, but also becomes impossible for anyone to track the entire constantly changing dependency map in their head.

Much like a logistics company, what is required are means for managing and tracking all forms of inventory as they flow and change throughout the service lifecycle. The first form of inventory are facility capabilities. Logistics requires always knowing the various qualities of warehouses and distribution hubs in logistics. Logistics companies have learned, much like some more advanced Cloud companies, that warehouses and hubs need to flex to the needs of business dynamics. While there will always be some core facilities, large contracts or changes in regulations require the ability to flex capabilities up and down, either through building temporary facilities or stationing container yards in strategic locations. Cloud requires knowing the location, capabilities, as well as real time capacities and environmentals of datacenters. Flexing these sorts of facilities has become increasingly important with the shifting nature of the global economy and regulatory landscape around data protection and taxation.

The second form of inventory are those capabilities of the workhorses of each of the respective industries. In logistics this would be the lorries, ships and aircraft. The capabilities, configurations and locations of these assets must be constantly tracked to allow the business to know how best to service their customers, both when the capabilities are available as well as when they are carrying customer product. Missing a lorry with refrigeration capabilities in a particular location could mean the difference between whether or not a product is delivered fresh. If a ship rather than an aircraft is available, delivery speed is compromised in exchange for bulk. In Cloud, this second form of inventory would be the servers, virtual machines, networks, and storage. Without understanding and constantly tracking at this level could hinder an organisation from maximizing its capabilities, or worse lead to service failure.

The third form of inventory are the actual packages themselves. Whether this is a physical package as in logistics, or a software package and its configuration in the Cloud, these are both in fact little more than mechanisms that are dynamically configured to handle the actual items that the customer cares about. The contents of a single package can be meaningful in itself, or only useful when bundled with other packages of relevant components. They come in all sorts of configurations, though standards exist in both industries (albeit much more nascent in IT) and are preferred. The quality and knowledge of those configurations, as well as the attributes attached to the package, whether in a shipping label or a software configuration, will heavily determine the level of success or failure of the job from the customer’s perspective. The customer can manage the quality of service through paying for higher or lower service levels, and can prioritize more important things over the less important.

All this might sound like a bit of a stretched metaphor, but there is one other detail that these two industries share. Both these industries look to heavily track and tune all of these forms of inventory, and their optimisation throughout the system is absolutely key for success. However, while the customer might be curious about the details of the inventory, and may even have a hand in creating and managing the package, none of these things is what is perceived as the actual thing of value to the end customer. In fact, the customer only cares that the service was delivered on time, at an acceptable price point, with the quality and security that they would expect. Both industries must never lose sight of this, but all the same they must know that inventory configuration management are their means for managing the service.

Sunday 20 June 2010

Cloud Doesn’t Necessarily Mean a Hypervisor is Involved

I spend a lot of time speaking with customers, vendors, analysts, and "cloud providers" about IT Cloud services. It never ceases to amaze me how many equate Cloud with a guest Operating System running under a VMWare or Xen hypervisor. This incredibly narrow view not only entirely misses the point of Cloud, but focuses on a particular instantiation that has an incredibly low barrier to entry into the market.

Let me provide an example from another field that most people would understand: transportation. Imagine you were to describe transportation. You might talk about moving people and/or things, individually or together, long or short distances, fast or slow, by air, ground and/or water. You might cover some of the various methods, like cars versus trains, ships versus barges, helicopters versus fixed wing aircraft; or the ways that they are powered, like petrol versus coal versus diesel versus wind or some such. All are valid ways to describe such a topic. One might argue that focusing in on any one segment could hinder your understanding of the topic.

Say that people started to equate transportation with a petrol powered bus, not only ignoring but eliminating all the other options out there. Immediately there would be problems. Some people would complain that a bus doesn’t help them easily move things that are not people, and that transportation must be a bad thing because sometimes buses get overcrowded, too slow, or don’t go where they want to go. Companies start labelling themselves “transportation leaders”, because they too can easily go out and get a bus and a driver, and add their own pinstriping or unique seats to their offering. Potential customers start to roll their eyes and say “transportation was so yesterday” because they too dabbled with buying their own bus but couldn’t see what all the excitement was about when they packed their employees in and out of it.

This might sound silly, and so does equating Cloud with a hypervisor. Cloud is the ability to consume a service, whether it is a simple infrastructure service like an OS on a hypervisor, or a much more complex one like CRM or billing, that a person or company can consume on demand (i.e.- as soon as one wants it, like electricity or telephony) as much or as little as they want, when they want it, and know that it will work predictably within certain service level thresholds (electricity will be there, with the right wattage; bandwidth will be there with the right agreed throughput with acceptable interference and uptime). The power really comes from the speed, cost, and flexibility that those services can be provided. It needs to be as easy to build out one’s back office by simply going to a web page with a credit card and selecting where your customer information is (say, in Salesforce), tying it to logistics information (provided, say, by UPS) and accounts receivable (provided, say, by HSBC), into billing, inventory, and integration platforms that can flex as demand shrinks or grows. It should also be easy to immediately select and provision IT and telephony items for a new office, as well as creating and rapidly resizing a new web presence in a far flung market to support the waxing and waning of new marketing campaigns.

A guest operating system under a hypervisor can provide some of these capabilities, much like a petrol powered bus can provide some transportation capabilities. Cloud is far more than that.