Monday 20 September 2010

The Art of Data Classification and Management

Anyone who has tried to build and manage dynamic infrastructure and services knows that data handling is one of the most complex parts. While it is slowly becoming more and more obvious to people how to generalize and template infrastructure, operating systems, application and service stacks, many still get stuck on how best to deal with the underlying data. Data for them feels unduly "sticky". Due to its size it is often difficult to dynamically move en masse, it changes frequently enough, its nature is often poorly understood, and to people who treat systems as a black box is unpredictable enough ways to make people nervous.

This uniqueness is a very serious problem. It also highlights one of the biggest weaknesses of most companies that try and tread into the cloud space, which is not understanding the nature and uses of the entire ecosystem. Being able to deploy, move around, and even virtualize systems is novel and potentially useful, but without a true understanding and model for the end to end service and how it is used gaps form that limit the usefulness of cloud, or worse run roughshod over the service quality expectations of the user community. This increases the amount of FUD against dynamic services and muddies the water for people to truly realize their real strengths.

Back to data. In my experience, the nature of data in any system can and should be understood, tracked, and put into classifications for better management and usability. One form of classification is the dynamism of the data. Is is static, either not changing at all or very slowly over time. Things like birth dates fit into this category, as your birthday never changes. Is the data dynamic, meaning that it actively changes, and the patterns of change can be captured, tracked, and replayed 90%+ of the time? Most common transactions in a database will fit this pattern, which is why redo logs and heavy tuning of their application and the way they are replayed on backup or replicate databases is an important skill in an HA environment. Is the data interstitial, data that is caught between the customer and the system or service that the customer is interacting with? This is the data that will almost certainly be lost if the underlying service goes away during the interaction, and needs to be minimized to an amount that the customer is willing to accept. Understanding the patterns for which the customer interacts with the system, and the types of states that the system can be left in during an outage help not only set customer expectations but also allow for a much better understanding of ways to more dynamically manage an environment in a way that maximizes service quality, resilience, utilization and cost effectiveness. Mapping the data across these classifications helps with the understanding of the environment in ways that greatly improves the effectiveness of targeted data management techniques.

There are other classifications that may be useful as well. Is the data relational or block? Is it big (such as media) or small (simple text)? Is the data latency tolerant (RSS feeds) or intolerant (financial feeds)? Is it quickly processed/rapid access (transactional system or distributed hash table) or batch (such as data processing in a map reduce or data warehouse cluster)? Is the service distributed globally or centrally located? Is data tolerance absolute or resilient to lazy updates? Does it need to be secure or have other regulatory or legal constraints that need to be taken into account while storing and handling the data? Each of these and others help with understanding the end to end system and allow for a much more targeted battery of approaches and tools that can be used to manage the environment more effectively. Often I build a matrix of sorts that allows for people to map and understand the nature of the various types of data to assist with such targeting and break the mindset that all data is created (and thus treated) equally, or the worst and most common mistake that it must be all treated at the same lowest common denominator.