Analytics

This is the fastest growing and most exciting areas of information management. The combination of Analytic Techniques, Big Data, AI and Realtime (amongst others) means that its potential is only just being understood.

It is so vast and moves so fast that this site will have to gradually increase its coverage of the area over time.

Analytics Over Time (a personal view, but with a point).

Over my IT career I have been closely involved with Analytics on a number of occasions, Over the years the capabilities and the prevailing lines of thought have changed. It's worth describing some of these to provide insights into where we are currently and the lessons that we can learn from these shifts in paradigm.

Data Warehousing; capture everything and you will find gold.

I was involved in the early days of data warehousing; building an Oracle database of Energy trading data.

This was quite new and we used some interesting techniques to be able to get the huge (tiny by today's standards) amounts of data into it. We built the core warehouse in 3rd normal form (more or less) so it was possible to query the core warehouse from many different angles and still get the right answer. There was an underlying assumption that hidden in this data set were many pieces of data gold waiting to be data mined out. We built more focussed data marts to cover specific areas but the source of value was felt to be the core warehouse itself.

Data Warehousing Without the Warehouse; you know where the gold is, go straight to it and miss out the expensive middleman

After a time the focus shifted away from the 3NF core data warehouse; those data mining initiatives didn't find much gold after all. Instead the data marts were much more the focus. Analytics had proved it's worth and this was where the value was to be found. While it was deemed necessary to understand the data in 3NF in a single data model you didn't have to implement that model, only use it to derive the data for your marts and then put the data straight in there, quicker and cheaper but still effective.

The models the marts were based on tended to be enterprise wide, focussing on reuse and implementing exacting data quality.

During this period query tools became normalised and everyday

Big Data, the precocious upstart

As time went on, the way it tends to, I became aware of a different way of doing analytics, with slightly different aims and a different paradigm. The idea was that the amount of data we are generating is growing exponentially (true) and that if we could capture and analyse it we could answer practically any question. A limiting factor up to that point had been storage and compute power, but technology, and pricing had changed such that it was feasible to buy a lot of small servers and run them in parallel to provide the storage and power to churn through huge datasets and produce analytics that nobody had been able to before. The advent of Hadoop map-reduce and the associated HDFS storage was making this possible.

From Slow Burn to Being the Norm

Gradually ideas from the big data world started breaking into the normal analytics discourse. Other technologies came along that could exploit different aspects of this data and use new statistical techniques and AI. The speed of data that could be analysed increased to near real time and then with predictive analytics almost into the future. These became the norm, data lakes were being filled everywhere.

The Punk Takes Over

Along with the rise of big data and associated technologies came a new attitude towards managing data. The old methods of managing data were - in some quarters - seen as being pointless, providing a brake to developing analytics and delivering value. The old rules didn't matter, find the data and put it together as quickly as possible to unlock value ahead of your rivals.

The Punk Trips Over

Some of the drawbacks of this approach gradually became apparent. Analyses weren't matching or consistent giving inaccurate results, data could be missed when it should come from two sources but only one was plugged it. Industrialising hastily constructed data feeds often took much longer that it took to construct the analytics. It was hard to know exactly what was meant by some results because none of it was consistent or written down. One bank I worked at implemented a beautiful big data platform, poured system after system into it. One day and many millions of pounds later, the business asked if they could get data out of it. The answer was no. They didn't know what the data was from each of the systems they had eagerly poured in and so didn't know the levels of security to be applied. The whole thing had to be torn down and restarted because of a fundamental failure in Information Management.

A More Balanced Approach

I think now we are approaching a more balanced view of how to make the most practical use of an analytics platform by using a hybrid approach. The notion of a Lakehouse is becoming quotidian with an old-school warehouse as part of it for the data that needs to be accurately analysed from multiple sides or needs to be reused across different analytics approaches, A big data repository also forms part of the Lakehouse to take the huge and often unstructured data that might come from a social media feed or a database log. An area for real time is usually present, bringing analytics away from reports and considered decisions, into the headphones of the call centre staff suggesting a best course of action for the client they are dealing with at that time.

What has - to my mind - come to be understood is that, while you can break the rules, you have to know that you are doing it and understand how that impacts on the types of analytics you can do with it.

For example you need pretty much perfect data to manage your accounts and financial reporting. But for sentiment analysis from a social media feed, less so, indeed you can't get that level of perfection from most social media feeds but you can get some really valuable analytics.

So, the principles that I believe are required as a minimum are that you document:-

What each data item is, it's definition
Where it came from; it's lineage
Who to ask questions of it or get decisions made about it
What is the quality of that data
What the format and velocity of that data is.
What level of security needs to be applied to it.

Not so much, and there could always be more but not a bad catalogue to keep us out of trouble.