November

10

MODERN DATA PLATFORM ARCHITECTURE - PART 1

PART 1 - EVOLVING OUT OF TRADITIONAL DATA WAREHOUSING APPROACHES

Moore’s law tells us that computational power doubles every 2 years and hence computing and storage costs are expected to be lower than they have ever been - all the time.

Big Data is a computing paradigm that was created specifically to deal with the challenges of very large amounts of sprawling being created; where parallel computing is utilised to handle that the desparate volume, velocity and variety of data at the expense of certain aspects such as atomic integrity . (This isn’t a blog about Big Data so we’ll move on quickly from there)

This perception of ever growing compute power as well as cheaper and faster storage and access has led some corporations to be careless with how they treat their data. (At a 30,000ft level, this can be summed up by why sort, just search - the old practice of catagorising data can’t keep up).

However as human interaction behaviours and organisational value chains transition more towards the digital realm, the magnitude the data challenge (i.e. how to make use and extract value from it) is a continuing growing one.

A paper by IDC (International Data Corporate) estimates that global data usage will grow from 33 zetabybes to 175 zetabytes between 2018 to 2025. This represents a ~27% per annum growth rate. (not quite double but close).

So in short, organisations are not and never will be entirely free of putting some information or data management discipline into their business intelligence and decision-making capabilities.

Information Pyramid.png

Large organisations contain a multitude of source operational / transaction systems whose results need to be rolled up into group-level summaries both for statutory / regulatory reporting purposes as well as business performance reporting. Allthough using the same underlying data, reports for different purposes may roll up through different hierarchies.

For example, BPM may want a product profitability view with product categorisation definitions different to regulatory definitions.

To date, there has been 2 predominant approaches to data warehousing. Bill Inmon’s top down approach and Ralph Kimball’s bottom up approach. These approaches have the same goal; find out what’s going on in your organisation.

Kimball Approach.png

You can think of Kimball’s approach as asking divisions to submit to the head office a set of metrics and leaving up to them to collate that information.  The benefits of this approach include:

  • Group-level data mart will only need to contain a set of summary and curated data and is not overloaded with too-fine-grain data that may adversely impact data warehouse performance.
  • It leaves divisions free to use and control their divisional data marts. For example, divisions may want to have a more real time view of their sales, but only need to report to group once a day or week etc.
  • Business / product divestments are easy as divisional data marts can be included / excluded, rather than replumbing numerous source systems

Key downside is the potential and often occurence of overlapping data marts with duplicate data that follow different definitions causing data integrity issues. Furthermore it is sometimes difficult to prescribe consistent metric and data definitions across federated data warehouse and integration teams. Difficulty in determining the number of unique customers an organisation has is a case-in-point.

Inmon Approach to data warehousing

The Inmon approach by contrast seeks to take direct feeds that are subject oriented; that is to take all customer and product data from each data source and integrate these subject areas together)

Furthermore, the Enterprise Data Warehouse is time-variant and non-volatile; it takes periodic snapshots so point in time changes can be compared and the data is typically read-only.

The primary benefits of this approach is that large amounts of data can be brought into an enterprise repository (an Enterprise Data Warehouse / EDW) and the downstream users (typically Finance / Risk / Strategy folk) can create a broad variety of metrics consistently and broadly applicable without having to go back to the transactional systems.

The downside of this approach is:

  • Takes a long time to source data from all systems across the enterprise
  • Utiliity of the EDW comes very late, i.e. requires critical mass of data to be ingested
  • Introduction of new features (derived from source system attributes) is constly as it may require a large set of source files to be revisited
  • A highly skilled team is required to perform data modelling in a way that suits enterprise and divisional BI users (this is the hardest part to get right)

These two approaches have essentially one key difference; to pre-aggregate or not to pre-aggregate…that is the question.

CHALLENGES WITH A TRADITIONAL DATA WAREHOUSE

Unfortunately, I would say that both of these approaches and indeed the notion of a standalone traditional data warehouseis inadeqaute. The shortcomings include:

  1. In the OLAP world, data largely flows only one way in batch fashion and is disconnected from the OLTP or operational world
  2. Lack of ability to merge streamed data with batch (enterprise holistic data) to form a complete but real time view
  3. Modelled, conformed and integrated data serves a purpose to simplify data volume and standardise data definitions but is inflexible to change and obfuscates detail
  4. Obfuscation of granular data reduces future data utility and limits further feature engineering and AI/ML approaches to predictive model building
  5. Lack of governance and control over data warehouse outputs, i.e. they are not seen as information assets or products. E.g. there is a need to track a particular output data set for retention
  6. Coarse grain data controls over privacy, user access and records management mean obligations such GDPR’s right to be forgotten are difficult, if not impossible to enforce.

In the next edition we will cover new data integration patterns and platform architectures.

Check back here shortly for the next edition or get in touch with our team to find out more.

recent posts

AI-POWERED ALGORITHM FOR STREET SIGNS DETECTION V2

DATA-DRIVEN, BUT FIRST WE MUST TACKLE THE ENTERPRISE DATA QUALITY CHALLENGE

YOU MUST DEAL WITH RE-IDENTIFICATION RISK BEFORE SHARING DATA BUT YOUR PRIVACY IMPACT ASSESSMENTS ARE INADEQUATE

INNOVATION IN LOCAL GOVERNMENT