A Quick Guide to Data Catalogue Prototyping
What is Data Cataloguing?
As the name suggests, data cataloguing is the practice of identifying all of the business’s data before recording information about those data in an organised inventory. Overall, a data catalogue gives us a clear picture of the whole data landscape. In a sense, it’s like doing a data ‘stock-take’.
Why Catalogue Data?
A data catalogue is a powerful tool which enables people to discover, understand and trust the data they need, when they need it. People who need to use data in their jobs, cannot do their jobs effectively when they struggle to find accurate, complete and trustworthy data. In those circumstances, people often spend more time searching for data than actually using data to generate analyses and valuable insights. When people struggle to find and access trusted data, the organisation can suffer from poor decisions, a slow pace of activity and restrictions on growth and competitiveness.
This picture courtesy of Bloor Research, nicely illustrates the situation that a data catalogue aims to remedy. Quite simply, the less time people spend searching for data – the more time they will have to use and share data. Which are of course the activities which generate data-driven benefits.
Image courtesy of https://www.bloorresearch.com/technology/data-catalogues/
Why Start With a Prototype?
In short – to see what we’re dealing with before jumping in with expensive tools and lengthy approaches! For businesses that haven’t yet started cataloguing their data, a prototype data catalogue provides a way of quickly and inexpensively getting started by gaining clarity on what data the business has, what it’s about, where it is and why the business has it. This commonly proves to be beneficial by:
- Prioritising the scope and focus of subsequent data cataloguing, to avoid time and cost wasted on ‘digging around’ in non-critical datasets.
- Informing decisions about the longer-term use of commercial data cataloguing tools, to optimise their value and cost-effectiveness.
- Fostering engagement and buy-in from key stakeholders, to accelerate data catalogue development and time-to-value.
Businesses that have already begun cataloguing their data, can also benefit from applying prototyping techniques for filling gaps in their data catalogue deployment. For instance by quickly gaining a ‘big picture’ view of the data landscape, which drives ‘checks and balances’ on progress and reinforces planning and stakeholder communications.
How Does a Prototype Data Catalogue Work?
A prototype data catalogue needs to encompass just 3-layers of information.
- Layer 1 – Data Relationships.
- Layer 2 – Data Flows.
- Layer 3 – Data Lineages.
Data relationships are simply the data links that exist between the various systems and processes in the business. For example, we may know or see that data moves between our website and CRM system. That’s a data relationship. Think of a data relationship as a data ‘pipe’. At this first stage we’re focused on finding all of these ‘data pipes’, but we’re not yet interested in what’s in those pipes. That comes next…
Data flows are the actual movements of data flowing through each data relationship. When we analyse at data flows, we’re looking inside each of the ‘data pipes’ to see what kinds of data are moving and in which direction. Data flows are defined by their topic or ‘theme’. That is to say, what the data are about. Examples of data topics are ‘customer’, ‘payment’, ‘address’, and so on.
We don’t want to get too detailed, as that would mean we end up taking too long and creating a data dictionary rather than a data catalogue. So for example, while identifying that some data relationships contain data about an ‘address’, we won’t zoom in to a level of detail that recognises ‘house number’, ‘postcode’, and so on. We would however recognise the difference between ‘customer address’ and ‘supplier address’, as they are two different topics of data, albeit similar ones.
Data lineages are a number of data flows arranged in a sequence, according to a specific scenario. We use them to understand the whole ‘upstream’ and ‘downstream’ big-picture of how data moves in particular situation. A very simple example would be a customer online order process, where we would expect to see that the first data flow is the customer’s selection of a product into the basket, the second data flow would be the customers details during checkout, and the third data flow might be the customer’s payment details, before the process completes with a fourth flow of order confirmation data.
In reality, the data lineage would very likely be more complex than that, but the principle is the same. We want to see the sequence in which data flows occur, so that we can understand the ‘upstream’ and ‘downstream’ states of the data involved in the process. Data lineages can be quite time consuming to analyse, so unlike a data catalogue which is created as an holistic foundation of our data understanding, data lineages are typically analysed only when necessary.
How is a Data Catalogue Prototyped?
Quite simply it’s a matter of gathering the right information and knowledge, and then recording and presenting it in ways that are beneficial.
To create a prototype data catalogue, knowledge and information about data is gathered through analysis of IT infrastructure, and through consultation with business and IT stakeholders. And because we’re just prototyping, we can use simple spreadsheets and diagrams. This will allow us to understand a lot more about the size and shape of data in the organisation, and what we need from a data catalogue long-term. From there, we can use what we’ve learnt from the data catalogue prototype to make well-informed decisions about the cost-benefits of using more advanced or specialised data cataloguing tools and technology.
Here’s a model which illustrates how it all comes together (click or tap on it to enlarge it).