Advertisement

Data Lineage and Data Quality: Two Vital Elements for Enterprise Success

By on

“An enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially, to be able to respond to unexpected events,” said Ian Rowlands, while speaking at the DATAVERSITY® Enterprise Data World 2016 Conference. In his presentation he discussed the importance of Data Lineage and how it has become an essential tool for enterprises to gain the most value from their data for everyone in the enterprise, at all levels:

“Most organizations are missing this ability to connect the data together. It used to be really hard to get CEOs to care about data. A trend that we’ve seen increasingly over the past several years is that CEOs have been getting more and more irritated at the inability to get value out of their data.”

Rowlands promised big rewards for those who can confidently deliver accurate results. He said, “If you do everything that I [recommend] in this 40 minutes or so, you will have an extremely robust environment in which to be very confident about the results that you deliver.” Rowlands is the Vice President of Product Management for ASG Software Solutions.

“There’s been a shift from ‘we’ve got to do this data stuff so we don’t get into trouble,’” said Rowlands. “To ‘I want to get value for all the money I’m spending on data.’ So that’s the challenge. And for me it comes down to ‘how do you ensure the quality of end results?’”


Rowlands paraphrased Sir Timothy Berners-Lee, to illustrate: “We can launch a rocket into space, we can bring it back and land on a barge, but we can’t get two numbers to match on a report.”

Rowlands cited his extensive career in the data industry, during which he remembers the phrase “garbage in-garbage out,” which has long been considered truth. If you get good data in, you’ll get good results out, he remarked:

“Only, it isn’t so! All too often, there is data which has been certified as ‘good quality source data,’ and yet by the time the results crop up on a report, the rocket is missing the barge and splashing into the sea.”

Data moves. It changes. It gets misunderstood. Before you can guarantee good results, these “roadblocks” must be removed.

Roadblocks to High Quality Data

  • Data Movement: Whenever data is moved, he said, there is potential for things to go wrong. “If all we do is ensure that the data is good at the entry point, but we don’t continue checking it throughout the flow from beginning to end, then we’re at risk.” Understanding results depends on knowing what moved where, and when.
  • Data Transformation: “Data doesn’t just move – it changes.” Aggregation, manipulation, and ETL occur so it’s important to know what processes are at work, he said.
  • Data Interpretation: Rowlands said that when the data is right, but the results are wrong, it’s impossible to understand what the data is saying. Data can be created for one purpose and used for another, or created by one team using one set of terms and used by another that uses different terminology. “It’s extremely important that we maintain a connection between the physical data and the business assets. Otherwise, this gets to be very tricky.”
  • Data selection: Knowing which data is critical, which subsets are important, how the data is categorized – all these are important to ensure consistency across the enterprise, he said. “You absolutely do have to know which things are important.”
  • Broken Data: “Sometimes the data really is broken.” Good Data Quality tools are now available, he said, that can provide rules for accuracy, consistency, conformity, completeness, timeliness, and uniqueness, anywhere in the Data Lineage. “These things can go wrong at any point in the movement of data from one place to another.” This is an issue that gets even more complicated with Big Data, or anything where the use of the data is not determined until after it has been ingested. If you build a Data Lake, and at a later time you start making decisions about how you’re going to use that data, the quality of that data is somewhat dependent on the use to which you are going to put it.

“If you don’t move these [roadblocks] out of the way, the chances of the CEO being able to connect the dots in his enterprise are fairly sparse,” said Rowlands.

Taking a “Deep Dive” into Your Data

Rowlands presented seven critical areas to examine, showing how those areas might be classified according to four different phases of development: chaotic, initial, progressive and dynamic.

  • The chaotic phase is considered “incidental,” as in “being driven by something happening” or a mad scramble to find out what’s going on. For 25-30 years that was the norm.
  • The initial phase is structured by applications, and for some key applications owned by certain owners, there’s an understanding of how data moves around.
  • In the progressive phase, it’s an organized program, a disciplined approach by domain or by line of business that says, “we are going to select the critical data elements, and bit by bit we are going to build a view of a data landscape in which we have a full understanding of what’s happening,” said Rowlands.
  • The dynamic phase is where, “automatically, whenever there is a change to business issue that affects data, Data Lineage is recalculated.”

Areas to Examine

These core areas can be used to “build out a framework that says, ‘here are the things we consider as essential to Data Intelligence.’” Data Intelligence is the practice that allows you to make better decisions and respond more rapidly to business opportunities and regulations. “Once the assessment is done” he says, you get to, “decide where you are, and where you actually want to be.”

1.  “Where” Data Lineage

To illustrate the “where” of Data Lineage, Rowlands provided a use case with data in multiple places, languages and states:

  • Discover a Cross-Platform Inventory
    • Mainframe
    • Distributed
    • Hadoop
    • Models, Applications, ETL Databases, Warehouses, Data Lakes, BI Reports, Data Quality
    • Rule-Based Lineage Stitching
  • Understand Multi-Layer Data Lineage
    • Business Traceability
    • Critical Data Element Lineage
    • Multi-Layer Technical Lineage
    • Issue Management – to resolve Data Lineage issues
    • Unpack ETL – to understand transformation details
    • Analyze Application Transformations
  • Present in Use-Case Related Formats
    • Business Traceability
    • Data Element Lineage Reports with “Trace Direct” to Highlight Paths
    • Compliance Exports
    • “Lineage Anywhere” – embedded in tools of your choice
    • Lineage Snapshots – to understand changes in Data Lineage

He expanded on the process for understanding the “where” of Data Lineage, breaking down each place where data might live in the enterprise, and adding,

“Business traceability is the connection of key business processes and the flow of logical entities from business process to business process. Technical lineage is the flow of physical data through underlying applications, services, data stores – that’s a very physical level.”

2. “How” Data Lineage

Understanding the “How” of Data Lineage is very connected to the “Where” of Data Lineage and makes aggregation and manipulation visible, he said:

  • Use “Where” information to find applications or tools that manipulate data, or
  • Use information gathered automatically

Historically, Rowlands said that the “Where” of Data Lineage was all that was needed, but things have changed:

“Now there is increasing demand, especially from auditors and regulators, for you to be able to demonstrate not just where the movement occurred, but how things changed. And you have to be able to do that in a repeatable, reliable, defensible, and increasingly accelerated manner. The scary thing about regulation to me is not that there is a demand for credibility, but there is a demand for on-demand demonstration of credibility.”

3. Data Understanding

Terminology can vary from department to department, so putting data terminology and historical changes to terminology in one central place is important. Knowing what a term means and agreeing on what the term means are two different steps, he said. What’s needed is a process for managing changes to data terminology, identifying where those terms are used and who owns them.

This can start with a spreadsheet, and in practice, that’s “a decent place to start for one application or two applications, but where it gets to be that you’re really adding business value, is when you can get a workflow-driven, collaborative process.” When changes are routed to end users and higher-ups so it’s an automatic process, “at that point, it’s really valuable,” Rowlands said.

4. Organization

Rowlands advocates creation of Reference Data on spreadsheets, in a managed data store, or ideally, a managed collaborative process. Without proper Reference Data, he says, “As you start aggregating data for reporting purposes, things will go in the wrong buckets, and when the right data goes in the wrong buckets you still get the wrong answers.”

5. Data Quality

The objective here is to be able to look at the end-to-end flow, he said. “I want to know when things have been transformed, and of course I want to know what things mean.” The valid values are important, he said, “but I also need to see how data quality is moving from one place to another.” One of the essential capabilities in an enterprise capabilities framework therefore is Data Quality.

There are now many tools available to ensure Data Quality. Rowlands said that Data Quality tools available now are “some of the most impressive Data Management tools that are around.”

6. Managing the Process

Rowlands advocates documentation to understand the roles of all involved. Having a workflow that automates flagging, investigation and resolution of issues is useful.

“I’d really like to be able to just log an issue and than have my workflows send it off to a Data Steward and then it’s not my problem anymore,” he said. “Report the concern at the point of use and not in a separate and isolated operation.”

7. Communication

It’s vital to build communication into your plan because there are multiple different stakeholders who will wish to see information in different ways.

“This is the core of Data Intelligence. You cannot understand your data unless you can pull off this little trick. This little trick is actually nothing more complicated in terms of description than to say, ‘I’ve got a report. How did the information get onto that report?’ If you can’t pull that trick off for your critical data elements, you are never going to be able to put your hand on your heart and say, ‘my reports properly reflect the data.’”

Data Intelligence is about minimizing doubts, increasing trust and speeding the process, so that you can respond to business opportunities and regulatory challenges quickly and confidently, he said, “And if you do that, you will be heroes, your managers will love you, and you will be fat, rich, and happy.”

 

Here is the video of the Enterprise Data World 2016 Presentation:

 

Register for the Enterprise Data World 2017 Conference Today (in Atlanta, Georgia)

 

Leave a Reply