It’s not just you; data pipelines break. The average organisation experiences about 70 data issues per year for every 1,000 tables in its environment.
What separates high-performing data teams is what happens after the incident. Some catch and resolve these issues quickly, often before they are noticed by internal or external data consumers. Other teams only notice after they have received an urgent email and take days or even weeks to determine the root cause.
This is problematic because the consequences of bad data escalate based on who discovers the issue. The longer an issue lingers, the greater the chance it negatively impacts the business—for example, when customers or the media discover it.
This is where data lineage comes in. Data lineage is a map that traces the connections between your data assets—typically pipelines, transformation models, tables in the data warehouse or data lake, and business intelligence (BI) tools. This visualisation helps data teams understand the journey data takes from when it is first ingested until it is consumed.
It’s a critical capability because robust data lineage can dramatically lower a data team’s time to resolution and the number of data incidents. It does this by quickly tracing issues to their root cause and by surfacing the data assets that are most important to the business, allowing teams to prioritise their efforts accordingly.
However, to navigate data quality issues efficiently, you must draw the right map. Here are four must-have data lineage features, whether you build or buy your solution.
1. Data Lineage Must Be Automated
Some data teams used to draw their data lineage maps manually. In fact, most will still have a high-level overview of the integrations across their modern data stack at the tool level.
While that can be helpful from an overall onboarding or executive briefing perspective, static depictions of data lineage are worthless for the level of detail required for troubleshooting. There are simply too many moving parts with high degrees of interconnectivity, constantly being modified by multiple parties.
Luckily, data lineage can be automated. The metadata that illustrates how each table is connected can be pulled and parsed from the SQL logs of a data warehouse or the metastore of a data lake. As the relationships between assets change, so too does the metadata, which can be used to update the lineage map automatically.
2. Data Lineage Must Be End-to-End
It’s not just the relationship between tables within the data warehouse that matters. Today, modern data platforms are built with integrations across several layers, including ingestion, transformation/orchestration, storage, and visualisation.
Data lineage should provide the complete, end-to-end picture across these layers because changes in one system can impact how data behaves in others. For example, an analytical engineer could accidentally introduce bad code while modifying a dbt model, which could create data anomalies within the data warehouse.
By understanding connections at the BI layer, you can also see how data assets are connected to data consumers. This allows you to understand the impact of issues or changes to specific assets. It can also help with strategic planning when reorganising the data team into a more decentralised structure, such as a data mesh.
3. Data Lineage Must Be at the Field Level
Sometimes data engineers, analysts, or other team members need to drill down to understand the provenance of a particular field. With large datasets, some fields may be more reliable or relevant than others.
You could write a series of SELECT TOP FIVE queries to explore tables to determine which fields are reliable, or you could leverage field-level lineage to review the upstream tables of a specific field.
Table-level lineage can reveal upstream tables on which a report depends, but field-level lineage can pinpoint the singular field in the singular table that impacts the one data point you care about. That greatly improves your team’s efficiency.
4. Data Lineage Must Have Context
Data lineage is important for understanding the connections between each asset, but it must go further and help data teams understand the context around each asset. For example, once you’ve traced a data issue to the most upstream table, you need to understand:
- Who owns this table?
- How frequently is it used?
- Has this table had issues before?
- What recent changes have been made to the code generating this table or to the queries being run on it?
- Is it a pipeline issue, or is the problem with the data itself?
Different solutions provide data lineage, but they differ in the context they provide. Data catalogues contain helpful information on how the data is used within the organisation as part of their lineage offerings, emphasising governance and a shared understanding. Data observability also focuses on how the data is consumed but with an emphasis on data health and other root cause analysis context.
Putting Lineage to Work
Data lineage is rapidly evolving. However, a map is only helpful if you understand where you want to go. Determine the end goal or the initiatives already in place that will benefit from data lineage.
Some common use cases beyond accelerated data anomaly resolution include:
- Better knowledge management of key data assets.
- Expanding access via a self-service data initiative.
- Ensuring a common understanding of metrics and terminology.
- Creating more accountability and ownership within the data team.
- Understanding data flows to help the transition to a data mesh.
Ultimately, data lineage is more than a technical map; it is the foundation of a data-aware culture. It replaces assumptions with evidence, accelerates problem-solving, and builds the trust necessary for an organisation to truly become data-driven.