

Stream-based: ingestion and transformation pipelines process and act on an unbounded stream of data close to real time.īatch-based: ingestion and transformation pipelines operate on more traditional sources of data (typically relational data). In the context of data lakes, there typically exist two types of pipelines: Applying the concept of lineage tackles the aforementioned drawbacks by extending largely manually curated data catalogs with rich, human-readable, and automatically generated context on datasets. In technical terms, it includes the origin, sequence of processing steps, and final state of a dataset. Hence, fine-grained lineage enables our platform users to monitor, comprehend, and debug complex data pipelines. While the former describes the interconnections of pipelines, databases and tables, the latter exposes details on applied transformations that generate and transform data. We typically differentiate between coarse-grained and fine-grained lineage for retrospective workflow provenance. While this helps data consumers to better understand and grasp the context of data, such catalogs are often not able to sufficiently capture the provenance of datasets given the complex set of variables involved-even with thriving data communities in place.ĭata lineage, also referred to as data provenance, surfaces the origins and transformations of data and provides valuable context for data providers and consumers. Īs a response to this, organizations introduce data catalogs to document (previously) ungoverned data on their platforms, maintaining an inventory of and providing documentation capabilities for data. Hence, a major part of preparing data consists of handling missing and erroneous metadata.

Gartner states, however, that about 80 percent of data lakes do not effectively collect metadata, as it is easier to create and dump data than to curate it. Starting in the 2010s, distributed-data-processing frameworks such as Apache Spark have increasingly gained momentum to help data engineers and scientists process data at the scale of petabytes. With the increased adoption of cloud services over the last few years, organizations have found easy means to collect, store, and analyze vast amounts of structured and unstructured data from heterogenous sources. Lastly, our solution visualizes the extracted data lineage via a modern web app, and integrates with BMW Group’s soon-to-be open-sourced Cloud Data Hub. We map the digested data into an expandable data model, enabling us to extract graph structures for both coarse- and fine-grained data lineage. We build upon the open-source component Spline, allowing us to reliably consume lineage metadata and identify interdependencies. In our practice report, we propose an end-to-end solution that digests lineage via (Py‑)Spark execution plans.
LINEAGE W SYSTEM REQUIREMENTS MANUAL
Thus, collecting data lineage-describing the origin, structure, and dependencies of data-in an automated fashion increases quality of provided metadata and reduces manual effort, making it critical for the development and operation of data pipelines. Today, however, the context of data is often only manually documented by subject matter experts, and lacks completeness and reliability due to the complex nature of data pipelines. Metadata management constitutes a key prerequisite for enterprises as they engage in data analytics and governance.
