Data Lineage
Overview
Pentaho offers you the ability to visualize the end-to-end flow of your data across PDI transformations and jobs, providing you with valuable insights to help you maintain meaningful data.
Pentaho offers you the ability to visualize the end-to-end flow of your data across PDI transformations and jobs, providing you with valuable insights to help you maintain meaningful data. This ability to track your data from source systems to target applications allows you take advantage of third-party tools, such as Meta Integration Technology (MITI) and yEd, to track and view specific data.
Once lineage tracking is enabled, PDI will generate a .graphml
file every time you run a transformation. You can then open this file using a third-party tool, such as yEd, to view a tree diagram of the data. By parsing through and teasing out the different parts of the graph, you can gain an end-to-end view into a specific element of data from origin to target. This ability can aid you in both data lineage and impact analysis:
- Data lineage provides the ability to discover the origins of an element of data and describes the sequence of jobs and transformations which have occurred up to the point of the request for the lineage information.
- Impact analysis is the reverse flow of information which can be used to trace the use and consumption of a data item, typically for the purpose of managing change or assessing and auditing access.
Sample Use Cases
Data lineage and impact analysis can be applicable in several ways.
As an ETL Developer:
- There are changes in my source system, such as fields which are added, deleted and renamed. What parts of my ETL processes need to adapt? (Impact Analysis)
- I need additional information in my target system, such as for reports. What sources are can provide this additional information? (Data Lineage)
As a Data Steward:
- There is a need for auditability and transparency to determine where data is coming from. A global, company-wide, metadata repository needs data lineage information from different systems and applications, i.e. very fine-grained metadata.
- What elements (fields, tables, etc.) in my ETL processes are never used? How many times is a specific element used in some or all of my ETL processes?
As a Report/Business User:
- Is my data accurate?
- I want to find reports which include specific information from a source, such as a field. This process is "data discovery." For example, are there any data sources which include sales and gender? Are there any reports which include sales and zip codes?
As a Troubleshooting Operator:
- The numbers in the report are wrong (or supposed to be wrong). What processes (transformations, jobs) are involved to help me determine where these numbers are coming from?
- A job or transformation did not finish successfully. What target tables and fields are affected which are used in the reports?
As an Administrator:
- For documentation and auditing purposes, I want to have a report on external sources and target fields, tables, and databases of my ETL processes. I need the data for a specific date and version.
- To ensure compliance, I want to validate naming conventions of artifacts (fields, tables, etc.)
- For integration into third-party data lineage tools, I want a flexible way of exporting the collected data lineage information.
Architecture
Pentaho's data lineage capabilities allow us to take advantage of tools from Metadata Integration Technology (MITI). If you use a lot of different systems and applications, you can track and visualize specific data across these systems using Pentaho' lineage capabilities and third party tools such as MITI and yEd.
Setup
Modify …\system\karaf\etc\pentaho.metaverse.cfg (Client & Pentaho Server when needed):
- You need to enable lineage explicitly by setting lineage.execution.runtime = on
- Modify the default folder for lineage GraphML files accordingly: lineage.execution.output.folder=./pentaho-lineage-output
- Set lineage.execution.generation.strategy=latest (by default)
After the execution of a job or transformation, the GraphML files are generated in the defined folder.
API
It is also possible to access the GraphML information via a Pentaho Server API. There are REST endpoints available to retrieve the lineage related artifacts.
Below are some example curl commands which exercise the REST endpoints available on the Pentaho Server. These calls use basic authentication. For more information on the various ways to authenticate with the Pentaho Server, see the "Authenticate with the Server Before Making Service Calls" topic on the Pentaho Documentation site.
For more detailed information about the REST endpoints available, you can got to the Pentaho Wiki to view the attached Enunciate file.
Get all lineage related artifacts
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" http://localhost:8080/pentaho/osgi/cxf/lineage/api/download/all
Get all lineage from a given date forward
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" http://localhost:8080/pentaho/osgi/cxf/lineage/api/download/all/20150706
Get all lineage between 2 dates
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" http://localhost:8080/pentaho/osgi/cxf/lineage/api/download/all/20150101/20150706
Get all of the lineage artifacts for a specific file in the DI repo
curl --request POST --header "Content-Type: application/json" --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" --data '{"path": "/LOCAL DI REPO/home/admin/dataGrid-dummy"}' http://localhost:8080/pentaho/osgi/cxf/lineage/api/download/file
Get all lineage related artifacts for a specific file in the DI repo between 2 dates
curl --request POST --header "Content-Type: application/json" --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" --data '{"path": "/LOCAL DI REPO/home/admin/dataGrid-dummy"}' http://localhost:8080/pentaho/osgi/cxf/lineage/api/download/file/20150701/20150707
Invalid date request
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" http://localhost:8080/pentaho/osgi/cxf/lineage/api/download/all/20159999