Pentaho Data Integration
Pentaho Data Integration (PDI) provides the Extract, Transform, and Load (ETL) capabilities that facilitates the process of capturing, cleansing, and storing data using a uniform and consistent format that is accessible and relevant to end users and IoT technologies.
Common uses of Pentaho Data Integration include:
- Data migration between different databases and applications
- Loading huge data sets into databases taking full advantage of cloud, clustered and massively parallel processing environments
- Data Cleansing with steps ranging from very simple to very complex transformations
- Data Integration including the ability to leverage real-time ETL as a data source for Pentaho Reporting
- Data warehouse population with built-in support for slowly changing dimensions and surrogate key creation (as described above)
Using the PDI Client
PDI Client (Spoon) is a desktop application that you install on your workstation, which enables you to build transformations and schedule and run jobs:
- Watch these two short videos:
- Familiarize yourself with the interface
Work with Repositories
We recommend that you use the Pentaho Repository for enterprise deployments.
Repository Configuration and Management |
Using the Data Integration Perspective
PDI workflows are built using steps or entries joined by hops that pass data from one item to the next. This workflow is built within two basic file types:
- Transformations perform ETL tasks.
- Jobs orchestrate ETL activities such as defining the flow, dependencies, and execution preparation.
Using Transformations and Jobs | |
Additional Features | |
Step and Entry Reference |
|
Using the Schedule Perspective in PDI
Schedule transformations and jobs to run at specific times.
All about Scheduling | Learn how to Schedule Transformations and Jobs |
PDI Administration
Learn about system requirements, the permissions needed for license and security management, and how to perform ETL solutions and data analytics tasks in PDI and Pentaho Business Analytics.
Supported Technologies | View the full list of hardware and software requirements for PDI and Pentaho Business Analytics: |
Installation and Licenses | Use one of the following methods to install PDI and Pentaho Business Analytics: |
Configuration and Management | Get started creating ETL solutions and data analytics tasks, manage servers, and fine-tune performance: PDI Tools and User Management Server Management
Performance Improvement |
Advanced PDI Concepts
Learn about developing custom plugins to extend or embed PDI functionality, sharing plugins, streamlining the data modeling process, connecting to Big Data sources, ways to maintain meaningful data and more.
Use the Command Line with PDI | Kitchen, Pan, and Carte are command line tools for executing jobs and transformations modeled in Spoon: |
Embed and Extend PDI | Learn how to develop custom plugins that extend PDI functionality or embed the engine into your own Java applications. |
Data Services | Use a Data Service to query the output of a step as if the data were stored in a physical table. Read about how to turn a transformation into a data service. |
Marketplace | Use the Marketplace to download, install, and share plugins developed by Pentaho and members of the user community. |
Data Lineage | Use Data Lineage to track your data from source systems to target applications and take advantage of third-party tools, such as Meta Integration Technology (MITI) and yEd, to track and view specific data. |
Big Data and Streamlined Data Refinery | Use transformation steps to connect to a variety of Big Data data sources, including Hadoop, NoSQL, and analytical databases such as MongoDB. Work through step-by-step tutorials, move beyond the basics, and learn how to edit transformations and metadata models.
|