Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation



An explanation of the common uses and key benefits of PDI.

Pentaho Data Integration (PDI) is an extract, transform, and load (ETL) solution that uses an innovative metadata-driven approach.

PDI includes the DI Server, a design tool, three utilities, and several plugins.

Common Uses

Pentaho Data Integration is an extremely flexible tool that addresses a broad number of use cases including:

  • Data warehouse population with built-in support for slowly changing dimensions and surrogate key creation
  • Data migration between different databases and applications
  • Loading huge data sets into databases taking full advantage of cloud, clustered, and massively parallel processing environments
  • Data Cleansing with steps ranging from very simple to very complex transformations
  • Data Integration including the ability to leverage real-time ETL as a data source for Pentaho Reporting
  • Rapid prototyping of ROLAP schemas
  • Hadoop functions: Hadoop job execution and scheduling, simple Hadoop MapReduce design, Amazon EMR integration

Key Benefits

Pentaho Data Integration features and benefits include:

  • Installs in minutes; you can be productive in one afternoon
  • 100% Java with cross platform support for Windows, Linux, and Macintosh
  • Easy to use graphical designer with over 100 out-of-the-box mapping objects including inputs, transforms, and outputs
  • Simple plug-in architecture for adding your own custom extensions
  • Enterprise Data Integration server providing security integration, scheduling, and robust content management including full revision history for jobs and transformations
  • Integrated designer (Spoon) combining ETL with metadata modeling and data visualization, providing the perfect environment for rapidly developing new Business Intelligence solutions
  • Streaming engine architecture provides the ability to work with extremely large data volumes
  • Enterprise-class performance and scalability with a broad range of deployment options including dedicated, clustered, and/or cloud-based ETL servers