Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Develop your PDI solution

Parent article

This workflow helps you to set up and configure the DI development and test environments, then build, test, and tune your Pentaho DI Solution prototype. This process is similar to the Trial Download Evaluation experience, except that you will be completely configuring the Pentaho Server for data integration and working with your own ETL developers.

If you need extra help, Pentaho professional services is available. The end result of this is to learn DI implementation best practices and deploy your DI solution to a production server. Most development and testing for DI occurs in Spoon.

Before you begin developing your DI solution, we recommend that you attend Pentaho training classes to learn how to install and configure the Pentaho Server, as well as how to develop data models.

This section is grouped into parts that will guide you during the development of your DI solution. These parts are iterative and you might bounce between them during development. For example as you tune a job, you might find that although you have built a solution that produces the right results, it takes a long time to run. So, you might need to rebuild and test a transformation to improve efficiency, and then retest it.

Design DI solution

Design helps you think critically about the problem you want to solve and possible solutions. Consider these questions as you gather your requirements and design the solution.
  • Output

    What does the overall solution look like? What questions are posing and how do you want the answers formatted?

  • Data Sources

    What type(s) of data sources are you querying? Where are they located? How much data do you need to process? Are you using big data? Are you using relational or non-relational data sources? Will you have a target data source? If so, where are they located?

  • Content/Processing

    What data quality issues do you have? How is the input data mapped to the output data? Where do you want to process the content, in PDI or in the data source? What hardware will you include in your development environment? Will you need one or more quality assurance test environments or production environments?

Also, consider templates or standards, naming conventions, and other requirements of your end users if you have them. Consider how you will back up your data as well.

Set up development environment

Setting up the environment includes installing and configuring PDI on development computers, configuring clustering if needed, and connecting to data sources. If you have one or more quality assurance environments, you will need to set those up also.
Table 1. PDI Set Up Checklist
TaskDo ThisObjective
Verify System Requirements
  • Acquire one or more servers that meet the requirements.
  • Obtain the correct drivers for your system.
Obtain Software and Install PDI
  • Get the software from your Sales Support representative.
  • Install the software.
  • Start the Pentaho Server and Spoon.
Install licenses for the Pentaho Server
  • Add all relevant Pentaho licenses.
Connect to the Pentaho Repository
  • Connect to the Pentaho Repository.
Apply Advanced Security (if needed)
  • Determine whether you need to apply DI Advanced Security.

Build and test solution

During this step, you develop transformations, jobs, and models, then test what you have developed. You will tune the transformations, jobs, and models for optimal performance.

Development occurs in the Spoon design tool. Spoon’s streamlined design tightly couples the build and test activities so that you can easily perform them iteratively. Spoon has perspectives help you perform ETL and visualize data. Spoon also provides a scheduling perspective that can be used to automate testing. Testing encompasses verifying the quality of transformations and jobs, reviewing visualizations, and debugging issues. One common method of testing is to include steps in a transformation or job that calculate hash totals, checksums, record counts, and so forth to determine whether data is being properly processed. You can also visualize your data in analyzer and report designer and review the results as you develop. This can not only help you find errors and issues with processing, but can help you get a jump on user acceptance testing if you show these reports to your customers or business analysts to get early feedback.

One basic question, is how to determine the numbers of transformations and jobs needed, as well as the order in which they should be executed. A good rule of thumb is to create one transformation for each combination of source system and target tables. You can often identify combinations in your mapping documents. Once you've identified the number of transformations that you need, you can use the same process to determine that number of jobs that you need. When considering the order of execution for transformations and jobs, consider how referential integrity is enforced. Run target table transformations that have no dependencies first, then run transformations that are depend on those tables next, and so forth.

Table 2. Build and Test Checklist - Spoon
TaskDo ThisObjective
Understand the Basics
  • Review information about the process and perspectives.
Review most often used steps and entries
  • Review available transformations and determine how you can use them for your solution.
  • Review job step references to identify which steps can be used in your solution.
Create and Run Transformations
  • Identify the transformations needed for your job and implement them.
  • Save transformation.
  • Run transformations locally.
Create and Run a Job
  • Create a job.
  • Arrange transformations in a job so that they execute logically.
  • Run a job.

Tune solution

Fine tune transformations and jobs to optimize performance. This involves using various tools such as the DI Operation and Audit Mart to determine where bottlenecks or other performance issues occur, and addressing them.
Table 3. Tune Checklist
TaskDo ThisObjective
Review the Performance Tuning Checklist and Make Changes to Transformations and Jobs
  • Get familiar with things that you can do to optimize performance.
  • Apply tuning tips as needed.
Consider other performance tuning options
  • Learn how to apply transactional databases.
  • Learn how to use logs to tune transformations and jobs.

Next steps

These resources will be helpful to you as you prepare to Go Live for Production: