Develop and Optimize a Pentaho Data Service
Develop a Pentaho Data Service
A Pentaho Data Service is a single, virtual table that contains the output of a transformation step. The Pentaho Data Service virtual table is JDBC-compliant data source that can be queried using SQL. You and others can connect to and query a Pentaho Data Service from within Pentaho or with some other JDBC-compliant tools like RStudio or SQuirreL. You do need to have access to the Pentaho Server and repository to publish and query the data service. To learn more about why you might want to use a Pentaho Data Service, see Turn Transformation Step Results into a Pentaho Data Service.
Create a Pentaho Data Service
To create a data service, name it, then select the step that outputs the data you want to be exposed to the data service. The virtual table that the data service creates has the same name as the data service.
- Create or open a transformation in Spoon. It is helpful to review Pentaho Data Service SQL Support Reference and Other Development Considerations as you create or review the transformation so that you understand what SQL is supported.
- Save the transformation to the Pentaho Data Server.
- Right-click the step in the transformation that outputs the data you want to make available to the data service, then select Data Services > New ....
- Enter a name for the data service in Service Name (Virtual Table Name) text box. Make sure the name is unique; no other data service stored locally or published to the Pentaho Server should have the same name.
- Make sure the Output Step is correct. If you need to change it, select an option from the drop down list.
- Click OK to exit the window and to save the data service. The data service badge is added to the transformation step as highlighted in the following graphic.
- A recommended but optional step, is to follow the steps in Test a Pentaho Data Service so that you can test your data service. Testing can help you correct, refine, and optimize your data service so it runs more efficiently.
Open, Edit or Delete a Pentaho Data Service
The following explains how to open a data service for editing or viewing, and how to delete a data service.
To Do This | Do One of These in Spoon |
---|---|
Open a Data Service for Editing or Viewing |
|
Delete a Data Service |
|
Test a Pentaho Data Service
After you’ve created a Pentaho Data Service, test it to make sure that it runs properly and generates the data you need. Testing can uncover bottlenecks in your transformation and can help you discover places where optimization techniques can be applied.
Run a Basic Test
To run a basic test of the data service, complete these steps.
- Do one of these things to open the Test Data Service window. The Test Data Service window is where you conduct your test.
- In the Data Service window, click Test Data Service.
- In the Spoon View tab, click Data Services. Right-click the data service you want, then select Test....
- Right-click the step in the transformation that the data service is attached to, then select Data Services > Test....
- There are a couple of optional settings that you can adjust.
- Log Level sets the amount of detail shown in the logs the test generates. Log results appear in the Query Results, Optimized Queries, Service Transformation Logging, Generated Transformation Logging, Service Metrics, and SQL Trans Metrics tabs. These tabs are detailed in Examine Test Results.
- Max Rows lets you set the maximum number of rows you want to see in your test results.
- To run the test, click Execute SQL.
- Examine the test results using the instructions in Examine Test Results.
- If you need to run another test, clear the cache, then run the test again.
- Click OK to exit the window.
- Optionally, you can choose to add an optimization if you want to make it run more efficiently.
- Publish the Pentaho Data Service.
Run an Optimization Test
If you have added an optimization of your data service, you will need to test it. To do this, first follow the instructions in the Run a Basic Test section, and then modify your query in the SQL area to pass a parameter for the optimization. Here is an example:
SELECT * FROM Mars WHERE rover=’Curiosity’
- To preview the optimization click Preview Optimization in the Test Data Service window.
- Use Examine Test Results instructions to help you interpret your test results.
Examine Test Results
Test results appear in the tabs in the bottom half of the Test Data Service window. The following table describes what appears in the tabs and provides test tips.
Tab | Description | Test Tips |
---|---|---|
Query Results | Shows events that occurred during the test run, as well as the query results. | Check this tab to make sure that you are getting the correct results. |
Optimized Queries | Displays processing information and the result of optimizations. Optimizations are techniques that can speed processing and help the Pentaho Data Service run more efficiently. | Examine this tab for results of applied optimizations. This can help you verify that the optimization was applied correctly. Consider using these results along with Service Metrics and Generated Transformation Logging results to see where optimizations are needed, and to assess whether optimizations are helping the transformation to run more efficiently. This tab shows the starting query as well as the input step queries after push down optimizations have been applied. There could be several input queries displayed in the tab: one for each input step that supports push down. |
Service Transformation Logging | Shows logs from the service transformation as it runs. The service transformation is the one that you created. When the transformation runs, PDI generates a second transformation (the generated transformation) that is written in SQL. | Review this tab to search for problems in the way the transformation is designed or built. If you see errors, review your transformation and any optimizations you might have applied. It is helpful to review the Service Transformation Logging and Generated Transformation Logging tabs together. |
Generated Transformation Logging | Shows logs from the generated transformation as it runs. The service transformation is the one that you created. When the transformation runs, PDI generates a second transformation (the generated transformation) that is written in SQL. | Check this tab for problems with the way the PDI transformation was generated. If you find issues, look at Pentaho Data Service SQL Support Reference and Other Development Considerations to make sure the SQL in the transformation is supported. It is helpful to review the Service Transformation Logging and Generated Transformation Logging tabs together. |
Service Metrics | Shows a GANTT chart of the amount of time, in milliseconds it took to run the different parts of the service transformation. | Examine the contents of this tab to find bottlenecks and to identify candidates for step optimization. Compare the Service Metrics and SQL Trans Metrics tabs to see if the same bottlenecks exist. If different bottlenecks occur, consider refining your transformation to eliminate or reduce them. |
SQL Trans Metrics | Shows a GANTT chart of the amount of time, in milliseconds it took to run the generated transformation. | Examine the contents of this tab to find bottlenecks and to identify candidates for step optimization. Compare the Service Metrics and SQL Trans Metrics tabs to see if the same bottlenecks exist. If different bottlenecks occur, consider refining your transformation to eliminate or reduce them. |
Optimize a Pentaho Data Service
As you test your data service, you might notice that there are certain bottlenecks, or parts of the transformation that could run more efficiently. If you want to improve the performance of your data service, apply an optimization technique. Here are a few techniques that are specifically designed for Pentaho Data Services.
Read the Best Practice - Pentaho Data Integration Performance Tuning article to learn about other general design and optimization techniques that can improve the performance of your transformation.
Optimization Technique | When to Use |
---|---|
Adjusts how long data results are cached. Consider using this technique if either of the following apply.
| |
Handles input step queries at the source. Consider using this technique if both of the following apply.
| |
Handles step queries at the source. Consider using this technique if both of the following apply.
|
Apply the Service Cache Optimization
This optimization technique stores the Pentaho Data Service in a cache. By default, caching is enabled and the results are stored for an hour, but you can adjust the cache so that the results of the data service are stored for one second or longer. Caching is a great technique to apply if you have moderately-sized data result sets, and you anticipate follow-on (subsequent) queries.
How the Service Cache Optimization Technique Works
If you run the data service during the time that the data service results are cached, PDI will run your query against the cached data set instead of running the entire transformation again only if certain conditions are met. These conditions are determined by the other optimization techniques you choose to apply as well as whether the query results are a subset of the cached data set.
Enabling the cache tells PDI to store the results of the data service transformation for the length of time you specify.
If you do not combine the Service Cache optimization with other optimization techniques and if you query the data service more than once before the cache expires, the query will run against the cached data set if the cached data set contains all of the matching records in the original data set. For example, if the initial run of the data service populates the cache with the results of SELECT * FROM employee, a subsequent data service call that retrieves a subset of the data (e.g. SELECT * FROM employee WHERE region = "South") will be run against the cached data set. For this to happen, the original query must have been run against the full data set.
When you run a test in in the Test Data Service window, then adjust the query and run the test again, the query is not run against the cache. This is because when you run a test, the results only return a certain number of records (100, 500, or 1000, depending on what you selected for Max Rows). For example, you could run a test that uses the SELECT * FROM employee query to return the first 100 records from a 5000 record table. The initial test query would return the first 100 records from the employee table. Let's assume that the cached result set consists of records for 50 males and 50 females. If you run a second test, but adjust the query so that you only show records for females (SELECT * FROM employee WHERE gender="F") running the query against the cached results would result in only 50 records being returned instead of 100. To avoid this and to provide more accurate results, PDI runs the transformation again so that the second test query is run against the table and returns the first 100 matching results.
Adjust the Cache Duration
To adjust the cache duration, complete the following steps.
- Click the Service Cache tab in the Data Service window.
- Make sure the cache is enabled making sure the Enable Caching checkbox is checked.
- Modify the Cache Duration (seconds) field.
- Click OK.
- Test the optimization.
- Publish the Pentaho Data Service.
Disable the Cache
To disable the cache, complete the following steps.
- Click the Service Cache tab in the Data Service window.
- Deselect Enable Caching to disable the cache.
- Click OK.
Clear the Cache
To clear the cache, complete the following steps.
- Click the Service Cache tab in the Data Service window.
- Deselect Enable Caching to disable the cache.
- Click OK to close the window.
- Open the Data Service window again, and click the Service Cache tab.
- Select Enable Caching to re-enable the cache.
Apply a Query Pushdown Optimization
Use the Query Pushdown optimization technique to translate the WHERE clause in a SQL query run against a data service to a corresponding WHERE clause in the Table Input or MongoDB Input steps. Queries in these input steps are then filtered down and more efficiently handled at the data source.
How the Query Pushdown Optimization Technique Works
To apply Query Pushdown optimization, first set the input step optimization values, then add the optimization parameter to the input step query.
The optimization requires the creation of a parameter that takes the place of the WHERE clause, like this: SELECT * FROM employee WHERE ${countryParam}.
If you combine this optimization technique with Service Cache optimization and if you query the data service more than once before the cache expires, the query will run against the cached data set if the cached data set contains all of the matching records in the original data set. Also, when you run a test in in the Test Data Service window, then adjust the query and run the test again, the query is not run against the cache. This is because when you run a test, the results only return a certain number of records (100, 500, or 1000, depending on what you selected for Max Rows). For more information on the optimization technique, see the Apply a Service Cache Optimization section.
Add the Query Pushdown Parameter to the Table Input or MongoDB Input Steps
The input step allows you to query the data source. These instructions explain how to add a parameter that will act as a WHERE clause in the SQL query or the equivalent type of clause in the MongoDB query.
- Create a transformation that has a Table Input or MongoDB Input step.
- Run the transformation to make sure it executes properly.
- Double-click the input step (Table Input or MongoDB Input) that contains the query you want to optimize.
- Add a parameter to the location of the WHERE clause value, like this:
- SQL Query Example: SELECT * FROM media WHERE ${countryParam}
- MongoDB Query Example: {$match : ${mongoDbParam}}
- Press [CTRL]+[SPACE] to view the list of parameters. Click on a parameter from the list to add it to the query.
- Select the Replace Variables in Script? checkbox.
- Click the OK button.
- Set up the Query Pushdown Parameter Optimization.
Set Up Query Pushdown Parameter Optimization
To set up this optimization, complete these steps.
- Open the Data Service window, then click the Query Pushdown tab.
- Click the + button near the Parameters label.
- In the Create Parameter window add the name of an optimization parameter you created in the input step’s SQL query.
- Click OK.
- Select the step that contains the parameter from Step Name.
- In the Definitions area of the window, enter the Data Service Field and the Step Field you want to parameterize.
- The Data Service Field contains the transformation's name of the field you want to parameterize. The name should be as it appears in the transformation's output field. For example, you might have renamed the cty field from your data source to country. You would enter country in the Data Service Field.
- The Step Field should contain the data source's name of the field you want to parameterize. For example, if you wanted to parameterize the cty field in a MySQL database, enter cty in the Step Field.
- Optionally, you can click the Get Optimizations button to automatically generate input step optimizations based on the output fields for the step on which you've created the data service and input fields.
- Click the OK button to save and close the window.
- Test the optimization.
- Publish the Pentaho Data Service.
Disable the Query Pushdown Optimization
To disable this optimization, click the Disable an Optimization checkbox in the Data Services window.
Apply a Parameter Pushdown Optimization
The Parameter Pushdown optimization technique can be applied to any step in the transformation. Although similar to the Query Pushdown optimization, it differs not only because it can applied to any step, but also because it maps a field value to a parameter in a simple WHERE clause in which a parameter is assigned a specific value using the equal operator like this: WHERE region= "South".
How the Parameter Pushdown Optimization Technique Works
To set up the other the parameter pushdown optimization, first set up the optimization, then add the parameter to the transformation step.
If you combine this optimization technique with Service Cache optimization and if you query the data service more than once before the cache expires, the query will run against the cached data set if the cached data set contains all of the matching records in the original data set. Also, when you run a test in in the Test Data Service window, then adjust the query and run the test again, the query is not run against the cache. This is because when you run a test, the results only return a certain number of records (100, 500, or 1000, depending on what you selected for Max Rows). For more information on the optimization technique, see the Apply a Service Cache Optimization section.
Add the Parameter Pushdown Parameter to the Step
These instructions explain how to add a parameter that will act to limit an input step (such as REST Client) in some way.
- Create a transformation.
- Run the transformation to make sure it executes properly.
- Add the parameter you want to optimize to one of the steps. Typically, you would want to specify the parameter in a JSON or REST Client step. If you use a filtered step, you would need to use the Get Variables step before the filtered step.
- Click the OK button.
- Set up the Parameter Pushdown Optimization.
Set Up Parameter Pushdown Optimization
To set up the parameter pushdown optimization, complete these steps.
- Open the Data Service window, then click the Parameter Pushdown tab.
- In the WHERE Clause Column, click in a cell and enter the Data Service Field name from the existing fields listed in the pull down menu. The Data Service Field name is the virtual table field name in your WHERE clause. Press [ENTER]. Possible values are defaulted for Transformation Parameter and Value Format.
- Adjust the name of the Transformation Parameter as needed. The name is something that you create; it must be unique in the data service. Type the name in the Transformation Parameter field or in the window transformation properties. (The parameter is created in the transformation properties or is something you specify in a field).
- If needed, add a prefix or suffix to the default %s in Value Format field. For example, if you want to format a value, consider using [value]=%s. In most cases, the default %s is sufficient formatting.
- Save the optimization and exit the window by clicking OK.
- Test the optimization.
- Publish the Pentaho Data Service.
Publish a Pentaho Data Service
To publish a data service, save the transformation that contains the data service to the EE Repository on the Pentaho Repository. Complete these steps.
- The transformation should already be saved to the Pentaho Repository. Since the Pentaho Repository is located in the Pentaho Server. Make sure that any external assets needed by the transformation can be accessed remotely, even if the transformation is run by a different user.
- Open a browser and go to the Pentaho Server and log in. If you have installed the Pentaho Server locally, the URL is usually localhost:8080.
- Validate that the data server was published by listing the data services on the server like this:
http://<Pentaho Server Host:Port>/pentaho/kettle/listServices
- Now you are ready to share the data service with others.
Monitor a Pentaho Data Service
To monitor a data service, type one of the following commands into your browser. Replace the <Pentaho Server Host:Port> of the Pentaho Server with the host name or IP address and the port of the Pentaho Server you are running the data service on. You will need to have access to the Pentaho Server and be authenticated to run the following commands.
List Names of Data Services on the Pentaho Server | List Status of Data Services on the Pentaho Server |
---|---|
http://<Pentaho Server Host:Port>/pentaho/kettle/listServices | http://<Pentaho Server Host:Port>/pentaho/kettle/status |
- If you want to query the Pentaho Data Service from another reporting tool, see the instructions in Connect to a Pentaho Data Service.
- Read the Share a Pentaho Data Service with Others section also to see what types of information you will need to supply to others using the data service.
Share a Pentaho Data Service with Others
Now that you’ve created and tested a data service, you can share it with other people so that they can connect to and query it. Here are some things that you’ll need to do.
- Make sure the person or group that you want to share the data service with has permissions to:
- Connect, read, and execute a query on the Pentaho Server to retrieve data service data.
- Read and execute the transformation that contains the data service.
- If you have optimized the data service, provide the person or group with names and definitions for any parameters that you have created. This will allow them to use the parameters when they query the data service.
- Provide the person or group the instructions in Connect to a Pentaho Data Service.