Pentaho Data Services
Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback.
One way to streamline this process is to make the output of a transformation step a Pentaho Data Service. The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database. Instead, results are published to the Pentaho Server as a virtual table.
You must have a Pentaho Server and repository to publish the data service.
The virtual table is a JDBC-compliant data source that you and others can connect to or query with SQL, provided they can access the server and the transformation. The Pentaho Data Service can be connected to or queried by a JDBC-compliant tool such as Pentaho Report Designer, Interactive Reporting, and CTools as well as other compatible tools like RStudio, DBVisualizer, or SQuirreL.
The Pentaho Data Service can also be used in some instances where building and maintaining a data warehouse is sometimes impractical or inefficient, especially when you need to quickly blend and visualize fast-moving or quickly evolving data sets on the fly. For example, if you want to compare your product prices with your competitors, you can create a transformation that blends prices from your in-house data sources and competitor prices. Then, you can convert the output step in the transformation into a Pentaho Data Service that creates a virtual table for querying when you connect to the Pentaho Server. You or others can connect to and query the virtual table, as you would any other JDBC data source to visualize the results in Analyzer or another tool.
The Pentaho Data Service also has a testing tool. This tool generates several logs and reports that you can use to refine the data service and determine where to apply specialized optimizations. You can also define parameters that others can use to pose customized queries. For example, you can create a data service that publishes a virtual “fact” table of a moderately-sized research dataset to a Pentaho Server. You can test and add optimizations and parameters, such as gender or test type so that the data service runs more quickly. Then, you can share connection and parameter information with a group of researchers, who can query the virtual table. Researchers can use Pentaho Interactive Reporting, a dashboard created with CTools, or an application of their choice, such as RStudio, to analyze and visualize the research dataset.
Pentaho Data Services supports a subset of SQL. For more details on what is supported, see Pentaho Data Service SQL Support Reference and Other Development Considerations. Also, see our Components Reference for a complete list of traditional supported data sources.
Create a Regular or Streaming Pentaho Data Service
You can create either a regular data service or a streaming data service. A streaming data service is commonly used when creating a streaming data dashboard with CTools.
Perform the following steps to create a data service, name it, then select the step that outputs the data you want to be exposed to the data service:
- Create or open a transformation in the PDI client. It is helpful to review Pentaho Data Service SQL Support Reference and Other Development Considerations as you create or review the transformation so that you understand which SQL commands are supported.
- Save the transformation to the Pentaho Server.
- Right-click the transformation step in the transformation that outputs the data you want to make available as a data service, then select Data Services > New.
- Enter a unique name for the data service in Service Name (Virtual Table Name) text box. The virtual table that the data service creates has the same name as the data service.
Ensure the name is unique; no other data service stored locally or published to the Pentaho Server should have the same name.
- Verify that the Output Step is the step you selected to become the data service in Step 3. If you need to change it, select the correct step name from the list.
- If you are working with streaming data, select Streaming for the Data Service Type.
- Click OK to save the data service and exit the window. The data service badge is added to the transformation step icon.
- A recommended but optional step, is to follow the steps in Test a Pentaho Data Service so that you can test your data service. Testing can help you correct, refine, and optimize your data service so it runs more efficiently.
Data Service Badge
After you create a Pentaho Data Service from a step, a data service badge is added to that step. The badge will designate whether the step has a related regular or streaming data service, as shown in the following images:
These badges will help to indicate which steps in your transformations are also data services and whether they are regular or streaming.
Open or Edit a Pentaho Data Service
To open a data service for viewing or editing, perform one of the following actions in the PDI client:
- In the View tab of the Explore pane, click Data Services. Right-click the name of the data service you want to open or edit, then select Edit.
- Right-click the step, indicated by the data service badge, that contains the data service, then select Data Services > Edit.
Delete a Pentaho Data Service
To delete a data service, perform one of the following actions in the PDI client:
- In the View tab of the Explore pane, click Data Services. Right-click the name of the data service you want to delete, then select Delete.
- Right-click the step, indicated by the data service badge, that contains the data service, then select Data Services > Delete.
Test a Pentaho Data Service
After creating your data service, test it to ensure that it runs properly and generates the data you need. Testing can uncover bottlenecks in your transformation and can help you discover places where optimization techniques can be applied.
Run a Basic Test
To run a basic test on a regular data service, perform the following steps:
- Verify Data Service Type is set to Regular.
- Perform one of the following actions to open the Test Data Service window:
- In the Data Service window, click Test Data Service.
- In the View tab of the PDI client Explore pane, click Data Services. Right-click the name of the data service you want, then select Test.
- Right-click the step attached to the data service as indicated by the data service badge, then select Data Services > Test.
- If needed, adjust the following optional settings:
- Log Level sets the amount of detail shown in the logs the test generates. Log results appear in the Query Results, Optimized Queries, Service Transformation Logging, Generated Transformation Logging, Service Metrics, and SQL Trans Metrics tabs. These tabs are detailed in Examine Test Results.
- Max Rows lets you set the maximum number of rows you want to see in your test results.
- To run the test, click Execute SQL.
- Examine the test results using the instructions in Examine Test Results.
- If you need to run another test, clear the cache, then run the test again.
- Click OK to exit the window.
- Optionally, you can choose to add an optimization if you want to make it run more efficiently.
- Publish the Pentaho Data Service.
Run a Streaming Optimization Test
To test streaming data, the records of the stream must be partitioned into windows (batching) for processing. How the records are batched depends on the window mode you choose. A window can be time-based or row-based. A time-based window is created within a specified interval of time. A row-based window is created per the specified number of rows collected for processing.
To run an optimization test on a streaming data service, perform the following steps:
- Verify Data Service Type is set to Streaming.
- Perform one of the following actions to open the Test Data Service window:
- In the Data Service window, click Test Data Service.
- In the View tab of the PDI client Explore pane, click Data Services. Right-click the name of the data service you want to test, then select Test.
- Right-click the step in the transformation (identified by the data service badge) that attaches to the data service, then select Data Services > Test.
- Select the window mode (Time Based or Row Based) used for processing during the test, and specify the times (in milliseconds) or number of rows depending on the window mode for the following settings:
- Window Size defines the number of rows that a window will have (row-based), or the time frame, in milliseconds, for capturing new rows to a window (time-based).
- Every sets the number of rows (row-based), or milliseconds (time-based) that should elapse before creating a new window.
- Limit the maximum number of milliseconds (row-based) or rows (time-based) to wait for a new window to be generated.
- If needed, adjust the following optional settings:
- Log Level sets the amount of detail shown in the logs the test generates. Log results appear in the Query Results, Optimized Queries, Service Transformation Logging, Generated Transformation Logging, Service Metrics, and SQL Trans Metrics tabs. These tabs are detailed in Examine Test Results.
- Max Rows lets you set the maximum number of rows you want to see in your test results.
- To run the test, click Execute SQL.
- Examine the test results using the instructions in Examine Test Results.
- Click OK to exit the window.
- Optionally, you can choose to add an optimization if you want to make it run more efficiently.
- Publish the Pentaho Data Service.
Run an Optimization Test
If you have added an optimization of your data service, you will need to test it. First, follow the instructions in Run a Basic Test section, and then modify your query in the SQL area to pass a parameter for the optimization, as shown in the following example:
SELECT * FROM Mars WHERE rover=’Curiosity’
- To preview the optimization click Preview Optimization in the Test Data Service window.
- Use Examine Test Results instructions to help you interpret your test results.
Examine Test Results
Test results appear in the tabs in the bottom half of the Test Data Service window.
When you test or run a data service, there are two transformations that run: a service transformation and a generated transformation. The service transformation is the one that you created. When the transformation runs, PDI generates a second transformation (the generated transformation) based on the executed SQL.
The following table describes what appears in the tabs and possible test tips on how to proceed:
Tab | Description | Test Tips |
---|---|---|
Query Results |
Shows events that occurred during the test run, as well as the query results. |
Review this tab to ensure that you are getting the correct results. |
Optimized Queries |
Displays processing information and the result of optimizations. Optimizations are techniques that can speed processing and help the Pentaho Data Service run more efficiently. |
Review this tab for the results of applied optimizations. The results can help you verify that the optimization was applied correctly. Consider using these results along with Service Metrics and Generated Transformation Logging results to see where optimizations are needed, and to assess whether optimizations are helping the transformation to run more efficiently. This tab shows the starting query as well as the input step queries after push down optimizations have been applied. Several input queries may be displayed in the tab, one for each input step that supports push down. |
Service Transformation Logging |
Shows logs from the service transformation as it runs. |
Review this tab for problems in the way the transformation is designed or built. If you see errors, review your transformation and any optimizations you might have applied. It is helpful to review the Service Transformation Logging and Generated Transformation Logging tabs together. |
Generated Transformation Logging |
Shows logs from the generated transformation as it runs. |
Review this tab for problems in the way the PDI transformation was generated. If you find issues, look at Pentaho Data Service SQL Support Reference and Other Development Considerations to ensure the SQL in the transformation is supported. It is also helpful to review the Service Transformation Logging and Generated Transformation Logging tabs together. |
Service Metrics |
Shows a GANTT chart of the amount of time, in milliseconds it took to run the different parts of the service transformation. |
Review the contents of this tab to find bottlenecks and to identify candidates for step optimization. Compare the Service Metrics and SQL Trans Metrics tabs to see if the same bottlenecks exist. If different bottlenecks occur, consider refining your transformation to eliminate or reduce them. |
SQL Trans Metrics |
Shows a GANTT chart of the amount of time, in milliseconds it took to run the generated transformation. |
Review the contents of this tab to find bottlenecks and to identify candidates for step optimization. Compare the Service Metrics and SQL Trans Metrics tabs to see if the same bottlenecks exist. If different bottlenecks occur, consider refining your transformation to eliminate or reduce them. |
Publish a Pentaho Data Service
To publish your data service, save the transformation containing the data service to the Pentaho Repository.
Perform the following steps to validate your data service has been published:
- Since the Pentaho Repository is in the Pentaho Server, ensure that any external assets needed by the transformation can be accessed remotely, even if the transformation is run by a different user.
- Open a browser, go to the Pentaho Server, and log in. If you have installed the Pentaho Server locally, the URL is usually localhost:8080.
- Validate your data server was published by listing the data services on the server, as shown in the following example:
http://<Pentaho Server Host:Port>/pentaho/kettle/listServices
You are now ready to share the data service with others.
Share a Pentaho Data Service with Others
With your data service now created and tested, you can share it with others so they can identify it, connect to it, and query it.
Perform the following actions to share your data service:
- Ensure the person or group, with whom you want to share your data service, has the following permissions:
- Connect, read, and execute a query on the Pentaho Server to retrieve data service data.
- Read and execute the transformation that contains the data service.
- Provide the unique data service name as saved to the Pentaho Repository to the person or group that you want to share your data service.
- If you have optimized the data service, provide the person or group with names and definitions for any parameters that you have created so they can use the parameters when they query the data service.
- Provide the person or group with the instructions in Connect to a Pentaho Data Service from a Pentaho Tool or Connect to the Pentaho Data Service from a Non-Pentaho Tool depending on how the service is used.
Connect to the Pentaho Data Service from a Pentaho Tool
A Pentaho Data Service is a virtual table that contains the output of a PDI transformation. You can connect to and query a regular Pentaho Data Service from any Pentaho tool, such as Report Designer, the PDI client (Spoon), and Analyzer. You can connect to and query a streaming Pentaho Data Service from a dashboard created with CTools.
To connect and query the Pentaho Data Service, you need to know the data service name and have permission to run the transformation and to access the Pentaho Server where it is published.
Connecting to the data service from another Pentaho tool is the same as connecting to a database. For information on connecting to a database, refer to Define Data Connections for the Pentaho Server. The following tables describe the parameters needed to make a Pentaho Data Service connection:
Required Parameters | Description |
---|---|
Connection Name | The unique name of the data service you want to access |
Connection Type | Pentaho Data Services |
Access | Native (JDBC) |
Hostname | Hostname of the Pentaho Server or IP address. By default, this is localhost if running the Pentaho Server locally. |
Port Number | Port number of the Pentaho Server the data service will run on. The default is 8080. |
Web App Name | Name of the web application. The default value is pentaho, which is typically used by the other Pentaho tools. |
Username | Name of a user who has permission to run the data service. |
Password | Password for a user who has permission to run the data service. |
You can also set the following optional parameters.
Optional Parameters | Description |
---|---|
proxyhostname |
Proxy server for HTTP connection(s). |
proxyport |
Proxy server port. |
nonproxyhosts |
Hosts that do not use the proxy server. If there is more than one host name, separate them with commas. |
debugtrans |
Optional name of the file where the generated transformation is stored. This transformation is generated to debug it. Example: /tmp/debug.ktr. Specify the name of the transformation or a path plus the name of the transformation. |
PARAMETER_[optionname]=value |
Sets the value for a parameter in the transformation. |
secure |
Set this parameter to TRUE to use the HTTPS secure protocol connect to the data service. If you omit this parameter or set it to FALSE, the standard HTTP unsecure protocol is used. |
Connect to the Pentaho Data Service from a Non-Pentaho Tool
A Pentaho Data Service is a virtual table that contains the output of a step in a PDI transformation. You can identify, connect to, and query the Pentaho Data Service from a non-Pentaho tool, like RStudio or SQuirreL. For a streaming data service, you must use a JDBC driver from your non-Pentaho tool to access streaming operations.
To connect and query the Pentaho Data Service, you need to know the data service name and have permission to run the transformation and to access the Pentaho Server where it is published.
To connect to and run a Pentaho Data Service from a non-Pentaho tool, like SQuirreL or Beaker, you need to install the service driver files, then create a connection to the data service from your tool.
Before you can connect to and run a Pentaho Data Service from a non-Pentaho tool like SQuirreL or Beaker, you need to download the PDI Data Service driver and install it. The driver is bundled with PDI. If you want someone who does not have PDI to connect to your data service, you will need to download the driver and give it to them so that they can install it.
Step 1: Download the Pentaho Data Service JDBC Driver
You can download the driver using the PDI client or manually.
To download the Pentaho Data Service JDBC Driver using the PDI client, perform the following steps:
- Open the transformation and identify the data service step indicated by the data service badge.
- Perform one of the following actions:
- Right-click the step, then select Data Service > Driver Details.
- In the transformation's View tab, select the data service from the Data Services folder, then right-click any data service and select Driver Details.
- In the Data Services window, click Driver Details.
- In the Driver Details window click Get Driver.
- In the Save As window, indicate where you want to save the Pentaho-Data-Service-Driver.zip file, then click Save.
- Click Close to exit the window.
- Follow the instructions to Install the Pentaho Data Services JDBC Driver.
To manually download the Pentaho Data Service JDBC Driver perform the following steps:
- Go to the <pentaho installation directory>/design-tools/data-integration/Data Service JDBC Driver directory and download the zipped file into it.
- Follow the instructions to Install the Pentaho Data Service JDBC Driver.
Step 2: Install the Pentaho Data Service JDBC Driver
To install the Pentaho Data Service JDBC Driver, complete these steps.
- Extract the driver files that you downloaded in the Download the Pentaho Data Service JDBC Driver instructions and paste them into the JDBC directory in your application. For example, if you want to access the data service from RStudio, paste the driver files in the same directory as the other JDBC drivers. If you are not sure where this is, check the documentation for that application.
- Start and stop the application.
- In the application, make a connection to the data service, using the instructions in Connect to the Pentaho Data Service from a Non-Pentaho Tool.
Step 3: Create a Connection from a Non-Pentaho Tool
Once the driver is installed, you will need to create the connection to the Pentaho Data Service. For many tools, you can connect by specifying a connection object. Review the connection details and optional options in Connect to the Pentaho Data Service from a Pentaho Tool.
You will probably also need the JDBC Driver class from the following table.
Parameter | Value |
---|---|
JDBC Driver Class | org.pentaho.di.trans.dataservice.jdbc.ThinDriver |
The JDBC connection string uses the following format:
jdbc:pdi://<Pentaho Server Hostname:Port>/kettle?option=value&option=value
The following example shows how you might format a connection string:
jdbc:pdi://localhost:8080/kettle?webappname=pentaho
The webappname is required if the data service is running on the Pentaho Server.
Query a Pentaho Data Service
You can use SQL to query a data service. You can also add a conditional statement to the query if the data service transformation uses a parameter. The parameter must be added to the transformation by an ETL developer before the data service was created. Refer to Develop a Pentaho Data Service for more details. You can then assign a value to the parameter in your query.
To find the name of the table to query, you can connect to the data service, then use explorer to find the name of the table. The name of table is usually the same as the name of the data service.
The Pentaho Data Service has the following limitations:
- There are SQL limitations for queries. See the Pentaho Data Source SQL Reference and Other Development Considerations article for more details.
- Although you can use any Pentaho-supported input source, data services can only be queried with SQL.
Example
The following example of a SQL query shows how to include a parameter. The example queries the employeeList data service for records, and when the data service transformation is executed, any use of the parameter employeeRegion in it will be substituted with the value USA EAST. This parameter could be in any transformation step that allows variable or parameter substitution.
You can also assign a value to the parameter in the Connection Properties window.
Using the following sample syntax:
SELECT * FROM '<data service name>' WHERE PARAMETER('<parameter_name>') = '<parameter_value>'
The following query would be used for our example:
view sourceprint SELECT * FROM 'employeeList' WHERE PARAMETER('employeeRegion')='USA EAST'
Monitor a Pentaho Data Service
To monitor a data service, type one of the following commands into your browser.
List Names of Data Services on the Pentaho Server | List Status of Data Services on the Pentaho Server |
---|---|
http://<Pentaho Server Host:Port>/pentaho/kettle/listServices |
http://<Pentaho Server Host:Port>/pentaho/kettle/status |
Replace the <Pentaho Server Host:Port> of the Pentaho Server with the host name or IP address and the port of the Pentaho Server running the data service. You will need to have access to the Pentaho Server and be authenticated to run these commands.