Embed Pentaho Data Integration
You can build and run PDI transformations and jobs in other Java applications. Examples of these types of PDI transformations and jobs are available in the kettle-sdk-embedding-samples directory of the accompany project within the sample code package. This sample project has a set of dependencies you should consider before using these example transformations and jobs.
Get started with embedding PDI
Consider the following dependencies while embedding PDI:
- Complete set of dependent PDI files
- Default OSGi features for PDI
- Kettle (Non-OSGi) plugins
- PDI Enterprise Edition license file
Complete set of dependent PDI files
The following PDI directories contain a complete set of all JAR files needed:
- pentaho/design-tools/data-integration/lib
- pentaho/design-tools/data-integration/libswt/<os>
- pentaho/design-tools/data-integration/classes
These dependencies must be included in your class path. You can copy the directories into your project’s directory structure or specify the path directly to your PDI installation, as shown in the following example code:
java -classpath "lib/*;libswt/linux/*;classes/*" MyApp.java java -classpath "$PDI_DI_DIR/lib/*;$PDI_DI_DIR/libswt/linux/*; $PDI_DI_DIR/classes/*" MyApp.java
OSGi features for PDI
To use the OSGi features of PDI, make the pentaho/design-tools/data-integration/system directory available to your application. This directory is required for proper Karaf initialization. You can use either of following methods to specify this directory:
- Copy the pentaho/design-tools/data-integration/system directory into the <working directory>/systems directory of your application.
- Set the pentaho.user.dir system property to point to
the PDI
pentaho/design-tools/data-integration directory, either through the
following command line option (
-Dpentaho.user.dir=<pdi install>/data-integration
) or directly in your code (System.setProperty( "pentaho.user.dir", new File("<pdi install>/data-integration")
); for example).
Kettle (non-OSGi) plugins
Make the kettle plugins (non-OSGi) available to your application. With a standard install, the kettle engine looks for plugins in either <working directory>plugins or <user.home>/.kettle/plugins. You can use either of following methods to make the default kettle plugins available:
- Copy the pentaho/design-tools/data-integration/plugins directory into the <working directory>/plugins directory of your application.
- Set the KETTLE_PLUGIN_BASE_FOLDERS system property to point to the PDI
pentaho/design-tools/data-integration directory, either through the following command line option (
-DKETTLE_PLUGIN_BASE_FOLDERS=<pdi install>/data-integration
) or directly in your code (System.setProperty( "KETTLE_PLUGIN_BASE_FOLDERS", new File("<pdi install>/data-integration"
) ); for example).
Once the plugin location(s) are properly configured, you can add custom plugins to your specific locations. You can also add custom plugins in other locations as long as they are registered
with the appropriate implementation of
prior to initializing the kettle environment, as shown in the following code example:PluginTypeInterface
StepPluginType.getInstance().getPluginFolders().add( new PluginFolder( "<path to the plugin folder>" , false, true ) );
Pentaho license file
Before initializing the Kettle environment, you must install the PDI Enterprise Edition license file for each user account. Then, to ensure that the Pentaho Server uses the same location to store and retrieve your Pentaho license, you must also create a PENTAHO_INSTALLED_LICENSE_PATH system environment variable for each account. The location of your license path must be available to the user accounts that run the Pentaho Server. For information about installing the license and setting the variable path, see Manage licenses using the command line interface.
Sample class scenarios
For each of the following embedding scenarios, a sample class can be executed as a stand-alone Java application:
- Run Transformations
- Run Jobs
- Dynamically Build Transformations
- Dynamically Build Jobs
Each sample has an associated unit test. To run an individual sample, execute the following command:
mvn test -Dtest=<sample unit test class>
The following sections describe how to use these samples as templates for embedding PDI in your applications.
Run transformations
org.pentaho.di.sdk.samples.embedding.RunningTransformations
class
is an example of how to run a PDI transformation from Java code in a stand-alone
application. This class sets parameters and executes the sample transformations in
pentaho/design-tools/data-integration/etl directory. You can
run a transform from its KTR file using
runTransformationFromFileSystem()
or from a PDI repository using
runTransfomrationFromRepository()
.Consider the following general steps while trying to run an embedded transformation:
Procedure
Initialize the Kettle environment.
Always make the first call toKettleEnvironment.init()
whenever you are working with the PDI APIs.Prepare the transformation.
The definition of a PDI transformation is represented by a TransMeta object. You can load this object from a KTR file, a PDI repository, or generate it dynamically. To query the declared parameters of the transformation definition uselistParameters()
. To query the assigned values, usesetParameterValue()
.Execute the transformation.
An executable Trans object is derived from theTransMeta
object that is passed to the constructor. TheTrans
object starts, then executes asynchronously. To ensure that all steps of theTrans
object have completed, callwaitUntilFinished()
.Evaluate the outcome.
After theTrans
object completes, you can access the result usinggetResult()
. The Result object can be queried for success by evaluatinggetNrErrors()
. This method returns zero (0) on success and a non-zero value when there are errors. To get more information, retrieve the transformation log lines.Shutdown listeners.
When the transformations have completed, callKettleEnvironment.shutdown()
to ensure the proper shutdown of all kettle listeners.
Run jobs
org.pentaho.di.sdk.samples.embedding.RunningJobs
class is an
example of how to run a PDI job from Java code in a stand-alone application. This class sets
parameters and executes the job in etl/parametrized_job.kjb. You
can run the job from the .kjb file using
runJobFromFileSystem()
or from a repository using
runJobFromRepository()
.Consider the following general steps while trying to run an embedded job:
Procedure
Initialize the Kettle environment.
Always make the first call toKettleEnvironment.init()
whenever you are working with the PDI APIs.Prepare the job.
The definition of a PDI job is represented by a JobMeta object. You can load this object from a KTB file, a PDI repository, or generate it dynamically. To query the declared parameters of the job definition uselistParameters()
. To set the assigned values usesetParameterValue()
.Execute the job.
An executable Job object is derived from theJobMeta
object that is passed to the constructor. TheJob
object starts, then executes in a separate thread. To wait for the job to complete, callwaitUntilFinished()
.Evaluate the outcome.
After theJob
completes, you can access the result usinggetResult(
). The Result object can be queried for success usinggetResult()
. This method returnstrue
on success andfalse
on failure. To get more information, retrieve the job log lines.Shutdown listeners.
When the transformations have completed, callKettleEnvironment.shutdown()
to ensure the proper shutdown of all Kettle listeners.
Dynamically build transformations
org.pentaho.di.sdk.samples.embedding.GeneratingTransformations
class is an example of a dynamic transformation. This class generates a transformation
definition and saves it to a KTR file.Consider the following general steps while trying to dynamically build a transformation:
Procedure
Initialize the Kettle environment.
Always make the first call toKettleEnvironment.init()
whenever you are working with the PDI APIs.Create and configure a transformation definition object
A transformation definition is represented by a TransMeta object. Create this object using the default constructor. The transformation definition includes the name, the declared parameters, and the required database connections.Populate the
The data flow of a transformation is defined by steps that are connected by hops. Perform the following tasks to populate the object with a transformation step:TransMeta
object with transformation stepsCreate the step by instantiating its class directly and configure it by using its
Transformation steps reside in sub-packages ofget
andset
methods.org.pentaho.di.trans.steps
.To use the Get File Names step, create an instance of org.pentaho.di.trans.steps.getfilenames.GetFileNamesMeta and use itsget
andset
methods to configure it.Obtain the step ID string.
Each PDI step has an ID that can be retrieved from the PDI plugin registry.A simple way to retrieve the step ID is to callPluginRegistry.getInstance().getPluginId(StepPluginType.class, theStepMetaObject)
.Create an instance of
An instance oforg.pentaho.di.trans.step.StepMeta
by passing the step ID string, the name, and the configured step object to the constructor.StepMeta
encapsulates the step properties, as well as controls the placement of the step on the PDI client (Spoon) canvas and connections to hops.Once the
StepMeta
object has been created, callsetDrawn(true)
andsetLocation(x,y)
to make sure the step appears correctly on the PDI client canvas.Add the step to the transformation, by calling
addStep()
on the transformation definition object.
Connect the hops.
Once steps have been added to the transformation definition, they need to be connected by hops. To create a hop, create an instance of org.pentaho.di.trans.TransHopMeta, passing in the From and To steps as arguments to the constructor. Add the hop to the transformation definition by callingaddTransHop()
.
Results
getXML()
and opening it in the PDI client for
inspection. The sample class
org.pentaho.di.sdk.samples.embedding.GeneratingTransformations
generates the following example transformation:
Dynamically build jobs
org.pentaho.di.sdk.samples.embedding.GeneratingJobs
class is an
example of a dynamic job. This class generates a job definition and saves it to a KJB
file.Consider the following general steps while trying to dynamically build a job:
Procedure
Initialize the Kettle environment.
Always make the first call toKettleEnvironment.init()
whenever you are working with the PDI APIs.Create and configure a job definition object.
A job definition is represented by a JobMeta object. Create this object using the default constructor. The job definition includes the name, the declared parameters, and the required database connections.Populate the
The control flow of a job is defined by job entries that are connected by hops. Perform the following tasks to populate the object with a job entry:JobMeta
object with job entries.Create the entry by instantiating its class directly and configure it by using its
Job entries reside in sub-packages ofget
andset
methods.org.pentaho.di.job.entries
.Use the File Exists job entry, create an instance of org.pentaho.di.job.entries.fileexists.JobEntryFileExists, and usesetFilename()
to configure it. The Start entry is implemented by org.pentaho.di.job.entries.special.JobEntrySpecial.Create an instance of org.pentaho.di.job.entry.JobEntryCopy by passing the entry created in the previous step to the constructor.
An instance ofJobEntryCopy
encapsulates the properties of an entry, as well as controls the placement of the entry on the PDI client canvas and connections to hops.Once created, call
setDrawn(true)
andsetLocation(x,y)
to make sure the entry appears correctly on the PDI client canvas.Add the entry to the job by calling
It is possible to place the same entry in several places on the canvas by creating multiple instances ofaddJobEntry()
on the job definition object.JobEntryCopy
and passing in the same entry instance.
Connect the hops.
Once entries have been added to the job definition, they need to be connected by hops. To create a hop, create an instance oforg.pentaho.di.job.JobHopMeta
, by passing in the From and To entries as arguments to the constructor. Configure the hop consistently. Configure it as a green or red hop by callingsetConditional()
andsetEvaluation(true/false)
. If it is an unconditional hop, callsetUnconditional()
. Add the hop to the job definition by callingaddJobHop()
.
Results
getXML()
, and opened in the PDI client for
inspection. The sample class
org.pentaho.di.sdk.samples.embedding.GeneratingJobs
generates the
following example job:
Obtain logging information
When you need more information about how transformations and jobs execute, you can view PDI log lines and text.
PDI collects log lines in a central place. The org.pentaho.di.core.logging.KettleLogStore class manages all log lines and provides methods for retrieving the log text for specific entities. To retrieve log text or log lines,
supply the log channel ID generated by PDI during runtime. You can obtain the log channel id by calling
getLogChannelId()
, which is part of LoggingObjectInterface. Jobs, transformations, job entries, and transformation steps all implement this interface.
For example, assuming the job variable is an instance of a running or completed job, the following code shows how you retrieve the job's log lines:
LoggingBuffer appender = KettleLogStore.getAppender(); String logText = appender.getBuffer(job.getLogChannelId(), false).toString();
The main methods in the sample classes org.pentaho.di.sdk.samples.embedding.RunningJobs
and org.pentaho.di.sdk.samples.embedding.RunningTransformations
retrieve log information from the executed transformation or job in this manner.
Expose a transformation or job as a web service
You can run a PDI transformation or job as part of a web-service by developing one of the following implementations:
- Write a servlet that maps incoming parameters for a transformation step or job entry and executes them as part of the request cycle.
- Use the Carte server or the Pentaho Server directly by building a transformation that writes its output to the HTTP response of the Carte server. Then, specify the Pass Output to Servlet option in the Text Output, XML Output, JSON Output, or scripting steps to write output to the HTTP response. For an example, run the pentaho/design-tools/data-integration/samples/transformations/Servlet Data Example.ktr sample transformation on Carte.
Use non-native plugins
To use non-native plugins with an embedded Pentaho Server, you must configure the server to find where the plugins reside. How you configure the server depends on whether your plugin is a directory with associated files or a single JAR file.
If your plugins are directories with associated files, register the directories
by setting the KETTLE_PLUGIN_BASE_FOLDERS system property just before
the call to KettleEnvironment.init()
, as shown in the following example for
the “plugins” and “plugins2”
plugins:
System.setProperty("KETTLE_PLUGIN_BASE_FOLDERS", "C:\\pentaho\\data-integration\\plugins,c:\\plugins2"); KettleEnvironment.init();
If your plugin is a single JAR file, annotate the classes for the plugin and
include them in the class path, then set the KETTLE_PLUGIN_CLASSES
system property to register the fully-qualified class names just before the call to
KettleEnvironment.init()
, as shown in the following example for a
“jsonoutput”
plugin:
System.setProperty("KETTLE_PLUGIN_CLASSES","org.pentaho.di.trans.steps.jsonoutput.JsonOutputMeta"); KettleEnvironment.init();
If you have custom transformation steps or job entries, you must use one of the above two methods to configure the locations where the embedded server will search for your custom transformation steps or custom job entries.