Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Copy files to a Hadoop YARN cluster

Parent article

If you start a job that will run on a YARN cluster, but it needs other files to execute (such as variables from your local copy of kettle.properties) those files will need to be copied to the YARN cluster. An easy way to do this is to add those files to the YARN Workspace folder. At runtime PDI copies all of the files in the YARN Workspace folder to the YARN cluster. This feature is well-suited for jobs that move through the development, testing, and staging lifecycle because the job uses the appropriate configuration files in the KETTLE_HOME directory for the environment in which it runs.

CautionFiles in the YARN Workspace folder are copied to the YARN cluster every time you run a job that starts the YARN Kettle Cluster. If you don't want to overwrite files that have the same names that are already on the YARN Kettle Cluster, delete files from the YARN Workspace folder. Then, in the Start a PDI Cluster on YARN step window, deselect the appropriate checkboxes in the Copy Local Resource Files to YARN section of the window.

Add files to the YARN Workspace folder

These instructions explain how to configure the Start a PDI Cluster on YARN entry so that following files are copied at runtime, to the YARN Workspace folder and then to the YARN cluster: kettle.properties, shared.xml, and repositories.xml. These instructions also explain how to manually copy additional files to the folder.

If the job is run from your local installation, the configuration files from your KETTLE_HOME directory are copied to the YARN Workspace folder. If the job is scheduled or is run on a Pentaho Server, the configuration files from the server's configured KETTLE_HOME are copied to the YARN Workspace folder.

Complete these steps:

Procedure

  1. Set the active YARN Hadoop cluster using the instructions found in Configuring Pentaho for Your Hadoop Distro and Version.

  2. Complete the instructions in the Additional Configuration for YARN shims article.

  3. In Spoon, create or open a job that contains the Start a YARN Kettle Cluster entry.

  4. Open the Start a PDI Cluster on YARN entry.

  5. Select any combination of the kettle.properties, shared.xml, and repository.xml checkboxes in the Copy Local Resource Files to YARN section of the window.

  6. Save and close the Start a PDI Cluster on YARN entry.

  7. If you want to copy other files to the cluster, manually copy them to the YARN Workspace folder here: pentaho-big-data-plugin/plugins/pentaho-kettle-yarn-plugin/workspace.

  8. Save and run the job.

Results

At runtime, the kettle.properties, shared.xml, and repositories.xml files (whatever was selected) are copied to the YARN Workspace folder and then to the YARN cluster.

Delete files from the YARN Workspace folder

To delete files from the YARN Workspace folder manually remove them. The YARN Workspace folder is kept here: pentaho-big-data-plugin/plugins/pentaho-kettle-yarn-plugin/workspace.