Copy files to a Hadoop YARN cluster
If you start a job that will run on a YARN cluster, but it needs other files to execute (such as variables from your local copy of kettle.properties) those files will need to be copied to the YARN cluster. An easy way to do this is to add those files to the YARN Workspace folder. At runtime PDI copies all of the files in the YARN Workspace folder to the YARN cluster. This feature is well-suited for jobs that move through the development, testing, and staging lifecycle because the job uses the appropriate configuration files in the KETTLE_HOME directory for the environment in which it runs.
Add files to the YARN Workspace folder
These instructions explain how to configure the Start a PDI Cluster on YARN entry so that following files are copied at runtime, to the YARN Workspace folder and then to the YARN cluster: kettle.properties, shared.xml, and repositories.xml. These instructions also explain how to manually copy additional files to the folder.
If the job is run from your local installation, the configuration files from your KETTLE_HOME directory are copied to the YARN Workspace folder. If the job is scheduled or is run on a Pentaho Server, the configuration files from the server's configured KETTLE_HOME are copied to the YARN Workspace folder.
Complete these steps:
Procedure
Set the active YARN Hadoop cluster using the instructions found in Configuring Pentaho for Your Hadoop Distro and Version.
Complete the instructions in the Additional Configuration for YARN shims article.
In Spoon, create or open a job that contains the Start a YARN Kettle Cluster entry.
Open the Start a PDI Cluster on YARN entry.
Select any combination of the kettle.properties, shared.xml, and repository.xml checkboxes in the Copy Local Resource Files to YARN section of the window.
Save and close the Start a PDI Cluster on YARN entry.
If you want to copy other files to the cluster, manually copy them to the YARN Workspace folder here: pentaho-big-data-plugin/plugins/pentaho-kettle-yarn-plugin/workspace.
Save and run the job.
Results
At runtime, the kettle.properties, shared.xml, and repositories.xml files (whatever was selected) are copied to the YARN Workspace folder and then to the YARN cluster.
Delete files from the YARN Workspace folder
To delete files from the YARN Workspace folder manually remove them. The YARN Workspace folder is kept here: pentaho-big-data-plugin/plugins/pentaho-kettle-yarn-plugin/workspace.