Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Manage Hadoop configurations through PDI

Parent article

Within PDI, a Hadoop configuration is the collection of Hadoop libraries required to communicate with a specific version of Hadoop and related tools, such as Hive HBase, Sqoop, or Pig.

Hadoop configurations are defined in the plugin.properties file and are designed to be easily configured within PDI by changing the active.hadoop.configuration property. The plugin.properties file resides in the pentaho-big-data-plugin/ folder.

All Hadoop configurations share a basic structure. Elements of the structure are defined in the table following this code block.

configuration/
|-- lib/
|--  |-- client/
|--  |-- pmr/
|--  '-- *.jar
|-- config.properties
|-- core-site.xml
`-- configuration-implementation.jar
Configuration ElementDefinition
lib/Libraries specific to the version of Hadoop this configuration was created to communicate with.
client/Libraries that are only required on a Hadoop client, for instance hadoop-core-* or hadoop-client-*
pmr/Jar files that contain libraries required for parsing data in input/output formats or otherwise outside of any PDI-based execution.
*.jarAll other libraries required for Hadoop configuration that are not client-only or special pmr jar files that need to be available to the entire JVM of Hadoop job tasks.
config.propertiesContains metadata and configuration options for this Hadoop configuration. Provides a way to define a configuration name, additional classpath, and native libraries the configuration requires. See the comments in this file for more details.
core-site.xmlConfiguration file that can be replaced to set a site-specific configuration, for example hdfs-site.xml would be used to configure HDFS.
configuration-implementation.jarFile that must be replaced in order to communicate with this configuration.

Create a new Hadoop configuration

If you have a Hadoop distribution not supported by Pentaho, or you have modified your Hadoop Installation in such a way that it is no longer compatible with Pentaho, you may need to create a new Hadoop configuration.

Changing which version of Hadoop PDI can communicate with requires you to swap the appropriate jar files within the plugin directory and then update the plugin.properties file.

CautionCreating a new Hadoop configuration is not officially supported by Pentaho. Please inform Pentaho support regarding your requirements.

Procedure

  1. Identify which Hadoop configuration most closely matches the version of Hadoop you want to communicate with.

  2. If you compare the default configurations included the differences are apparent, copy this folder, then paste and rename it.

    The name of this folder will be the name of your new configuration.
  3. Copy the jar files for your specified Hadoop version.

  4. Paste the jar files into the lib/ directory.

  5. Change the active.hadoop.configuration= property in the plugins/pentaho-big-dataplugin/plugin.properties file to match your specific Hadoop configuration.

    This property configures which distribution of Hadoop to use when communicating with a Hadoop cluster and must match the name of the folder you created in Step 1. Update this property if you are using a version other than the default Hadoop version.

Include or exclude classes or packages for a Hadoop configuration

You have the option to include or exclude classes or packages from loading with a Hadoop configuration.

Configure these options within the plugin.properties file located at plugins/pentaho-big-data-plugin. For additional information, see the comments within the plugin.properties file.

  • Include Additional Class Paths or Libraries

    To include additional class paths, native libraries, or a user-friendly configuration name, include the directory within classpath property within the big data plugin.properties file.

  • Exclude Classes or Packages

    To exclude classes or packages from duplicate loading by a Hadoop configuration class loader, include them in the ignored.classes property within the plugin.properties file. This is necessary when logging libraries expect a single class shared by all class loaders, as with Apache Commons Logging for example