Within PDI, a Hadoop configuration is the collection of Hadoop libraries required to communicate with a specific version of Hadoop and related tools, such as Hive, HBase, Sqoop, or Pig.
Hadoop configurations are defined in the plugin.properties
file and are designed to be easily configured within PDI by changing the active
hadoop.configuration property. The
plugin.properties file resides in the
All Hadoop configurations share a basic structure. Elements of the structure are defined following the code sample:
configuration/ |-- lib/ |-- |-- pmr/ |-- '-- *.jar |-- config.properties |-- core-site.xml `-- configuration-implementation.jar
|Libraries specific to the version of Hadoop with which this configuration was created to communicate.
|Jar files that contain libraries required for parsing data in input/output formats or otherwise outside of any PDI-based execution.
|All other libraries required for Hadoop configuration that are not client-only or special PMR JAR files that need to be available to the entire JVM of Hadoop job tasks.
|Contains metadata and configuration options for this Hadoop configuration. It provides a way to define a configuration name, additional classpath, and native libraries that the configuration requires. See the comments in this file for more details.
|Configuration file that can be replaced to set a site-specific configuration. For example, hdfs-site.xml would be used to configure HDFS.
|File that must be replaced to communicate with this configuration.
Include or exclude classes or packages for a Hadoop configuration
You have the option to include or exclude classes or packages from loading with a Hadoop configuration.
Configure these options within the plugin.properties file located at plugins/pentaho-big-data-plugin. For additional information, see the comments within the plugin.properties file.
Include Additional Class Paths or Libraries
To include additional class paths, native libraries, or a user-friendly configuration name, include the directory within classpath property within the big data plugin.properties file.
Exclude Classes or Packages
To exclude classes or packages from duplicate loading by a Hadoop configuration class loader, include them in the ignored.classes property within the plugin.properties file. This is necessary when logging libraries expect a single class shared by all class loaders, as with Apache Commons Logging for example