Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Post-install system configurations

Parent article

After installing Lumada Data Catalog, you must perform the following system configuration changes:

  • Updating HiveServer2 with LDC JARs
  • Installing JDBC drivers
  • Configure CDH
  • Configuring HDP 3.1
  • Configure LDC ports
  • Customize the session timeout
  • Set the Spark log location
  • Setting logging properties
  • Set a temporary file location for the web server

Updating Hive Server with LDC JARs

Updating Hive Server with LDC JARs depends on your Hadoop distribution. Data Catalog supports the creation of HIVE tables for any resource in HDFS by using third party and custom functions contained in JAR files. These JAR files must be recognized by the HiveServer2 so the applications reading these tables can use the same JARs.

You need to update the following two files:

  • Hive SerDe JAR (hive-serde-1.2.2.jar)
  • Data Catalog Hive format JAR (ldc-hive-formats-6.1.1.jar)

These JARs are shipped with the Data Catalog installation package and are located in the <Agent Location>/ext path.

How you place these JAR files depends on the following Hadoop distributions:

  • For CDH and EMR, copy the JAR files to the Hive Server2 hive/auxlib directory.
  • For HDP, copy the JAR files to the hive/li directory.

After you have copied these files to the correct directory, you must configure them.

Configure HiveServer2 for LDC

If you are using HiveServer2, you must configure it for Data Catalog because all users must have access to the same custom jars when creating and accessing the generated Hive views. Data Catalog requires a designated location on the HDFS/S3 system, which is accessible by the Data Catalog service user and the custom jars have read and write access for all users. As a best practice, review your cluster configuration to determine the best method.

Perform the following steps to configure HiveServer2 for Data Catalog:

Procedure

  1. Point HiveServer2 to the location of the custom JAR files using one of the following three methods:

    • Place the Hive SerDe JAR and Data Catalog Hive format JAR in the hive/auxlib directory in the HiveServer2 node.
    • Copy the JARs to an alternate location on the HiveServer2 node and set the path of the alternate directory in the hive.aux.jars.path in hive-site.xml.

      For example:

      <property>
          <name>hive.aux.jars.path</name>
          <value>/usr/hdp/2.6.2.14-5/hive/lib</value>
          <description>
            The location of the plugin jars that contain implementations of the user defined functions and serdes.
          </description>
      </property>
    • Copy the JARs in the hive/auxlib directory or an alternate location on the HiveServer2 node and set the path of this directory as the HIVE_AUX_JARS_PATH in the hive-env.sh.

      For example:

      $ export HIVE_AUX_JARS_PATH=/usr/hdp/2.6.2.14-5/hive/lib

    NoteThe HiveServer2 and Data Catalog service user must have -rw-r--r-- permissions to these JARs. If the Data Catalog service user does not have ownership of this SerDe JAR, the format and schema discovery will not work.
  2. Copy the ldc-hive-formats-6.1.0.jar file from the installation directory <Agent Location>/ext/ to the location you specified on the HiveServer2 node.

    For example:

    $ cp /opt/ldc/agent/ext/ ldc-hive-formats-6.1.0.jar\
         /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hive/auxlib/
    
  3. (EMR only) Restart the Hive-Catalog service using the following commands:

    $ sudo stop hive-hcatalog-server
    $ sudo start hive-hcatalog-server
  4. Restart HiveServer2 using the following commands:

    $ sudo stop hive-server2
    $ sudo start hive-server2
  5. Copy the custom JAR files to the specified HDFS or S3 system directories.

    For example:

    <LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/
    <LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/<LDC Install Dir>$ vi agent/conf/configuration.json
    
  6. Copy the custom JAR files to the Application Server directories in the <LDC Install Dir>/app-server/ext directory.

    For example:

    <LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/
    <LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/
    <LDC Install Dir>$ vi agent/conf/configuration.json
    
  7. Edit the Agent’s configuration.json file in the <LDC Install Dir>/agent/conf directory and the Application Server’s configuration.json file in the <LDC Install Dir>/app-server/conf directory with the following changes:

    1. Specify the location of the custom JAR files on the HDFS or S3 system.

    2. In the ldc.profile.customSerde.url key, specify the HDFS/SE path for the Hive SerDes as in the following example:

      " ldc.profile.customSerde.url " : {
      "value" : "hdfs://hdp.wld.com:8020/user/<yourcompany>/custom-jars",<path to custom SerDe jars>
      "type" : "STRING",
      "restartRequired” : true,
      "readonly" : true,
      "description" : "HDFS(hdfs:///) / S3(s3://) path containing custom serde jars to be added to spark sql at runtime", 
      "label" : "Custom Serde Jar Location",
      "category" : "MISC",
      "defaultValue" : "",
      "visible" : false,
      "uiConfig" : false
      },
      
  8. Save and close the files.

Installing JDBC drivers

To include data in Data Catalog from the supported relational database sources MySQL, MSSQL Server, Oracle, Redshift, SAH-Hana, Snowflake or Teradata, you must place the appropriate JDBC driver in the Data Catalog installation app-server/ext and the agent/ext directories.

In addition to the legacy RDBMS's Data Catalog also supports Amazon Aurora, SAP-HANA, Snowflake, and PostgreSQL.

You should consider of the following information for specific drivers:

  • Because Aurora is an AWS-hosted, fully-managed database, based on MySQL, Aurora databases are processed byData Catalog using the MYSQL driver.
  • Data Catalog is not sensitive to the specific version of a driver, so you may want to use the same version of a driver that you know works for other applications in your environment.
  • The MySQL Connector driver version 5.1.17 should not be used.
  • When using Hive Server 2 in High Availability mode or with Spark 2, you also need to copy the Hive JDBC driver to the following directories:
    • Data Catalog installation directory
    • app-server/ext
    • agent/ext

Configure CDH

Some versions of CDH using Spark2 may cause the Data Catalog Schema discovery to fail with an IllegalArgumentException Illegal Pattern XXX error. If you receive this error, perform the following steps:

Procedure

  1. Stop the Spark2, Application Server, and Agent services.

  2. Copy the commons-lang3 jar from Cloudera's /parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars directory to Data Catalog's Agent and Application Server /ext directories using the following comands:

    $ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/app-server/ext/

    $ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/agent/ext/

  3. In the Spark2 configuration Advanced section of the Cloudera Manager, add the code spark.executor.extraClassPath=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar to the Spark2 Client Advanced Configuration Snippet (Safety Valve) in spark2-conf/spark-defaults.conf.

  4. Deploy the Spark2 client configuration.

  5. Restart the Spark2, Agent, and Application Server services.

Configuring HDP 3.1

For HDP 3.1, you must set up Hive and Spark to work together and enable Hive support in Data Catalog.

Enable Hive and Spark together in HDP 3.1

Using Ambari, perform the following steps to set up Hive and Spark to work together in HDP 3.1 with Data Catalog on the HDP3.1 platform:

Procedure

  1. Enable Hive low-latency analytical processing (LLAP) using the instructions at Setting Up LLap

  2. Configure Hive to work with Spark using the instructions at Spark-Hive Connection.

  3. If your Hive tables use non-CSV SerDes, edit the custom hivesite.xml file to add the following code: metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader

  4. Restart Hive services.

Enable HDP3.1 support

Perform the following steps to enable HDP3.1 support in Data Catalog:

Procedure

  1. Shutdown the Agent service.

  2. Open the agent/conf/configuration.json file with a text editor and add or enable the following property.

    NoteEnsure that the file does not contain non-printable characters.
    "ldc.discovery.sparkWarehouseConnector": {
    "value": true,
    "type": "BOOLEAN",
    "restartRequired": false,
    "readOnly": false,
    "description": "Need to be switched on to utilize Spark Hive connector.
    Please make sure that the jar is available at a local path configured in jettyStart ",
    "label": "Spark Hive Connector",
    "category": "DISCOVERY",
    "defaultValue": false,
    "visible": true
    }
    
  3. Save and close the file.

  4. Navigate to the /opt/ldc/agent/bin directory and open the ldc script file with a text editor.

  5. Verify that a local folder is specified for the SPARK_HIVE_CONNECTOR_JAR_DIR.

    NoteYou may have to change the folder path depending on where the HDP software is installed. For example, SPARK_HIVE_CONNECTOR_JAR_DIR=/usr/hdp/current/hive_warehouse_connector
  6. Restart the Agent service.

    NoteFor post-installation system configurations of MapR, EMR, and HDInsight refer to those sections in the installation guide.

Configure LDC ports

By default, Data Catalog processes listen in on the following ports:
  • 4242
  • 4039
  • 8082

Perform the following steps if you need to change these ports:

Procedure

  1. Run the command-line portion of the Application Server, Metadata Server, and Agent installers.

  2. Stop all services using the following command:

    $ ps -ef | grep ldc
    NoteIf the Data Catalog services are still running, stop them using the following commands:

    APP-SERVER-HOME$ bin/app-server stop

    META-SERVER-HOME$ bin/metadata-server stop

    AGENT-HOME$ bin/agent stop

  3. Change the port numbers in the configuration locations listed in the following table:

    ServiceDefault portConfiguration location
    Application Server

    8082 (HTTP)

    APP-SERVER-HOME$ /conf/install.properties

    LDC_JETTY_HTTP_PORT=8082

    Application Server4039APP-SERVER-HOME$ /conf/install.properties

    LDC_JETTY_HTTP_PORT=4039

    Metadata Server4242METADATA-SERVER-HOME$ /conf/application.yaml port: 4242
  4. Restart the services using the following commands:

    APP-SERVER-HOME$ bin/app-server start

    META-SERVER-HOME$ bin/metadata-server start

    AGENT-HOME$ bin/agent start

Customize the session timeout

You can customize the timeout interval by changing the ldc.shiro.global.timeout value in the Application Server script found in the <LDC-HOME>/app-server/bin directory. When Data Catalog does not detect any user activity for a specified amount of time, the Application Server times out and automatically logs out the current user. By default, this timeout interval is set to 1200000 milliseconds (20 minutes).

Perform the following steps to change the session timeout interval:

Procedure

  1. Stop the Application Server using the command <APP-SERVER-HOME>$ bin/app-server stop

  2. Open the script named app-server in the LDC-HOME/app-server/bin directory with a text editor.

  3. Locate the ldc.shiro.global.timeout property in the WEBAPP_OPTS definition and change its value in milliseconds, as shown in the following example:

    WEBAPP_OPTS="-Dldc.webapp.war=${LDC_WEBAPP_WAR} \ 
    -Dldc.webapp.extra.classpath=${EXTRA_CLASSPATH} \ 
    -Dldc.webapp.override.descriptor=${JETTY_BASE}/etc/ldc-override-descriptor.xml \ 
    -Dldc.plugins.dir=${PLUGINS_DIR} \ 
    -Dldc.home=${LDC_INSTALL_DIR} \ 
    -Dldc.shiro.global.timeout=1200000
    -Dldc.setup.mode=${SETUP_MODE}"
    
  4. Save your changes and close the script file.

  5. Start the Application Server using the command:<APP-SERVER-HOME>$ bin/app-server start

Set the Spark log location

If you want to change the location of the Spark driver output logs, you must change it to a location that can be written to by the Data Catalog service user account on each HDFS data node.

Perform the following steps to change the location:

Procedure

  1. Navigate to the <LDC App-Server>/conf/ directory.

  2. Open the log4j-driver.xml file in a text editor.

  3. Replace the ==${waterlinedata.log.dir}== variable with an absolute path, as shown in the following example:

    <appender name="logFile" class="org.apache.log4j.RollingFileAppender">
       <param name="file" value="${ldc.log.dir}/wd-jobs.log" />
       <param name="DatePattern" value="'.'yyyy-MM-dd" />
       ...
    </appender>
  4. Save and close the log4j-driver.xml file.

Setting logging properties

You can change the level of the messages recorded in a log by changing the level values in the logging control files. Data Catalog jobs use the Apache log4j logging API for logging messages reported by the application. Data Catalog produces messages with the following levels:

  • ERROR
  • WARN
  • INFO
  • DEBUG
  • TRACE

By default, the console output displays messages at the INFO level, while the logs are set to the DEBUG level. The locations of the logging control files are listed in the following table:

OperationOutput LocationLogging Controls
LDC Application Server

Application Server console

wd-ui.log4j2

<APP-SERVER-HOME>/conf/wd-ui-log4j2.xml
Spark JobsLDCAgent consoleYou can use the <COMMAND> --verbose option, Where the Data Catalog job can be of any format, schema, or profile, for example.
NoteYou may run jobs in local mode to debug job-related issues using the following command:

AGENT-HOME$ bin/ldc <COMMAND> --master local

Set a temporary file location for the web server

If you find that your environment cannot manage the /tmp directory and Data Catalog is using it for web server intermediate storage, then you can change the location of the web server used by Data Catalog. By default, the Data Catalog web application server creates temporary files in the /tmp folder on the computer where Data Catalog is installed. If this directory is deleted or if the /tmp location does not contain enough space, the web server may fail.

Perform the following steps to set a different temporary file location for the web server:

Procedure

  1. Navigate to the <APP-SERVER-HOME>/<jetty-distribution>/ldc-base/webapps/ directory and open the web server configuration file ldc.xml with a text editor.

  2. Add the following <Call> code to the <Configure> element as shown below, replacing the second <Arg> with the new temporary file location:

    <Configure class="org.eclipse.jetty.webapp.WebAppContext">
    ...
      <Call name="setAttribute">
        <Arg>org.eclipse.jetty.webapp.basetempdir</Arg>
        <Arg>/path/to/new/tmp/dir</Arg>
      </Call>
    ...
    </Configure>
  3. Save and close the file.

  4. Restart Data Catalog services using the following commands:

    APP-SERVER-HOME$ bin/app-server restart

    LDC Meta-Server$ bin/metadata-server restart

    AGENT-HOME$ bin/agent restart