Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Post-install system configurations

Parent article

After installing Lumada Data Catalog, perform the following system configuration changes.

Update Hive Server with Lumada Data Catalog JARs

Data Catalog supports creation of HIVE table for any resource in HDFS and uses certain third party and custom functions and SerDe contained in JAR files. These JARs need to be recognized by the HiveServer2 so that applications reading these tables have access to the same JARs.

There are two files that need to be updated:

  • Hive SerDe JAR
  • Data Catalog Hive format JAR

The above JARs are shipped with Data Catalog Install Package and found under <Agent Location>/ext path.

The placement of these JARs may vary depending on the distribution. For CDH and EMR, these JARs need to be placed in hive/auxlib directory of the Hive Server2. In case of HDP, it is usually hive/li.

For example: CDH: /opt/cloudera/parcels/CDH‌-5.12.0-1.cdh5.12.0.p0.29/lib/hive/auxlib

EMR: /usr/lib/hive/auxlib

HDP:/usr/hdp/2.6.2.14-5/hive/lib

To configure HiveServer2 for Data Catalog, see Configure HiveServer2 for Data Catalog.

Configure HiveServer2 for Data Catalog

Perform the following steps to configure HiveServer2 for Data Catalog, :

There are three ways to achieve this configuration; review your cluster configuration to determine the appropriate option:

Procedure

  1. Point HiveServer2 to the location of the custom functions.

    There are three ways to achieve this configuration; review your cluster configuration to determine the appropriate option:
    • Place the JARs in the hive/auxlib directory in the HiveServer2 node.

      or

    • Place the JARs in the hive/auxlib directory or some other location on the HiveServer2 node and set the contents of this directory in the hive.aux.jars.path in hive-site.xml.

      For example:

      <property>
          <name>hive.aux.jars.path</name>
          <value>/usr/hdp/2.6.2.14-5/hive/lib</value>
          <description>
            The location of the plugin jars that contain implementations of the user defined functions and serdes.
          </description>
      </property>

      or

    • Place the JARs in the hive/auxlib directory or some other location on the HiveServer2 node and set the contents of this directory in the HIVE_AUX_JARS_PATH in the hive-env.sh.

      OR

    • Place the JARs in the hive/auxlib directory or some other location on the HiveServer2 node and set the contents of this directory in the HIVE_AUX_JARS_PATH in the hive-env.sh.

      For example:

      $ export HIVE_AUX_JARS_PATH=/usr/hdp/2.6.2.14-5/hive/lib

    Make sure that the permissions on these JARs are correct for use by the HiveServer2 and Data Catalog Service User. The hive-serdes-1.0.jar placed in Data Catalog install path should have permissions for Data Catalog service user. If the Data Catalog service user does not have ownership of this SerDe jar, format/schema discovery will fail with a permission denied error.

  2. Move the Data Catalog JAR files to the chosen location on the HiveServer2 node

    The JAR files are provided in the installation: <Agent Location>/ext/waterlinedata-hive-formats-5.0-Spark2.jar. Move this file to the destination directory on the HiveServer2 host.

    For example:

    $ cp /opt/waterlinedata/agent/ext/waterlinedata-hive-formats-5.0-Spark2.jar \
         /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hive/auxlib/.
  3. (EMR only) Restart the Hive-Catalog service.

    In EMR environment, in addition to starting the HiveServer2, one needs to restart the Hive-Catalog service too as follows:

    $ sudo stop hive-hcatalog-server
    $ sudo start hive-hcatalog-server
    $ sudo stop hive-server2
    $ sudo start hive-server2
  4. Restart HiveServer2.

    To allow other applications to read Hive tables created from inside Data Catalog, Hive Server 2 needs to be restarted after replacing these files. The Data Catalog installation is not dependent on this restart, but you won't see the Hive functionality until the Hive Server 2 is restarted.

  5. Update the Data Catalog configuration file.

    To ensure that all users have access to the same custom jars when creating and accessing the so generated Hive views, Data Catalog requires a designated location on the HDFS/S3 system which is accessible by the Data Catalog service user and that the custom jars have read/write access for all users.

    This location must then be specified in Data Catalog's configuration.json file located under <WLD Install Dir>/agent/conf directory and <WLD Install Dir>/app-server/conf directory.

    <WLD Install Dir>$ cp agent/ext/waterlinedata-hive-formats-5.0-Spark2.jar
          /location/on/hdfs_s3/for/custom/jars <WLD Install Dir>$ cp agent/ext/hive-serde-1.0.1.jar
          /location/on/hdfs_s3/for/custom/jars

    These jars also need to be copied to the App-server under <WLD Install Dir>/app-server/ext directory.

    <WLD Install Dir>$ cp agent/ext/waterlinedata-hive-formats-5.0-Spark2.jar app-server/ext/

    <WLD Install Dir>$ cp agent/ext/hive-serde-1.0.1.jar app-server/ext/

    <WLD Install Dir> $ vi agent/conf/configuration.json

    Search for waterlinedata.profile.customSerde.url and specify the HDFS/SE path for the Hive SerDes.

    Custom serdes config

    Repeat the above in the app-server configuration.json.

Collect JDBC drivers

To include data from relational sources such as MySQL, MSSQL Server, Oracle, Redshift, SAH-Hana, Snowflake or Teradata, in Data Catalog, you will need to place the appropriate JDBC driver in the Data Catalog installation (in the app-server/ext and agent/ext directories).

In addition to the legacy RDBMS's Data Catalog also supports Amazon Aurora, SAP-HANA, Snowflake, and PostgreSQL.

Points to remember:

  1. While Aurora is an AWS hosted, fully managed database, based on MySQL, Aurora databases are processed by Data Catalog using the MYSQL driver.
  2. Typically, Data Catalog is not sensitive to the specific version of the driver, so use the same version that you know works for other applications in your environment. That said, avoid using MySQL Connector driver 5.1.17.
  3. When using Hive Server 2 in High Availability mode or with Spark 2, the Hive JDBC driver also needs to be placed in the Data Catalog installation, app-server/ext,agent/ext, and directories.

Configurations on CDH

Some versions of CDH using Spark2 will cause Data Catalog Schema discovery to fail with an IllegalArgumentException Illegal Pattern XXX.

This is a known issue of Cloudera documented as CDH-46402 and a workaround is described at Serialization Error.

If the above workaround does not work, perform the following steps:

Procedure

  1. Copy the commons-lang3 jar from Cloudera's /parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars directory to Data Catalog's agent and app-server /ext directories below:

    $ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \
         /opt/waterlinedata/app-server/ext/
    
    $ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \
         /opt/waterlinedata/agent/ext/
  2. In Cloudera Manager, in the Spark2 configuration Advanced section, add

    spark.executor.extraClassPath=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar to the Spark2 Client Advanced Configuration Snippet (Safety Valve) for spark2-conf/spark-defaults.conf.
  3. Deploy the Spark2 client configuration.

  4. Restart Spark2.

Configurations on HDP 3.1

To enable Hive and Spark to work together in HDP3.1 when deploying Data Catalog on the HDP3.1 platform, perform the following additional steps. (All these changes should be performed via Ambari.)

Procedure

  1. Enable Hive LLAP if not enabled.

    Detailed instructions are available at Setting Up LLap.
  2. Configure Hive to work with spark as described in configure Spark-Hive Connection.

  3. If the Hive tables use non -CSV SerDes, add the following in custom hive-set: metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader

  4. Restart Hive services.

Enable HDP3.1 support in Data Catalog

Perform the following steps to enable HDP3.1 support in Data Catalog:

Procedure

  1. Shutdown services.

  2. Add and/or enable the following property in agent/conf/configuration.json.

    Ensure the file is not corrupted.
    "waterlinedata.discovery.sparkWarehouseConnector": {
    "value": true,
    "type": "BOOLEAN",
    "restartRequired": false,
    "readOnly": false,
    "description": "Need to be switched on to utilize Spark Hive connector.
    Please make sure that the jar is available at a local path configured in jettyStart ",
    "label": "Spark Hive Connector",
    "category": "DISCOVERY",
    "defaultValue": false,
    "visible": true
    }
  3. In the Data Catalog script, check the location specified for the SPARK_HIVE_CONNECTOR_JAR_DIR at line #332, and ensure that the folder exists locally.

    You may have to change the folder path depending on how the HDP software is installed. SPARK_HIVE_CONNECTOR_JAR_DIR=/usr/hdp/current/hive_warehouse_connector
  4. Restart services.

    For EMR, HDInsight, and MapR, refer to respective dedicated sections in the Installation Guide.

Configure ports

Data Catalog processes listen on ports 4242, 4039, and 8082 by default.

Follow the steps below if you need to change which ports are used.

Before you begin

Run the command-line portion of the installer.

Procedure

  1. If you have not done so already, stop the Data Catalog services using the following command:

    $ ps -ef | grep ldc
  2. (Optional) If the Data Catalog services are still running, stop them using the following commands:

    APP-SERVER-HOME$ bin/app-server stop
    META-SERVER-HOME$ bin/metadata-server stop
    AGENT-HOME$ bin/agent stop
  3. Change the conflicting port numbers in the locations listed in the following table.

    The services, default ports, and configuration locations are as follows:
    ServiceDefault portConfiguration location
    Application Server

    8082 (HTTP)

    APP-SERVER-HOME/conf/install.properties

    LDC_JETTY_HTTP_PORT=8082

    Application Server4039APP-SERVER-HOME/conf/install.properties

    LDC_JETTY_HTTP_PORT=4039

    Metadata Server4242METADATA-SERVER-HOME/conf/application.yml port: 4242
  4. Restart services.

    APP-SERVER-HOME$ bin/app-server start
    META-SERVER-HOME$ bin/metadata-server start
    AGENT-HOME$ bin/agent start

Customize session timeout

If the Data Catalog does not detect any user activity for a specified amount of time, the application server times out and automatically logs out the current user.

By default, this timeout interval is set to 20 minutes (1200000 milliseconds). You can customize this timeout interval by changing the ldc.shiro.global.timeout value in the app-server script found in the <LDC-HOME>/app-server/bin path.

To change the session timeout interval:

Procedure

  1. Stop the application server using the following command:

    <APP-SERVER-HOME>$ bin/app-server stop
  2. Open the app-server script located in LDC-HOME/app-server/bin path.

  3. Search for the ldc.shiro.global.timeout property located under the WEBAPP_OPTS definition, and update its value in milliseconds.

    WEBAPP_OPTS="-Dldc.webapp.war=${LDC_WEBAPP_WAR} \ 
    -Dldc.webapp.extra.classpath=${EXTRA_CLASSPATH} \ 
    -Dldc.webapp.override.descriptor=${JETTY_BASE}/etc/ldc-override-descriptor.xml \ 
    -Dldc.plugins.dir=${PLUGINS_DIR} \ 
    -Dldc.home=${LDC_INSTALL_DIR} \ 
    -Dldc.shiro.global.timeout=1200000
    -Dldc.setup.mode=${SETUP_MODE}"
  4. Save your changes and exit the script.

  5. Start the application server using the following command:

    <APP-SERVER-HOME>$ bin/app-server start

Set the Spark log location

If needed, set the location of the Spark driver output logs to a location that is writable by the Data Catalog service user on each HDFS data node. Set this location in <LDC App-Server>/conf/log4j-driver.xml. Replace the variable ==${waterlinedata.log.dir}== with an absolute path.

<appender name="logFile" class="org.apache.log4j.RollingFileAppender">
   <param name="file" value="${waterlinedata.log.dir}/wd-jobs.log" />
   <param name="DatePattern" value="'.'yyyy-MM-dd" />
   ...
</appender>

Setting logging properties

Data Catalog jobs use the Apache log4j logging API and its conventions for identifying the level of messages reported by the application. Data Catalog produces messages with the levels ERROR, WARN, INFO, DEBUG, and TRACE. By default, the console output is set to display messages at the INFO level and more severe while logs are set to the DEBUG level. You can increase or reduce the severity of the messages recorded in a given log by adjusting the levels in logging control files.

OperationOutput LocationLogging Controls
LDC Application Server

console

wd-ui.log4j2

<APP-SERVER-HOME>/conf/wd-ui-log4j2.xml
Spark JobsWLDAgent consoleOption: <COMMAND> --verbose

Where the Data Catalog job can be of any format, schema, profile, etc.

Note: You may run jobs in local mode to debug job-related issues:

AGENT-HOME$ bin/ldc <COMMAND> --master local

Set a temporary file location for the web server

By default, the Data Catalog web application server creates temporary files in the /tmp folder on the computer where Data Catalog is installed. If this directory is cleared or if the /tmp location does not contain enough space, the web server may fail. If you find that your environment cannot manage the /tmp directory and Data Catalog is using it for web server intermediate storage, then you can change the location of the web server used by Data Catalog.

Perform the following steps to set a different temporary file location for the web server:

Procedure

  1. Locate and edit the web server configuration file:

    <APP-SERVER-HOME>/jetty-distribution-9.4.18.v20190429/ldc-base/webapps/ldc.xml
  2. Add the following <Call> snippet under the <Configure> element as follows, replacing the second <Arg> with the new temporary file location:

    <Configure class="org.eclipse.jetty.webapp.WebAppContext">
    ...
      <Call name="setAttribute">
        <Arg>org.eclipse.jetty.webapp.basetempdir</Arg>
        <Arg>/path/to/new/tmp/dir</Arg>
      </Call>
    ...
    </Configure>
  3. Restart Data Catalog services using the following commands:

    APP-SERVER-HOME$ bin/app-server restart
    LDC Meta-Server$ bin/metadata-server restart
    AGENT-HOME$ bin/agent restart