Post-install system configurations

Updating Hive Server with LDC JARs

Updating Hive Server with LDC JARs depends on your Hadoop distribution. Data Catalog supports the creation of HIVE tables for any resource in HDFS by using third party and custom functions contained in JAR files. These JAR files must be recognized by the HiveServer2 so the applications reading these tables can use the same JARs.

You need to update the following two files:

Hive SerDe JAR (hive-serde-1.2.2.jar)
Data Catalog Hive format JAR (ldc-hive-formats-6.1.1.jar)

These JARs are shipped with the Data Catalog installation package and are located in the <Agent Location>/ext path.

How you place these JAR files depends on the following Hadoop distributions:

For CDH and EMR, copy the JAR files to the Hive Server2 hive/auxlib directory.
For HDP, copy the JAR files to the hive/li directory.

After you have copied these files to the correct directory, you must configure them.

Configure HiveServer2 for LDC

If you are using HiveServer2, you must configure it for Data Catalog because all users must have access to the same custom jars when creating and accessing the generated Hive views. Data Catalog requires a designated location on the HDFS/S3 system, which is accessible by the Data Catalog service user and the custom jars have read and write access for all users. As a best practice, review your cluster configuration to determine the best method.

Perform the following steps to configure HiveServer2 for Data Catalog:

Procedure

Point HiveServer2 to the location of the custom JAR files using one of the following three methods:
- Place the Hive SerDe JAR and Data Catalog Hive format JAR in the hive/auxlib directory in the HiveServer2 node.
- Copy the JARs to an alternate location on the HiveServer2 node and set the path of the alternate directory in the hive.aux.jars.path in hive-site.xml.
  For example:
```
<property>
    <name>hive.aux.jars.path</name>
    <value>/usr/hdp/2.6.2.14-5/hive/lib</value>
    <description>
      The location of the plugin jars that contain implementations of the user defined functions and serdes.
    </description>
</property>
```
- Copy the JARs in the hive/auxlib directory or an alternate location on the HiveServer2 node and set the path of this directory as the HIVE_AUX_JARS_PATH in the hive-env.sh.
  For example:
  $ export HIVE_AUX_JARS_PATH=/usr/hdp/2.6.2.14-5/hive/lib
NoteThe HiveServer2 and Data Catalog service user must have -rw-r--r-- permissions to these JARs. If the Data Catalog service user does not have ownership of this SerDe JAR, the format and schema discovery will not work.
Copy the ldc-hive-formats-6.1.0.jar file from the installation directory <Agent Location>/ext/ to the location you specified on the HiveServer2 node.

For example:
```
$ cp /opt/ldc/agent/ext/ ldc-hive-formats-6.1.0.jar\
     /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hive/auxlib/
```
(EMR only) Restart the Hive-Catalog service using the following commands:
```
$ sudo stop hive-hcatalog-server
$ sudo start hive-hcatalog-server
```

Restart HiveServer2 using the following commands:

$ sudo stop hive-server2
$ sudo start hive-server2

Copy the custom JAR files to the specified HDFS or S3 system directories.

For example:

<LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/
<LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/<LDC Install Dir>$ vi agent/conf/configuration.json

Copy the custom JAR files to the Application Server directories in the <LDC Install Dir>/app-server/ext directory.

For example:

<LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/
<LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/
<LDC Install Dir>$ vi agent/conf/configuration.json

Edit the Agent’s configuration.json file in the <LDC Install Dir>/agent/conf directory and the Application Server’s configuration.json file in the <LDC Install Dir>/app-server/conf directory with the following changes:

Specify the location of the custom JAR files on the HDFS or S3 system.

In the ldc.profile.customSerde.url key, specify the HDFS/SE path for the Hive SerDes as in the following example:

" ldc.profile.customSerde.url " : {
"value" : "hdfs://hdp.wld.com:8020/user/<yourcompany>/custom-jars",<path to custom SerDe jars>
"type" : "STRING",
"restartRequired” : true,
"readonly" : true,
"description" : "HDFS(hdfs:///) / S3(s3://) path containing custom serde jars to be added to spark sql at runtime", 
"label" : "Custom Serde Jar Location",
"category" : "MISC",
"defaultValue" : "",
"visible" : false,
"uiConfig" : false
},

Save and close the files.

Installing JDBC drivers

To include data in Data Catalog from the supported relational database sources MySQL, MSSQL Server, Oracle, Redshift, SAH-Hana, Snowflake or Teradata, you must place the appropriate JDBC driver in the Data Catalog installation app-server/ext and the agent/ext directories.

In addition to the legacy RDBMS's Data Catalog also supports Amazon Aurora, SAP-HANA, Snowflake, and PostgreSQL.

You should consider of the following information for specific drivers:

Because Aurora is an AWS-hosted, fully-managed database, based on MySQL, Aurora databases are processed byData Catalog using the MYSQL driver.
Data Catalog is not sensitive to the specific version of a driver, so you may want to use the same version of a driver that you know works for other applications in your environment.
The MySQL Connector driver version 5.1.17 should not be used.
When using Hive Server 2 in High Availability mode or with Spark 2, you also need to copy the Hive JDBC driver to the following directories:
- Data Catalog installation directory
- app-server/ext
- agent/ext

Configure CDH

Some versions of CDH using Spark2 may cause the Data Catalog Schema discovery to fail with an IllegalArgumentException Illegal Pattern XXX error. If you receive this error, perform the following steps:

Procedure

Stop the Spark2, Application Server, and Agent services.
Copy the commons-lang3 jar from Cloudera's /parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars directory to Data Catalog's Agent and Application Server /ext directories using the following comands:
$ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/app-server/ext/
$ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/agent/ext/
In the Spark2 configuration Advanced section of the Cloudera Manager, add the code spark.executor.extraClassPath=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar to the Spark2 Client Advanced Configuration Snippet (Safety Valve) in spark2-conf/spark-defaults.conf.
Deploy the Spark2 client configuration.
Restart the Spark2, Agent, and Application Server services.

Configuring HDP 3.1

For HDP 3.1, you must set up Hive and Spark to work together and enable Hive support in Data Catalog.

Enable Hive and Spark together in HDP 3.1

Using Ambari, perform the following steps to set up Hive and Spark to work together in HDP 3.1 with Data Catalog on the HDP3.1 platform:

Procedure

Enable Hive low-latency analytical processing (LLAP) using the instructions at Setting Up LLap
Configure Hive to work with Spark using the instructions at Spark-Hive Connection.
If your Hive tables use non-CSV SerDes, edit the custom hivesite.xml file to add the following code: metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
Restart Hive services.

Enable HDP3.1 support

Perform the following steps to enable HDP3.1 support in Data Catalog:

Procedure

Shutdown the Agent service.

Open the agent/conf/configuration.json file with a text editor and add or enable the following property.

NoteEnsure that the file does not contain non-printable characters.

"ldc.discovery.sparkWarehouseConnector": {
"value": true,
"type": "BOOLEAN",
"restartRequired": false,
"readOnly": false,
"description": "Need to be switched on to utilize Spark Hive connector.
Please make sure that the jar is available at a local path configured in jettyStart ",
"label": "Spark Hive Connector",
"category": "DISCOVERY",
"defaultValue": false,
"visible": true
}

Save and close the file.
Navigate to the /opt/ldc/agent/bin directory and open the ldc script file with a text editor.
Verify that a local folder is specified for the SPARK_HIVE_CONNECTOR_JAR_DIR.

NoteYou may have to change the folder path depending on where the HDP software is installed. For example, SPARK_HIVE_CONNECTOR_JAR_DIR=/usr/hdp/current/hive_warehouse_connector
Restart the Agent service.

NoteFor post-installation system configurations of MapR, EMR, and HDInsight refer to those sections in the installation guide.

Configure LDC ports

By default, Data Catalog processes listen in on the following ports:

4242
4039
8082

Perform the following steps if you need to change these ports:

Procedure

Run the command-line portion of the Application Server, Metadata Server, and Agent installers.
Stop all services using the following command:
$ ps -ef | grep ldc
NoteIf the Data Catalog services are still running, stop them using the following commands:
APP-SERVER-HOME$ bin/app-server stop
META-SERVER-HOME$ bin/metadata-server stop
AGENT-HOME$ bin/agent stop

Change the port numbers in the configuration locations listed in the following table:

Service	Default port	Configuration location
Application Server	8082 (HTTP)	`APP-SERVER-HOME`$ /conf/install.properties `LDC_JETTY_HTTP_PORT=8082`
Application Server	4039	`APP-SERVER-HOME`$ /conf/install.properties `LDC_JETTY_HTTP_PORT=4039`
Metadata Server	4242	`METADATA-SERVER-HOME`$ /conf/application.yaml port: 4242

Restart the services using the following commands:

APP-SERVER-HOME$ bin/app-server start

META-SERVER-HOME$ bin/metadata-server start

AGENT-HOME$ bin/agent start

Customize the session timeout

You can customize the timeout interval by changing the ldc.shiro.global.timeout value in the Application Server script found in the <LDC-HOME>/app-server/bin directory. When Data Catalog does not detect any user activity for a specified amount of time, the Application Server times out and automatically logs out the current user. By default, this timeout interval is set to 1200000 milliseconds (20 minutes).

Perform the following steps to change the session timeout interval:

Procedure

Stop the Application Server using the command <APP-SERVER-HOME>$ bin/app-server stop
Open the script named app-server in the LDC-HOME/app-server/bin directory with a text editor.

Locate the ldc.shiro.global.timeout property in the WEBAPP_OPTS definition and change its value in milliseconds, as shown in the following example:

WEBAPP_OPTS="-Dldc.webapp.war=${LDC_WEBAPP_WAR} \ 
-Dldc.webapp.extra.classpath=${EXTRA_CLASSPATH} \ 
-Dldc.webapp.override.descriptor=${JETTY_BASE}/etc/ldc-override-descriptor.xml \ 
-Dldc.plugins.dir=${PLUGINS_DIR} \ 
-Dldc.home=${LDC_INSTALL_DIR} \ 
-Dldc.shiro.global.timeout=1200000
-Dldc.setup.mode=${SETUP_MODE}"

Save your changes and close the script file.
Start the Application Server using the command:<APP-SERVER-HOME>$ bin/app-server start

Set the Spark log location

If you want to change the location of the Spark driver output logs, you must change it to a location that can be written to by the Data Catalog service user account on each HDFS data node.

Perform the following steps to change the location:

Procedure

Navigate to the <LDC App-Server>/conf/ directory.
Open the log4j-driver.xml file in a text editor.

Replace the ==${waterlinedata.log.dir}== variable with an absolute path, as shown in the following example:

<appender name="logFile" class="org.apache.log4j.RollingFileAppender">
   <param name="file" value="${ldc.log.dir}/wd-jobs.log" />
   <param name="DatePattern" value="'.'yyyy-MM-dd" />
   ...
</appender>

Save and close the log4j-driver.xml file.

Setting logging properties

You can change the level of the messages recorded in a log by changing the level values in the logging control files. Data Catalog jobs use the Apache log4j logging API for logging messages reported by the application. Data Catalog produces messages with the following levels:

ERROR
WARN
INFO
DEBUG
TRACE

By default, the console output displays messages at the INFO level, while the logs are set to the DEBUG level. The locations of the logging control files are listed in the following table:

Operation Output Location Logging Controls

LDC Application Server

Operation	Output Location	Logging Controls
LDC Application Server	Application Server console wd-ui.log4j2	`<APP-SERVER-HOME>/conf/wd-ui-log4j2.xml`
Spark Jobs	`LDCAgent` console	You can use the `<COMMAND> --verbose` option, Where the Data Catalog job can be of any format, schema, or profile, for example. NoteYou may run jobs in local mode to debug job-related issues using the following command: `AGENT-HOME$ bin/ldc <COMMAND> --master local`

Application Server console

wd-ui.log4j2

<APP-SERVER-HOME>/conf/wd-ui-log4j2.xml

Spark Jobs

LDCAgent console

You can use the <COMMAND> --verbose option, Where the Data Catalog job can be of any format, schema, or profile, for example.

NoteYou may run jobs in local mode to debug job-related issues using the following command:

AGENT-HOME$ bin/ldc <COMMAND> --master local

Set a temporary file location for the web server

If you find that your environment cannot manage the /tmp directory and Data Catalog is using it for web server intermediate storage, then you can change the location of the web server used by Data Catalog. By default, the Data Catalog web application server creates temporary files in the /tmp folder on the computer where Data Catalog is installed. If this directory is deleted or if the /tmp location does not contain enough space, the web server may fail.

Perform the following steps to set a different temporary file location for the web server:

Procedure

Navigate to the <APP-SERVER-HOME>/<jetty-distribution>/ldc-base/webapps/ directory and open the web server configuration file ldc.xml with a text editor.

Add the following <Call> code to the <Configure> element as shown below, replacing the second <Arg> with the new temporary file location:

<Configure class="org.eclipse.jetty.webapp.WebAppContext">
...
  <Call name="setAttribute">
    <Arg>org.eclipse.jetty.webapp.basetempdir</Arg>
    <Arg>/path/to/new/tmp/dir</Arg>
  </Call>
...
</Configure>

Save and close the file.
Restart Data Catalog services using the following commands:

APP-SERVER-HOME$ bin/app-server restart

LDC Meta-Server$ bin/metadata-server restart

AGENT-HOME$ bin/agent restart

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Updating Hive Server with LDC JARs

Configure HiveServer2 for LDC

Installing JDBC drivers

Configure CDH

Configuring HDP 3.1

Enable Hive and Spark together in HDP 3.1

Enable HDP3.1 support

Configure LDC ports

Customize the session timeout

Set the Spark log location

Setting logging properties

Set a temporary file location for the web server