Post-install system configurations

Last updated
Save as PDF

After installing Lumada Data Catalog, you must perform several post-installation tasks.

Customize the default installation
The default Data Catalog 7.0.1 installation is not supported for production use. As a best practice, your organization should point to your production applications. To set up a production environment, you must replace the built-in MongoDB and MinIO instances with external services.
- Replacing MongoDB
- Replacing MinIO
Install JDBC drivers
Configuration updates for standalone agents

You must make the following system configuration changes if you have standalone agents in your Data Catalog configuration:
- Update HiveServer2 with LDC JARs
  - Configure HiveServer2 for LDC
- Configure CDH (if using)
- Configuring HDP 3.1 (if using)
  - Enable Hive and Spark together in HDP 3.1
  - Enable HDP 3.1 support
- Set the Spark log location
Configure agents
- Agent configuration
- Agent scripts

Installing JDBC drivers

To include data in Data Catalog from the supported relational database sources MySQL, MSSQL Server, Oracle, Redshift, SAH-Hana, Snowflake or Teradata, you must place the appropriate JDBC driver in the Data Catalog installation app-server/ext and the agent/ext directories.

In addition to the legacy RDBMS's Data Catalog also supports Amazon Aurora, SAP-HANA, Snowflake, and PostgreSQL.

You should consider of the following information for specific drivers:

Because Aurora is an AWS-hosted, fully-managed database, based on MySQL, Aurora databases are processed byData Catalog using the MYSQL driver.
Data Catalog is not sensitive to the specific version of a driver, so you may want to use the same version of a driver that you know works for other applications in your environment.
The MySQL Connector driver version 5.1.17 should not be used.
When using Hive Server 2 in High Availability mode or with Spark 2, you also need to copy the Hive JDBC driver to the following directories:
- Data Catalog installation directory
- app-server/ext
- agent/ext

Updating Hive Server with LDC JARs

NoteYou only need to update the Hive server if you have standalone agents in your Data Catalog configuration.

Updating Hive Server with LDC JARs depends on your Hadoop distribution. Data Catalog supports the creation of HIVE tables for any resource in HDFS by using third party and custom functions contained in JAR files. These JAR files must be recognized by the HiveServer2 so the applications reading these tables can use the same JARs.

You need to update the following two files:

Hive SerDe JAR (hive-serde-1.2.2.jar)
Data Catalog Hive format JAR (ldc-hive-formats-6.1.1.jar)

These JARs are shipped with the Data Catalog installation package and are located in the <Agent Location>/ext path.

How you place these JAR files depends on the following Hadoop distributions:

For CDH and EMR, copy the JAR files to the Hive Server2 hive/auxlib directory.
For HDP, copy the JAR files to the hive/li directory.

After you have copied these files to the correct directory, you must configure them.

Configure HiveServer2 for LDC

NoteYou only need this perform this task if you have standalone agents in your Data Catalog configuration.

If you are using HiveServer2, you must configure it for Data Catalog because all users must have access to the same custom jars when creating and accessing the generated Hive views. Data Catalog requires a designated location on the HDFS/S3 system, which is accessible by the Data Catalog service user and the custom jars have read and write access for all users. As a best practice, review your cluster configuration to determine the best method.

Perform the following steps to configure HiveServer2 for Data Catalog:

Procedure

Point HiveServer2 to the location of the custom JAR files using one of the following three methods:
- Place the Hive SerDe JAR and Data Catalog Hive format JAR in the hive/auxlib directory in the HiveServer2 node.
- Copy the JARs to an alternate location on the HiveServer2 node and set the path of the alternate directory in the hive.aux.jars.path in hive-site.xml.
  For example:
```
<property>
    <name>hive.aux.jars.path</name>
    <value>/usr/hdp/2.6.2.14-5/hive/lib</value>
    <description>
      The location of the plugin jars that contain implementations of the user defined functions and serdes.
    </description>
</property>
```
- Copy the JARs in the hive/auxlib directory or an alternate location on the HiveServer2 node and set the path of this directory as the HIVE_AUX_JARS_PATH in the hive-env.sh.
  For example:
  $ export HIVE_AUX_JARS_PATH=/usr/hdp/2.6.2.14-5/hive/lib
NoteThe HiveServer2 and Data Catalog service user must have -rw-r--r-- permissions to these JARs. If the Data Catalog service user does not have ownership of this SerDe JAR, the format and schema discovery will not work.
Copy the ldc-hive-formats-6.1.0.jar file from the installation directory <Agent Location>/ext/ to the location you specified on the HiveServer2 node.

For example:
```
$ cp /opt/ldc/agent/ext/ ldc-hive-formats-6.1.0.jar\
     /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hive/auxlib/
```
(EMR only) Restart the Hive-Catalog service using the following commands:
```
$ sudo stop hive-hcatalog-server
$ sudo start hive-hcatalog-server
```

Restart HiveServer2 using the following commands:

$ sudo stop hive-server2
$ sudo start hive-server2

Copy the custom JAR files to the specified HDFS or S3 system directories.

For example:

<LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/
<LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/<LDC Install Dir>$

Copy the custom JAR files to the Application Server directories in the <LDC Install Dir>/app-server/ext directory.

For example:

<LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/
<LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/
<LDC Install Dir>

Edit the Agent’s configuration.json file in the <LDC Install Dir>/agent/conf directory and the Application Server’s configuration.json file in the <LDC Install Dir>/app-server/conf directory with the following changes:

Specify the location of the custom JAR files on the HDFS or S3 system.

In the ldc.profile.customSerde.url key, specify the HDFS/SE path for the Hive SerDes as in the following example:

" ldc.profile.customSerde.url " : {
"value" : "hdfs://hdp.wld.com:8020/user/<yourcompany>/custom-jars",<path to custom SerDe jars>
"type" : "STRING",
"restartRequired” : true,
"readonly" : true,
"description" : "HDFS(hdfs:///) / S3(s3://) path containing custom serde jars to be added to spark sql at runtime", 
"label" : "Custom Serde Jar Location",
"category" : "MISC",
"defaultValue" : "",
"visible" : false,
"uiConfig" : false
},

Save and close the files.

Configure CDH

NoteThis task is necessary only if you have standalone agents in your Data Catalog configuration and are using CDH.

Some versions of CDH using Spark2 may cause the Data Catalog Schema discovery to fail with an IllegalArgumentException Illegal Pattern XXX error. If you receive this error, perform the following steps:

Procedure

Stop the Spark2, Application Server, and Agent services.
Copy the commons-lang3 jar from Cloudera's /parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars directory to Data Catalog's Agent and Application Server /ext directories using the following comands:
$ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/app-server/ext/
$ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/agent/ext/
In the Spark2 configuration Advanced section of the Cloudera Manager, add the code spark.executor.extraClassPath=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar to the Spark2 Client Advanced Configuration Snippet (Safety Valve) in spark2-conf/spark-defaults.conf.
Deploy the Spark2 client configuration.
Restart the Spark2, Agent, and Application Server services.

Configuring HDP 3.1

If you have standalone agents and are using HDP 3.1, you must set up Hive and Spark to work together and enable Hive support in Data Catalog.

Enable Hive and Spark together in HDP 3.1

NoteThis task is necessary only if you have standalone agents in your Data Catalog configuration and are using HDP.

Using Ambari, perform the following steps to set up Hive and Spark to work together in HDP 3.1 with Data Catalog on the HDP3.1 platform:

Procedure

Enable Hive low-latency analytical processing (LLAP) using the instructions at Setting Up LLap
Configure Hive to work with Spark using the instructions at Spark-Hive Connection.
If your Hive tables use non-CSV SerDes, edit the custom hivesite.xml file to add the following code: metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
Restart Hive services.

Enable HDP 3.1 support

NoteThis task is needed only if you have standalone agents in your Data Catalog configuration and are using HDP.

Perform the following steps to enable HDP3.1 support:

Procedure

Shutdown the Agent service.

Open the agent/conf/configuration.json file with a text editor and add or enable the following property.

NoteEnsure that the file does not contain non-printable characters.

"ldc.discovery.sparkWarehouseConnector": {
"value": true,
"type": "BOOLEAN",
"restartRequired": false,
"readOnly": false,
"description": "Need to be switched on to utilize Spark Hive connector.
Please make sure that the jar is available at a local path configured in jettyStart ",
"label": "Spark Hive Connector",
"category": "DISCOVERY",
"defaultValue": false,
"visible": true
}

Save and close the file.
Navigate to the /opt/ldc/agent/bin directory and open the ldc script file with a text editor.
Verify that a local folder is specified for the SPARK_HIVE_CONNECTOR_JAR_DIR.

NoteYou may have to change the folder path depending on where the HDP software is installed. For example, SPARK_HIVE_CONNECTOR_JAR_DIR=/usr/hdp/current/hive_warehouse_connector
Restart the Agent service.

Set the Spark log location

NoteYou only need to this perform this task if you have standalone agents in your Data Catalog configuration.

If you want to change the location of the Spark driver output logs, you must change it to a location that can be written to by the Data Catalog service user account on each HDFS data node.

Perform the following steps to change the location:

Procedure

Navigate to the <LDC App-Server>/conf/ directory.
Open the log4j-driver.xml file in a text editor.

Replace the ==${ldc.log.dir}== variable with an absolute path, as shown in the following example:

<appender name="logFile" class="org.apache.log4j.RollingFileAppender">
   <param name="file" value="${ldc.log.dir}/wd-jobs.log" />
   <param name="DatePattern" value="'.'yyyy-MM-dd" />
   ...
</appender>

Save and close the log4j-driver.xml file.

Configure an agent

Agents are typically installed under the /opt/ldc/agent/ directories on the edge nodes of the cluster.

To configure an agent, you must edit the application.yml file in the <AGENT-HOME\>conf/ directory.

Agent scripts

There are three scripts available under the agent installation:

agent

The agent script includes operational and maintenance aspects of the agent. Use the --help option, as shown below, to list all commands supported by the agent script.

Agent Install$ bin/agent --help

Usage: ./agent [command] [options]

Commands:
          start : start the application
           stop : stop the application
        restart : restart the application
         status : prints the status of the application (running/stopped)
            log : print the last 200 lines of the application log.
                  Use with -f or --follow to follow
       jobs-log : print the last 200 lines of the jobs log.
                  Use with -f or --follow to follow
       register : register the agent with app-server
                  Use with --endpoint <endpoint> 
                           --agent-id <agent-id> 
                           --agent-token <token> and 
                           --cert-fingerprint <ssl SHA256 fingerprint of secure endpoint

Options:
     -h, --help : displays this help
  -v, --version : prints product version information
   -f, --follow : for log or jobs-log command, follows the log file
        --debug : for start or restart commands, starts the agent in debug mode
      --suspend : when used with --debug, suspends the JVM until the debugger is attached

ldc-util

The ldc-util script includes utility jobs that supplement Data Catalog. Use the --help option, as shown below, to list all commands supported by the agent script.

<AGENT-HOME>$ bin/ldc-util --help

Usage: ./ldc-util [command] [options]

Commands:
                  encrypt : encrypt plain text password for use in configuration files
     restore_builtin_tags : recreate built-in tags and related discovery cache
                reencrypt : re-encrypt passwords that are encrypted with old key.
        seedRolesAndUsers : seed roles and users.

Options:
     -h, --help : displays this help

Examples:
        ./ldc-util encrypt
        ./ldc-util restore_builtin_tags
        ./ldc-util reencrypt
        ./ldc-util seedRolesAndUsers

ldc
The ldc script offers legacy, functional Data Catalog jobs on the agent side. The --help option will list all the commands supported in the agent script.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Installing JDBC drivers

Updating Hive Server with LDC JARs

Configure HiveServer2 for LDC

Configure CDH

Configuring HDP 3.1

Enable Hive and Spark together in HDP 3.1

Enable HDP 3.1 support

Set the Spark log location

Configure an agent

Agent scripts