Post-install system configurations
After installing Lumada Data Catalog, you must perform the following system configuration changes:
- Updating HiveServer2 with LDC JARs
- Installing JDBC drivers
- Configure CDH
- Configuring HDP 3.1
- Configure LDC ports
- Customize the session timeout
- Set the Spark log location
- Setting logging properties
- Set a temporary file location for the web server
Updating Hive Server with LDC JARs
Updating Hive Server with LDC JARs depends on your Hadoop distribution. Data Catalog supports the creation of HIVE tables for any resource in HDFS by using third party and custom functions contained in JAR files. These JAR files must be recognized by the HiveServer2 so the applications reading these tables can use the same JARs.
You need to update the following two files:
- Hive SerDe JAR (hive-serde-1.2.2.jar)
- Data Catalog Hive format JAR (ldc-hive-formats-6.1.1.jar)
These JARs are shipped with the Data Catalog installation package and are located in the <Agent Location>/ext path.
How you place these JAR files depends on the following Hadoop distributions:
- For CDH and EMR, copy the JAR files to the Hive Server2 hive/auxlib directory.
- For HDP, copy the JAR files to the hive/li directory.
After you have copied these files to the correct directory, you must configure them.
Configure HiveServer2 for LDC
Perform the following steps to configure HiveServer2 for Data Catalog:
Procedure
Point HiveServer2 to the location of the custom JAR files using one of the following three methods:
- Place the Hive SerDe JAR and Data Catalog Hive format JAR in the hive/auxlib directory in the HiveServer2 node.
- Copy the JARs to an alternate location on the HiveServer2 node and set the path of the alternate directory in the hive.aux.jars.path in hive-site.xml.
For example:
<property> <name>hive.aux.jars.path</name> <value>/usr/hdp/2.6.2.14-5/hive/lib</value> <description> The location of the plugin jars that contain implementations of the user defined functions and serdes. </description> </property>
- Copy the JARs in the hive/auxlib directory or an alternate location on the HiveServer2 node and set the path of this directory as the HIVE_AUX_JARS_PATH in the hive-env.sh.
For example:
$ export HIVE_AUX_JARS_PATH=/usr/hdp/2.6.2.14-5/hive/lib
NoteThe HiveServer2 and Data Catalog service user must have-rw-r--r--
permissions to these JARs. If the Data Catalog service user does not have ownership of this SerDe JAR, the format and schema discovery will not work.Copy the ldc-hive-formats-6.1.0.jar file from the installation directory <Agent Location>/ext/ to the location you specified on the HiveServer2 node.
For example:
$ cp /opt/ldc/agent/ext/ ldc-hive-formats-6.1.0.jar\ /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hive/auxlib/
(EMR only) Restart the Hive-Catalog service using the following commands:
$ sudo stop hive-hcatalog-server $ sudo start hive-hcatalog-server
Restart HiveServer2 using the following commands:
$ sudo stop hive-server2 $ sudo start hive-server2
Copy the custom JAR files to the specified HDFS or S3 system directories.
For example:
<LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/ <LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/<LDC Install Dir>$ vi agent/conf/configuration.json
Copy the custom JAR files to the Application Server directories in the <LDC Install Dir>/app-server/ext directory.
For example:
<LDC Install Dir>$ cp agent/ext/ ldc-hive-formats-6.1.0.jar app-server/ext/ <LDC Install Dir>$ cp agent/ext/ hive-serde-1.2.2.jar app-server/ext/ <LDC Install Dir>$ vi agent/conf/configuration.json
Edit the Agent’s configuration.json file in the <LDC Install Dir>/agent/conf directory and the Application Server’s configuration.json file in the <LDC Install Dir>/app-server/conf directory with the following changes:
Specify the location of the custom JAR files on the HDFS or S3 system.
In the
ldc.profile.customSerde.url
key, specify the HDFS/SE path for the Hive SerDes as in the following example:" ldc.profile.customSerde.url " : { "value" : "hdfs://hdp.wld.com:8020/user/<yourcompany>/custom-jars",<path to custom SerDe jars> "type" : "STRING", "restartRequired” : true, "readonly" : true, "description" : "HDFS(hdfs:///) / S3(s3://) path containing custom serde jars to be added to spark sql at runtime", "label" : "Custom Serde Jar Location", "category" : "MISC", "defaultValue" : "", "visible" : false, "uiConfig" : false },
Save and close the files.
Installing JDBC drivers
To include data in Data Catalog from the supported relational database sources MySQL, MSSQL Server, Oracle, Redshift, SAH-Hana, Snowflake or Teradata, you must place the appropriate JDBC driver in the Data Catalog installation app-server/ext and the agent/ext directories.
In addition to the legacy RDBMS's Data Catalog also supports Amazon Aurora, SAP-HANA, Snowflake, and PostgreSQL.
You should consider of the following information for specific drivers:
- Because Aurora is an AWS-hosted, fully-managed database, based on MySQL, Aurora databases are processed byData Catalog using the MYSQL driver.
- Data Catalog is not sensitive to the specific version of a driver, so you may want to use the same version of a driver that you know works for other applications in your environment.
- The MySQL Connector driver version 5.1.17 should not be used.
- When using Hive Server 2 in High Availability mode or with Spark 2, you also need to copy the Hive JDBC driver to the following directories:
- Data Catalog installation directory
- app-server/ext
- agent/ext
Configure CDH
IllegalArgumentException Illegal Pattern XXX
error. If you receive this error, perform the following steps:Procedure
Stop the Spark2, Application Server, and Agent services.
Copy the commons-lang3 jar from Cloudera's /parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars directory to Data Catalog's Agent and Application Server /ext directories using the following comands:
$ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/app-server/ext/
$ cp /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar \ /opt/ldc/agent/ext/
In the
section of the Cloudera Manager, add the codespark.executor.extraClassPath=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/commons-lang3-3.5.jar
to the Spark2 Client Advanced Configuration Snippet (Safety Valve) in spark2-conf/spark-defaults.conf.Deploy the Spark2 client configuration.
Restart the Spark2, Agent, and Application Server services.
Configuring HDP 3.1
For HDP 3.1, you must set up Hive and Spark to work together and enable Hive support in Data Catalog.
Enable Hive and Spark together in HDP 3.1
Procedure
Enable Hive low-latency analytical processing (LLAP) using the instructions at Setting Up LLap
Configure Hive to work with Spark using the instructions at Spark-Hive Connection.
If your Hive tables use non-CSV SerDes, edit the custom hivesite.xml file to add the following code:
metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
Restart Hive services.
Enable HDP3.1 support
Procedure
Shutdown the Agent service.
Open the agent/conf/configuration.json file with a text editor and add or enable the following property.
NoteEnsure that the file does not contain non-printable characters."ldc.discovery.sparkWarehouseConnector": { "value": true, "type": "BOOLEAN", "restartRequired": false, "readOnly": false, "description": "Need to be switched on to utilize Spark Hive connector. Please make sure that the jar is available at a local path configured in jettyStart ", "label": "Spark Hive Connector", "category": "DISCOVERY", "defaultValue": false, "visible": true }
Save and close the file.
Navigate to the /opt/ldc/agent/bin directory and open the ldc script file with a text editor.
Verify that a local folder is specified for the
SPARK_HIVE_CONNECTOR_JAR_DIR
.NoteYou may have to change the folder path depending on where the HDP software is installed. For example, SPARK_HIVE_CONNECTOR_JAR_DIR=/usr/hdp/current/hive_warehouse_connectorRestart the Agent service.
Configure LDC ports
- 4242
- 4039
- 8082
Perform the following steps if you need to change these ports:
Procedure
Run the command-line portion of the Application Server, Metadata Server, and Agent installers.
Stop all services using the following command:
$ ps -ef | grep ldc
NoteIf the Data Catalog services are still running, stop them using the following commands:APP-SERVER-HOME$ bin/app-server stop
META-SERVER-HOME$ bin/metadata-server stop
AGENT-HOME$ bin/agent stop
Change the port numbers in the configuration locations listed in the following table:
Service Default port Configuration location Application Server 8082 (HTTP)
APP-SERVER-HOME$ /conf/install.properties LDC_JETTY_HTTP_PORT=8082
Application Server 4039 APP-SERVER-HOME$ /conf/install.properties LDC_JETTY_HTTP_PORT=4039
Metadata Server 4242 METADATA-SERVER-HOME$ /conf/application.yaml port: 4242 Restart the services using the following commands:
APP-SERVER-HOME$ bin/app-server start
META-SERVER-HOME$ bin/metadata-server start
AGENT-HOME$ bin/agent start
Customize the session timeout
ldc.shiro.global.timeout
value in the Application Server script found in the <LDC-HOME>/app-server/bin
directory. When Data Catalog does not detect any user activity for a specified amount of time, the Application Server times out and automatically logs out the current user. By default, this timeout interval is set to 1200000 milliseconds (20 minutes). Perform the following steps to change the session timeout interval:
Procedure
Stop the Application Server using the command
<APP-SERVER-HOME>$ bin/app-server stop
Open the script named
app-server
in the LDC-HOME/app-server/bin directory with a text editor.Locate the ldc.shiro.global.timeout property in the
WEBAPP_OPTS
definition and change its value in milliseconds, as shown in the following example:WEBAPP_OPTS="-Dldc.webapp.war=${LDC_WEBAPP_WAR} \ -Dldc.webapp.extra.classpath=${EXTRA_CLASSPATH} \ -Dldc.webapp.override.descriptor=${JETTY_BASE}/etc/ldc-override-descriptor.xml \ -Dldc.plugins.dir=${PLUGINS_DIR} \ -Dldc.home=${LDC_INSTALL_DIR} \ -Dldc.shiro.global.timeout=1200000 -Dldc.setup.mode=${SETUP_MODE}"
Save your changes and close the script file.
Start the Application Server using the command:
<APP-SERVER-HOME>$ bin/app-server start
Set the Spark log location
Perform the following steps to change the location:
Procedure
Navigate to the <LDC App-Server>/conf/ directory.
Open the log4j-driver.xml file in a text editor.
Replace the
==${waterlinedata.log.dir}==
variable with an absolute path, as shown in the following example:<appender name="logFile" class="org.apache.log4j.RollingFileAppender"> <param name="file" value="${ldc.log.dir}/wd-jobs.log" /> <param name="DatePattern" value="'.'yyyy-MM-dd" /> ... </appender>
Save and close the log4j-driver.xml file.
Setting logging properties
You can change the level of the messages recorded in a log by changing the level values in the logging control files. Data Catalog jobs use the Apache log4j logging API for logging messages reported by the application. Data Catalog produces messages with the following levels:
- ERROR
- WARN
- INFO
- DEBUG
- TRACE
By default, the console output displays messages at the INFO level, while the logs are set to the DEBUG level. The locations of the logging control files are listed in the following table:
Operation | Output Location | Logging Controls |
LDC Application Server |
Application Server console wd-ui.log4j2 | <APP-SERVER-HOME>/conf/wd-ui-log4j2.xml |
Spark Jobs | LDCAgent console | You can use the <COMMAND> --verbose option, Where the Data Catalog job can be of any format, schema, or profile, for example.NoteYou may run jobs in local mode to debug job-related issues using the following command:
|
Set a temporary file location for the web server
Perform the following steps to set a different temporary file location for the web server:
Procedure
Navigate to the <APP-SERVER-HOME>/<jetty-distribution>/ldc-base/webapps/ directory and open the web server configuration file ldc.xml with a text editor.
Add the following
<Call>
code to the<Configure>
element as shown below, replacing the second<Arg>
with the new temporary file location:<Configure class="org.eclipse.jetty.webapp.WebAppContext"> ... <Call name="setAttribute"> <Arg>org.eclipse.jetty.webapp.basetempdir</Arg> <Arg>/path/to/new/tmp/dir</Arg> </Call> ... </Configure>
Save and close the file.
Restart Data Catalog services using the following commands:
APP-SERVER-HOME$ bin/app-server restart
LDC Meta-Server$ bin/metadata-server restart
AGENT-HOME$ bin/agent restart