Standalone agent installation

Last updated
Save as PDF

If you have Hadoop-based data sources such as HDFS and Hive or if some of your data sources are separated from your Kubernetes cluster by significant latencies or low bandwidths, you will need standalone agents closer to your data sources or on the edge nodes of your Hadoop cluster. These remote agents then connect to the main Data Catalog deployment in Kubernetes for storing the metadata.

In this type of installation, you need to make sure your environment meets the system requirements, and you validate the system components before you begin the installation. Follow the pre-installation steps to set up your environment before installing Data Catalog.

Pre-Installation steps for Standalone agent installation

Before starting the Lumada Data Catalog installation process, you must first set up all the dependent external components, such as setting up the Data Catalog service user and collecting other essential information about your cluster configuration. These prerequisites are applicable to most Hadoop distributions and include steps to help you prepare your Hadoop environment for the Data Catalog installation.

Basic pre-install preparation includes the following tasks:

Review system requirements
Review the external components and applications that Data Catalog supports. See System requirements for Standalone agent installation.
Verify components
Verify the proper functioning of the various Hadoop components that Data Catalog interacts with. See Component validation for Standalone agent installation.

You must also make the following changes outside of the Lumada Data Catalog environment or on other cluster nodes:

Validate component functioning
Configure components for Data Catalog

If you do not have control over or access to these components, you may need to plan in advance to perform these changes or contact your system administrator.

System requirements for Standalone agent installation

Lumada Data Catalog runs on an edge node in a Hadoop cluster and requires specific external components and applications to operate optimally. This section provides a list of those components and applications along with details of their use and versions we support. If you have questions about your particular computing environment, please contact support at the Hitachi Vantara Lumada and Pentaho Support Portal.

Processing engine

Data Catalog uses Apache Spark for profiling jobs against HDFS and Hive data. The application code is transferred to each cluster node and is executed against the data that resides on that node. The results are collected in either the MongoDB repository, the Discovery Cache, or both. Data Catalog then runs Spark jobs against the resulting metadata to determine suggestions for business term associations. See Discovery Cache storage.

Data Catalog jobs are standard Spark jobs, so your organizational practices for tuning cluster operations apply to running and tuning Data Catalog jobs.

Hadoop distributions

Data Catalog must be installed on the edge node within the Hadoop cluster. The Hadoop and Hive clients installed on this edge node must have access to the Hadoop NameNode and HiveServer2.

The following table lists the supported distributions and the compatible applications for each Hadoop distribution on the edge node.

Distribution	Components	Versions	Notes
Cloudera	CDH Apache Spark™ HDFS HIVE	6.1.1 2.4 3.0.0 2.1.1	- - - -
Cloudera	CDP Apache Spark HDFS HIVE Apache Atlas	7.1.7 2.4.7 3.1.1 3.1.3 2.1.0	- - - - -
EKS		1.21.x
EMR 6.0.0	Apache Spark HDFS HIVE	2.4.4 3.2.1 3.1.2	- - -
HortonWorks	HDP Apache Spark HDFS HIVE Apache Atlas	3.1.0 2.3.2 3.1.1 3.1.0 1.1.0	- - - - -
MEP (MapR Ecosystem Pack)	MapR Apache Spark MapR-FS HIVE	6.0.1 2.2.1 6.2.0.0.20200915234957.GA-1 2.3.6	- - - -
Microsoft Azure	HDI Apache Spark HDFS HIVE Cloud/Blob ADL	4.0 2.4.4 3.1.1 3.1.0 -	- - - - -

Distribution

Components

Versions

Notes

Cloudera

CDH

Apache Spark™

HDFS

HIVE

6.1.1

2.4

3.0.0

2.1.1

Cloudera

CDP

Apache Spark

HDFS

HIVE

Apache Atlas

7.1.7

2.4.7

3.1.1

3.1.3

2.1.0

EKS

1.21.x

EMR 6.0.0

Apache Spark

HDFS

HIVE

2.4.4

3.2.1

3.1.2

HortonWorks

HDP

Apache Spark

HDFS

HIVE

Apache Atlas

3.1.0

2.3.2

3.1.1

3.1.0

1.1.0

MEP

(MapR

Ecosystem

Pack)

MapR

Apache Spark

MapR-FS

HIVE

6.0.1

2.2.1

6.2.0.0.20200915234957.GA-1

2.3.6

Microsoft Azure

HDI

Apache Spark

HDFS

HIVE

Cloud/Blob ADL

4.0

2.4.4

3.1.1

3.1.0

Component validation for Standalone agent installation

You must verify the proper functioning of the various components that interact with Data Catalog.

Perform the following validation:

Validate the Hadoop configuration.

Once the configuration is verified, you can start configuring the components for Data Catalog compatibility.

Validate Hadoop configuration

To install Data Catalog, you need to validate that your existing Hadoop components are running and communicating properly among themselves.

Perform the following steps to prepare for Data Catalog installation by validating each of the places where Data Catalog interacts with Hadoop:

Verify the file system URI.
Check cluster status.
Verifying HDFS and Hive access.
Validate access to Hadoop components through the browser.
Verify HDFS discovery metadata storage.

Verify the file system URI

Perform the following steps to identify the host name (<HDFS file system host>) for the Hadoop file system:

Procedure

Use the following command to find the host name for your cluster:
$ hdfs getconf -confKey fs.defaultFS
Navigate to the core-site.xml file on the cluster.
Edit the core-site.xml to verify the fs.defaultFS parameter is set to the correct host name.

Check cluster status

Perform the following steps to verify your Hadoop services are running and active:

Procedure

Verify that HDFS, MapReduce, and YARN are running.
If Hive is configured for your cluster, Hive and its constituent components, such as Hive Metastore, HiveServer2, WebHCat Server, and whichever database used by Hive, such as MySQL, must be running.
If you do not use a cluster management tool, such as Ambari, Cloudera Manager, or MapR Control System, check individual services by running the command line for the component, as shown in the following example codes:
- $ hdfs dfsadmin -report
- $ yarn version
- $ beeline (!quit to exit)

Verifying HDFS and Hive access

Data Catalog depends on the cluster authorization system to manage user access to HDFS resources. A non-catalog service user with access to HDFS files and Hive tables should also have access to these HDFS file and Hive tables from the Data Catalog. Be sure to identify such end-users and their access to the HDFS files and Hive tables.

You can use the following applications to check end-user access:

Hue or Apache Ambari
Beeswax

Optionally, for a Hortonworks cluster, you must perform additional steps if you are running HiveServer2 in High-Availability (HA) mode.

Verify HDFS and Hive access using Hue or Apache Ambari

Perform the following steps with Hue or Apache Ambari to verify HDFS and Hive access:

Procedure

Navigate to your existing data in HDFS or load new data.
Verify that you can access files you own as well as files for which you have access through group membership.
If you cannot sign into Hue or Ambari or access HDFS files from inside one of these tools or from the command line, ask your Hadoop administrator for appropriate credentials.

Verify HDFS and Hive access using Beeswax or Beeline

You can verify HDFS and Hive access with Beeswax or Beeline. Beeswax is accessible through Hue and Ambari and Beeline is accessible through the Hive command line.

Perform the following steps:

Procedure

Verify that you can access the existing databases and tables.
If you cannot sign into Beeline or cannot access Hive tables with Beeswax, then ask your Hadoop administrator for the applicable credentials.
Profile the jobs to test table-level access to determine if your cluster uses Apache Ranger or Apache Sentry for access control.
Verify that the Data Catalog service user has table-level access (not column-level access) to Hive tables that you want included in your catalog.

NoteColumn-level access control for access from Spark SQL is not supported by the HDFS-Sentry plug-in.

Copy the JDBC standalone JAR for HiveServer2 in HA mode

If you want to run HiveServer2 in HA mode on a Hortonworks cluster, you need to copy the standalone JAR to the LDC Application Server and to the LDC Agent.

Perform the following steps to copy the file to these components:

Procedure

Stop the Data Catalog services.
Locate the JDBC standalone JAR file in your Hive library.
For example, /usr/hdp/<rightVersion>/hive/lib/hive-jdbc-1.2.xxx-standalone.jar.
Copy the JAR file to <LDC-HOME>app-server/ext/ and to <LDC-HOME>agent/ext/.
Restart the Data Catalog services.

Validate access to Hadoop components through the browser

You need access to the cluster host from the remote computer.

If your cluster is configured to use Kerberos, perform the following steps to configure the browser with a Kerberos plug-in and verify you have valid user credentials.

Procedure

Start a browser from a computer that is not on the edge node where you are installing Data Catalog.

Verify that you can sign into the following components for your cluster:

Component	Access URL
Hue (CDH, MapR)	http://<HDFS file system host>:8888
Ambari (HDP)	http://<HDFS file system host>:8080
Cloudera Manager (CDH)	http://<HDFS file system host>:7180
MapR Control System (MapR)	http://<HDFS file system host>:8443

Verify HDFS discovery metadata storage

You need permanent storage on HDFS for the metadata used for discovery operations. The location of this storage is used by discovery profiling, tag discovery, and lineage discovery jobs. The Data Catalog service user needs read, write, and execute access to this location.

Perform the following steps to verify read, write, and execute access to permanent storage on HDFS:

Procedure

Navigate to storage on HDFS established for Data Catalog.
This location is usually /user/<Data Catalog service user (ldcuser)>/.
Determine if the Data Catalog service user has read, write, and execute access to this location.
If the service user does not have access, either grant access or contact your Hadoop administrator for access.

Before you begin (Installing on Kubernetes with standalone agents)

Before you begin to install Data Catalog on Kubernetes with standalone agents, you must obtain the standalone agents binary files.

Contact Hitachi Vantara support to obtain access to the following artifact:

Standalone agent binary
ldc-agent-7.0.0.run

Set up CDH 7.1.3 (Kerberos enabled)

NoteThese steps can be performed manually, or you can set up and use a Docker image.

Use the following steps to set up a CDH 7.1.3 Kerberos enabled environment:

Execute run file and configure the agent
Verify the application configuration
Obtain a valid Kerberos ticket for the user registering/running the agent
Register the agent
Verify the connection

Execute run file and configure agent for CDH 7.1.3

NoteThese steps can be performed manually, or you can set up and use a docker image.

Use the following steps to execute the run file and configure the agent.

Procedure

Copy the run file into the instance where you will run it.
Enter the following command:
sudo sh ldc-7.0.1-agent.run

Enter your responses similarly to this example:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               LUMADA DATA CATALOG AGENT INSTALLER
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1. Express Install 		(Requires superuser access)
    2. Custom Install 		(Runs with non-sudo access)
    3. Upgrade
    4. Exit

Enter your choice [1-4]: 1
Enter the name of the Lumada Data Catalog service user [ldcuser]: ldcuser
Enter install location [/opt/ldc] : /opt/ldc_example
Enter log location [/var/ log/ldc] : /opt/ldc_example/logs
Enter Appserver endpoint [http://localhost:3000] : 31083
Enter the name of the agent [suggested agent name] : <agent name>
Enter HIVE version [3.1.2]: <HIVE version>
Is Kerberos enabled? [y/N]: Y
Full path to Lumada Data Catalog service user keytab : /home/ldcuser/ldcuser.keytab
Lumada Data Catalog service user's fully qualified principal :  ldcuser@<your company>.COM
~~~~~~~~~~~~~~~~~~~~~~~
   SELECTION SUMMARY
~~~~~~~~~~~~~~~~~~~~~~~
Lumada Data Catalog service user : ldcuser
            Install location : /opt/ldc_example/ldc (will be created)
            Log location : /opt/ldc_exmaple/logs/ldc (will be created)
                Kerberos enabled : true
            Kerberos keytab path : /home/ldcuser/ldcuser.keytab
              Kerberos principal : ldcuser@<your company>.COM
              AppServer endpoint : https://<hostname>:31083
                        Agent ID : <agent name>
Proceed? [Y/n]: Y

Results

After the run file executes, the Agent files are set up in the /opt/ldc/agent directory.

Verify the application configuration

Use the following steps to verify the application configuration in the conf/application.yml file:

Note: When configuring the agent to the destination you must find the correct DNS. If you are using a Mac, check the certificate DNS value in KeyChain Access.

Procedure

Make sure you have wss instead of ws in your URL as in this line:
url: wss://<your host>:31083/wsagent
Verify the file contains https for the graphql-service-user, and the secure port is 31083, as in this line:
graphql-service-url: https://hostname>:31083/graphql/

Results

Your conf/application.yml file should look similar to the following example:

home-server:
  isSecure: "false"
  url: wss://ldc7-app-server.ldc-test.svc.cluster.local:31083/wsagent
  agentId: agent-220412
  trust-store-name: ldc-truststore
  trust-store-password: OBF:1vv91v8s1v9u1sw01vnw1w8t1unz1w8x1vn61svy1v8s1v9u1vu1
  key-store-type: PKCS12
  key-store: classpath:keystore
  key-store-name: ldc-keystore
  key-store-password: OBF:1eoi1rwv1k051mqv1isz1itn1ms71jzt1rvz1eng
  graphql-service-url: https://ldc7-app-server.ldc-test.svc.cluster.local:31083/graphql/
  isDefault: "${LDC_APP_SERVER_DEFAULT:false}"
  token: ...

LDC Obtain valid Kerberos ticket

Procedure

Register the agent (Set up CDH 7.1.3 with Kerberos)

Procedure

Set up AWS EMR (Non-Kerberos)

Access the master node for EMR
Configure agent for Non-Kerberos setup
Start agent
Verify connection
- Troubleshooting EMR

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Pre-Installation steps for Standalone agent installation

System requirements for Standalone agent installation

Processing engine

Hadoop distributions

Component validation for Standalone agent installation

Validate Hadoop configuration

Verify the file system URI

Check cluster status

Verifying HDFS and Hive access

Verify HDFS and Hive access using Hue or Apache Ambari

Verify HDFS and Hive access using Beeswax or Beeline

Copy the JDBC standalone JAR for HiveServer2 in HA mode

Validate access to Hadoop components through the browser

Verify HDFS discovery metadata storage

Before you begin (Installing on Kubernetes with standalone agents)

Set up CDH 7.1.3 (Kerberos enabled)

Execute run file and configure agent for CDH 7.1.3

Verify the application configuration

LDC Obtain valid Kerberos ticket

Register the agent (Set up CDH 7.1.3 with Kerberos)

Set up AWS EMR (Non-Kerberos)