Standalone agent installation
If you have Hadoop-based data sources such as HDFS and Hive or if some of your data sources are separated from your Kubernetes cluster by significant latencies or low bandwidths, you will need standalone agents closer to your data sources or on the edge nodes of your Hadoop cluster. These remote agents then connect to the main Data Catalog deployment in Kubernetes for storing the metadata.
In this type of installation, you need to make sure your environment meets the system requirements, and you validate the system components before you begin the installation. Follow the pre-installation steps to set up your environment before installing Data Catalog.
Pre-Installation steps for Standalone agent installation
Before starting the Lumada Data Catalog installation process, you must first set up all the dependent external components, such as setting up the Data Catalog service user and collecting other essential information about your cluster configuration. These prerequisites are applicable to most Hadoop distributions and include steps to help you prepare your Hadoop environment for the Data Catalog installation.
Basic pre-install preparation includes the following tasks:
Review system requirements
Review the external components and applications that Data Catalog supports. See System requirements for Standalone agent installation.
Verify components
Verify the proper functioning of the various Hadoop components that Data Catalog interacts with. See Component validation for Standalone agent installation.
You must also make the following changes outside of the Lumada Data Catalog environment or on other cluster nodes:
- Validate component functioning
- Configure components for Data Catalog
If you do not have control over or access to these components, you may need to plan in advance to perform these changes or contact your system administrator.
System requirements for Standalone agent installation
Lumada Data Catalog runs on an edge node in a Hadoop cluster and requires specific external components and applications to operate optimally. This section provides a list of those components and applications along with details of their use and versions we support. If you have questions about your particular computing environment, please contact support at the Hitachi Vantara Lumada and Pentaho Support Portal.
Processing engine
Data Catalog uses Apache Spark for profiling jobs against HDFS and Hive data. The application code is transferred to each cluster node and is executed against the data that resides on that node. The results are collected in either the MongoDB repository, the Discovery Cache, or both. Data Catalog then runs Spark jobs against the resulting metadata to determine suggestions for business term associations. See Discovery Cache storage.
Data Catalog jobs are standard Spark jobs, so your organizational practices for tuning cluster operations apply to running and tuning Data Catalog jobs.
Hadoop distributions
Data Catalog must be installed on the edge node within the Hadoop cluster. The Hadoop and Hive clients installed on this edge node must have access to the Hadoop NameNode and HiveServer2.
The following table lists the supported distributions and the compatible applications for each Hadoop distribution on the edge node.
Distribution | Components | Versions | Notes |
Cloudera |
CDH Apache Spark™ HDFS HIVE |
6.1.1 2.4 3.0.0 2.1.1 |
- - - - |
Cloudera |
CDP Apache Spark HDFS HIVE Apache Atlas |
7.1.7 2.4.7 3.1.1 3.1.3 2.1.0 |
- - - - - |
EKS | 1.21.x | ||
EMR 6.0.0 |
Apache Spark HDFS HIVE |
2.4.4 3.2.1 3.1.2 |
- - - |
HortonWorks |
HDP Apache Spark HDFS HIVE Apache Atlas |
3.1.0 2.3.2 3.1.1 3.1.0 1.1.0 |
- - - - - |
MEP (MapR Ecosystem Pack) |
MapR Apache Spark MapR-FS HIVE |
6.0.1 2.2.1 6.2.0.0.20200915234957.GA-1 2.3.6 |
- - - - |
Microsoft Azure |
HDI Apache Spark HDFS HIVE Cloud/Blob ADL |
4.0 2.4.4 3.1.1 3.1.0 - |
- - - - - |
Component validation for Standalone agent installation
You must verify the proper functioning of the various components that interact with Data Catalog.
Perform the following validation:
- Validate the Hadoop configuration.
Once the configuration is verified, you can start configuring the components for Data Catalog compatibility.
Validate Hadoop configuration
To install Data Catalog, you need to validate that your existing Hadoop components are running and communicating properly among themselves.
Perform the following steps to prepare for Data Catalog installation by validating each of the places where Data Catalog interacts with Hadoop:
- Verify the file system URI.
- Check cluster status.
- Verifying HDFS and Hive access.
- Validate access to Hadoop components through the browser.
- Verify HDFS discovery metadata storage.
Verify the file system URI
Procedure
Use the following command to find the host name for your cluster:
$ hdfs getconf -confKey fs.defaultFS
Navigate to the core-site.xml file on the cluster.
Edit the core-site.xml to verify the fs.defaultFS parameter is set to the correct host name.
Check cluster status
Procedure
Verify that HDFS, MapReduce, and YARN are running.
If Hive is configured for your cluster, Hive and its constituent components, such as Hive Metastore, HiveServer2, WebHCat Server, and whichever database used by Hive, such as MySQL, must be running.
If you do not use a cluster management tool, such as Ambari, Cloudera Manager, or MapR Control System, check individual services by running the command line for the component, as shown in the following example codes:
$ hdfs dfsadmin -report
$ yarn version
$ beeline (!quit to exit)
Verifying HDFS and Hive access
Data Catalog depends on the cluster authorization system to manage user access to HDFS resources. A non-catalog service user with access to HDFS files and Hive tables should also have access to these HDFS file and Hive tables from the Data Catalog. Be sure to identify such end-users and their access to the HDFS files and Hive tables.
You can use the following applications to check end-user access:
- Hue or Apache Ambari
- Beeswax
Optionally, for a Hortonworks cluster, you must perform additional steps if you are running HiveServer2 in High-Availability (HA) mode.
Verify HDFS and Hive access using Hue or Apache Ambari
Procedure
Navigate to your existing data in HDFS or load new data.
Verify that you can access files you own as well as files for which you have access through group membership.
If you cannot sign into Hue or Ambari or access HDFS files from inside one of these tools or from the command line, ask your Hadoop administrator for appropriate credentials.
Verify HDFS and Hive access using Beeswax or Beeline
Perform the following steps:
Procedure
Verify that you can access the existing databases and tables.
If you cannot sign into Beeline or cannot access Hive tables with Beeswax, then ask your Hadoop administrator for the applicable credentials.Profile the jobs to test table-level access to determine if your cluster uses Apache Ranger or Apache Sentry for access control.
Verify that the Data Catalog service user has table-level access (not column-level access) to Hive tables that you want included in your catalog.
NoteColumn-level access control for access from Spark SQL is not supported by the HDFS-Sentry plug-in.
Copy the JDBC standalone JAR for HiveServer2 in HA mode
Perform the following steps to copy the file to these components:
Procedure
Stop the Data Catalog services.
Locate the JDBC standalone JAR file in your Hive library.
For example, /usr/hdp/<rightVersion>/hive/lib/hive-jdbc-1.2.xxx-standalone.jar.Copy the JAR file to <LDC-HOME>app-server/ext/ and to <LDC-HOME>agent/ext/.
Restart the Data Catalog services.
Validate access to Hadoop components through the browser
If your cluster is configured to use Kerberos, perform the following steps to configure the browser with a Kerberos plug-in and verify you have valid user credentials.
Procedure
Start a browser from a computer that is not on the edge node where you are installing Data Catalog.
Verify that you can sign into the following components for your cluster:
Component Access URL Hue (CDH, MapR) http://<HDFS file system host>:8888 Ambari (HDP) http://<HDFS file system host>:8080 Cloudera Manager (CDH) http://<HDFS file system host>:7180 MapR Control System (MapR) http://<HDFS file system host>:8443
Verify HDFS discovery metadata storage
Perform the following steps to verify read, write, and execute access to permanent storage on HDFS:
Procedure
Navigate to storage on HDFS established for Data Catalog.
This location is usually /user/<Data Catalog service user (ldcuser)>/.Determine if the Data Catalog service user has read, write, and execute access to this location.
If the service user does not have access, either grant access or contact your Hadoop administrator for access.
Before you begin (Installing on Kubernetes with standalone agents)
Before you begin to install Data Catalog on Kubernetes with standalone agents, you must obtain the standalone agents binary files.
Contact Hitachi Vantara support to obtain access to the following artifact:
Standalone agent binary
ldc-agent-7.0.0.run
Set up CDH 7.1.3 (Kerberos enabled)
Use the following steps to set up a CDH 7.1.3 Kerberos enabled environment:
- Execute run file and configure the agent
- Verify the application configuration
- Obtain a valid Kerberos ticket for the user registering/running the agent
- Register the agent
- Verify the connection
Execute run file and configure agent for CDH 7.1.3
Procedure
Copy the run file into the instance where you will run it.
Enter the following command:
sudo sh ldc-7.0.1-agent.run
Enter your responses similarly to this example:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ LUMADA DATA CATALOG AGENT INSTALLER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Express Install (Requires superuser access) 2. Custom Install (Runs with non-sudo access) 3. Upgrade 4. Exit Enter your choice [1-4]: 1 Enter the name of the Lumada Data Catalog service user [ldcuser]: ldcuser Enter install location [/opt/ldc] : /opt/ldc_example Enter log location [/var/ log/ldc] : /opt/ldc_example/logs Enter Appserver endpoint [http://localhost:3000] : 31083 Enter the name of the agent [suggested agent name] : <agent name> Enter HIVE version [3.1.2]: <HIVE version> Is Kerberos enabled? [y/N]: Y Full path to Lumada Data Catalog service user keytab : /home/ldcuser/ldcuser.keytab Lumada Data Catalog service user's fully qualified principal : ldcuser@<your company>.COM ~~~~~~~~~~~~~~~~~~~~~~~ SELECTION SUMMARY ~~~~~~~~~~~~~~~~~~~~~~~ Lumada Data Catalog service user : ldcuser Install location : /opt/ldc_example/ldc (will be created) Log location : /opt/ldc_exmaple/logs/ldc (will be created) Kerberos enabled : true Kerberos keytab path : /home/ldcuser/ldcuser.keytab Kerberos principal : ldcuser@<your company>.COM AppServer endpoint : https://<hostname>:31083 Agent ID : <agent name> Proceed? [Y/n]: Y
Results
After the run file executes, the Agent files are set up in the /opt/ldc/agent directory.
Verify the application configuration
Use the following steps to verify the application configuration in the conf/application.yml file:
Note: When configuring the agent to the destination you must find the correct DNS. If you are using a Mac, check the certificate DNS value in KeyChain Access.
Procedure
Make sure you have
wss
instead ofws
in your URL as in this line:url: wss://<your host>:31083/wsagent
Verify the file contains
https
for the graphql-service-user, and the secure port is31083
, as in this line:graphql-service-url: https://hostname>:31083/graphql/
Results
home-server: isSecure: "false" url: wss://ldc7-app-server.ldc-test.svc.cluster.local:31083/wsagent agentId: agent-220412 trust-store-name: ldc-truststore trust-store-password: OBF:1vv91v8s1v9u1sw01vnw1w8t1unz1w8x1vn61svy1v8s1v9u1vu1 key-store-type: PKCS12 key-store: classpath:keystore key-store-name: ldc-keystore key-store-password: OBF:1eoi1rwv1k051mqv1isz1itn1ms71jzt1rvz1eng graphql-service-url: https://ldc7-app-server.ldc-test.svc.cluster.local:31083/graphql/ isDefault: "${LDC_APP_SERVER_DEFAULT:false}" token: ...
Set up AWS EMR (Non-Kerberos)
- Access the master node for EMR
- Configure agent for Non-Kerberos setup
- Start agent
- Verify connection
- Troubleshooting EMR