Component validations
You must verify the proper functioning of the various Hadoop components that interact with Data Catalog.
Perform the following validations:
- Validate Hadoop configuration.
- Validate Spark environment variables.
- Validate the user authentication method.
Once these component validations are verified, you can start configuring the components for Data Catalog compatibility beginning with service user configuration. See Configure the Data Catalog service user for details.
Validate Hadoop configuration
To install Data Catalog, you need to validate that your existing Hadoop components are running and communicating properly among themselves.
Perform the following steps to prepare for Data Catalog installation by validating each of the places where Data Catalog interacts with Hadoop:
- Verify the file system URI.
- Check cluster status.
- Verifying HDFS and Hive access.
- Validate access to Hadoop components through the browser.
- Verify HDFS discovery metadata storage.
Verify the file system URI
Procedure
Use the following command to find the host name for your cluster:
$ hdfs getconf -confKey fs.defaultFS
Navigate to the core-site.xml file on the cluster.
Edit the core-site.xml to verify the fs.defaultFS parameter is set to the correct host name.
Check cluster status
Procedure
Verify that HDFS, MapReduce, and YARN are running.
If Hive is configured for your cluster, Hive and its constituent components, such as Hive Metastore, HiveServer2, WebHCat Server, and whichever database used by Hive, such as MySQL, must be running.
If you do not use a cluster management tool, such as Ambari, Cloudera Manager, or MapR Control System, check individual services by running the command line for the component, as shown in the following example codes:
$ hdfs dfsadmin -report
$ yarn version
$ beeline (!quit to exit)
Verifying HDFS and Hive access
Data Catalog depends on the cluster authorization system to manage user access to HDFS resources. A non-catalog service user with access to HDFS files and Hive tables should also have access to these HDFS file and Hive tables from the Data Catalog. Be sure to identify such end-users and their access to the HDFS files and Hive tables.
You can use the following applications to check end-user access:
- Hue or Apache Ambari
- Beeswax
Optionally, for a Hortonworks cluster, you must perform additional steps if you are running HiveServer2 in High-Availability (HA) mode.
Verify HDFS and Hive access using Hue or Apache Ambari
Procedure
Navigate to your existing data in HDFS or load new data.
Verify that you can access files you own as well as files for which you have access through group membership.
If you cannot sign into Hue or Ambari or access HDFS files from inside one of these tools or from the command line, ask your Hadoop administrator for appropriate credentials.
Verify HDFS and Hive access using Beeswax or Beeline
Perform the following steps:
Procedure
Verify that you can access the existing databases and tables.
If you cannot sign into Beeline or cannot access Hive tables with Beeswax, then ask your Hadoop administrator for the applicable credentials.Profile the jobs to test table-level access to determine if your cluster uses Apache Ranger or Apache Sentry for access control.
Verify that the Data Catalog service user has table-level access (not column-level access) to Hive tables that you want included in your catalog.
NoteColumn-level access control for access from Spark SQL is not supported by the HDFS-Sentry plug-in.
Copy the JDBC standalone JAR for HiveServer2 in HA mode
Perform the following steps to copy the file to these components:
Procedure
Stop the Data Catalog services.
Locate the JDBC standalone JAR file in your Hive library.
For example, /usr/hdp/<rightVersion>/hive/lib/hive-jdbc-1.2.xxx-standalone.jar.Copy the JAR file to <LDC-HOME>app-server/ext/ and to <LDC-HOME>agent/ext/.
Restart the Data Catalog services.
Validate access to Hadoop components through the browser
If your cluster is configured to use Kerberos, perform the following steps to configure the browser with a Kerberos plug-in and verify you have valid user credentials.
Procedure
Start a browser from a computer that is not on the edge node where you are installing Data Catalog.
Verify that you can sign into the following components for your cluster:
Component Access URL Hue (CDH, MapR) http://<HDFS file system host>:8888 Ambari (HDP) http://<HDFS file system host>:8080 Cloudera Manager (CDH) http://<HDFS file system host>:7180 MapR Control System (MapR) http://<HDFS file system host>:8443
Verify HDFS discovery metadata storage
Perform the following steps to verify read, write, and execute access to permanent storage on HDFS:
Procedure
Navigate to storage on HDFS established for Data Catalog.
This location is usually /user/<Data Catalog service user (ldcuser)>/.Determine if the Data Catalog service user has read, write, and execute access to this location.
If the service user does not have access, either grant access or contact your Hadoop administrator for access.
Validate Spark environment variables
Perform the following steps to run the smoke test:
Procedure
Open
spark-shell
:$ /usr/bin/spark-shell
Import the following packages to create SparkConf, SparkContext, and HiveContext objects:
$ import org.apache.spark.SparkConf
$ import org.apache.spark.SparkContext
$ import org.apache.spark.sql.hive.HiveContext
Create a new SQL context:
$ val sqlContext = new HiveContext(sc);
$ sqlContext.sql("select count(*) from database.table").collect().foreach(println)
Validating the user authentication method
Before installing Data Catalog on a secure cluster, you need to identify what security measures your cluster employs for authentication so you can integrate Data Catalog to use the sanctioned security channels. You need to know the authentication method and configuration details. You also need to validate that you can access your authentication system.
You can use the following authentication schemes for end-user access to Data Catalog.
SSH configuration
If you are configured to use SSH for user authentication, the Data Catalog application server communicates with the host system on the listen address and port defined in /etc/ssh/sshd_config. By default, the port is set to 22. If your organization uses a different convention, update the port (authPort) in <App-Server Location>/conf/login-ssh.properties file.
SSH is a reliable security mechanism that has one limitation: it assumes the password authentication mechanism is available to the application server. For that reason, it does not work on systems that use Amazon AWS, Google Compute, or other cloud configurations.
Kerberos authentication
- Method 1: All user interactions are controlled through Kerberos, including
logins through the browser and connections made by the service user to run the web
application server and to start jobs. If you choose this method, then set your user
authentication to use Kerberos, and specify a valid Kerberos-credentialed user as the
initial administrator user.
The user must be configured to validate using a password. Be sure to test the connection with this user's Kerberos credentials.
- Method 2: Only service user operations are controlled through Kerberos. If you choose this method, your user authentication can be set to use SSH, and specify valid Kerberos credentials only in the Data Source connections.
Perform the following steps to check for the Kerberos client on your machine:
Procedure
Use the following command to check if the Kerberos client is installed and configured to contact the KDC:
$ kinit
This command should prompt you for the current user's password. If the command does not prompt you for the current user's password, then your machine is not yet configured with Kerberos. Work with your Kerberos administrator to do the following: install Kerberos, add this computer to the Kerberos database, and generate a keytab for this computer as a Kerberos application server.
Enter in any value and exit the command.
Verify the Kerberos configuration file (/etc/krb5.conf) includes a description of the realm in which Data Catalog resides, as shown in the following example for a server in a sample company called "Acme":
[libdefaults] default_realm = ACME.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] ACME.COM = { kdc = server1.acme.com:88 admin_server = server1.acme.com:88 } [domain_realm] .acme.com = ACME.COM acme.com = ACME.COM
Next steps
- The Hadoop service is running.
- The active user has access to the Hadoop application.
- The current user has a valid ticket: run
klist
from a terminal on the client computer. - The browser is configured to use Kerberos when accessing secure sites.
- A Kerberos KDC is accessible from this computer.
LDAP authentication
You can configure Data Catalog to lookup user authentication information in the directory service to verify the credentials of the user logging on to Data Catalog.
This method uses the corporate LDAP directory (Active Directory) for user authentication. Validate the LDAP URL that Data Catalog should use to connect to the LDAP server.
This URL can include filters to indicate a specific group or groups that should have access to Data Catalog. For restricting LDAP Scope for user and group search, see LDAP Configuration.
Any user removed from the LDAP is promptly observed by Data Catalog.
Local database authentication for cloud deployments
Amazon Web Services (AWS) and Google Cloud Platform do not support a password authentication mechanism for managing users. These platforms use SSH key-based authentication as opposed to password authentication.
Data Catalog does not support SSH keys for authentication on cloud deployments. It uses a local database to store user credentials. This method does not supersede the cloud provider's security and it does not override the operating system's security concepts.
The user list grants access to the Data Catalog web application only. Data Catalog respects the access permissions granted by the file system. The user list can include user names configured in the operating system. Listed users that are not mirrored in the operating system see only files that can be read by all users.
MapR authentication (with or without security)
Check that the MapR ticket generation utility is installed and configured to access the CLDB node using the following command:
$ maprlogin
If this command prompts for the current user’s password, then this node is configured as part of a secure MapR cluster. At this point, configuring Data Catalog over a secure MapR requires additional steps during the Data Catalog installation. See Installation for MapR for more information.