Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Component validations

Parent article

You must verify the proper functioning of the various Hadoop components that interact with Data Catalog.

Perform the following validations:

  1. Validate Hadoop configuration.
  2. Validate Spark environment variables.
  3. Validate the user authentication method.

Once these component validations are verified, you can start configuring the components for Data Catalog compatibility beginning with service user configuration. See Configure the Data Catalog service user for details.

Validate Hadoop configuration

To install Data Catalog, you need to validate that your existing Hadoop components are running and communicating properly among themselves.

Perform the following steps to prepare for Data Catalog installation by validating each of the places where Data Catalog interacts with Hadoop:

  1. Verify the file system URI.
  2. Check cluster status.
  3. Verifying HDFS and Hive access.
  4. Validate access to Hadoop components through the browser.
  5. Verify HDFS discovery metadata storage.

Verify the file system URI

Perform the following steps to identify the host name (<HDFS file system host>) for the Hadoop file system:

Procedure

  1. Use the following command to find the host name for your cluster:

    $ hdfs getconf -confKey fs.defaultFS
  2. Navigate to the core-site.xml file on the cluster.

  3. Edit the core-site.xml to verify the fs.defaultFS parameter is set to the correct host name.

Check cluster status

Perform the following steps to verify your Hadoop services are running and active:

Procedure

  1. Verify that HDFS, MapReduce, and YARN are running.

  2. If Hive is configured for your cluster, Hive and its constituent components, such as Hive Metastore, HiveServer2, WebHCat Server, and whichever database used by Hive, such as MySQL, must be running.

  3. If you do not use a cluster management tool, such as Ambari, Cloudera Manager, or MapR Control System, check individual services by running the command line for the component, as shown in the following example codes:

    • $ hdfs dfsadmin -report
    • $ yarn version
    • $ beeline (!quit to exit)

Verifying HDFS and Hive access

Data Catalog depends on the cluster authorization system to manage user access to HDFS resources. A non-catalog service user with access to HDFS files and Hive tables should also have access to these HDFS file and Hive tables from the Data Catalog. Be sure to identify such end-users and their access to the HDFS files and Hive tables.

You can use the following applications to check end-user access:

  • Hue or Apache Ambari
  • Beeswax

Optionally, for a Hortonworks cluster, you must perform additional steps if you are running HiveServer2 in High-Availability (HA) mode.

Verify HDFS and Hive access using Hue or Apache Ambari

Perform the following steps with Hue or Apache Ambari to verify HDFS and Hive access:

Procedure

  1. Navigate to your existing data in HDFS or load new data.

  2. Verify that you can access files you own as well as files for which you have access through group membership.

    If you cannot sign into Hue or Ambari or access HDFS files from inside one of these tools or from the command line, ask your Hadoop administrator for appropriate credentials.

Verify HDFS and Hive access using Beeswax or Beeline

You can verify HDFS and Hive access with Beeswax or Beeline. Beeswax is accessible through Hue and Ambari and Beeline is accessible through the Hive command line.

Perform the following steps:

Procedure

  1. Verify that you can access the existing databases and tables.

    If you cannot sign into Beeline or cannot access Hive tables with Beeswax, then ask your Hadoop administrator for the applicable credentials.
  2. Profile the jobs to test table-level access to determine if your cluster uses Apache Ranger or Apache Sentry for access control.

  3. Verify that the Data Catalog service user has table-level access (not column-level access) to Hive tables that you want included in your catalog.

    NoteColumn-level access control for access from Spark SQL is not supported by the HDFS-Sentry plug-in.

Copy the JDBC standalone JAR for HiveServer2 in HA mode

If you want to run HiveServer2 in HA mode on a Hortonworks cluster, you need to copy the standalone JAR to the LDC Application Server and to the LDC Agent.

Perform the following steps to copy the file to these components:

Procedure

  1. Stop the Data Catalog services.

  2. Locate the JDBC standalone JAR file in your Hive library.

    For example, /usr/hdp/<rightVersion>/hive/lib/hive-jdbc-1.2.xxx-standalone.jar.
  3. Copy the JAR file to <LDC-HOME>app-server/ext/ and to <LDC-HOME>agent/ext/.

  4. Restart the Data Catalog services.

Validate access to Hadoop components through the browser

You need access to the cluster host from the remote computer.

If your cluster is configured to use Kerberos, perform the following steps to configure the browser with a Kerberos plug-in and verify you have valid user credentials.

Procedure

  1. Start a browser from a computer that is not on the edge node where you are installing Data Catalog.

  2. Verify that you can sign into the following components for your cluster:

    ComponentAccess URL
    Hue (CDH, MapR)http://<HDFS file system host>:8888
    Ambari (HDP)http://<HDFS file system host>:8080
    Cloudera Manager (CDH)http://<HDFS file system host>:7180
    MapR Control System (MapR)http://<HDFS file system host>:8443

Verify HDFS discovery metadata storage

You need permanent storage on HDFS for the metadata used for discovery operations. The location of this storage is used by discovery profiling, tag discovery, and lineage discovery jobs. The Data Catalog service user needs read, write, and execute access to this location.

Perform the following steps to verify read, write, and execute access to permanent storage on HDFS:

Procedure

  1. Navigate to storage on HDFS established for Data Catalog.

    This location is usually /user/<Data Catalog service user (ldcuser)>/.
  2. Determine if the Data Catalog service user has read, write, and execute access to this location.

  3. If the service user does not have access, either grant access or contact your Hadoop administrator for access.

Validate Spark environment variables

You need to run a smoke test to verify that all underlying Spark environmental variables are correctly setup. Data Catalog jobs run in a manner similar to Spark SQL context.

Perform the following steps to run the smoke test:

Procedure

  1. Open spark-shell:

    $ /usr/bin/spark-shell

  2. Import the following packages to create SparkConf, SparkContext, and HiveContext objects:

    $ import org.apache.spark.SparkConf

    $ import org.apache.spark.SparkContext

    $ import org.apache.spark.sql.hive.HiveContext

  3. Create a new SQL context:

    $ val sqlContext = new HiveContext(sc);

    $ sqlContext.sql("select count(*) from database.table").collect().foreach(println)

Validating the user authentication method

Before installing Data Catalog on a secure cluster, you need to identify what security measures your cluster employs for authentication so you can integrate Data Catalog to use the sanctioned security channels. You need to know the authentication method and configuration details. You also need to validate that you can access your authentication system.

You can use the following authentication schemes for end-user access to Data Catalog.

SSH configuration

If you are configured to use SSH for user authentication, the Data Catalog application server communicates with the host system on the listen address and port defined in /etc/ssh/sshd_config. By default, the port is set to 22. If your organization uses a different convention, update the port (authPort) in <App-Server Location>/conf/login-ssh.properties file.

SSH is a reliable security mechanism that has one limitation: it assumes the password authentication mechanism is available to the application server. For that reason, it does not work on systems that use Amazon AWS, Google Compute, or other cloud configurations.

Kerberos authentication

If your cluster is controlled with Kerberos authentication, you can use one of the following two methods to configure Data Catalog to interact with the Key Distribution Center (KDC):
  • Method 1: All user interactions are controlled through Kerberos, including logins through the browser and connections made by the service user to run the web application server and to start jobs. If you choose this method, then set your user authentication to use Kerberos, and specify a valid Kerberos-credentialed user as the initial administrator user.

    The user must be configured to validate using a password. Be sure to test the connection with this user's Kerberos credentials.

  • Method 2: Only service user operations are controlled through Kerberos. If you choose this method, your user authentication can be set to use SSH, and specify valid Kerberos credentials only in the Data Source connections.

Perform the following steps to check for the Kerberos client on your machine:

Procedure

  1. Use the following command to check if the Kerberos client is installed and configured to contact the KDC:

    $ kinit

    This command should prompt you for the current user's password. If the command does not prompt you for the current user's password, then your machine is not yet configured with Kerberos. Work with your Kerberos administrator to do the following: install Kerberos, add this computer to the Kerberos database, and generate a keytab for this computer as a Kerberos application server.

  2. Enter in any value and exit the command.

  3. Verify the Kerberos configuration file (/etc/krb5.conf) includes a description of the realm in which Data Catalog resides, as shown in the following example for a server in a sample company called "Acme":

    [libdefaults]
        default_realm = ACME.COM
        dns_lookup_realm = false
        dns_lookup_kdc = false
        ticket_lifetime = 24h
        renew_lifetime = 7d
        forwardable = true
    [realms]
        ACME.COM = {
            kdc = server1.acme.com:88
            admin_server = server1.acme.com:88
        }
    [domain_realm]
        .acme.com = ACME.COM
        acme.com = ACME.COM

Next steps

If you are not able to sign in, check for the following behaviors in your Kerberos environment:
  • The Hadoop service is running.
  • The active user has access to the Hadoop application.
  • The current user has a valid ticket: run klist from a terminal on the client computer.
  • The browser is configured to use Kerberos when accessing secure sites.
  • A Kerberos KDC is accessible from this computer.

LDAP authentication

You can configure Data Catalog to lookup user authentication information in the directory service to verify the credentials of the user logging on to Data Catalog.

This method uses the corporate LDAP directory (Active Directory) for user authentication. Validate the LDAP URL that Data Catalog should use to connect to the LDAP server.

This URL can include filters to indicate a specific group or groups that should have access to Data Catalog. For restricting LDAP Scope for user and group search, see LDAP Configuration.

Any user removed from the LDAP is promptly observed by Data Catalog.

Local database authentication for cloud deployments

Amazon Web Services (AWS) and Google Cloud Platform do not support a password authentication mechanism for managing users. These platforms use SSH key-based authentication as opposed to password authentication.

Data Catalog does not support SSH keys for authentication on cloud deployments. It uses a local database to store user credentials. This method does not supersede the cloud provider's security and it does not override the operating system's security concepts.

The user list grants access to the Data Catalog web application only. Data Catalog respects the access permissions granted by the file system. The user list can include user names configured in the operating system. Listed users that are not mirrored in the operating system see only files that can be read by all users.

MapR authentication (with or without security)

Check that the MapR ticket generation utility is installed and configured to access the CLDB node using the following command:

$ maprlogin

If this command prompts for the current user’s password, then this node is configured as part of a secure MapR cluster. At this point, configuring Data Catalog over a secure MapR requires additional steps during the Data Catalog installation. See Installation for MapR for more information.