Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Installing Lumada Data Catalog on CDH, HDP, or CDP

Parent article

These instructions assume that you are a systems administrator installing Lumada Data Catalog on a CDH, HDP, or CDP platform distribution. Ensure you have completed the Component validations before proceeding with installation. The following sections describe how to download and run the installers.

Requirements

Data Catalog requires the following external components:

  • Solr
  • Postgres

Solr and PostgresSQL (Postgres) must be installed before installing Data Catalog. See instructions on installing and configuring these components and gather the following information before installing Data Catalog. Note that the actual values should be specific to your environment.

Solr Connection Details
Solr URLhttp://hostname:8983/solr
ZooKeeper Znodehostname:2181/solr
Collection Namewdcollection
For installation details, see Downloading and installing Solr.
Postgres Connection Details
URLjdbc:postgresql://hostname:5432/postgres
Usernameldcuser
Passwordldcuser

Download the Data Catalog packages

Download the Data Catalog distribution from the location provided by Data Catalog. If your organization has subscribed to support, you can find the location through the Hitachi Vantara Lumada and Pentaho Support Portal.

Installers

You should obtain access to three installers, where X is the specific version of the package you want to install:

  • ldc-app-server-X.run
  • ldc-metadata-server-X.run
  • ldc-agent-X.run
Installation sequence

The installation of the Data Catalog packages must occur in the following order:

  1. Lumada Data Catalog Application Server
  2. Lumada Data Catalog Metadata Server
  3. Lumada Data Catalog Agents

Installing Data Catalog packages

The following installation is a generic installation on a non-Kerberized environment. For environment-specific installations, see these articles in Installing Lumada Data Catalog:

If you cannot find installation instructions for your specific environment, contact Hitachi Vantara Lumada and Pentaho Support Portal.

Install the Lumada Data Catalog Application Server

The Lumada Data Catalog (LDC) Application Server is installed in two parts: a command line part and a browser-based part. Before you begin the Data Catalog installation, you must have a user who has root access permissions or sudo permissions. You must also provision a directory for a storage location, typically in HDFS, where the Data Catalog service user can store and access Data Catalog’s computed fingerprints.

An HDFS privileged user, typically hdfs, can use the following commands to create a storage location:

hdfs dfs -mkdir /user/ldcuser

hdfs dfs -chown -R ldcuser:ldcuser /user/ldcuser

Use the following steps to install Data Catalog.

Part 1: Perform command line installation

Procedure

  1. Stop all Data Catalog processes that are running.

  2. Begin the installation using the one of the two following commands:

    If you are NOT using Kerberos authentication, use this command.# sudo bash ./ldc-app-server-6.0.0.run
    If you are using Kerberos authentication, use this command.

    ImportantTo accommodate Kerberos, a manual step is required between the command line and the browser-based parts of the installation. This command prevents the web server from starting. Before resuming Data Catalog installation, you must perform the steps in Set up Kerberos. When you have finished, proceed to Part 2: Perform installation in the browser.
    # sudo bash ./ldc-app-server-*.run -- --no-exec
  3. As a user with root access, from the command line interface, run the executable file you downloaded from the Lumada and Pentaho Support Portal according to your command choice, above.

  4. Enter 1 to use Express Install.

    The installer file is uncompressed and verified.
  5. At the prompts, complete your entries and selections for the following items:

    • The name of the Lumada Data Catalog service user.
    • The installation location.
    • The log location.
    • The installed Solr server version.
    • Verification if Kerberos is enabled or not.
    • Request to link hdfs-site.xml, hive-site.xml, core-site.xml to Lumada Data Catalog installation.
    • The full path to core-site.xml file.
    • The full path to hdfs-site.xml file.
    • The full path to hive-site.xml file.
  6. Review the summary of your selections, then enter Y(es) when ready to proceed.

    The directories are created, the LDC Application Server is installed, and services are started.

Results

This completes the command line part of the installation process.

Next steps

Proceed to Part 2: Perform installation in the browser.

ImportantIf you are performing the LDC Application Server installation for the Kerberos special case, restart the LDC Application Server using the restart command from the LDC Application Server's home directory before proceeding.

Part 2: Perform installation in the browser

The installation process continues in a browser window.

Perform the following steps to continue the installation:

Procedure

  1. Browse to the setup link at http://<LDC node>:8082/setup.

    The browser opens the Welcome to Lumada Data Catalog page.
  2. Click Let's get started.

    The browser opens the Lumada Data Catalog End User License Agreement page.
  3. Read the license terms and conditions, then select the check box to accept the license agreement and click I agree.

    A license is granted and the Connect with Solr page appears.
  4. Enter the following fields and settings on the Connect with Solr page to set up the Data Catalog Solr collection repository:

    1. In Solr Client Mode, select the client mode that corresponds to your Solr installation.

      These are the same values that you configured in Install Solr and Create the collection.
    2. In the Solr Server Url field, enter the URL of Solr server.

    3. In the Solr Zookeeper Ensemble field, enter the ZooKeeper ensemble.

    4. In the Lumada Data Catalog Collection Name field, enter the Solr collection name for your collection.

    5. In Solr Authentication Mode, select an authentication mode.

      Depending on the Solr Authentication mode for your Solr implementation, you may have to enter the Solr credentials or ensure there is a valid Kerberos ticket, which is active at the time of this installation. Solr Authentication Mode dialog box
    6. Click Test Connection.

      The Connection Successful message appears.
      NoteIf the test does not succeed, verify that Solr is running, and that the Data Catalog service user has access to the collection.
    7. Click Next step.

    The Connect with Postgres page appears.
  5. Enter the following fields and settings to set up the Postgres database which is used for Data Catalog audit logs and Discussions.

    1. In the Postgres Driver Class field, enter the class used to communicate with the Postgres database.

    2. In the Url field, enter the location of the Postgres database installation.

    3. In the Postgres User field, enter the username used to access Postgres.

    4. In the Postgres Password field, enter the password for the above username used to access Postgres.

    5. Click Test Connection.

      The Postgres Connection Successful message appears.
      NoteIf the test does not succeed, verify that Solr is running, and that the Data Catalog service user has access to the collection.
    6. Click Next step.

      The Large properties storage page appears.
  6. Enter the following details about the storage location on your cluster (typically HDFS) which is used to store the metadata information required for running Data Catalog jobs and identifying tags.

    1. In the Large Metadata Storage Uri field, the URI is automatically detected by the installer.

      If the detected URI is incorrect, enter the URI of the HDFS name node. If High Availability (HA) is enabled, this URI will be the HA URI of the HDFS service.
    2. In the Parent Path field, enter the path of the parent where you want to store the metadata. This path is typically the home directory of the Data Catalog user.

      This path must be write-accessible by the Data Catalog service user. When running jobs, a .ldc_hdfs_metadata directory is created under this path.
    3. Click Test Connection.

      The Connection Successful message appears.

      NoteIf you have not already configured the Hadoop proxy settings of the Data Catalog service user, the Test Connection might fail. Verify that the client configuration has been propagated to the entire cluster and that the cluster has been restarted. In addition, check the following true-distributed setups for accessibility:
      • For jobs involved in reading or writing to a large properties storage in an HDFS cluster, the namenodes and datanodes of that cluster need to be network-accessible from the cluster where the jobs are executed.
      • If the clusters involved are configured with different Kerberos realms with no mutual trust, the jobs will fail. For such scenarios, it is best to configure the large properties storage to be a neutral accessible system, such as an S3 bucket.
    4. Click Next step.

    The Repository Bootstrap page appears.
  7. Ensure the following repository and roles bootstrap processes successfully complete:

    • The Solr schema is created.
    • The Postgres schema is created.
    • The Roles, Job sequences, and Built-in tags are bootstrapped.
  8. When the process finishes, click Next step.

    The Authentication method page appears.

Next steps

Proceed according to the type of authentication you want to use:

Use LDAP for authentication

This procedure assumes that you have already configured the authentication method that will validate the users logging into Data Catalog. For more information, see Validating the user authentication method.

Perform the following steps to configure Data Catalog to use LDAP to validate users who log in to the web application:

Procedure

  1. In Authentication Type, select LDAP.

  2. For LDAP Auth Mode, select the authentication mode.

    See LDAP search modes for details.
  3. For LDAP Url, enter the URL for the authentication type.

  4. For Auth Identity Pattern, enter the identity pattern for the authentication.

    The pattern must contain the username literal that will replace the actual user ID.
  5. For Lumada Data Catalog Administrator, enter a user as the administrator who manages the Data Catalog.

    As a best practice, enter ldcuser. However, you can enter a different name here now or later. This user is granted with a Data Catalog Administrator role, which is configured to have access to all data sources and tag domains. Use this login to add additional users and to continue configuration tasks. See Role-based access control (RBAC) for information.
  6. For Test Authentication, enter the user credentials of the administrator in the Username and Password fields.

  7. Click Test Login.

    The Login successful message appears.
  8. Click Next step.

    The LDC Metadata Server details page appears.

Next steps

Use SSH for authentication

This procedure assumes that you have already configured the authentication method that will validate the users logging into Data Catalog. For more information, see Validating the user authentication method.

Perform the following steps to configure Data Catalog to use SSH to validate users who log in to the web application:

Procedure

  1. In Authentication Type, select SSH.

  2. For SSH Host, enter the host to connect to for SSH authentication.

  3. For SSH Port, enter the standard SSH port 22 or another port if configured separately.

  4. For Host Fingerprint, enter the SHA256 sum of the RSA host key used to verify the host.

    This field is automatically filled if Detect is used.
  5. For Lumada Data Catalog Administrator, enter a user as the administrator who manages the Data Catalog.

    As a best practice, enter ldcuser. However, you can enter a different name here now or later. This user is granted with a Data Catalog Administrator role, which is configured to have access to all data sources and tag domains. Use this login to add additional users and to continue configuration tasks. See Role-based access control (RBAC) for information.
  6. For Test Authentication, enter the user credentials of the administrator in the Username and Password fields.

  7. Click Test Login.

    The Login successful message appears.
  8. Click Next step.

    The LDC Metadata Server details page appears.

Next steps

Use Kerberos for authentication

This procedure assumes that you have already configured the authentication method that will validate the users logging into Data Catalog. For more information, see Validating the user authentication method.

Perform the following steps to configure Data Catalog to use Kerberos to validate users who log in to the web application:

Procedure

  1. For Authentication Type, select KERBEROS.

  2. For Lumada Data Catalog Administrator, enter a user as the administrator who manages the Data Catalog.

    As a best practice, enter ldcuser. However, you can enter a different name here now or later. This user is granted with a Data Catalog Administrator role, which is configured to have access to all data sources and tag domains. Use this login to add additional users and to continue configuration tasks. See Role-based access control (RBAC) for information.
  3. For Test Authentication, enter the user credentials of the administrator in the Username and Password fields.

  4. Click Test Login.

    The Login successful message appears.
  5. Click Next step.

    The LDC Metadata Server details page appears.

Next steps

Get the Lumada Data Catalog Metadata Server command token

The LDC Application Server installation automatically creates a Lumada Data Catalog (LDC) Metadata Server command token, which is needed when you initialize and register the LDC Metadata Server with the LDC Application Server.

Perform the following steps to initialize and register the LDC Metadata Server:

Procedure

  1. Click the copy icon to copy the LDC Metadata Server command token from the LDC Metadata Server details page, and then save the contents locally for later use.

    You need this information when installing the LDC Metadata Server.
  2. Click Next step.

    The Restart page appears.
  3. Click Restart to apply the changes.

    After the changes are applied, Data Catalog is ready.

    The Welcome page appears.

    You may have to restart the Data Catalog services using a command line to make sure the changes are applied successfully: $ bin/app-server restart. If Page not found is returned, check the setup.log file in the /var/log/ldc directory for insights into the failure.

  4. Log in with your Data Catalog administrator credentials.

    Welcome page

    If you select the Remember me check box, Data Catalog remembers only the username of the current user. If your organization policy does not permit username retention, you can disable this feature by setting the value of the

    ldc.web.login.AutoCompleteAllowed
    property in <LDC App-Server>/conf/configuration.json to false.

    Some browsers, however, may permit the auto-fill form feature, which is outside the control of Data Catalog.

    ImportantWhenever you restart the Postgres server, you must also restart the LDC Application Server, and if they are installed, the LDC Metadata Server and the LDC Agent.

Next steps

Install the Lumada Data Catalog Metadata Server

Before you begin

If you have not restarted the LDC Application Server since installing it, do so before performing the steps below.
The command token installs the Lumada Data Catalog (LDC) Metadata Server binaries and configuration at the specified location, connecting to the LDC Application Server at the specified endpoint, authenticating, and fetching the configuration from the LDC Application Server (for example, Solr server connection parameters). It publishes its own public-host, port, TLS (on/off) flag, and certificate fingerprint to the LDC Application Server.

Follow the steps below to install the LDC Metadata Server:

Procedure

  1. If you have the previously generated LDC Metadata Server command token, go to step 3. Otherwise, proceed to the next step.

  2. (Optional) Retrieve the command token:

    1. Navigate to Manage and then click Network.

    2. Click Metadata Server and then select metadata-rest-server.

    3. Click 1 selected, and then select Install Metadata Rest Server from the drop-down menu.

      The Install Metadata Server pane appears.
    4. Click the copy icon and save the token to a location.

    5. Click Close to close the pane.

  3. Log off Data Catalog.

  4. From a command line interface, run the executable file you downloaded from the Lumada and Pentaho Support Portal along with the LDC Metadata Server command token on the node where you want to install the LDC Metadata Server.

    [root@docker b1318]# sudo bash ./ldc-metadata-server-6.0.1.run -- --init \ 
                                        --endpoint http://docker.ldc.com:8082 \ 
                                        --client-id metadata-rest-server \
                                        --token 270831cf-1141-4a7f-adc9-7e452b33d4d8 \ 
                                        --public-host docker.ldc.com \ 
                                        --port 4242
                                        --public-port 4242
    

    Refer to the following list for a description of each argument:

    • --init

      Initialize: synchronize the repository configuration from the LDC Application Server.

    • --endpoint

      The URL of the LDC Application Server you want to connect to

    • --token

      Authentication token

    • --public-host

      Public host of the LDC Metadata Server to be reported to LDC Agents when they subsequently register. "Public" does not necessarily mean the internet facing public hostname/IP. It only means the hostname/IP that is routable from all the LDC Agents. If all the LDC Agents are part of a private subnet, this should be the private hostname/IP of the LDC Metadata Server host.

    • --port

      Port on which to run

    • --public-port

      Public port on which you may want to specify, especially in the Kubernetes environment, in addition to the (local/internal) port that the LDC Metadata Server uses to communicate with the LDC Application Server.

    • --cert-fingerprint

      The SHA-256 fingerprint of the certificate where the endpoint argument is pointing to the LDC Application Server's SSL port, and the LDC Application Server is serving the default self-signed certificate.

    • --no-tls

      (Optional) Instructs the LDC Metadata Server to listen on a plain HTTP socket and not on an TLS socket

    The LDC Metadata Server installer is verified and extracted.
  5. Enter 1 to use Express Install.

  6. At the prompts, complete your entries and selections:

    • The name of the Lumada Data Catalog service user
    • The installation location
    • The log location
    • Is Kerberos enabled
    • The installed Solr server version
    The LDC Metadata Server connectivity settings are saved in the following configuration properties on the LDC Application Server:
    • ldc.metadata.server.host
    • ldc.metadata.server.port
    • ldc.metadata.server.isSecure
    • ldc.metadata.server.fingerprint
    • The relevant parameters on the LDC Metadata Server are saved in the application.yml under the conf directory.

Results

The LDC Metadata Server initializes and starts listening for connections.

NoteErrors during installation can occur if the LDC Metadata Server is unable to connect to the LDC Application Server to fetch the Solr connection parameters, or to publish its the host, port, or fingerprint information to the LDC Application Server.

If you have errors, or if you need to change parameters after installation, you can run the following LDC Metadata Server script from a command line, instead of the full installation:

/opt/ldc/metadata-server/bin/metadata-server init \
                                               --endpoint http://docker.ldc.com:8082 \
                                               --client-id metadata-rest-server \
                                               --token 270831cf-1141-4a7f-adc9-7e452b33d4d8 \
                                               --public-host docker.ldc.com \ 
                                               --port 4242 \
                                               --public-port 4242 \
                                               --cert-fingerprint bb4648da8f32d63959a89e1bb0bcba5d2146b0557fb52123cb3eb73fbc8ef265

Next steps

Install Lumada Data Catalog Agents

You begin Lumada Data Catalog Agents installation by creating a local LDC Agent to service the data sources on your local cluster. During the installation process, the LDC Application Server automatically creates the LDC Agents command token which is needed when you initialize and register the LDC Agent with the LDC Application Server. Afterward, you will need to perform additional installations for each on-premises and Cloud cluster that will be used for data discovery. It is a best practice to install LDC Agents close to your data sources to speed processing and minimize any latency issues when publishing the metadata back to the centralized catalog.

Follow the steps below to install LDC Agents:

Procedure

  1. Browse to Data Catalog's Welcome page at http://<LDC node>:8082 and log in with your Data Catalog administrator credentials.

    The Welcome to Lumada Data Catalog page opens.
  2. Navigate to Manage and then click Agents.

  3. Click Create Agent.

    The Create Agent dialog box opens.
  4. In the Name field, enter a name for the Agent, and then in the Description field, enter a description for the Agent.

  5. Click Add.

    The Register Agent dialog box appears.
  6. Click the copy icon.

  7. From a command line interface, run the executable file you downloaded from Support portal along with the Agents command token on the cluster where you want to install LDC Agents:

    [root@docker b1318]# sudo bash./ldc-agent-6.0.0.run -- --register \
                             --endpoint http://docker.ldc.com:8082 \
                             --agent-id ra1f0dfb95a22446ff \
                             --agent-token 995a72e9-0930-4514-9306-e2aade5db1fa
    

    Refer to the following list for a description of each argument:

    • endpoint

      "Public" URL of the LDC Application Server that is exposed to the LDC Agent. If the LDC Application Server and the LDC Agent are in the same network, this can be the private host name/IP and port of the LDC Application Server.

    • agent-id and agent-token

      Authentication parameters for the LDC Agent to connect to the central catalog.

    • cert-fingerprint

      SHA-256 fingerprint of the certificate served by the LDC Application Server endpoint. Note that this fingerprint can be different from the generated fingerprint if you are connecting via a reverse proxy, and you will need to substitute the correct fingerprint.

    The step does the following:

    • Installs the agent binaries and configuration at the specified location.
    • Connects to the LDC Application Server at the specified endpoint, authenticates using the token, and registers itself.
    • Fetches the LDC Metadata Server connection parameters like the host, port, TLS/SSL flag and fingerprint and connects to the LDC Metadata Server to test the connectivity.
    • Opens a websocket connection to the LDC Application Server and stays connected waiting for further commands.
  8. Enter 1 to use Express Install.

  9. At the prompts, complete your entries and selections:

    • The name of the Lumada Data Catalog service user
    • The installation location
    • The log location
    • The Hive version
    • Is Kerberos enabled
  10. Review the summary of your selections, then enter Y when ready to proceed.

    The installation completes. LDC Agents opens a web socket connection to the LDC Application Server, as shown by the green Registered and Connected indicators.

    Agents

  11. Install LDC Agents on each of your additional data source clusters.

    The LDC Agent's connection configuration to the LDC Application Server is stored in the conf/application.yml. The Agent's connection configuration to the LDC Metadata Server is stored in the conf/meta-client-configuration.json, and is used by jobs to publish metadata to the LDC Metadata Server.

Results

This completes installation of LDC Agents. If there are errors during installation, or if you need to change parameters after installation, you can run the following LDC Agents script, instead of the full installation:

/opt/ldc/agent/bin/agent register --endpoint http://b5:8082 \
                                            --agent-id ra3145b5dc2941434e \
                                            --agent-token 5e15a8dd-b995-4a4f-b3b7-2e7f9205110e

Next steps

This completes installation of Data Catalog. Login with your Data Catalog administrator credentials to begin cataloging your data lake.

Installation with special cases

In some cases, you may need to pause installation of the LDC Application Server to make changes before Data Catalog installation can continue. These special cases are explained in the following articles.

Use custom ports

If another application on the cluster is using one of the ports (8082, 4039) that the LDC Application Server uses, or if a previous version of Data Catalog is running on an edge node, the installer will warn you that the expected ports have conflicts.

Perform the following steps to resolve detected port conflicts during installation:

Procedure

  1. If you have an existing version of Data Catalog running, exit the installer by pressing CtrlC.

  2. Stop any Data Catalog services.

  3. From a command line, execute the following code to restart the installer: ./ldc-app-server-*.run -- --no-exec

  4. Complete the command-line portion of the installation, as described in Part 1: Perform command line installation.

  5. Switch to the service user and modify the port numbers in the conf/install.properties file as shown below:

    <app-server> $ vi conf/install.properties
    
        #============================================
        # Jetty Related configs
        #============================================
    
        JETTY_MEMORY_ARGS="-Xss2m -Xms512m -Xmx6144m"
    
        LDC_JETTY_HTTP_PORT=8082
        LDC_JETTY_HTTPS_PORT=4039
        LDC_WEB_DAEMON_PORT=4082
        LDC_SERVICE_USER=ldcuser
        LDC_LOG_DIR=/var/log/ldc
  6. Restart services to complete the installation: $ bin/app-server start --setup

Change temporary directory if not writable

Data Catalog's installer expands the archive in a temporary directory specified by the $TMPDIR environment variable. The default directory is /tmp. The installer needs write permission to the temporary directory. If the directory is not writable, change the $TMPDIR variable to a different writable directory as shown in the following example:

mkdir test
export TMPDIR=$PWD/test
./ldc-app-server-*.run

Set up Kerberos

This task is for setting up Kerberos for Data Catalog.
NoteYou can skip this task if:
  • your cluster is NOT configured with Kerberos.
  • you answered 'Y'(es) to the Is Kerberos enabled? field during the installation process and supplied the Keytab file name with full path and principle details during installation.

If your cluster is configured with Kerberos and you need to set up Kerberos for Data Catalog, perform the following steps:

Procedure

  1. Create a conf/keytab.properties file with the following contents:

    keyTabPath=/home/ldc/ldcuser.keytab
    principal=ldcuser@CORP.ACME.COM

    See the following list for a description of each argument:

    • keyTabPath

      Specifies the full file system path to the keytab, for example, /home/ldc/ldcuser.keytab. This path must be readable by the Data Catalog service user.

    • principal

      Specifies the full service principal of the service user, for example, ldcuser@CORP.ACME.COM.

  2. Get a ticket using kinit.

  3. Link core-site.xml, hds-site.xml, and hive-site.xml into the LDC Application Server's conf/ directory.

    You can typically find these files under the /etc/ path. If you cannot find the files here, then locate them using the following commands and use the discovered path:.

    $ ln -s /etc/hadoop/conf/core-site.xml /opt/ldc/app-server/conf/
    $ ln -s /etc/hadoop/conf/hdfs-site.xml /opt/ldc/app-server/conf/
    $ ln -s /etc/hive/conf/hive-site.xml /opt/ldc/app-server/conf/
    
  4. Use the following command to restart the LDC Application Server in setup mode:

    /opt/ldc/app-server/bin/app-server restart --setup.

Next steps

Continue setup of the LDC Application Server in a browser as described in Part 2: Perform installation in the browser.

For LDC Agents installation, confirm the following:

  • LDC Agents has a valid ticket when adding data sources and running jobs.
  • The keytab.properties in the agent/conf directory has the contents similar to those shown below:

    keyTabPath=/home/ldc/ldcuser.keytab
    principal=ldcuser@CORP.ACME.COM

    If this file does not exist, then create it. Make sure this keytab file is readable for the LDC Agents service user.

NoteThe LDC Metadata Server does not need any specific changes.